Speeding up package installation

Can’t be bothered reading, tell me now

A simple one line tweak can significantly speed up package installation and updates.

The wonder of CRAN

One of the best features of R is CRAN. When a package is submitted to CRAN, not only is it checked under three versions of R

  • R-past, R-release and R-devel

but also three different operating systems

  • Windows, Linux and Mac (with multiple flavours of each)

CRAN also checks that the updated package doesn’t break existing packages. This last part is particularly tricky when you consider all the dependencies a package like ggplot2 or Rcpp have. Furthermore, it performs all these checks within 24 hours, ready for the next set packages.

What many people don’t realise is that for CRAN to perform this miracle of package checking, it builds and checks these packages in parallel; so rather than installing a single package at a time, it checks multiple packages at once. Obviously some care has to be taken when checking/installing packages due to the connectivity between packages, but R takes care of these details.

Parallel package installation: Ncpus

If you examine the help package of ?install.packages, there’s a sneaky argument called Ncpus. From the help page:

Ncpus: The number of parallel processes to use for a parallel install of more than one source package.

The default value of this argument is

Ncpus = getOption(‘Ncpus’, 1L)

The getOption() part determines if the value has been set in options(). If no value is found, the default number of processes to use is 1. If you haven’t heard of Ncpus, it’s almost certainly 1, but you can check using

getOption("Ncpus", 1L)
## [1] 6

Does it work?

To test if changing the value of Ncpus makes a difference, we’ll install the tidyverse package with all it’s associated dependencies. On my machine, all packages live in a directory called /rpackages/, for each test below I deleted /rpackages/ so all tidyverse dependencies were reinstalled.

My machine has eight cores

parallel::detectCores()
# [1] 8

So it doesn’t make sense to set Ncpus above 8. Another point is that although R reports that I have 8 cores, I only have 4 physical cores; the other 4 are due to hyper-threading. In practice, this means that I’m only likely to get at most a 6 fold speed-up.

For this experiment, I used the RStudio CRAN repository, set via

options(repos = c("CRAN" = "https://cran.rstudio.com/"))

To time the installation procedure, I just use the standard system.time() function.

After removing the /rpackages/ directory, I set Ncpus equal to 1 and installed the tidyverse package with dependencies

options(Ncpus = 1)
system.time(install.packages("tidyverse"))
## Time in seconds
#    user  system elapsed 
#372.252  15.468 409.364 

So a standard installation takes almost 7 minutes (409/60)!

Before we go on, it’s worth noting a couple of caveats:

  • This timing also includes the download time of the packages; however for simplicity I’m ignoring this. Downloading the packages takes around 20 seconds
  • The above number uses a sample size of 1 to estimate the time; repeating the above experiment, resulted in a remarkably consistent installation time of 410 seconds.

Repeating this experiment with different values of Ncpus gives the table below:

Ncpus Elapsed (Secs) Ratio
1 409 2.26
2 224 1.24
4 196 1.08
6 181 1.00

So setting Ncpus to 2 allows us to half the installation time from 409 seconds to around 224 (seconds). Increasing Ncpus to 4 gives a further speed boost. Due to the dependencies between packages, we’ll never achieve a perfect speed-up, e.g. if package X depends on Y, then we have to install X first. However, for a simple change we get an easy speed boost.

Furthermore, setting Ncpus gives a speed boost when updating packages via update.packages().

A permanent change: .Rprofile

Clearly writing options(Ncpus = 6) before you install a package is a pain. However, you can just add it to your .Rprofile file. In a future blog post, we cover the .Rprofile in more detail, but for the purposes of this post, your .Rprofile is a file that contains R code that runs whenever R starts. You can test whether you have an .Rprofile file using the command

file.exists("~/.Rprofile")

If you don’t have an .Rprofile file, create one in your home area

Sys.getenv("HOME")

Then simply add options(Ncpus = XX) to your file.

The one remaining question is what value should you set XX. I typically set it to six since I have eight cores. This allows packages to be installed in parallel, while giving me a little bit of wiggle room to check email and listen to music.

References

If you are interested in how CRAN handles the phenomenal number of package submissions, check out this recent talk:


comments powered by Disqus