Sluggish system or client code?
Over the course of several weeks, we worked to deploy a one-stop data science platform for data analysis and visualisation for one of our clients. This platform consisted of interconnected applications, which are the motor that enables the productivity of the data scientists sitting at the wheel.
The components of the platform were:
- Gitlab: where data scientists can develop and share their code using all the benefits of Git version control.
- Posit Workbench: which hosts development environments such as RStudio on beefy servers, with far more computational power compared to IDEs on local machines.
- Posit Connect: which allows data scientists to easily share data, dashboards and reports. It allows the sharing of documents, reports, interactive web applications, as well as hosting Application Programmatic Interfaces (APIs).
- Posit Package Manger: which allows for the organisation, centralisation and distribution of code packages. It provides a mirror of R and Python packages, downloaded from external sources such as CRAN (the Comprehensive R Archive Network). It also provides a way for internally-developed R and Python packages to be shared, if the client wishes.
Our deployment philosophy
When we deploy these components together, we do so in such a way that they enhance each other’s functionality; the sum is greater than its parts. For instance, we:
- Allow users to use the same authentication across all of these applications.
- Ensure that users are able to publish documents from Workbench to Connect out-of-the-box. Users don’t have to worry about specifying the correct URLs or ports for all of this to work.
- Ensure that users in Workbench can access any package they need (developed internally, or from popular external package repositories such as CRAN) via Package Manager, without any extra configurations required.
Having all of these setup out-of-the-box means users can get straight to enjoying exploring and utilising the many ways in which Posit can increase productivity, without spending time on set-up.
We also carry out disaster recovery, ensuring that in the event of the unexpected (server failure or data corruption, for example), we can recover all data from a backup.
Finally, we carry out security hardening. Each component in our system is checked to ensure it operates to appropriate security standards. This means our infrastructure is secured to UK Government (National Cyber Security Centre) standards, and certified by CREST-accredited cloud security professionals.
Workbench system performance
One of the key selling points of doing computations on a cloud-hosted server – as opposed to a data scientist’s laptop – is that it’s possible to access very powerful machines in the cloud. This improves the speed at which data scientists’ code gets executed, meaning quicker iterations on analysis. Where commands take longer than a few seconds, it can distract from analysts’ train of thought.
If users perceive that the system provided to them is less-than-performant, they may not use it. It is important that we demonstrate to our users that the platform we provide has excellent performance.
Fast Feedback
Now, back to the client project. We had nearly finished the project, and had given the client a testing environment. Out of the blue, we received this message:
Hi all, is there any reason why rowwise() is performing particularly slow in Workbench?
Time on Workbench: 5.3 minutes
Time to execute example code on client’s laptop: 8 seconds
Oh no!
We were shocked. We pride ourselves on providing applications that are useful to our data scientist users. It seemed that – even though the CPUs in our cloud instance are far more powerful than those in a typical laptop – our system was the less performant. Clearly there must be some configuration wrong – something we can change to put things right!
What we tried first
We tried everything we could think of to trace the root cause of the problem. We tried evaluating the code on our laptops. We tried other Workbench servers.
… Both showed that running on Workbench on powerful machines was much slower than on laptops. We tested across many Workbench servers and against laptops! It perplexed us!
Ok, what next?
There were a few more places that we could look:
- The specifications of the CPUs involved on our servers, compared to our laptops, to see if it would explain the slowdown. It did not.
- Trying in R sessions outside of Workbench, to see if somehow it was a slowdown related to the RStudio Workbench application. It was not.
R packages
One thing left to try: was the version of the R package in question, {dplyr}
, the same on all machines?
The Workbench servers we provided had newer versions of that package past v1.1.0, while on all of our laptops, we had older versions cached. This may have been because we had just set up the server and users were just getting started with using it and installing the packages they needed, so they would tend to have the later versions of packages installed. On their laptops, they may have installed {dplyr}
or {tidyverse}
some time ago.
By downgrading the version of {dplyr}
, it turned out we were able to execute the given reproducible example in 3 seconds – faster than the client’s existing solution!
Obvious solution: downgrade? Check Diffify first
You may think that the obvious solution would be to encourage the client to downgrade the version of {dplyr}
that they use in production to one before v1.1.0, which would be much faster in using dplyr::rowwise()
.
However, we had one last thing to check: what features would be lost if we did this? Potentially there could be improvements in the later versions, which we could lose by downgrading? Would this break existing code?
Enter Diffify.
Diffify provides a comparison between different versions of R packages stored on CRAN or Python packages stored on PyPI. It allows users to select the versions of packages that they want to compare, and presents the differences in a human readable way, making it easy to pick out anything relevant quickly.
It does this by looking at things such as:
- NEWS files included with packages,
- Changes in functions included as part of the package
- Arguments which functions take.
Diffify was useful in this case! Had we downgraded {dplyr}
, we would have removed recent performance improvements in other dplyr functions. A patricular example to note is with the case_when()
function. In version 1.1.0, this function would be significantly slower, an important fact to note given that the client was moving across to using case_when()
as an alternative to using rowwise(), which is being deprecated. Downgrading to version 1.1.0 would have had the result of not allowing us to access these improvements.
The release notes said:
Fixed a major performance regression in case_when(). It is still a little slower than in dplyr 1.0.10, but we plan to improve this further in the future (#6674).
So, perhaps both the client’s current approach of using rowwise()
and future approach of using case_when()
would both perform well on v1.0.10. But this has to be tested.
Final recommendation we made to client
For this particular function, rowwise()
, it turns out the key determinant of performance is the version of the {dplyr}
package being used. Although downgrading the version would solve this particular problem, it’s important to make sure that doing so doesn’t affect other functions under active development, such as case_when()
.
In fact, the functions used in the client’s previous approach were moving to a suspended development stage. In this case, downgrading would have solved a problem that would soon no longer exist, and introduce a new problem for the code migrated to the better supported case_when()
function.
Summary
Here we see some extra support we provided our client for a problem we hadn’t anticipated at the beginning of our project. Sometimes the issue appears to be in one place, but further investigation reveals it’s in another. We are glad we have a good relationship with our client, who mentioned the slowdown to us, allowing us to get to problem solving.
How can we help?
If you are looking for a data science platform, or require support maintaining your existing set-up, get in touch! As Full Service Certified Posit Partners, we are trusted by Posit to provide installation, support and maintenance services on their products, as well as resell Posit licenses at no extra cost, but with great deals on our services.