Vetiver: First steps in MLOps

Published: June 13, 2024

tags: r, vetiver, machine-learning, production, mlops

This is Part 1 of a series of blogs on {vetiver}. Future blogs will be linked here as they are released.

Part 1: Vetiver: First steps in MLOps (This post)
Part 2: Vetiver: Model Deployment
Part 3: Vetiver: Monitoring Models in Production
Part 4: Vetiver: MLOps for Python

Most R users are familiar with the classic workflow popularised by R for Data Science. Data scientists begin by importing and cleaning the data, then iteratively transform, model, and visualise it. Visualisation drives the modeling process, which in turn prompts new visualisations, and periodically, they summarise their work and report results.

Traditional data science workflow diagram. Stages are import, tidy, then transform, visualise, model in a loop, then communicate.

This workflow stems partly from classical statistical modeling, where we are interested in a limited number of models and understanding the system behind the data. In contrast, machine learning prioritises prediction, necessitating the consideration and updating of many models. Machine Learning Operations (MLOps) expands the modeling component of the traditional data science workflow, providing a framework to continuously build, deploy, and maintain machine learning models in production.

Data: Importing and Tidying

The first step in deploying your model is automating data importation and tidying. Although this step is a standard part of the data science workflow, a few considerations are worth highlighting.

File formats: Consider moving from large CSV files to a more efficient format like Parquet, which reduces storage costs and simplifies the tidying step.

Moving to packages: As your analysis matures, consider creating an R package to encourage proper documentation, testing, and dependency management.

Tidying & cleaning: With your code in a package and tests in place, optimise bottlenecks to improve efficiency.

Versioning data: Ensure reproducibility by including timestamps in your database queries or otherwise ensuring you can retrieve the same dataset in the future.

Modelling

This post isn’t focused on modeling frameworks, so we’ll use {tidymodels} and the {palmerpenguins} dataset for brevity.

library("palmerpenguins")
library("tidymodels")
# Remove missing values
penguins_data = tidyr::drop_na(penguins, flipper_length_mm)

We aim to predict penguin species using island, flipper_length_mm, and body_mass_g. A scatter plot indicates this should be feasible. Plot of Body mass (g) vs flipper length (mm). The species of penguin is shown by the colour and the island is shown by the shape. There is a visible split between the Gentoo penguins and the others, with gentoo being overall larger in both ways. The scatter plot points to an obvious separation of Gentoo, to the other species. But pulling apart Adelie / Chinstrap looks a little more tricky.

Modelling wise, we’ll again keep things simple - a straight forward nearest neighbour model, where we use the island, flipper length and body mass to predict species type:

model = recipe(species ~ island + flipper_length_mm + body_mass_g, 
               data = penguins_data) |>
  workflow(nearest_neighbor(mode = "classification")) |> 
  fit(penguins_data)

The model object can now be used to predict species. Reusing the same data as before, we have an accuracy of around 95%.

model_pred = predict(model, penguins_data)
mean(model_pred$.pred_class == as.character(penguins_data$species))
#> [1] 0.9474

Vetiver Model

Now that we have a model, we can start with MLOps and {vetiver}. First, collate all the necessary information to store, deploy, and version the model.

v_model = vetiver::vetiver_model(model, 
                           model_name = "k-nn", 
                           description = "blog-test")
v_model
#> 
#> ── k-nn ─ <bundled_workflow> model for deployment 
#> blog-test using 3 features

The v_model object is a list with six elements, including our description.

names(v_model)
#> [1] "model"       "model_name"  "description" "metadata"    "prototype"  
#> [6] "versioned"

v_model$description
#> [1] "blog-test"

The metadata contains various model-related components.

v_model$metadata
#> $user
#> list()
#> 
#> $version
#> NULL
#> 
#> $url
#> NULL
#> 
#> $required_pkgs
#> [1] "kknn"      "parsnip"   "recipes"   "workflows"

Storing your Model

To deploy a {vetiver} model object, we use a pin from the {pins} package. A pin is simply an R (or Python!) object that is stored for reuse at a later date. The most common use case of the {pins} package (at least for me) is for caching data for a shiny application or quarto document. Basically an easy way to cache data.

However, we can pin any R object - including a pre-built model. We pin objects to “boards” - boards can exist in many places, including Azure, Google drive, or a simple s3 bucket. For this example, I’m using using Posit Connect:

vetiver::vetiver_pin_write(board = pins::board_connect(), v_model)

To retrieve the object, use:

# Not something you would normally do with a {vetiver} model
pins::pin_read(pins::board_connect(), "colin/k-nn")
#> $model
#> bundled workflow object.
#> 
#> $prototype
#> # A tibble: 0 × 3
#> # ℹ 3 variables: island <fct>, flipper_length_mm <int>, body_mass_g <int>

Deploying as an API

The final step is to construct an API around your stored model. This is achieved using the {plumber} package. To deploy locally, i.e. on your own computer, we create a plumber instance and pass the model using {vetiver}

plumber::pr() |>
  vetiver::vetiver_api(v_model) |>
  plumber::pr_run()

This deploys the APIs locally. When you run the code, a browser window will likely open. If it doesn’t simply navigate to http://127.0.0.1:7764/__docs__/.

If the API has successfully deployed, then

base_url = "127.0.0.1:7764/"
url = paste0(base_url, "ping")
r = httr::GET(url)
metadata = httr::content(r, as = "text", encoding = "UTF-8")
jsonlite::fromJSON(metadata)

should return

#$status
#[1] "online"
#
#$time
#[1] "2024-05-27 17:15:39"

The API also has endpoints metadata and pin-url allowing you to programmatically query the model. The key endpoint for MLops, is predict. This endpoint allows you to pass new data to your model, and predict the outcome

url = paste0(base_url, "predict")
endpoint = vetiver::vetiver_endpoint(url)
pred_data = penguins_data |>
  dplyr::select("island", "flipper_length_mm", "body_mass_g") |>
  dplyr::slice_sample(n = 10)
predict(endpoint, pred_data)

Summary

This post introduces MLOps and its applications. In the next post, we’ll discuss deploying models in production.