Vetiver: First steps in MLOps
This is part one of a two part series on {vetiver}. Future blogs will be linked here as they are released.
- Part 1: Vetiver: First steps in MLOps (This post)
- Part 2: Vetiver: Model Deployment
Most R users are familiar with the classic workflow popularised by R for Data Science. Data scientists begin by importing and cleaning the data, then iteratively transform, model, and visualise it. Visualisation drives the modeling process, which in turn prompts new visualisations, and periodically, they summarise their work and report results.
This workflow stems partly from classical statistical modeling, where we are interested in a limited number of models and understanding the system behind the data. In contrast, machine learning prioritises prediction, necessitating the consideration and updating of many models. Machine Learning Operations (MLOps) expands the modeling component of the traditional data science workflow, providing a framework to continuously build, deploy, and maintain machine learning models in production.
Data: Importing and Tidying
The first step in deploying your model is automating data importation and tidying. Although this step is a standard part of the data science workflow, a few considerations are worth highlighting.
File formats: Consider moving from large CSV files to a more efficient format like Parquet, which reduces storage costs and simplifies the tidying step.
Moving to packages: As your analysis matures, consider creating an R package to encourage proper documentation, testing, and dependency management.
Tidying & cleaning: With your code in a package and tests in place, optimise bottlenecks to improve efficiency.
Versioning data: Ensure reproducibility by including timestamps in your database queries or otherwise ensuring you can retrieve the same dataset in the future.
Modelling
This post isn’t focused on modeling frameworks, so we’ll use {tidymodels} and the {palmerpenguins} dataset for brevity.
library("palmerpenguins")
library("tidymodels")
# Remove missing values
penguins_data = tidyr::drop_na(penguins, flipper_length_mm)
We aim to predict penguin species using island, flipper_length_mm, and body_mass_g. A scatter plot indicates this should be feasible. The scatter plot points to an obvious separation of Gentoo, to the other species. But pulling apart Adelie / Chinstrap looks a little more tricky.
Modelling wise, we’ll again keep things simple - a straight forward nearest neighbour model, where we use the island, flipper length and body mass to predict species type:
model = recipe(species ~ island + flipper_length_mm + body_mass_g,
data = penguins_data) |>
workflow(nearest_neighbor(mode = "classification")) |>
fit(penguins_data)
The model object can now be used to predict species. Reusing the same data as before, we have an accuracy of around 95%.
model_pred = predict(model, penguins_data)
mean(model_pred$.pred_class == as.character(penguins_data$species))
#> [1] 0.9474
Vetiver Model
Now that we have a model, we can start with MLOps and {vetiver}. First, collate all the necessary information to store, deploy, and version the model.
v_model = vetiver::vetiver_model(model,
model_name = "k-nn",
description = "blog-test")
v_model
#>
#> ── k-nn ─ <bundled_workflow> model for deployment
#> blog-test using 3 features
The v_model
object is a list with six elements, including our
description.
names(v_model)
#> [1] "model" "model_name" "description" "metadata" "prototype"
#> [6] "versioned"
v_model$description
#> [1] "blog-test"
The metadata
contains various model-related components.
v_model$metadata
#> $user
#> list()
#>
#> $version
#> NULL
#>
#> $url
#> NULL
#>
#> $required_pkgs
#> [1] "kknn" "parsnip" "recipes" "workflows"
Storing your Model
To deploy a {vetiver} model object, we use a pin from the {pins} package. A pin is simply an R (or Python!) object that is stored for reuse at a later date. The most common use case of the {pins} package (at least for me) is for caching data for a shiny application or quarto document. Basically an easy way to cache data.
However, we can pin any R object - including a pre-built model. We pin objects to “boards” - boards can exist in many places, including Azure, Google drive, or a simple s3 bucket. For this example, I’m using using Posit Connect:
vetiver::vetiver_pin_write(board = pins::board_connect(), v_model)
To retrieve the object, use:
# Not something you would normally do with a {vetiver} model
pins::pin_read(pins::board_connect(), "colin/k-nn")
#> $model
#> bundled workflow object.
#>
#> $prototype
#> # A tibble: 0 × 3
#> # ℹ 3 variables: island <fct>, flipper_length_mm <int>, body_mass_g <int>
Deploying as an API
The final step is to construct an API around your stored model. This is achieved using the {plumber} package. To deploy locally, i.e. on your own computer, we create a plumber instance and pass the model using {vetiver}
plumber::pr() |>
vetiver::vetiver_api(v_model) |>
plumber::pr_run()
This deploys the APIs locally. When you run the code, a browser window
will likely open. If it doesn’t simply navigate to
http://127.0.0.1:7764/__docs__/
.
If the API has successfully deployed, then
base_url = "127.0.0.1:7764/"
url = paste0(base_url, "ping")
r = httr::GET(url)
metadata = httr::content(r, as = "text", encoding = "UTF-8")
jsonlite::fromJSON(metadata)
should return
#$status
#[1] "online"
#
#$time
#[1] "2024-05-27 17:15:39"
The API also has endpoints metadata
and pin-url
allowing you to
programmatically query the model. The key endpoint for MLops, is
predict
. This endpoint allows you to pass new data to your model, and
predict the outcome
url = paste0(base_url, "predict")
endpoint = vetiver::vetiver_endpoint(url)
pred_data = penguins_data |>
dplyr::select("island", "flipper_length_mm", "body_mass_g") |>
dplyr::slice_sample(n = 10)
predict(endpoint, pred_data)
Summary
This post introduces MLOps and its applications. In the next post, we’ll discuss deploying models in production.