Understanding the Parquet file format
This is part of a series of related posts on Apache Arrow. Other posts in the series are:
Apache Parquet is a popular column
storage file format used by Hadoop systems, such as Pig,
Spark, and Hive. The file format is
language independent and has a binary representation. Parquet is used to
efficiently store large data sets and has the extension .parquet
. This
blog post aims to understand how parquet works and the tricks it uses to
efficiently store data.
Key features of parquet are:
- it’s cross platform
- it’s a recognised file format used by many systems
- it stores data in a column layout
- it stores metadata
The latter two points allow for efficient storage and querying of data.
Column Storage
Suppose we have a simple data frame:
tibble::tibble(id = 1:3,
name = c("n1", "n2", "n3"),
age = c(20, 35, 62))
#> # A tibble: 3 × 3
#> id name age
#> <int> <chr> <dbl>
#> 1 1 n1 20
#> 2 2 n2 35
#> 3 3 n3 62
If we stored this data set as a CSV file, what we see in the R terminal is mirrored in the file storage format. This is row storage. This is efficient for file queries such as,
SELECT * FROM table_name WHERE id == 2
We simply go to the 2nd row and retrieve that data. It’s also very easy
to append rows to the data set - we just add a row to the bottom of the
file. However, if we want to sum the data in the age
column, then this
is potentially inefficient. We would need to determine which value on
each row is related to age
, and extract that value.
Parquet uses column storage. In column layouts, column data are stored sequentially.
1 2 3
n1 n2 n3
20 35 62
With this layout, queries such as
SELECT * FROM dd WHERE id == 2
are now inconvenient. But if we want to sum up all ages, we simply go to the third row and add up the numbers.
Reading and writing parquet files
In R, we read and write parquet files using the {arrow} package.
# install.packages("arrow")
library("arrow")
packageVersion("arrow")
#> [1] '14.0.0.2'
To create a parquet file, we use write_parquet()
# Use the penguins data set
data(penguins, package = "palmerpenguins")
# Create a temporary file for the output
parquet = tempfile(fileext = ".parquet")
write_parquet(penguins, sink = parquet)
To read the file, we use read_parquet()
. One of the benefits of using
parquet, is small file sizes. This is important when dealing with large
data sets, especially once you start incorporating the cost of cloud
storage. Reduced file size is achieved via two methods:
- File compression. This is specified via the
compression
argument inwrite_parquet()
. The default issnappy
. - Clever storage of values (the next section).
Parquet Encoding
Since parquet uses column storage, values of the same type are number stored together. This opens up a whole world of optimisation tricks that aren’t available when we save data as rows, e.g. CSV files.
Run length encoding
Suppose a column just contains a single value repeated on every row. Instead of storing the same number over and over (as a CSV file would), we can just record “value X repeated N times”. This means that even when N gets very large, the storage costs remain small. If we had more than one value in a column, then we can use a simple look-up table. In parquet, this is known as run length encoding. If we have the following column
c(4, 4, 4, 4, 4, 1, 2, 2, 2, 2)
#> [1] 4 4 4 4 4 1 2 2 2 2
This would be stored as
- value 4, repeated 5 times
- value 1, repeated once
- value 2, reported 4 times
To see this in action, lets create a simple example, where the character
A
is repeated multiple times in a data frame column:
x = data.frame(x = rep("A", 1e6))
We can then create a couple of temporary files for our experiment
parquet = tempfile(fileext = ".parquet")
csv = tempfile(fileext = ".csv")
and write the data to the files
arrow::write_parquet(x, sink = parquet, compression = "uncompressed")
readr::write_csv(x, file = csv)
Using the {fs} package, we extract the size
# Could also use file.info()
fs::file_info(c(parquet, csv))[, "size"]
#> # A tibble: 2 × 1
#> size
#> <fs::bytes>
#> 1 808
#> 2 1.91M
We see that the parquet file is tiny, whereas the CSV file is almost 2MB. This is actually a 500 fold reduction in file space.
Dictionary encoding
Suppose we had the following character vector
c("Jumping Rivers", "Jumping Rivers", "Jumping Rivers")
#> [1] "Jumping Rivers" "Jumping Rivers" "Jumping Rivers"
If we want to save storage, then we could replace Jumping Rivers
with
the number 0
and have a table to map between 0
and Jumping Rivers
.
This would significantly reduce storage, especially for long vectors.
x = data.frame(x = rep("Jumping Rivers", 1e6))
arrow::write_parquet(x, sink = parquet)
readr::write_csv(x, file = csv)
fs::file_info(c(parquet, csv))[, "size"]
#> # A tibble: 2 × 1
#> size
#> <fs::bytes>
#> 1 909
#> 2 14.3M
Delta encoding
This encoding is typically used in conjunction with timestamps. Times are typically stored as Unix times, which is the number of seconds that have elapsed since January 1st, 1970. This storage format isn’t particularly helpful for humans, so typically it is pretty-printed to make it more palatable for us. For example,
(time = Sys.time())
#> [1] "2024-02-01 13:02:03 CET"
unclass(time)
#> [1] 1706788923
If we have a large number of time stamps in a column, one method for reducing file size is to simply subtract the minimum time stamp from all values. For example, instead of storing
c(1628426074, 1628426078, 1628426080)
#> [1] 1628426074 1628426078 1628426080
we would store
c(0, 4, 6)
#> [1] 0 4 6
with the corresponding offset 1628426074
.
Other encodings
There are a few other tricks that parquet uses. Their GitHub page gives a complete overview.
If you have a parquet file, you can use parquet-mr to investigate the encoding used within a file. However, installing the tool isn’t trivial and does take some time.
Feather vs Parquet
The obvious question that comes to mind when discussing parquet, is how does it compare to the feather format. Feather is optimised for speed, whereas parquet is optimised for storage. It’s also worth noting, that the Apache Arrow file format is feather.
Parquet vs RDS Formats
The RDS file format used by readRDS()/saveRDS()
and load()/save()
.
It is file format native to R and can only be read by R. The main
benefit of using RDS is that it can store any R object - environments,
lists, and functions.
If we are solely interested in rectangular data structures, e.g. data frames, then reasons for using RDS files are
- the file format has been around for a long time and isn’t likely to change. This means it is backwards compatible
- it doesn’t depend on any external packages; just base R.
The advantages of using parquet are
- the file size of parquet files are slightly smaller. If you want to
compare file sizes, make sure you set
compression = "gzip"
inwrite_parquet()
for a fair comparison. - parquet files are cross platform
- in my experiments, parquet files, as you would expect, are slightly smaller. For some use cases, an additional saving of 5% may be worth it. But, as always, it depends on your particular use cases.
References
- The default compression algorithm used by Parquet.
- A nice talk by Raoul-Gabriel Urma.
- Parquet-tools for interrogating Parquet files.
- The official list of file optimisations
- Stackoverflow questions on Parquet: Feather & Parquet and Arrow and Parquet