nanoparquet 0.5.1

We’re very chuffed to announce the release of nanoparquet 0.5.1 (and 0.5.0). nanoparquet is a small, self-sufficient R package for reading and writing Parquet files.

You can install it from CRAN with:

install.packages("nanoparquet")

This blog post will go over some of the improvements in nanoparquet 0.5.0 and 0.5.1.

You can see a full list of changes in the release notes here and here .

library(nanoparquet)

List columns#

Parquet has a LIST type for columns whose values are variable-length sequences of scalars. nanoparquet 0.5.0 adds support for reading and writing such columns.

Note: for now nanoparquet supports one level of nesting: each element of a list column must be an atomic vector of a single type, not a list of lists. All elements in a column must have the same scalar type.

To write a list column, put a regular R list into your data frame. Each element must be an atomic vector (integer, double, or character), NULL for a missing list, or an empty vector for an empty list. NA values inside an element vector encode missing elements.

df <- data.frame(id = 1:4)
df$scores <- list(c(80L, 95L, 70L), c(100L), NULL, integer(0))
write_parquet(df, tmp <- tempfile(fileext = ".parquet"))

read_parquet_schema(tmp)

# A data frame: 5 × 14
  file_name  r_col name  r_type type  type_length repetition_type converted_type
  <chr>      <int> <chr> <chr>  <chr>       <int> <chr>           <chr>         
1 /var/fold…    NA sche… <NA>   <NA>           NA <NA>            <NA>          
2 /var/fold…     1 id    integ… INT32          NA REQUIRED        INT_32        
3 /var/fold…     2 scor… list(… <NA>           NA OPTIONAL        LIST          
4 /var/fold…     2 list  <NA>   <NA>           NA REPEATED        <NA>          
5 /var/fold…     2 elem… <NA>   INT32          NA OPTIONAL        INT_32        
# ℹ 6 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#   precision <int>, field_id <int>, children <list>

read_parquet() reads LIST columns back as R list columns:

as.data.frame(read_parquet(tmp))

  id     scores
1  1 80, 95, 70
2  2        100
3  3       NULL
4  4

infer_parquet_schema() shows how nanoparquet maps each column to a Parquet type. For list columns, the r_type shows e.g. list(integer):

infer_parquet_schema(df)[2:7]

# A data frame: 4 × 6
  r_col name    r_type        type  type_length repetition_type
  <int> <chr>   <chr>         <chr>       <int> <chr>          
1     1 id      integer       INT32          NA REQUIRED       
2     2 scores  list(integer) <NA>           NA OPTIONAL       
3     2 list    <NA>          <NA>           NA REPEATED       
4     2 element <NA>          INT32          NA OPTIONAL

A LIST column occupies three rows in the schema: the outer list node, a repeated group node, and the leaf element node.

When you need to specify the element type explicitly, you can use parquet_schema():

schema <- parquet_schema(
  id     = "INT32",
  scores = list("LIST", element = "INT32")
)
write_parquet(df, tmp2 <- tempfile(fileext = ".parquet"), schema = schema)

read_parquet_schema(tmp2)

# A data frame: 5 × 14
  file_name  r_col name  r_type type  type_length repetition_type converted_type
  <chr>      <int> <chr> <chr>  <chr>       <int> <chr>           <chr>         
1 /var/fold…    NA sche… <NA>   <NA>           NA <NA>            <NA>          
2 /var/fold…     1 id    integ… INT32          NA REQUIRED        <NA>          
3 /var/fold…     2 scor… list(… <NA>           NA OPTIONAL        LIST          
4 /var/fold…     2 list  <NA>   <NA>           NA REPEATED        <NA>          
5 /var/fold…     2 elem… <NA>   INT32          NA OPTIONAL        INT_32        
# ℹ 6 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#   precision <int>, field_id <int>, children <list>

New types#

`bit64::integer64`#

Parquet’s INT64 type holds 64-bit integers. R’s native integer is only 32 bits, so nanoparquet has mapped INT64 to double by default. nanoparquet 0.5.1 adds support for bit64::integer64, which gives you true 64-bit integer arithmetic in R.

write_parquet() now writes bit64::integer64 columns as INT64:

library(bit64)
df2 <- data.frame(id = as.integer64(c(1e18, 2e18, 3e18)))
write_parquet(df2, tmp3 <- tempfile(fileext = ".parquet"))
read_parquet_schema(tmp3)

# A data frame: 2 × 14
  file_name  r_col name  r_type type  type_length repetition_type converted_type
  <chr>      <int> <chr> <chr>  <chr>       <int> <chr>           <chr>         
1 /var/fold…    NA sche… <NA>   <NA>           NA <NA>            <NA>          
2 /var/fold…     1 id    double INT64          NA REQUIRED        INT_64        
# ℹ 6 more variables: logical_type <I<list>>, num_children <int>, scale <int>,
#   precision <int>, field_id <int>, children <list>

To read INT64 columns back as bit64::integer64 instead of the default double, use the read_int64_type option. The bit64 package must be installed; if it isn’t, nanoparquet throws a clear error.

read_parquet(tmp3, options = parquet_options(read_int64_type = "integer64"))

# A data frame: 3 × 1
       id
  <int64>
1    0e18
2    2e18
3    3e18

`blob::blob`#

read_parquet() previously returned raw BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY columns (i.e. those without a string, UUID, or decimal annotation) as plain lists of raw vectors. They are now returned as blob::blob objects, which print more neatly and come with the full set of blob helpers. write_parquet() now also accepts blob::blob columns, so round-tripping binary data is straightforward:

library(blob)
df3 <- data.frame(id = 1:3)
df3$payload <- blob::blob(
  charToRaw("hello"),
  charToRaw("world"),
  charToRaw("!")
)
write_parquet(df3, tmp4 <- tempfile(fileext = ".parquet"))
as.data.frame(read_parquet(tmp4))

  id   payload
1  1 blob[5 B]
2  2 blob[5 B]
3  3 blob[1 B]

nanoparquet as a filter#

In Unix, a filter is a program that reads from standard input and writes to standard output, making it a composable building block in shell pipelines. write_parquet() now supports writing to standard output via file = ":stdout:":

write_parquet(mtcars, ":stdout:")

The most common use case is from the command line:

Rscript --quiet -e 'nanoparquet::write_parquet(mtcars, ":stdout:")' > mtcars.parquet

You can build this into a data pipeline. For example, to convert a CSV to Parquet, and then process Parquet with another tool in one shot, without an intermediate .parquet file on the disk, you can do:

cat data.csv |
  Rscript --quiet -e '
    df <- read.csv(file("stdin"))
    nanoparquet::write_parquet(df, ":stdout:")
  ' | another-parquet-tool

Since nanoparquet 0.4.0, read_parquet() can also read from an R connection, so you can pipe Parquet data in as well:

url <- "https://raw.githubusercontent.com/r-lib/nanoparquet/main/inst/extdata/userdata1.parquet"
con <- pipe(paste("curl --silent", url))
df <- read_parquet(con)