Writing partitioned parquet (part 1)

19 Feb, 2024

One of the most complicated things about data engineering is that there are very few problems with an industry-accepted solution: there’s no “right” way to build out an infrastructure or solve a problem.

Even when there is a consensus, it’s really hard to find out about it (this is a big reason why you should talk to people and go to conferences).

In part, that’s a good thing, because there are few universal solutions. It wouldn't make sense for a 50 person start-up to adopt Uber’s data engineering architecture.

However, there are a number of common data engineering tasks that almost everyone has to do— reading data, transforming it, and writing it. Performance demands can be different, but for the most part we all just want to simplify and optimize.

In this article, I’ll focus on a very specific part of data engineering— writing partitioned datasets in the parquet format. It’s something that’s come up time and time again for me and something you might find value in, regardless of your chosen solution.

The Prompt

Our task will be to read a list of failed banks from the FDIC and write it to a dataset partitioned by state. It’s a pretty small set of data, so this should be a good example.

Our loading script:

import pandas as pd
import os

url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/banklist.csv'

bank_df = pd.read_csv(url, encoding="windows-1251")

bank_df.columns = [c.strip() for c in bank_df.columns]

bank_df.head()

And we want to create an output like:

bank_data/
├── State=AL
│   └── data_0.parquet
├── State=AR
│   └── data_0.parquet
├── State=AZ
│   └── data_0.parquet
...
├── State=WI
│   └── data_0.parquet
├── State=WV
│   └── data_0.parquet
└── State=WY
    └── data_0.parquet

We’ll walk through four options—DuckDB, Pandas, Polars, and PyArrow—covering differences, similarities, and benchmarks.

DuckDB

DuckDB makes this easy— using SQL, we just write a COPY statement to export the data to a folder:

import duckdb

conn = duckdb.connect()

conn.sql("COPY bank_df TO 'bank_data' (FORMAT PARQUET, PARTITION_BY (State), OVERWRITE_OR_IGNORE 1)")

To read the data:

conn.sql("SELECT * FROM read_parquet('bank_data/**/*.parquet')")

Pandas

Under the hood, Pandas is using fastparquet or PyArrow. With PyArrow, we see similar performance characteristics to DuckDB:

bank_df.to_parquet(
    "pandas_bank_data",
    partition_cols=["State"],
)

Polars

Again, we note that, under the hood, Polars is using PyArrow to build our partitioned output. Still, this is a pretty neat one-liner:

import polars as pl

pl_bank_df = pl.from_pandas(bank_df)

pl_bank_df.write_parquet(
    "polars_bank_data",
    use_pyarrow=True,
    pyarrow_options={"partition_cols": ["State"]},
)

PyArrow

While we pick up finer-grained control with PyArrow, this is at the cost of simplicity. We have to first read the data into a PyArrow Table:

import pyarrow as pa

bank_table = pa.Table.from_pandas(bank_df)

We can then write our output as expected:

import pyarrow.parquet as pq

pq.write_to_dataset(bank_table, root_path='pyarrow_bank_data',
                    partition_cols=['State'])

Performance

If we isolate & execute our commands in a jupyter notebook with the %%timeit magic function, we can get execution times and standard deviations for each write.

Takeaways

It shouldn’t come as a huge surprise that Polars, Pandas, and PyArrow are nearly identical— they’re all using the same backend. With overlapping standard deviations, we should consider these results identical. More surprising is the performance of DuckDB.

Of course, we have to consider that almost every library requires conversion before writing.

If you already have a Pandas dataframe, you’ll need to consider the time to import another library, convert, and write the dataframe.

Because Pandas shares the same backend as PyArrow and Polars, you should never convert a Pandas dataframe to PyArrow/Polars just to write it. That’s just adding complexity and additional overhead.

Additionally, this is a really tiny dataset. We shouldn’t assume the same results for production-size workloads. But, we can draw some preliminary conclusions:

With Pandas, if df.parquet() does the trick, stick with it.

If you’re using Polars, stick with Polars.

If you need more granular control, toggling to PyArrow might be useful.

In a follow-up post, we’ll explore if converting your dataframe to DuckDB allows you to write partitioned files faster.

In that example, we’ll use some real-world sized data (think GBs). We’ll also consider implications for writing to cloud storage— for example, is it just as simple to write a PyArrow table as a dataframe?

#data #duckdb #engineering