๐ Writing Partitioned Parquet (Part 1)
The best ways I've found to write clean, partitioned datasets in just a few linesโ with Pandas, PyArrow, Polars, and DuckDB.
One of the most complicated things about data engineering is that there are very few problems with a industry-accepted solution: thereโs no โrightโ way to build out an infrastructure or solve a problem.
Even when there is a concensus, itโs really hard to find out about it (this is a big reason why you should talk to people and go to conferences).
In part, thatโs a good thing, beacuse there are few universal solutions. It woudnโt make sense for a 50 person start-up to adopt Uberโs data engineering architecture.
However, there are a number of common data engineering tasks that almost everyone has to doโย reading data, transforming it, and writing it. Performance demands can be different, but for the most part we all just want to simplify and optimize.
In this article, Iโll focus on a very specific part of data engineeringโ writing partitioned datasets in the parquet format. Itโs something thatโs come up time and time again for me and something you might find value in, regardless of your chosen solution.
โ๏ธ The Prompt
Our task will be to read a list of failed banks from the FDIC and write it to a dataset partitioned by state. Itโs a pretty small set of data, so this should be a good example.
Our loading script:
import pandas as pd
import os
url = 'https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/banklist.csv'
bank_df = pd.read_csv(url, encoding="windows-1251")
bank_df.columns = [c.strip() for c in bank_df.columns]
bank_df.head()
And we want to create an output like:
bank_data/
โโโ State=AL
โย ย โโโ data_0.parquet
โโโ State=AR
โย ย โโโ data_0.parquet
โโโ State=AZ
โย ย โโโ data_0.parquet
...
โโโ State=WI
โย ย โโโ data_0.parquet
โโโ State=WV
โย ย โโโ data_0.parquet
โโโ State=WY
โโโ data_0.parquet
Weโll walk though four options (wow, so many options)โDuckDB, Pandas, Polars, and PyArrowโcovering differences, similarities, and benchmarks.
๐ฆ DuckDB
DuckDB makes this easyโ using SQL, we just write a COPY
statement to export the data to a folder:
import duckdb
conn = duckdb.connect()
conn.sql("COPY bank_df TO 'bank_data' (FORMAT PARQUET, PARTITION_BY (State), OVERWRITE_OR_IGNORE 1)")
If properly configured, you can do this just as easily with a destination in your cloud provider of choice. I cover this in my in-depth article on DuckDB here.
To read the data:
conn.sql("SELECT * FROM read_parquet('bank_data/**/*.parquet')")
Behind the scenes, DuckDB is streaming parquet files while using parallel processing with optimizations like automatic filtering + projection pushdown.
That means you can query larger-than memory datasets and get performance gains over single-threaded alternatives. You can read more about those concepts in a DuckDB blog here from 2021.
๐ผ Pandas
Under the hood, Pandas is using fastparquet or PyArrow. With PyArrow, we see similar performance characteristics to DuckDB:
bank_df.to_parquet(
"pandas_bank_data",
partition_cols=["State"],
)
As of Pandas 2.2.0, there isnโt a way to scan a directory of parquet files using pattern matching, so youโll need DuckDB/Polars or a for-loop to do this easily.
๐ปโโ๏ธ Polars
Again, we note that, under the hood, Polars is using PyArrow to build our partititioned output. Still, this is a pretty neat one-liner:
import polars as pl
pl_bank_df = pl.from_pandas(bank_df)
pl_bank_df.write_parquet(
"polars_bank_data",
use_pyarrow=True,
pyarrow_options={"partition_cols": ["State"]},
)
Again, this is interoperable with S3, GCP, etc. Reading our dataset:
pl.scan_parquet("polars_bank_data/**/*.parquet").collect()
Simple and easy! Good news for Polars early adopters. ๐
๐น PyArrow
Ok, so weโve noted that Pandas and Polars both use a PyArrow backend for this operation.
While we pickup finer-grained control with PyArrow, this is at the cost of simplicity. The PyArrow docs can be complex and we have to first read the data into a PyArrow Table:
import pyarrow as pa
bank_table = pa.Table.from_pandas(bank_df)
We can then write our output as expected:
import pyarrow.parquet as pq
pq.write_to_dataset(bank_table, root_path='pyarrow_bank_data',
partition_cols=['State'])
And read:
table = pq.read_table('pyarrow_bank_data')
๐ฅPerformance
If we isolate & execute our commands in a jupyter notebook with the %%timeit
magic function, we can get execution times and standard deviations for each write
.
You can find the complete notebook here.
๐ค Takeaways
It shouldnโt come as a huge surprise that Polars, Pandas, and PyArrow are nearly identicalโ theyโre all using the same backend. With overlapping standard deviations, we should consider these results identical. More surprising is the performance of DuckDB.
Of course, we have to consider that almost every library requires conversion before writing.
If you already have a Pandas dataframe, youโll need to consider the time to import another library, convert, and write the dataframe.
Because Pandas shares the same backend as PyArrow and Polars, you should never convert a Pandas dataframe to PyArrow/Polars just to write it. Thatโs just adding complexity and additional overhead.
Additionally, this is a really tiny dataset. We shouldnโt assume the same results for production-size workloads. But, we can draw some preliminary conclusions:
With Pandas, if
df.parquet()
does the trick, stick with it.If youโre using Polars, stick with Polars.
If you need more granular control, toggling to PyArrow might be useful.
In a follow-up post, weโll explore if converting your dataframe to DuckDB allows you to write partitioned files faster.
In that example, weโll use some real-world sized data (think GBs). Weโll also consider implications for writing to cloud storageโย for example, is it just as simple to write a PyArrow table as a dataframe?