I actually worked on what is below in June of last year. Chris suggested I do more extensive timings and give more results on the timings. But I am old and slow and already have too much to do, so I thought I would just post this as is. Parquet files are becoming very popular and if you for example have a very static database or are serving .csv files you might look to using parquet files instead, at least when used with ERDDAP™.
-Roy
Modern parquet libraries allow for the creation of partitioned parquet “files”, that can be useful to speed up access in very large parquet datasets. Here I look at how it is done, what the dataset structure looks like, and what speed boost, if any, I can expect in ERDDAP.
I am starting with the detects.csv dataset I used previously, it is 305MB csv file and has columns as DuckDB sees them of:
Study_ID TEXT,
TagCode TEXT,
DateTime_PST TIMESTAMP,
recv BIGINT,
location TEXT,
general_location TEXT,
latitude DOUBLE,
longitude DOUBLE,
rkm DOUBLE,
RelDT TIMESTAMP,
Weight DOUBLE,
Length DOUBLE,
Rel_rkm DOUBLE,
Rel_loc TEXT,
tag_life BIGINT,
Rel_latitude DOUBLE,
Rel_longitude DOUBLE,
time TEXT
In choosing how to partition you want to partition using either one of the the column values (or something that can be derived from the value such as deriving year from RelDT):
1. The cardinality of the unique values.
2. Relatively even distribution of the data so that the file sizes are similar
3. File sizes that are not too large nor too small - tiny files may require opening a lot of files which is slow because it adds a lot of overhead, while too large files can be slow to process.
I created an unpartitioned parquet file, as well as two partitioned datasets, one partitioned by year and one partitioned by study_id. I partitioned the dataset using the “brute-force” method to make certain that I understood the structure required and that there weren’t any gotchas, as there did not seem to be any need for a manifest or a json file or the like, The “brute-force” method proceeds by:
1. Reading the data into a dataframe
2. Setting up an enclosing directory and one subdirectory for each element in the partition.
3. Writing the subsetted data to the appropriate sub-directory, which might involve code such as:
for study_id, group in df.groupby('Study_ID'):
group_clean.to_parquet(f"study_id={study_id}/data.parquet")
That is it, you subset the dataframe and write it to a parquet file in the appropriate sub-directory.
The partial structure of the one partitioned by study_id is:
├── partitioned_by_study_id
│ ├── study_id=ButteSink_2024
│ │ └── data.parquet
│ ├── study_id=FR_Spring_2024
│ │ └── data.parquet
│ ├── study_id=Georg_Sl_Barrier_2024_Apr
│ │ └── data.parquet
│ ├── study_id=Georg_Sl_Barrier_2024_Dec
│ │ └── data.parquet
│ ├── study_id=Georg_Sl_Barrier_2024_Feb
│ │ └── data.parquet
│ ├── study_id=Georg_Sl_Barrier_2024_Jan
│ │ └── data.parquet
│ ├── study_id=LAR_NIM_FRCS_2024
│ │ └── data.parquet
We can check that modern parquet libraries can read this in one swoop by trying to open the top-level directory in pandas and seeing if all of the data are there:
>>> import pandas as pd
>>> study_df = pd.read_parquet('partitioned_by_study_id')
>>> study_df.shape
(1495649, 19)
>>>
That is the expected result. However it is not necessary to use the “brute-force” approach to create the partitioned datasets. In pandas you can do:
df.to_parquet(
'partitioned_by_study_id’,
partition_cols=['Study_ID’]
and in pyarrow you can do:
pq.write_to_dataset(
table,
root_path='partitioned_output’,
partition_cols=['Study_ID'] # ← Tell it what to partition on
)
Partitioned parquet files can also have hierarchical partitioning (as does zarr format v3), say by study_id and then year. Depending how users access the data, this could speed things up even more. The downside is it can create a lot of small parquet files and typical requests may require opening a lot files, which could offset the benefit. So good partitioning requires a lot of thought on typical data access as well as file size.
To access the partitioned dataset in ERDDAP, just point to the base directory of the partition and tell it:
<fileNameRegex>.*\.parquet</fileNameRegex>
<recursive>true</recursive>
and the result is:
One thing to notice is that the partitioned parquet dataset has no extra files - no json or xml or Avro files about the dataset as in iceberg or zarr. So for example when I do the command:
study_df = pd.read_parquet('partitioned_by_study_id’)
the library scans through the directory and subdirectories for all parquet files and loads metadata about the files. This of course has a cost.
Given how the the parquet library works on partitioned datasets, comparing the speed of the single file parquet dataset versus the partitioned parquet dataset , reveals a lot about speed tradeoffs, dataset design, and how different use cases can produce differing results. If I do a request that involves only one specific. Study_ID theoretically that should be faster than using the single file dataset because the file to be read is much smaller, but there is a cost of knowing the file structure and finding which file(s) to open. So the single file dataset can be faster if the request is small enough, particularly if I time not just the subsetting command but rather the entire process of opening and subsetting the file(s), This can be offset if am in Python or another programming language and once open I make multiple requests to the file(s) because the metadata is only read once. A lesson here is that the access pattern involved matters, and access using ERDDAP, say, differs from access when doing an analysis using the data.
And in fact, this is exactly what I found in my tests. Even in Python, if I make only one small, simple subset, the single file dataset can be as fast or faster, but if I make a large, complex subset, or if I make multiple requests, then the partitioned dataset is faster (this of course is as long as the request is not cutting across all the files - such as all the observations during a time period - it is the exact same problem as with chunks in netcdf4/hdf5/zarr, chunking for one request pattern can be faster but slower for another).
ERDDAP of course does not keep the file(s) open after a subsetting operation is done, but it does store some information about the parquet files on initial dataset load. But despite this, my tests revealed the same pattern - for small, simple requests the single file can be as fast if not faster, for larger, more complex requests the partitioned dataset is faster, though the single file dataset still showed good performance.
How might you make better use of the partitioning and/or chunking? If I use duckDB for example to access the partitioned parquet dataset, similar to what I previously showed for .csv and iceberg files, and save it to a .duckDB file, it stores all the metadata in the duckDB file, so if I open the partitioned dataset through duckDB rather than directly, it reads the metadata from the single, quite fast duckDB file rather than opening all the parquet files (and though ERDDAP also stores some of the metadata it is not as extensive). Zarr v2 does this by writing text files that contain all the metadata about the chunks from one dataset so initially only those files need to be read, iceberg takes it even further, writing several layers of Avro and json files, but to be fair iceberg is also doing versioning and time travel, so it is a more complicated format.
What is interesting about iceberg is that users have found that for large datasets having all those text files is both slow and fragile, so the solution is to have a persistent database, say postgresql, keeping track of all the catalog information. The initial information gets read only once, and databases are good at tracking transactions. And then there is duckLake (
https://duckdb.org/2025/05/27/ducklake.html) which does away with the text files completely and has the data in parquet files or related formats and has a database that keeps track of everything the Avro and json files do in iceberg. Wh. ile ERDDAP does not store as much information as does iceberg or duckLake, the model is the same - a persistent application reads the information needed for fast access at startup and then keeps track of updates. Of course ERDDAP also serves the data and whole lot more, while iceberg is more of a format specification.
How does partitioning translate when used by ERDDAP? In my not very extensive tests, the results are consistent with the discussion above. Given a request that can take advantage of the partitioning, say a request that only uses one. Study_ID , for a small request the unpartitioned file can be faster but as the request get larger the partitioned dataset becomes faster, mainly because you are accessing a much smaller file. For a request that cuts across the partitions, say using many Study_ID. , the results can really vary, depending on the size and nature of the request, with the unpartitioned file often being faster. This should remind you of something else I have discussed recently, chunking and chunk size. Partitioning essentially breaks the parquet file into chunks, and one would expect that the benefits and problems will be similar. Like chunking in netcdf4/zarr files, there is no one best way to chunk, or in this case partition, the dataset. A partitioning that benefits one use of the data will slow down another use. For example, if I partition by study_id and most users select data by year or year/month, than I have likely slowed down access, not speeded up access. So a good partitioning should be based on how you expect most of your users to access the data.
So the overall lessons seem to be that partioning/chunking is most beneficial when doing applications such as data analysis but may be less so when doing one-time extracts such as in ERDDAP, depending on the typical request to ERDDAP for that dataset, and if partitioning/chunking, then having a persistent application that keeps track of various metadata, access information, and changes can speed up access. I will be curious to see if as the icechunk format develops if it eventually also goes to a persistent application to store the needed information, rather than relying on a potentially large number of text files.
And yes feel free to disagree with or add to anything here. The object of this post, previous posts, and ones to come is to expand the dataset formats that ERDDAP can access without having to modify ERDDAP, and to use these options as efficiently as possible. And also to show that ERDDAP can work with many of the more “modern” formats as well as with data lakes.
-Roy