netCDF and Zarr: a Unidata perspective

Tiffany C. Vance

unread,

Oct 1, 2024, 11:13:42 AM10/1/24

to <ioos_tech@googlegroups.com>

https://www.unidata.ucar.edu/blogs/news/entry/netcdf-vs-zarr-an-incomplete

--

Tiffany C. Vance, Ph.D.
Ocean Technology Transition (OTT) Program Manager
US IOOS Program Seattle, WA

“There are no boring landscapes, only landscapes we haven't learned to see.”- Paul Groth

"To be what you are is one thing, to be what you want -- now that's something else."

- Ferron

"Got to kick at the darkness 'til it bleeds daylight"

- Bruce Cockburn

Richard Signell

unread,

Oct 1, 2024, 11:39:44 AM10/1/24

to ioos...@googlegroups.com

I read this blog post, but although it says " having a large amount of files might degrade performance", I didn't see that the blog post actually demonstrates degraded performance. Did I miss it?

Rich Signell
180 County Rd

Bourne, MA 02532

--
You received this message because you are subscribed to the Google Groups "ioos_tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ioos_tech+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ioos_tech/CADmNpi-8iE0L0ufhsr5EXmvshOi_Ln9sV%2BhTbkLcp1mhWj7r0A%40mail.gmail.com.

Jonathan Joyce

unread,

Oct 2, 2024, 12:44:15 PM10/2/24

to ioos_tech

I think the HPC use-case heavily depends on the type of filesystem employed. For example, it's been known that some such as Lustre do not handle many small files well due to reliance on metadata lookups for file access (https://www.weka.io/learn/glossary/file-storage/lustre-file-system-explained/#The-Lustre-Infrastructure). Object storage, on the other hand, thrives on many small parallel requests.

The other architectural reality to consider is the transfer cost of the data both to and from the HPC system: how quickly can the initialization data be retrieved, and how long does it take to make the output data ready for analysis? Cloud-centric workflows eliminate the data movement problem. The right choice heavily depends on the problem being solved of course.

The work toward Zarr V3 (https://zarr.dev/blog/zarr-python-v3-update/) is going to unlock even more performance benefits and optimizations. I'm curious if Unidata is going to update ncZarr to be more compatible with Zarr V3 in the future. Today, we are still pretty constrained by the file format choices of the past, but I think in the future we will have even more effective ways of accessing the data through large-scale virtual indexes.

-Jonathan

Reply all

Reply to author

Forward