Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

netCDF and Zarr: a Unidata perspective

29 views
Skip to first unread message

Tiffany C. Vance

unread,
Oct 1, 2024, 11:13:42 AM10/1/24
to <ioos_tech@googlegroups.com>

--
Tiffany C. Vance, Ph.D.  
Ocean Technology Transition (OTT) Program Manager                                
US IOOS Program  Seattle, WA

There are no boring  landscapes, only landscapes we haven't learned to see.”- Paul Groth

"To be what you are is one thing, to be what you want -- now that's something else."
Ferron

"Got to kick at the darkness 'til it bleeds daylight"
- Bruce Cockburn

Richard Signell

unread,
Oct 1, 2024, 11:39:44 AM10/1/24
to ioos...@googlegroups.com
I read this blog post, but although it says " having a large amount of files might degrade performance", I didn't see that the blog post actually demonstrates degraded performance.   Did I miss it?

Rich Signell
180 County Rd
Bourne, MA 02532


--
You received this message because you are subscribed to the Google Groups "ioos_tech" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ioos_tech+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ioos_tech/CADmNpi-8iE0L0ufhsr5EXmvshOi_Ln9sV%2BhTbkLcp1mhWj7r0A%40mail.gmail.com.

Jonathan Joyce

unread,
Oct 2, 2024, 12:44:15 PM10/2/24
to ioos_tech
I think the HPC use-case heavily depends on the type of filesystem employed. For example, it's been known that some such as Lustre do not handle many small files well due to reliance on metadata lookups for file access (https://www.weka.io/learn/glossary/file-storage/lustre-file-system-explained/#The-Lustre-Infrastructure). Object storage, on the other hand, thrives on many small parallel requests.

The other architectural reality to consider is the transfer cost of the data both to and from the HPC system: how quickly can the initialization data be retrieved, and how long does it take to make the output data ready for analysis? Cloud-centric workflows eliminate the data movement problem. The right choice heavily depends on the problem being solved of course.

The work toward Zarr V3 (https://zarr.dev/blog/zarr-python-v3-update/) is going to unlock even more performance benefits and optimizations. I'm curious if Unidata is going to update ncZarr to be more compatible with Zarr V3 in the future. Today, we are still pretty constrained by the file format choices of the past, but I think in the future we will have even more effective ways of accessing the data through large-scale virtual indexes.

-Jonathan

Reply all
Reply to author
Forward
0 new messages