We have been busy here working with our little dataset library and I wanted to talk about some of the upgrades that I think are important/interesting.
We have first class support of Apache Arrow now which means I took the time to actually understand, byte-by-byte, the binary on-disk format. I also found a memory mapping library that I think is great, larray.
Arrow files are really sequences of datasets and we have a brand new namespace which will grow over time devoted to really large (multiple GB, out of memory) reductions over sequences of datasets which have competitive performance characteristics with anything out there.
Lastly, we have a (great!) blog post exploring memory mapping, apache arrow, and the tech.datatype bindings to larray. It specifically highlights how nice Clojure is for exploring binary file formats — doing this brought back memories of doing similar things in C++ and wow, with Clojure I get the same performance and I can actually see what I am working with! Much appreciation to Alex, Rich, and the team!
https://techascent.com/blog/memory-mapping-arrow.html
Enjoy!
Chris