ANN: RawArray.jl

656 views
Skip to first unread message

David Smith

unread,
Sep 25, 2016, 12:26:12 PM9/25/16
to julia-users
Hi, all:

I finally pushed this out, and it might satisfy some of your needs for a simple way to store N-d arrays to disk. Hope you enjoy it.

RawArray (.ra) is a simple file format for storing n-dimensional arrays. RawArray was designed to be portable, fast, storage efficient, and future proof. Basically it writes the binary array data directly to disk with a short header that is used to recreate type and dimension information. 

RawArray is faster than HDF5 and supports complex numbers out of the box, which HDF5 does not. RawArray supports all basic `Int`, `UInt`, `Float`, and `Complex{}` types, and more can be easily added in the future, such as Rational or Big*. It can also handle derived types, but the serialization of them is currently left up to the user.

A system of version numbers and flags are implemented to future-proof the data files as well, in case the implementation needs to change for some reason.

You can grab it with `Pkg.add("RawArray")`. A minimum of Julia 0.4 is required.


Cheers,
Dave

Isaiah Norton

unread,
Sep 25, 2016, 10:59:45 PM9/25/16
to julia...@googlegroups.com
Is there a reason to use this file format over NRRD [1]? To borrow a wise phrasing: I wonder if the world needs another lightweight raw data format ;)

For what it's worth, NRRD is already supported by JuliaIO/Images.jl, and I believe addresses the use-cases identified in your readme, but with a number of technical and non-technical advantages (not least: a number of independent implementations, and a substantial user base, at least as far as these things go).

I say this -- very selfishly I admit -- as someone who has been on the receiving end of far too many files in home-brewed formats.

David Smith

unread,
Sep 26, 2016, 9:59:15 AM9/26/16
to julia-users
Hi, Isaiah. This is a valid question.

0. As a preface, I'd like to say I'm not trying to replace anything. I wrote RawArray to solve a problem we have in magnetic resonance imaging (quickly saving and loading large complex float arrays), and then I decided to share it so if other people like it and find it useful, then cool beans.

Now for the mild stumping...

1. I don't think NRRD is as substantially used as you might think. I've worked in imaging science for years on the data processing/file format end, and I've never seen anyone use it, and I've never even heard of it.  (Pity, because it looks nice enough. :-\)

2. RawArray is simpler to handle and trivial to understand. I believe all you need from an I/O library is I/O.* I don't want my file I/O library performing transformations on my data. 

I also don't need it to read image formats. Part of the reason behind RawArray is to avoid standard image formats because they are not optimized for large complex-float arrays. I just want to save multi-GB data arrays to disk quickly and read them back quickly on a different machine, five years later. 

I have other implementations (https://github.com/davidssmith/ra), and all are super short and platform agnostic.

3. RawArray is surely faster. All it does is read. It doesn't perform any transformations or encoding, so it can't possibly be slower than NRRD. There is a C library at (https://github.com/davidssmith/ra) if you think a pure Julia implementation isn't fast enough. 

Cheers,
Dave

[*] That said, I'm not completely ruling out having transformations available in RawArray between the RAM and disk. For example, when I first wrote it, I had included Blosc compression as an option, signaled by a flag in the header. But in general most transformations are best made in RAM after reading or on disk with already existing, battle-proven tools, such as gzip, uunencode, tar, etc. 

David Smith

unread,
Sep 26, 2016, 10:03:20 AM9/26/16
to julia-users
Sorry I forgot to add: 

JuliaIO/Images.jl relies on having ImageMagick installed, whereas RawArray.jl is a pure Julia solution without any dependencies. 

Isaiah Norton

unread,
Sep 26, 2016, 3:17:59 PM9/26/16
to julia...@googlegroups.com
Thanks for the response.

1. I don't think NRRD is as substantially used as you might think. I've worked in imaging science for years on the data processing/file format end, and I've never seen anyone use it, and I've never even heard of it.  (Pity, because it looks nice enough. :-\)

I think I can make a solid argument that the install-base of software supporting NRRD is on the order of 30-50k -- probably larger than anything but NIfTI (which is substantial, to me, as far as obscure, non-DICOM medical imaging formats go...)

But that's neither here nor there. I'm not going to change your mind, but perhaps someone else will see this thread and think twice before creating `RawArraysWithSlightlyMoreMetadata.jl`.
 
JuliaIO/Images.jl relies on having ImageMagick installed, whereas RawArray.jl is a pure Julia solution without any dependencies. 

Indeed, that was a major pain point, especially for Windows users. However, Images.jl has been modularized, and no longer requires ImageMagick. NRRD.jl does have more dependencies (~10) than might be expected, in order to support Images inter-op, but from what I can tell they are all pure-Julia except for Rmath (the binary portions of which are distributed as part of base Julia).

I'll take my other, minor, comments off-list since this curmudgeonly pet peeve probably isn't of much general interest!

Steven G. Johnson

unread,
Oct 1, 2016, 2:48:37 PM10/1/16
to julia-users


On Monday, September 26, 2016 at 9:59:15 AM UTC-4, David Smith wrote:
I also don't need it to read image formats. Part of the reason behind RawArray is to avoid standard image formats because they are not optimized for large complex-float arrays. I just want to save multi-GB data arrays to disk quickly and read them back quickly on a different machine, five years later. 

 Aside from a small ASCII header, it looks (from the specs) like NRRD can save a multidimensional complex floating-point array as just the raw data, i.e. a single "write" call.  So I'm not sure what you mean by "not optimized".

As for being able to read something 5 years later, using a pre-existing format with some kind of userbase seems to improve the odds of that.

On Monday, September 26, 2016 at 10:03:20 AM UTC-4, David Smith wrote:
Sorry I forgot to add: 

JuliaIO/Images.jl relies on having ImageMagick installed, whereas RawArray.jl is a pure Julia solution without any dependencies. 

The NRRD spec is not that complicated at first glance; it looks like it wouldn't be too hard to write a pure-Julia implementation of it.   If you only want to support the subset of NRRD's functionality provided by RawArray, the implementation effort wouldn't be much harder than RawArray.

Tobias Knopp

unread,
Oct 1, 2016, 4:43:35 PM10/1/16
to julia-users

Páll Haraldsson

unread,
Oct 8, 2016, 8:04:35 PM10/8/16
to julia-users


On Monday, September 26, 2016 at 1:59:15 PM UTC, David Smith wrote:
Hi, Isaiah. This is a valid question.

0. As a preface, I'd like to say I'm not trying to replace anything. I wrote RawArray to solve a problem we have in magnetic resonance imaging (quickly saving and loading large complex float arrays), and then I decided to share it so if other people like it and find it useful, then cool beans.

Now for the mild stumping...

1. I don't think NRRD is as substantially used as you might think. I've worked in imaging science for years on the data processing/file format end, and I've never seen anyone use it, and I've never even heard of it.  (Pity, because it looks nice enough. :-\)

2. RawArray is simpler to handle and trivial to understand. I believe all you need from an I/O library is I/O.* I don't want my file I/O library performing transformations on my data. 

I also don't need it to read image formats. Part of the reason behind RawArray is to avoid standard image formats because they are not optimized for large complex-float arrays. I just want to save multi-GB data arrays to disk quickly and read them back quickly on a different machine, five years later. 

I have other implementations (https://github.com/davidssmith/ra), and all are super short and platform agnostic.

3. RawArray is surely faster. All it does is read. It doesn't perform any transformations or encoding, so it can't possibly be slower than NRRD.

Maybe not compared to NRRD, but it can be slower than lossless image compression.

I did read (short.. good):
https://github.com/davidssmith/ra/blob/master/doc/ra-sedona-abstract.pdf

https://en.wikipedia.org/wiki/Free_Lossless_Image_Format

FLIF is not a replacement for all uses (multidimensional, would be interesting to know if could to be extended to..), but seem to be the best option for non-lyssy image compression:

http://flif.info/index.html
"
    53% smaller than lossless JPEG 2000 compression,
    74% smaller than lossless JPEG XR compression.

Even if the best image format was picked out of PNG, JPEG 2000, WebP or BPG for a given image corpus, depending on the type of images (photograph, line art, 8 bit or higher bit depth, etc), then FLIF still beats that by 12% on a median corpus
[..]
    FLIF does away with knowing what image format performs the best at any given task.
[..]
Other lossless formats also support progressive decoding (e.g. PNG with Adam7 interlacing), but FLIF is better at it. Here is a simple demonstration video, which shows an image as it is slowly being downloaded:
[..]
No patents, Free

    Unlike some other image formats (e.g. BPG and JPEG 2000), FLIF is completely royalty-free and it is not known to be encumbered by software patents. At least as far as we know. FLIF is uses arithmetic coding, just like FFV1 (which inspired FLIF), but as far as we know, all patents related to arithmetic coding are expired. Other than that, we do not think FLIF uses any techniques on which patents are claimed. However, we are not lawyers. There are a stunning number of software patents, some of which are very broad and vague; it is impossible to read them all, let alone guarantee that nobody will ever claim part of FLIF to be covered by some patent. All we know is that we did not knowingly use any technique which is (still) patented, and we did not patent FLIF ourselves either.

    The reference implementation of FLIF is Free Software. It is released under the terms of the GNU Lesser General Public License (LGPL), version 3 or any later version.
[..]
    The reference FLIF decoder is also available as a shared library, released under the more permissive (non-copyleft) terms of the Apache 2.0 license. Public domain example code is available to illustrate how to use the decoder library.

    Moreover, the reference implementation is available free of charge (gratis) under these terms.
[..]
FLIF currently has the following features:

    Lossless compression
    Lossy compression (encoder preprocessing option, format itself is lossless so no generation loss)
    Greyscale, RGB, RGBA (also palette and color-bucket modes)
    Color depth: up to 16 bits per channel (high bit depth)"

--
Palli.

Páll Haraldsson

unread,
Oct 13, 2016, 9:05:29 AM10/13/16
to julia-users

[Explaining more, and correcting typo..]

On Sunday, October 9, 2016 at 12:04:35 AM UTC, Páll Haraldsson wrote:
FLIF is not a replacement for all uses (multidimensional, would be interesting to know if could to be extended to..), but seem to be the best option for non-lossy image compression:

Of course, say three dimensional, could be trivially, concatenated FLIF files/bytestreams of 2D slices.

But I really meant, could there be a way to compress whole array/"image", similar to how wavelet (and fft) can be generalized to more dimensions?

Tim Holy

unread,
Oct 13, 2016, 2:00:30 PM10/13/16
to julia...@googlegroups.com
A couple of points:
- As mentioned by others, NRRD doesn't have performance issues (except likely for very small arrays, due to the overhead of parsing). If you just want a raw dump of memory, you can get that, and if it's big it uses `Mmap.mmap` when it reads the data back in. So you can read terabyte-sized arrays.
- in my opinion, NRRD does have a few deficiencies (https://sourceforge.net/p/teem/bugs/14/), of which the most serious is probably the lack of a real test suite. (I'd guess this is probably the main reason it hasn't taken over the world when it comes to storing images.) But I don't think it would affect how you would use it for storing `Complex{Float32}`s or whatever.
- the `images-next` branch is the current state-of-the-art in julia's support for NRRD: it's almost a complete rewrite from previous versions, as part of https://github.com/timholy/Images.jl/issues/542. Aside from being better-designed and having better support for the standard, it no longer does any reinterpretation of data unless the metadata are such that it is certain this is what should happen.

That said, I'm not evangelizing its use, just wanting to make sure you understand what NRRD is and isn't.

Best,
--Tim

Páll Haraldsson

unread,
Oct 14, 2016, 12:51:14 PM10/14/16
to julia-users
On Thursday, October 13, 2016 at 6:00:30 PM UTC, Tim Holy wrote:
If you just want a raw dump of memory, you can get that, and if it's big it uses `Mmap.mmap` when it reads the data back in. So you can read terabyte-sized arrays.

[Not clear on mmap.. just a possibility, kind or requirement when arrays are this big?]

Good to know, I have another thread on array sizes (and 2 GB limit).

You mean you could read terabyte-sized, not that it's common or that you know of for 1D arrays?

[Would that be arrays of big structs? Fewer than 2G, e.g. 32-bit index would do?]

I'm not at all worried for 2D (or more dimensions).

Reply all
Reply to author
Forward
0 new messages