RFC for the new bloscpack header

34 views
Skip to first unread message

Valentin Haenel

unread,
Jun 27, 2012, 4:59:09 PM6/27/12
to Blosc
Dear all subscribers,

Francesc and I have met in real-life and come up with a proposal for a
new, revised bloscpack header. It will allow easy interoperability with
the new CArray persistence layer that Francesc is working on.

I have put the details on github as a gist:

https://gist.github.com/3006723

V-

Francesc Alted

unread,
Jun 28, 2012, 5:00:17 AM6/28/12
to bl...@googlegroups.com
Awesome! Regarding the open questions:

> The file-size in the header could be potentially replaced with the
> last-chunk-size which would be a int32 and thus 4 bytes smaller. The
> total file-size could then be inferred from the three values nchunks,
> chunk-size and last-chunk-size.

I don't care too much about this, so +0 here.

> Is the index as uint32 really big enough. If the file-size is kept as
> int64 the indices would be to small to index the entire file?

Good point. Yes, I think indexes (should we call them offsets better?)
should be int64 too.

> Should indexes be converted to a bitfield. To allow for storing
> additional settings in the future?

Uh? Can you explain with a bit more detail what you are proposing here?

> Do we want to store the blosc typesize, for example in the single
> reserved byte? This would allow to calculate in which chunk to find an
> element or a sequence of elements.

Not sure about this. For accessing the element in a chunk you will need
to use Blosc to access data on it, and this typesize info is certainly
in the Blosc header already. So I'd say -0 to this.

--
Francesc Alted

Valentin Haenel

unread,
Jun 30, 2012, 10:40:08 AM6/30/12
to bl...@googlegroups.com
Hey Francesc,

* Francesc Alted <fal...@gmail.com> [2012-06-28]:
> On 6/27/12 10:59 PM, Valentin Haenel wrote:
> >Francesc and I have met in real-life and come up with a proposal for a
> >new, revised bloscpack header. It will allow easy interoperability with
> >the new CArray persistence layer that Francesc is working on.
> >
> >I have put the details on github as a gist:
> >
> >https://gist.github.com/3006723
>
> Awesome! Regarding the open questions:
>
> >The file-size in the header could be potentially replaced with the
> >last-chunk-size which would be a int32 and thus 4 bytes smaller.
> >The total file-size could then be inferred from the three values
> >nchunks, chunk-size and last-chunk-size.
>
> I don't care too much about this, so +0 here.

Okay. Since you are +0 I'll update this.

> >Is the index as uint32 really big enough. If the file-size is kept
> >as int64 the indices would be to small to index the entire file?
>
> Good point. Yes, I think indexes (should we call them offsets
> better?) should be int64 too.

Okay for calling them offsets and int64. I propose to use '-1' to
denote an unknown offset.

> >Should indexes be converted to a bitfield. To allow for storing
> >additional settings in the future?
>
> Uh? Can you explain with a bit more detail what you are proposing here?

Sure. What I mean is to have a bitfield, similar to the bitfield in the
blosc header. The first bit would signify the presence of the offsets.
The other 7 bits would be reserved for signifying other things about the
compressed file. For example, if there are reserved, empty chunks for
expansion.

Incidentally, we have not yet decided how to handle reserved chunks for
expansion. One way, would be to store the number of additional chunks in
the header and simply add them after the last chunk. The size would be
chunk-size plus 16 bytes for the blosc header (in case the data turns
out to be non-compressible), plus space for the checksum if requested.
The main reason for adding them at file creation time, is that we can
pre-allocate space for the offsets.

> >Do we want to store the blosc typesize, for example in the single
> >reserved byte? This would allow to calculate in which chunk to
> >find an element or a sequence of elements.
>
> Not sure about this. For accessing the element in a chunk you will
> need to use Blosc to access data on it, and this typesize info is
> certainly in the Blosc header already. So I'd say -0 to this.

Would the typesize not be needed for locating the chunk that contains
the items which need to be fetched? Something like:

chunk_index = (item_index * typesize) // chunk_size

I vaguely remember having discussed this with you; but unfortunately I
do not remember the outcome :(

I'll now proceed to update the header specification to include the
results from this discussion.

best,

V-

Valentin Haenel

unread,
Jun 30, 2012, 11:00:15 AM6/30/12
to bl...@googlegroups.com
Hi,

* Valentin Haenel <valenti...@gmx.de> [2012-06-30]:
> > >Is the index as uint32 really big enough. If the file-size is kept
> > >as int64 the indices would be to small to index the entire file?
> >
> > Good point. Yes, I think indexes (should we call them offsets
> > better?) should be int64 too.
>
> Okay for calling them offsets and int64. I propose to use '-1' to
> denote an unknown offset.

https://gist.github.com/3006723/ec179e3b7bdffdb65dc1799fbf5aa141ad6288c9

V-

Valentin Haenel

unread,
Jun 30, 2012, 11:14:48 AM6/30/12
to bl...@googlegroups.com
* Valentin Haenel <valenti...@gmx.de> [2012-06-30]:
> * Francesc Alted <fal...@gmail.com> [2012-06-28]:
> > >The file-size in the header could be potentially replaced with the
> > >last-chunk-size which would be a int32 and thus 4 bytes smaller.
> > >The total file-size could then be inferred from the three values
> > >nchunks, chunk-size and last-chunk-size.
> >
> > I don't care too much about this, so +0 here.
>
> Okay. Since you are +0 I'll update this.

https://gist.github.com/3006723/9834af520257db7b7d4aaeb4af2ce9cdf5662fc9

(the syntax is a bit foobared, will fix this shortly)

V-

Francesc Alted

unread,
Jul 6, 2012, 5:30:55 AM7/6/12
to bl...@googlegroups.com
Hey Valentin, I forgot to answer these, sorry.

On 6/30/12 4:40 PM, Valentin Haenel wrote:
>
>>> Is the index as uint32 really big enough. If the file-size is kept
>>> as int64 the indices would be to small to index the entire file?
>> Good point. Yes, I think indexes (should we call them offsets
>> better?) should be int64 too.
> Okay for calling them offsets and int64. I propose to use '-1' to
> denote an unknown offset.

That's fine with me.

>>> Should indexes be converted to a bitfield. To allow for storing
>>> additional settings in the future?
>> Uh? Can you explain with a bit more detail what you are proposing here?
> Sure. What I mean is to have a bitfield, similar to the bitfield in the
> blosc header. The first bit would signify the presence of the offsets.
> The other 7 bits would be reserved for signifying other things about the
> compressed file. For example, if there are reserved, empty chunks for
> expansion.

That's a good idea. +1 for including such a bitfield.

>
> Incidentally, we have not yet decided how to handle reserved chunks for
> expansion. One way, would be to store the number of additional chunks in
> the header and simply add them after the last chunk. The size would be
> chunk-size plus 16 bytes for the blosc header (in case the data turns
> out to be non-compressible), plus space for the checksum if requested.
> The main reason for adding them at file creation time, is that we can
> pre-allocate space for the offsets.

Definitely, pre-allocating space and then filling the offset info would
be the way to go, IMO.

>
>>> Do we want to store the blosc typesize, for example in the single
>>> reserved byte? This would allow to calculate in which chunk to
>>> find an element or a sequence of elements.
>> Not sure about this. For accessing the element in a chunk you will
>> need to use Blosc to access data on it, and this typesize info is
>> certainly in the Blosc header already. So I'd say -0 to this.
> Would the typesize not be needed for locating the chunk that contains
> the items which need to be fetched? Something like:
>
> chunk_index = (item_index * typesize) // chunk_size
>
> I vaguely remember having discussed this with you; but unfortunately I
> do not remember the outcome :(

Good point. So I change my mind to a clear +1 on this.

--
Francesc Alted

Valentin Haenel

unread,
Jul 7, 2012, 12:40:18 PM7/7/12
to bl...@googlegroups.com
Ola Francesc,

* Francesc Alted <fal...@gmail.com> [2012-07-06]:
> Hey Valentin, I forgot to answer these, sorry.

No Problem ! :)

> >>>Should indexes be converted to a bitfield. To allow for storing
> >>>additional settings in the future?
> >>Uh? Can you explain with a bit more detail what you are proposing here?
> >Sure. What I mean is to have a bitfield, similar to the bitfield in the
> >blosc header. The first bit would signify the presence of the offsets.
> >The other 7 bits would be reserved for signifying other things about the
> >compressed file. For example, if there are reserved, empty chunks for
> >expansion.
>
> That's a good idea. +1 for including such a bitfield.

OK. I have included it in the latest version.

> >Incidentally, we have not yet decided how to handle reserved chunks for
> >expansion. One way, would be to store the number of additional chunks in
> >the header and simply add them after the last chunk. The size would be
> >chunk-size plus 16 bytes for the blosc header (in case the data turns
> >out to be non-compressible), plus space for the checksum if requested.
> >The main reason for adding them at file creation time, is that we can
> >pre-allocate space for the offsets.
>
> Definitely, pre-allocating space and then filling the offset info
> would be the way to go, IMO.

Yeah I think so too. Though we would need to designate the number and
size of empty chunks perhaps? How exactly to handle the issue of
pre-allocated blocks is not clear to me.

> >>>Do we want to store the blosc typesize, for example in the single
> >>>reserved byte? This would allow to calculate in which chunk to
> >>>find an element or a sequence of elements.
> >>Not sure about this. For accessing the element in a chunk you will
> >>need to use Blosc to access data on it, and this typesize info is
> >>certainly in the Blosc header already. So I'd say -0 to this.
> >Would the typesize not be needed for locating the chunk that contains
> >the items which need to be fetched? Something like:
> >
> > chunk_index = (item_index * typesize) // chunk_size
> >
> >I vaguely remember having discussed this with you; but unfortunately I
> >do not remember the outcome :(
>
> Good point. So I change my mind to a clear +1 on this.

OK. I have included it in the latest version.

V-

Francesc Alted

unread,
Jul 9, 2012, 5:29:32 AM7/9/12
to bl...@googlegroups.com
On 7/7/12 6:40 PM, Valentin Haenel wrote:
>
> Incidentally, we have not yet decided how to handle reserved chunks for
> expansion. One way, would be to store the number of additional chunks in
> the header and simply add them after the last chunk. The size would be
> chunk-size plus 16 bytes for the blosc header (in case the data turns
> out to be non-compressible), plus space for the checksum if requested.
> The main reason for adding them at file creation time, is that we can
> pre-allocate space for the offsets.
>> Definitely, pre-allocating space and then filling the offset info
>> would be the way to go, IMO.
> Yeah I think so too. Though we would need to designate the number and
> size of empty chunks perhaps? How exactly to handle the issue of
> pre-allocated blocks is not clear to me.

Empty chunks? Why do you want to book empty chunks? I don't get it, sorry.

--
Francesc Alted

Valentin Haenel

unread,
Jul 9, 2012, 5:39:07 AM7/9/12
to bl...@googlegroups.com
* Francesc Alted <fal...@gmail.com> [2012-07-09]:
The empty chunks are to be used in the file-format as pre-allocated
space which can be used to enlarge the file w/o having to copy it. Or do
we not need to pre-allocate the space in the file and can we just grow
the file as we add new chunks to it?

V-

Francesc Alted

unread,
Jul 9, 2012, 5:49:39 AM7/9/12
to bl...@googlegroups.com
Well, my plan was to use the bloscpack format just to keep a *fix*
number of chunks. In case we want to enlarge a dataset, my plan was to
add a new file. You know, pre-allocating in the same file will allways
be tricky. Let's leverage the filesystem capabilities for doing this.

--
Francesc Alted

Valentin Haenel

unread,
Jul 9, 2012, 7:18:54 AM7/9/12
to bl...@googlegroups.com
Ola!
Agreed, this also keeps the format simple. So, all remaining questions
regarding the file-format have been cleared up. Hence I will proceed to
implement the new format for bloscpack very soon.

V-
Reply all
Reply to author
Forward
0 new messages