Francesc and I have met in real-life and come up with a proposal for a
new, revised bloscpack header. It will allow easy interoperability with
the new CArray persistence layer that Francesc is working on.
> Francesc and I have met in real-life and come up with a proposal for a
> new, revised bloscpack header. It will allow easy interoperability with
> the new CArray persistence layer that Francesc is working on.
> The file-size in the header could be potentially replaced with the > last-chunk-size which would be a int32 and thus 4 bytes smaller. The > total file-size could then be inferred from the three values nchunks, > chunk-size and last-chunk-size.
I don't care too much about this, so +0 here.
> Is the index as uint32 really big enough. If the file-size is kept as > int64 the indices would be to small to index the entire file?
Good point. Yes, I think indexes (should we call them offsets better?) should be int64 too.
> Should indexes be converted to a bitfield. To allow for storing > additional settings in the future?
Uh? Can you explain with a bit more detail what you are proposing here?
> Do we want to store the blosc typesize, for example in the single > reserved byte? This would allow to calculate in which chunk to find an > element or a sequence of elements.
Not sure about this. For accessing the element in a chunk you will need to use Blosc to access data on it, and this typesize info is certainly in the Blosc header already. So I'd say -0 to this.
> On 6/27/12 10:59 PM, Valentin Haenel wrote:
> >Francesc and I have met in real-life and come up with a proposal for a
> >new, revised bloscpack header. It will allow easy interoperability with
> >the new CArray persistence layer that Francesc is working on.
> >The file-size in the header could be potentially replaced with the
> >last-chunk-size which would be a int32 and thus 4 bytes smaller.
> >The total file-size could then be inferred from the three values
> >nchunks, chunk-size and last-chunk-size.
> I don't care too much about this, so +0 here.
Okay. Since you are +0 I'll update this.
> >Is the index as uint32 really big enough. If the file-size is kept
> >as int64 the indices would be to small to index the entire file?
> Good point. Yes, I think indexes (should we call them offsets
> better?) should be int64 too.
Okay for calling them offsets and int64. I propose to use '-1' to
denote an unknown offset.
> >Should indexes be converted to a bitfield. To allow for storing
> >additional settings in the future?
> Uh? Can you explain with a bit more detail what you are proposing here?
Sure. What I mean is to have a bitfield, similar to the bitfield in the
blosc header. The first bit would signify the presence of the offsets.
The other 7 bits would be reserved for signifying other things about the
compressed file. For example, if there are reserved, empty chunks for
expansion.
Incidentally, we have not yet decided how to handle reserved chunks for
expansion. One way, would be to store the number of additional chunks in
the header and simply add them after the last chunk. The size would be
chunk-size plus 16 bytes for the blosc header (in case the data turns
out to be non-compressible), plus space for the checksum if requested.
The main reason for adding them at file creation time, is that we can
pre-allocate space for the offsets.
> >Do we want to store the blosc typesize, for example in the single
> >reserved byte? This would allow to calculate in which chunk to
> >find an element or a sequence of elements.
> Not sure about this. For accessing the element in a chunk you will
> need to use Blosc to access data on it, and this typesize info is
> certainly in the Blosc header already. So I'd say -0 to this.
Would the typesize not be needed for locating the chunk that contains
the items which need to be fetched? Something like:
> * Francesc Alted <fal...@gmail.com> [2012-06-28]:
> > >The file-size in the header could be potentially replaced with the
> > >last-chunk-size which would be a int32 and thus 4 bytes smaller.
> > >The total file-size could then be inferred from the three values
> > >nchunks, chunk-size and last-chunk-size.
>>> Is the index as uint32 really big enough. If the file-size is kept
>>> as int64 the indices would be to small to index the entire file?
>> Good point. Yes, I think indexes (should we call them offsets
>> better?) should be int64 too.
> Okay for calling them offsets and int64. I propose to use '-1' to
> denote an unknown offset.
That's fine with me.
>>> Should indexes be converted to a bitfield. To allow for storing
>>> additional settings in the future?
>> Uh? Can you explain with a bit more detail what you are proposing here?
> Sure. What I mean is to have a bitfield, similar to the bitfield in the
> blosc header. The first bit would signify the presence of the offsets.
> The other 7 bits would be reserved for signifying other things about the
> compressed file. For example, if there are reserved, empty chunks for
> expansion.
That's a good idea. +1 for including such a bitfield.
> Incidentally, we have not yet decided how to handle reserved chunks for
> expansion. One way, would be to store the number of additional chunks in
> the header and simply add them after the last chunk. The size would be
> chunk-size plus 16 bytes for the blosc header (in case the data turns
> out to be non-compressible), plus space for the checksum if requested.
> The main reason for adding them at file creation time, is that we can
> pre-allocate space for the offsets.
Definitely, pre-allocating space and then filling the offset info would be the way to go, IMO.
>>> Do we want to store the blosc typesize, for example in the single
>>> reserved byte? This would allow to calculate in which chunk to
>>> find an element or a sequence of elements.
>> Not sure about this. For accessing the element in a chunk you will
>> need to use Blosc to access data on it, and this typesize info is
>> certainly in the Blosc header already. So I'd say -0 to this.
> Would the typesize not be needed for locating the chunk that contains
> the items which need to be fetched? Something like:
> >>>Should indexes be converted to a bitfield. To allow for storing
> >>>additional settings in the future?
> >>Uh? Can you explain with a bit more detail what you are proposing here?
> >Sure. What I mean is to have a bitfield, similar to the bitfield in the
> >blosc header. The first bit would signify the presence of the offsets.
> >The other 7 bits would be reserved for signifying other things about the
> >compressed file. For example, if there are reserved, empty chunks for
> >expansion.
> That's a good idea. +1 for including such a bitfield.
OK. I have included it in the latest version.
> >Incidentally, we have not yet decided how to handle reserved chunks for
> >expansion. One way, would be to store the number of additional chunks in
> >the header and simply add them after the last chunk. The size would be
> >chunk-size plus 16 bytes for the blosc header (in case the data turns
> >out to be non-compressible), plus space for the checksum if requested.
> >The main reason for adding them at file creation time, is that we can
> >pre-allocate space for the offsets.
> Definitely, pre-allocating space and then filling the offset info
> would be the way to go, IMO.
Yeah I think so too. Though we would need to designate the number and
size of empty chunks perhaps? How exactly to handle the issue of
pre-allocated blocks is not clear to me.
> >>>Do we want to store the blosc typesize, for example in the single
> >>>reserved byte? This would allow to calculate in which chunk to
> >>>find an element or a sequence of elements.
> >>Not sure about this. For accessing the element in a chunk you will
> >>need to use Blosc to access data on it, and this typesize info is
> >>certainly in the Blosc header already. So I'd say -0 to this.
> >Would the typesize not be needed for locating the chunk that contains
> >the items which need to be fetched? Something like:
> Incidentally, we have not yet decided how to handle reserved chunks for
> expansion. One way, would be to store the number of additional chunks in
> the header and simply add them after the last chunk. The size would be
> chunk-size plus 16 bytes for the blosc header (in case the data turns
> out to be non-compressible), plus space for the checksum if requested.
> The main reason for adding them at file creation time, is that we can
> pre-allocate space for the offsets.
>> Definitely, pre-allocating space and then filling the offset info
>> would be the way to go, IMO.
> Yeah I think so too. Though we would need to designate the number and
> size of empty chunks perhaps? How exactly to handle the issue of
> pre-allocated blocks is not clear to me.
Empty chunks? Why do you want to book empty chunks? I don't get it, sorry.
> >Incidentally, we have not yet decided how to handle reserved chunks for
> >expansion. One way, would be to store the number of additional chunks in
> >the header and simply add them after the last chunk. The size would be
> >chunk-size plus 16 bytes for the blosc header (in case the data turns
> >out to be non-compressible), plus space for the checksum if requested.
> >The main reason for adding them at file creation time, is that we can
> >pre-allocate space for the offsets.
> >>Definitely, pre-allocating space and then filling the offset info
> >>would be the way to go, IMO.
> >Yeah I think so too. Though we would need to designate the number and
> >size of empty chunks perhaps? How exactly to handle the issue of
> >pre-allocated blocks is not clear to me.
> Empty chunks? Why do you want to book empty chunks? I don't get it, sorry.
The empty chunks are to be used in the file-format as pre-allocated
space which can be used to enlarge the file w/o having to copy it. Or do
we not need to pre-allocate the space in the file and can we just grow
the file as we add new chunks to it?
> * Francesc Alted <fal...@gmail.com> [2012-07-09]:
>> On 7/7/12 6:40 PM, Valentin Haenel wrote:
>>> Incidentally, we have not yet decided how to handle reserved chunks for
>>> expansion. One way, would be to store the number of additional chunks in
>>> the header and simply add them after the last chunk. The size would be
>>> chunk-size plus 16 bytes for the blosc header (in case the data turns
>>> out to be non-compressible), plus space for the checksum if requested.
>>> The main reason for adding them at file creation time, is that we can
>>> pre-allocate space for the offsets.
>>>> Definitely, pre-allocating space and then filling the offset info
>>>> would be the way to go, IMO.
>>> Yeah I think so too. Though we would need to designate the number and
>>> size of empty chunks perhaps? How exactly to handle the issue of
>>> pre-allocated blocks is not clear to me.
>> Empty chunks? Why do you want to book empty chunks? I don't get it, sorry.
> The empty chunks are to be used in the file-format as pre-allocated
> space which can be used to enlarge the file w/o having to copy it. Or do
> we not need to pre-allocate the space in the file and can we just grow
> the file as we add new chunks to it?
Well, my plan was to use the bloscpack format just to keep a *fix* number of chunks. In case we want to enlarge a dataset, my plan was to add a new file. You know, pre-allocating in the same file will allways be tricky. Let's leverage the filesystem capabilities for doing this.
> On 7/9/12 11:39 AM, Valentin Haenel wrote:
> >* Francesc Alted <fal...@gmail.com> [2012-07-09]:
> >>On 7/7/12 6:40 PM, Valentin Haenel wrote:
> >>>Incidentally, we have not yet decided how to handle reserved chunks for
> >>>expansion. One way, would be to store the number of additional chunks in
> >>>the header and simply add them after the last chunk. The size would be
> >>>chunk-size plus 16 bytes for the blosc header (in case the data turns
> >>>out to be non-compressible), plus space for the checksum if requested.
> >>>The main reason for adding them at file creation time, is that we can
> >>>pre-allocate space for the offsets.
> >>>>Definitely, pre-allocating space and then filling the offset info
> >>>>would be the way to go, IMO.
> >>>Yeah I think so too. Though we would need to designate the number and
> >>>size of empty chunks perhaps? How exactly to handle the issue of
> >>>pre-allocated blocks is not clear to me.
> >>Empty chunks? Why do you want to book empty chunks? I don't get it, sorry.
> >The empty chunks are to be used in the file-format as pre-allocated
> >space which can be used to enlarge the file w/o having to copy it. Or do
> >we not need to pre-allocate the space in the file and can we just grow
> >the file as we add new chunks to it?
> Well, my plan was to use the bloscpack format just to keep a *fix*
> number of chunks. In case we want to enlarge a dataset, my plan was
> to add a new file. You know, pre-allocating in the same file will
> allways be tricky. Let's leverage the filesystem capabilities for
> doing this.
Agreed, this also keeps the format simple. So, all remaining questions
regarding the file-format have been cleared up. Hence I will proceed to
implement the new format for bloscpack very soon.