Hey Francesc,
* Francesc Alted <
fal...@gmail.com> [2012-06-28]:
> On 6/27/12 10:59 PM, Valentin Haenel wrote:
> >Francesc and I have met in real-life and come up with a proposal for a
> >new, revised bloscpack header. It will allow easy interoperability with
> >the new CArray persistence layer that Francesc is working on.
> >
> >I have put the details on github as a gist:
> >
> >
https://gist.github.com/3006723
>
> Awesome! Regarding the open questions:
>
> >The file-size in the header could be potentially replaced with the
> >last-chunk-size which would be a int32 and thus 4 bytes smaller.
> >The total file-size could then be inferred from the three values
> >nchunks, chunk-size and last-chunk-size.
>
> I don't care too much about this, so +0 here.
Okay. Since you are +0 I'll update this.
> >Is the index as uint32 really big enough. If the file-size is kept
> >as int64 the indices would be to small to index the entire file?
>
> Good point. Yes, I think indexes (should we call them offsets
> better?) should be int64 too.
Okay for calling them offsets and int64. I propose to use '-1' to
denote an unknown offset.
> >Should indexes be converted to a bitfield. To allow for storing
> >additional settings in the future?
>
> Uh? Can you explain with a bit more detail what you are proposing here?
Sure. What I mean is to have a bitfield, similar to the bitfield in the
blosc header. The first bit would signify the presence of the offsets.
The other 7 bits would be reserved for signifying other things about the
compressed file. For example, if there are reserved, empty chunks for
expansion.
Incidentally, we have not yet decided how to handle reserved chunks for
expansion. One way, would be to store the number of additional chunks in
the header and simply add them after the last chunk. The size would be
chunk-size plus 16 bytes for the blosc header (in case the data turns
out to be non-compressible), plus space for the checksum if requested.
The main reason for adding them at file creation time, is that we can
pre-allocate space for the offsets.
> >Do we want to store the blosc typesize, for example in the single
> >reserved byte? This would allow to calculate in which chunk to
> >find an element or a sequence of elements.
>
> Not sure about this. For accessing the element in a chunk you will
> need to use Blosc to access data on it, and this typesize info is
> certainly in the Blosc header already. So I'd say -0 to this.
Would the typesize not be needed for locating the chunk that contains
the items which need to be fetched? Something like:
chunk_index = (item_index * typesize) // chunk_size
I vaguely remember having discussed this with you; but unfortunately I
do not remember the outcome :(
I'll now proceed to update the header specification to include the
results from this discussion.
best,
V-