Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
RFC for the new bloscpack header
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  11 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Valentin Haenel  
View profile  
 More options Jun 27 2012, 4:59 pm
From: Valentin Haenel <valentin.hae...@gmx.de>
Date: Wed, 27 Jun 2012 22:59:09 +0200
Local: Wed, Jun 27 2012 4:59 pm
Subject: RFC for the new bloscpack header
Dear all subscribers,

Francesc and I have met in real-life and come up with a proposal for a
new, revised bloscpack header. It will allow easy interoperability with
the new CArray persistence layer that Francesc is working on.

I have put the details on github as a gist:

https://gist.github.com/3006723

V-


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Francesc Alted  
View profile   Translate to Translated (View Original)
 More options Jun 28 2012, 5:00 am
From: Francesc Alted <fal...@gmail.com>
Date: Thu, 28 Jun 2012 11:00:17 +0200
Local: Thurs, Jun 28 2012 5:00 am
Subject: Re: [blosc] RFC for the new bloscpack header
On 6/27/12 10:59 PM, Valentin Haenel wrote:

> Dear all subscribers,

> Francesc and I have met in real-life and come up with a proposal for a
> new, revised bloscpack header. It will allow easy interoperability with
> the new CArray persistence layer that Francesc is working on.

> I have put the details on github as a gist:

> https://gist.github.com/3006723

Awesome!  Regarding the open questions:

> The file-size in the header could be potentially replaced with the
> last-chunk-size which would be a int32 and thus 4 bytes smaller. The
> total file-size could then be inferred from the three values nchunks,
> chunk-size and last-chunk-size.

I don't care too much about this, so +0 here.

> Is the index as uint32 really big enough. If the file-size is kept as
> int64 the indices would be to small to index the entire file?

Good point.  Yes, I think indexes (should we call them offsets better?)
should be int64 too.

> Should indexes be converted to a bitfield. To allow for storing
> additional settings in the future?

Uh?  Can you explain with a bit more detail what you are proposing here?

> Do we want to store the blosc typesize, for example in the single
> reserved byte? This would allow to calculate in which chunk to find an
> element or a sequence of elements.

Not sure about this.  For accessing the element in a chunk you will need
to use Blosc to access data on it, and this typesize info is certainly
in the Blosc header already.  So I'd say -0 to this.

--
Francesc Alted


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Valentin Haenel  
View profile  
 More options Jun 30 2012, 10:40 am
From: Valentin Haenel <valentin.hae...@gmx.de>
Date: Sat, 30 Jun 2012 16:40:08 +0200
Local: Sat, Jun 30 2012 10:40 am
Subject: Re: [blosc] RFC for the new bloscpack header
Hey Francesc,

* Francesc Alted <fal...@gmail.com> [2012-06-28]:

Okay. Since you are +0 I'll update this.

> >Is the index as uint32 really big enough. If the file-size is kept
> >as int64 the indices would be to small to index the entire file?

> Good point.  Yes, I think indexes (should we call them offsets
> better?) should be int64 too.

Okay for calling them offsets and int64. I propose to use '-1' to
denote an unknown offset.

> >Should indexes be converted to a bitfield. To allow for storing
> >additional settings in the future?

> Uh?  Can you explain with a bit more detail what you are proposing here?

Sure. What I mean is to have a bitfield, similar to the bitfield in the
blosc header. The first bit would signify the presence of the offsets.
The other 7 bits would be reserved for signifying other things about the
compressed file. For example, if there are reserved, empty chunks for
expansion.

Incidentally, we have not yet decided how to handle reserved chunks for
expansion. One way, would be to store the number of additional chunks in
the header and simply add them after the last chunk. The size would be
chunk-size plus 16 bytes for the blosc header (in case the data turns
out to be non-compressible), plus space for the checksum if requested.
The main reason for adding them at file creation time, is that we can
pre-allocate space for the offsets.

> >Do we want to store the blosc typesize, for example in the single
> >reserved byte? This would allow to calculate in which chunk to
> >find an element or a sequence of elements.

> Not sure about this.  For accessing the element in a chunk you will
> need to use Blosc to access data on it, and this typesize info is
> certainly in the Blosc header already.  So I'd say -0 to this.

Would the typesize not be needed for locating the chunk that contains
the items which need to be fetched? Something like:

    chunk_index = (item_index * typesize) // chunk_size

I vaguely remember having discussed this with you; but unfortunately I
do not remember the outcome :(

I'll now proceed to update the header specification to include the
results from this discussion.

best,

V-


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Valentin Haenel  
View profile  
 More options Jun 30 2012, 11:00 am
From: Valentin Haenel <valentin.hae...@gmx.de>
Date: Sat, 30 Jun 2012 17:00:15 +0200
Local: Sat, Jun 30 2012 11:00 am
Subject: Re: [blosc] RFC for the new bloscpack header
Hi,

* Valentin Haenel <valentin.hae...@gmx.de> [2012-06-30]:

> > >Is the index as uint32 really big enough. If the file-size is kept
> > >as int64 the indices would be to small to index the entire file?

> > Good point.  Yes, I think indexes (should we call them offsets
> > better?) should be int64 too.

> Okay for calling them offsets and int64. I propose to use '-1' to
> denote an unknown offset.

https://gist.github.com/3006723/ec179e3b7bdffdb65dc1799fbf5aa141ad6288c9

V-


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Valentin Haenel  
View profile  
 More options Jun 30 2012, 11:14 am
From: Valentin Haenel <valentin.hae...@gmx.de>
Date: Sat, 30 Jun 2012 17:14:48 +0200
Local: Sat, Jun 30 2012 11:14 am
Subject: Re: [blosc] RFC for the new bloscpack header
* Valentin Haenel <valentin.hae...@gmx.de> [2012-06-30]:

> * Francesc Alted <fal...@gmail.com> [2012-06-28]:
> > >The file-size in the header could be potentially replaced with the
> > >last-chunk-size which would be a int32 and thus 4 bytes smaller.
> > >The total file-size could then be inferred from the three values
> > >nchunks, chunk-size and last-chunk-size.

> > I don't care too much about this, so +0 here.

> Okay. Since you are +0 I'll update this.

https://gist.github.com/3006723/9834af520257db7b7d4aaeb4af2ce9cdf5662fc9

(the syntax is a bit foobared, will fix this shortly)

V-


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Francesc Alted  
View profile   Translate to Translated (View Original)
 More options Jul 6 2012, 5:30 am
From: Francesc Alted <fal...@gmail.com>
Date: Fri, 06 Jul 2012 11:30:55 +0200
Local: Fri, Jul 6 2012 5:30 am
Subject: Re: [blosc] RFC for the new bloscpack header
Hey Valentin, I forgot to answer these, sorry.

On 6/30/12 4:40 PM, Valentin Haenel wrote:

>>> Is the index as uint32 really big enough. If the file-size is kept
>>> as int64 the indices would be to small to index the entire file?
>> Good point.  Yes, I think indexes (should we call them offsets
>> better?) should be int64 too.
> Okay for calling them offsets and int64. I propose to use '-1' to
> denote an unknown offset.

That's fine with me.

>>> Should indexes be converted to a bitfield. To allow for storing
>>> additional settings in the future?
>> Uh?  Can you explain with a bit more detail what you are proposing here?
> Sure. What I mean is to have a bitfield, similar to the bitfield in the
> blosc header. The first bit would signify the presence of the offsets.
> The other 7 bits would be reserved for signifying other things about the
> compressed file. For example, if there are reserved, empty chunks for
> expansion.

That's a good idea.  +1 for including such a bitfield.

> Incidentally, we have not yet decided how to handle reserved chunks for
> expansion. One way, would be to store the number of additional chunks in
> the header and simply add them after the last chunk. The size would be
> chunk-size plus 16 bytes for the blosc header (in case the data turns
> out to be non-compressible), plus space for the checksum if requested.
> The main reason for adding them at file creation time, is that we can
> pre-allocate space for the offsets.

Definitely, pre-allocating space and then filling the offset info would
be the way to go, IMO.

>>> Do we want to store the blosc typesize, for example in the single
>>> reserved byte? This would allow to calculate in which chunk to
>>> find an element or a sequence of elements.
>> Not sure about this.  For accessing the element in a chunk you will
>> need to use Blosc to access data on it, and this typesize info is
>> certainly in the Blosc header already.  So I'd say -0 to this.
> Would the typesize not be needed for locating the chunk that contains
> the items which need to be fetched? Something like:

>      chunk_index = (item_index * typesize) // chunk_size

> I vaguely remember having discussed this with you; but unfortunately I
> do not remember the outcome :(

Good point.  So I change my mind to a clear +1 on this.

--
Francesc Alted


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Valentin Haenel  
View profile  
 More options Jul 7 2012, 12:40 pm
From: Valentin Haenel <valentin.hae...@gmx.de>
Date: Sat, 7 Jul 2012 18:40:18 +0200
Local: Sat, Jul 7 2012 12:40 pm
Subject: Re: [blosc] RFC for the new bloscpack header
Ola Francesc,

* Francesc Alted <fal...@gmail.com> [2012-07-06]:

> Hey Valentin, I forgot to answer these, sorry.

No Problem ! :)

> >>>Should indexes be converted to a bitfield. To allow for storing
> >>>additional settings in the future?
> >>Uh?  Can you explain with a bit more detail what you are proposing here?
> >Sure. What I mean is to have a bitfield, similar to the bitfield in the
> >blosc header. The first bit would signify the presence of the offsets.
> >The other 7 bits would be reserved for signifying other things about the
> >compressed file. For example, if there are reserved, empty chunks for
> >expansion.

> That's a good idea.  +1 for including such a bitfield.

OK. I have included it in the latest version.

> >Incidentally, we have not yet decided how to handle reserved chunks for
> >expansion. One way, would be to store the number of additional chunks in
> >the header and simply add them after the last chunk. The size would be
> >chunk-size plus 16 bytes for the blosc header (in case the data turns
> >out to be non-compressible), plus space for the checksum if requested.
> >The main reason for adding them at file creation time, is that we can
> >pre-allocate space for the offsets.

> Definitely, pre-allocating space and then filling the offset info
> would be the way to go, IMO.

Yeah I think so too. Though we would need to designate the number and
size of empty chunks perhaps? How exactly to handle the issue of
pre-allocated blocks is not clear to me.

> >>>Do we want to store the blosc typesize, for example in the single
> >>>reserved byte? This would allow to calculate in which chunk to
> >>>find an element or a sequence of elements.
> >>Not sure about this.  For accessing the element in a chunk you will
> >>need to use Blosc to access data on it, and this typesize info is
> >>certainly in the Blosc header already.  So I'd say -0 to this.
> >Would the typesize not be needed for locating the chunk that contains
> >the items which need to be fetched? Something like:

> >     chunk_index = (item_index * typesize) // chunk_size

> >I vaguely remember having discussed this with you; but unfortunately I
> >do not remember the outcome :(

> Good point.  So I change my mind to a clear +1 on this.

OK. I have included it in the latest version.

V-


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Francesc Alted  
View profile  
 More options Jul 9 2012, 5:29 am
From: Francesc Alted <fal...@gmail.com>
Date: Mon, 09 Jul 2012 11:29:32 +0200
Local: Mon, Jul 9 2012 5:29 am
Subject: Re: [blosc] RFC for the new bloscpack header
On 7/7/12 6:40 PM, Valentin Haenel wrote:

> Incidentally, we have not yet decided how to handle reserved chunks for
> expansion. One way, would be to store the number of additional chunks in
> the header and simply add them after the last chunk. The size would be
> chunk-size plus 16 bytes for the blosc header (in case the data turns
> out to be non-compressible), plus space for the checksum if requested.
> The main reason for adding them at file creation time, is that we can
> pre-allocate space for the offsets.
>> Definitely, pre-allocating space and then filling the offset info
>> would be the way to go, IMO.
> Yeah I think so too. Though we would need to designate the number and
> size of empty chunks perhaps? How exactly to handle the issue of
> pre-allocated blocks is not clear to me.

Empty chunks?  Why do you want to book empty chunks?  I don't get it, sorry.

--
Francesc Alted


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Valentin Haenel  
View profile  
 More options Jul 9 2012, 5:39 am
From: Valentin Haenel <valentin.hae...@gmx.de>
Date: Mon, 9 Jul 2012 11:39:07 +0200
Local: Mon, Jul 9 2012 5:39 am
Subject: Re: [blosc] RFC for the new bloscpack header
* Francesc Alted <fal...@gmail.com> [2012-07-09]:

The empty chunks are to be used in the file-format as pre-allocated
space which can be used to enlarge the file w/o having to copy it. Or do
we not need to pre-allocate the space in the file and can we just grow
the file as we add new chunks to it?

V-


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Francesc Alted  
View profile  
 More options Jul 9 2012, 5:49 am
From: Francesc Alted <fal...@gmail.com>
Date: Mon, 09 Jul 2012 11:49:39 +0200
Local: Mon, Jul 9 2012 5:49 am
Subject: Re: [blosc] RFC for the new bloscpack header
On 7/9/12 11:39 AM, Valentin Haenel wrote:

Well, my plan was to use the bloscpack format just to keep a *fix*
number of chunks.  In case we want to enlarge a dataset, my plan was to
add a new file.  You know, pre-allocating in the same file will allways
be tricky.  Let's leverage the filesystem capabilities for doing this.

--
Francesc Alted


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Valentin Haenel  
View profile  
 More options Jul 9 2012, 7:18 am
From: Valentin Haenel <valentin.hae...@gmx.de>
Date: Mon, 9 Jul 2012 13:18:54 +0200
Local: Mon, Jul 9 2012 7:18 am
Subject: Re: [blosc] RFC for the new bloscpack header
Ola!

* Francesc Alted <fal...@gmail.com> [2012-07-09]:

Agreed, this also keeps the format simple. So, all remaining questions
regarding the file-format have been cleared up.  Hence I will proceed to
implement the new format for bloscpack very soon.

V-


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »