usage question

13 views
Skip to first unread message

Brent Pedersen

unread,
May 11, 2020, 4:48:43 PM5/11/20
to blosc
Hi, thanks for developing blosc(2).
I wondered if I could get some feedback on a use-case I have in mind for blosc.
The data is genomic data. I currently encode chromosome,position(chromosome position), reference allele, alternate allele into a single uint64 [where reference and alternate are A, C, T, or G]. For each uint64, I also store several (2-10) int32, float32 values (depending on user input). The final tool is for annotation and user-access is always sorted, so I currently always decompress an entire chromosome into memory, do binary search on the uint64 position array to get the index. Then use that index to grab the int32/float32 values. The problem is that even a single chromosome could have ~1 billion positions (including all possible alts) and that's too much to keep in memory. So, I'd like to use blosc.

From what I understand of the new API, for each chromosome, I could create a new superchunk for every e.g. 10 million positions. Then, if position 70,000,0001 is requested, I'd decompress the 7th chunk and keep it in memory until reaching 80 million (or higher). I would have a different frame for each of the 2-10 float32/int32 values. That would also be decompressed as the position crosses 10 million. With a chunk size of 10 million, I'd have a max of ~1.44 GB in memory assuming 10 values and 3 entries for every genomic position.

I think an alternative would be to use blosc_getitem with a largish nitems. But then I'd still need to track the superchunks as they relate to genomic positions.

Does this seem like a reasonable use of blosc2? Is there some machinery in blosc that will handle decompressing new chunks as they are needed (and re-using the currently decompressed chunk) ?


Thanks for any feedback,
-Brent

Francesc Alted

unread,
May 13, 2020, 4:27:59 AM5/13/20
to Blosc
Hi Brent,

If I get this correctly, yes, I think creating persistent frames with chunks of 10 million elements is perfectly fine for your binary search.  Having said this, I'd just suggest to have a look at caterva (https://github.com/Blosc/Caterva), a multidimensional array container that uses c-blosc2 behind the scenes so you can get more boilerplate done. By making use of slicing in caterva (https://caterva.readthedocs.io/en/latest/reference/array.html#slicing) you can access an element in a caterva array very easily (i.e. you don't need to walk over the different chunks and call getitem() at the end).  Also, there is no need to decompress an entire chunk, but only the block that contains the interesting information; caterva will take care of that too.

Although the API for Caterva is well documented (https://caterva.readthedocs.io/), we don't have an example section yet, but you can have a look at the code in the test suite:


Finally, note that the API for Caterva is still kind of alpha, so there is no guarantee that we don't change it in the future (although we recently did a big API refactoring and we are happy with that so far).

Hope this helps,
Francesc


--
You received this message because you are subscribed to the Google Groups "blosc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blosc+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blosc/3251a068-278b-4340-b168-cbcc82caca96%40googlegroups.com.


--
Francesc Alted

Brent Pedersen

unread,
May 14, 2020, 6:01:36 PM5/14/20
to bl...@googlegroups.com
Hi Francesc, thanks for the reply.
Re caterva, does hit support a container with for example a uint64
col, and a mix of int32 and float32 columns? The examples I see are
all for the same data-type. I wasn't sure if it's actually like a
column store, or if it's compressing the (homogeneous) N-D array as
continuous data.

My main use-case is query-by-position and the position data (uint64)
is sparse, so i'll need to do binary search to find the index. Then
use that
index in the other float32/int32 "columns". So maybe I could store
position as a blosc2 frame, and then have a caterva structure for the
float32s and one for
the int32s?

-Brent
> To view this discussion on the web visit https://groups.google.com/d/msgid/blosc/CAFrp1vp5w%2BcryxeoMBvpVyQYfPugfxF2YPm%2Bhw5-C-68VQKX7w%40mail.gmail.com.

Francesc Alted

unread,
May 15, 2020, 3:16:11 AM5/15/20
to Blosc
Yes, caterva is a homogeneously *sized* N-D array stored in chunks, and it does not have the notion of data types (only the typesize).  With that, you can store a struct with fields and pass the size of the struct as the `typesize`.  As you retrieve data, you retrieve the whole struct and it is up to you to unpack it correctly.  So yes, you could have a couple of frames, one for the position data and the other for the struct (by the way, as you struct is 8 bytes long, caterva can still make use of an accelerated path for the shuffle or bitshuffle filters).

Francesc

Brent Pedersen

unread,
Jun 19, 2020, 4:01:07 PM6/19/20
to bl...@googlegroups.com
Hi Francesc, thanks for the reply. I'm slowly working on this and have
encountered a couple of questions.
I am writing a wrapper blosc2 for the nim language.

For just blosc2_contexts, I can see how to get the size of a
compressed block and set my output size accordingly:
https://github.com/brentp/blosc2-nim/blob/2efd19807cb52f71280c7d9ffa345c87e489101b/src/blosc2.nim#L302

however, I don't see how to get the uncompressed size if my data is in
chunk_i within a (frame-backed) schunk. I see that given a
frame-backed schunk,
I can't access schunk->data directly. Is there an API function or
example I am missing?

thanks,
-Brent
> To view this discussion on the web visit https://groups.google.com/d/msgid/blosc/CAFrp1vr1nfvNd%3D%3DRBTbh9rTJyCTR1SaNd6_iFNW6kvd%2BYx25QQ%40mail.gmail.com.

Brent Pedersen

unread,
Jun 29, 2020, 11:59:44 AM6/29/20
to bl...@googlegroups.com
Hi again, I'd appreciate any help on this. Thank you.

Francesc Alted

unread,
Jun 29, 2020, 1:46:10 PM6/29/20
to Blosc
Hi Brent,

On Fri, May 15, 2020 at 12:01 AM Brent Pedersen <bped...@gmail.com> wrote:
Hi Francesc, thanks for the reply.
Re caterva, does hit support a container with for example a uint64
col, and a mix of int32 and float32 columns? The examples I see are
all for the same data-type. I wasn't sure if it's actually like a
column store, or if it's compressing the (homogeneous) N-D array as
continuous data.

Well, actually Caterva is type agnostic, so you just declare a container having items of a certain size.  This size can be either for homogeneous or heterogeneous data (e.g. 8 bytes can be interpreted as either 1 double, or 1 int32 + 1 float32).  Then, by using [caterva_array_get_slice_buffer](https://github.com/Blosc/Caterva/blob/master/caterva/caterva.h#L449-L463), you can retrieve slices and interpret the buffer as you wish.  Such a type-agnostic feature is by design and allows Caterva to be much simpler and efficient.
 

My main use-case is query-by-position and the position data (uint64)
is sparse, so i'll need to do binary search to find the index. Then
use that
index in the other float32/int32 "columns". So maybe I could store
position as a blosc2 frame, and then have a caterva structure for the
float32s and one for
the int32s?

Sounds reasonable to me, or as said above, if you are always interested in retrieving the float32 and int32 fields in one go, you can store items of size 8 and then map them as an array of structs once retrieved.  Another approach is to create a large dataset with your float32/int32 columns in the 'interesting' uint64 positions and fill with zeros all the rest.  Compression will get rid of the zeros quite efficiently.  This way you can avoid the binary search step.  Of course, this will only be feasible for a range of position data that is 'reasonably' large (i.e. something that does not exceed some tens of billions, depending on your hardware).

Hope this helps,

Francesc
 

Francesc Alted

unread,
Jun 29, 2020, 1:46:51 PM6/29/20
to Blosc
Thanks for the remainder.  I did read your message before and I forgot to reply.

Brent Pedersen

unread,
Jun 29, 2020, 2:21:03 PM6/29/20
to bl...@googlegroups.com
Hi Francesc,
I have been top-posting (shame on me). So my latest question was actually:

Francesc Alted

unread,
Jun 29, 2020, 2:41:58 PM6/29/20
to Blosc
Ups, stupid of me.

If I understand you correctly, you can use `blosc_cbuffer_sizes()` for getting the uncompressed size.  This is `nbytes` in the docstrings:


Of course, you need first to access the chunk from the frame via [blosc2_schunk_get_chunk](https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc2.h#L964-L982).

Does this help?
Francesc

 

thanks,

-Brent

--
You received this message because you are subscribed to the Google Groups "blosc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blosc+un...@googlegroups.com.

Brent Pedersen

unread,
Jul 7, 2020, 11:45:21 AM7/7/20
to bl...@googlegroups.com
On Mon, Jun 29, 2020 at 12:42 PM Francesc Alted <fal...@gmail.com> wrote:
>
> Ups, stupid of me.
>
> On Mon, Jun 29, 2020 at 8:21 PM Brent Pedersen <bped...@gmail.com> wrote:
>>
>> Hi Francesc,
>> I have been top-posting (shame on me). So my latest question was actually:
>>
>> For just blosc2_contexts, I can see how to get the size of a
>> compressed block and set my output size accordingly:
>> https://github.com/brentp/blosc2-nim/blob/2efd19807cb52f71280c7d9ffa345c87e489101b/src/blosc2.nim#L302
>>
>> however, I don't see how to get the uncompressed size if my data is in
>> chunk_i within a (frame-backed) schunk. I see that given a frame-backed schunk,
>> I can't access schunk->data directly. Is there an API function or
>> example I am missing?
>
>
> If I understand you correctly, you can use `blosc_cbuffer_sizes()` for getting the uncompressed size. This is `nbytes` in the docstrings:
>
> https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc2.h#L527-L544
>
> Of course, you need first to access the chunk from the frame via [blosc2_schunk_get_chunk](https://github.com/Blosc/c-blosc2/blob/master/blosc/blosc2.h#L964-L982).
>
> Does this help?
> Francesc
>

Yes, this works. Thank you.
I just want to clarify one more thing. You noted that a caterva
container must be of a homogeneous type (well, or at least size).
Is this true of a frame also? I thought that I might add an int32
superchunk, followed by an int64 superchunk, but I get an error with
that.
So, just to verify, I'll need a separate frame for each data size,
correct? e.g. a 4 byte frame and an 8 byte frame for int32,float32 and
int64 and float64,
respectively, yes?

thanks again for your help, and the software. I have something that
mostly works now.
-Brent

>
>>
>>
>> thanks,
>> -Brent
>>
>> --
>> You received this message because you are subscribed to the Google Groups "blosc" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to blosc+un...@googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/msgid/blosc/CAAp4xwr_KKJqLeFx597Dys9PvUUBNz7fpzA-EQtJiCwKWFsNMQ%40mail.gmail.com.
>
>
>
> --
> Francesc Alted
>
> --
> You received this message because you are subscribed to the Google Groups "blosc" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to blosc+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/blosc/CAFrp1voRQDoKtg8mWVzD9y6FQegoM5rScrQZkCwD%3DXKAwNJ4Zg%40mail.gmail.com.

Francesc Alted

unread,
Jul 7, 2020, 12:06:08 PM7/7/20
to Blosc
That's mostly correct.  The itemsize cannot change inside a superchunk or a frame because this is critical for being able to access values quickly.  As said, on a single superchunk/frame you can store either heterogeneous data (e.g. 4-bytes followed by 8-bytes, for a total of 12-byte items, although powers of 2 itemsizes are recommended for efficiency) or, you can have a list of them with different itemsizes (e.g. a frame made of 4-byte items and then another one made by 8-byte items).  It is up to the requirements of the problem.

Finally, a superchunk is very similar to a frame, but the storage for the former is sparse (essentially, a list of chunks), whereas the latter is completely sequential (hence, apt for storing it in a file).  Generally speaking you want a superchunk for storing data in-memory and a frame for storing data on a file on a persistent way.
 

thanks again for your help, and the software. I have something that
mostly works now.

Cool!  As C-Blosc2/Caterva are relatively new and still not apt for production (mainly because the format might change, but the core is quite well tested), we would be grateful if you can send some pointers to what you are doing for back-reference from our sites.

Francesc

 
-Brent

>
>>
>>
>> thanks,
>> -Brent
>>
>> --
>> You received this message because you are subscribed to the Google Groups "blosc" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to blosc+un...@googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/msgid/blosc/CAAp4xwr_KKJqLeFx597Dys9PvUUBNz7fpzA-EQtJiCwKWFsNMQ%40mail.gmail.com.
>
>
>
> --
> Francesc Alted
>
> --
> You received this message because you are subscribed to the Google Groups "blosc" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to blosc+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/blosc/CAFrp1voRQDoKtg8mWVzD9y6FQegoM5rScrQZkCwD%3DXKAwNJ4Zg%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "blosc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blosc+un...@googlegroups.com.

Brent Pedersen

unread,
Jul 7, 2020, 2:59:50 PM7/7/20
to bl...@googlegroups.com
>
>
> That's mostly correct. The itemsize cannot change inside a superchunk or a frame because this is critical for being able to access values quickly. As said, on a single superchunk/frame you can store either heterogeneous data (e.g. 4-bytes followed by 8-bytes, for a total of 12-byte items, although powers of 2 itemsizes are recommended for efficiency) or, you can have a list of them with different itemsizes (e.g. a frame made of 4-byte items and then another one made by 8-byte items). It is up to the requirements of the problem.
>
> Finally, a superchunk is very similar to a frame, but the storage for the former is sparse (essentially, a list of chunks), whereas the latter is completely sequential (hence, apt for storing it in a file). Generally speaking you want a superchunk for storing data in-memory and a frame for storing data on a file on a persistent way.
>
>>
>>
>> thanks again for your help, and the software. I have something that
>> mostly works now.
>
>
> Cool! As C-Blosc2/Caterva are relatively new and still not apt for production (mainly because the format might change, but the core is quite well tested), we would be grateful if you can send some pointers to what you are doing for back-reference from our sites.
>

For now I have this: https://github.com/brentp/blosc2-nim . Once I
have the actual software that does what I want, I'll send an email.


> Francesc
>
>
Reply all
Reply to author
Forward
0 new messages