Columnar data ingest/export

102 views
Skip to first unread message

Leo Meyerovich

unread,
Sep 15, 2017, 11:57:35 AM9/15/17
to cstore users
Hi, we're examining how to get Apache Arrow columnar format data quickly in & out of cstore. It might be that v1 is something like, externally convert Arrow into ORC, directly load ORC into cstore, compute, write ORC results to disk, and convert ORC back to Arrow.

Any pointers on APIs or SQL we can be accessing to do this?

Thanks!

Murat Tuncer

unread,
Sep 15, 2017, 12:40:39 PM9/15/17
to Leo Meyerovich, cstore users
Hello Leo,

cstore_fdw file format is not 100% ORC compliant.  Even if you are able to convert into ORC cstore might not be able to read it.

I think your best bet is to export into a csv file and import that into cstore using copy.

Could explain a bit on your use case ? Why do you need two way conversion here ? 

Murat




--
You received this message because you are subscribed to the Google Groups "cstore users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cstore-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Murat Tuncer
Software Engineer | Citus Data
mtu...@citusdata.com

Leo Meyerovich

unread,
Sep 15, 2017, 12:49:34 PM9/15/17
to cstore users
Hi Murat, thanks for the quick reply!

We are trying to use postgres/cstore as a phase in a bursty compute pipeline, not as the persistent data store. We already have the columnar data, so looking to get it in, compute, and out, quickly.  Converting to csv and loading that is proving slow -- more than 1-3s for 100K-10M rows. We can control what the Arrow types are, so if cstore has a predictable subset of ORC and known APIs for load/export, that may get us far. Or maybe postgres has another form of fast bulk columnar import...

Murat Tuncer

unread,
Sep 15, 2017, 1:06:56 PM9/15/17
to Leo Meyerovich, cstore users
cstore uses 2 layered approach to store data

1 - blocks
-  a  column block contains 10K rows of data per column. each column block is stored next to each other. Data compression is done at block level. each block also contains header.

2 - stripes
- a stripe contains blocks for 150K (configurable) rows
each stripe has header/footer metadata.

Another file (cstore.footer) is used for storing metadata information about where stripes start and their length. 

You can take a look at cstore serialization code and copy related code parts to produce data files.  cstore_writer.c contains everything you need. It would require some coding on your part.

Please let me know if you want to go down to that path.



--
You received this message because you are subscribed to the Google Groups "cstore users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cstore-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages