data size blowing out when storing in bcolz

159 views
Skip to first unread message

Daniel Mahler

unread,
Feb 26, 2016, 7:59:06 AM2/26/16
to bcolz
I have a dataset with ~7M rows and 3 columns, 2 numeric and 1 consisting of ~20M distinct uuid strings.
The data takes around 3G as a csv file and castra can store it in about 2G.
I would like to test out bcolz with this data.

I tried

    odo(dask.dataframe.from_castra('data.castra'), 'data.bcolz')

which generated ~70G of data before exhausting inodes on the disk and crashing.

How should I import this data into bcolz?

thanks
Daniel

Francesc Alted

unread,
Feb 26, 2016, 10:50:06 AM2/26/16
to Bcolz
70 GB?  Hmm, I suppose this is odo doing strange things.  For importing the CSV data directly into bcolz I would use pandas CSV reader in conjuntion with ctable.append() method.  Look at:

https://github.com/Blosc/bcolz/blob/master/doc/tutorial_ctable.ipynb

for a tutorial on how to create and populate ctable objects in bcolz.

Francesc

--
You received this message because you are subscribed to the Google Groups "bcolz" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bcolz+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Francesc Alted

Francesc Alted

unread,
Feb 26, 2016, 10:53:48 AM2/26/16
to Bcolz
Also, as you tables are quite large, make sure to read the chapter on optimization tips at: http://bcolz.blosc.org/opt-tips.html, and most specially the "Informing about the length of your carrays" section.

HTH
--
Francesc Alted

Daniel Mahler

unread,
Feb 26, 2016, 2:58:57 PM2/26/16
to bcolz, fal...@gmail.com
Thanks Francesc. What is a good way to import from a dask dataframe (backed by castra)?

thanks
Daniel

Matthew Rocklin

unread,
Feb 26, 2016, 3:06:40 PM2/26/16
to bc...@googlegroups.com, Francesc Alted
You could use the dask.dataframe.to_hdf function and then use bcolz code to migrate.  One could also add a to_bcolz function to dask.dataframe.  I won't personally have time to do this for the next few weeks though.

Daniel Mahler

unread,
Feb 26, 2016, 4:03:21 PM2/26/16
to bcolz, fal...@gmail.com
Hi Matthew

There is no to_bcolz on dask dataframe, only from_bcolz.

cheers
D

Matthew Rocklin

unread,
Feb 26, 2016, 4:04:23 PM2/26/16
to bc...@googlegroups.com, Francesc Alted
Yes, that's why I said we could consider adding it.  I won't personally have time to do this for the next few weeks though.

Daniel Mahler

unread,
Feb 26, 2016, 4:08:43 PM2/26/16
to bcolz, fal...@gmail.com
Sorry, I misread your message.
Long day/week ...

Matthew Rocklin

unread,
Feb 26, 2016, 4:10:27 PM2/26/16
to bc...@googlegroups.com, Francesc Alted
Same.  No worries.

Daniel Mahler

unread,
Feb 26, 2016, 5:06:32 PM2/26/16
to bcolz, fal...@gmail.com

The data actually fits in memory on my ec2 instance,
so I tried to do the import via pandas
I still get the same problem even when I incorporate the advice from the tips page:

df0 = ddf.from_castra("data.castra")
df = odo.odo(df0, pd.DataFrame)
cnames = df.columns.tolist()
cols = [df[c] for c in cnames]
bc = bcolz.ctable(cols, names=cnames, expectedlen=df.shape[0], rootdir="data.bcolz", cparams=bcolz.cparams(clevel=9))

the final statement seems take forever and generates data several time larger then the input file.
It also generates over a million files
I thought the tips page says the the expectedlen parameter should preent that.
Am I doing something wrong?

thanks
Daniel

Daniel Mahler

unread,
Feb 26, 2016, 5:14:47 PM2/26/16
to bcolz, fal...@gmail.com

> It also generates over a million files
Make that over 15M and counting.

Daniel Mahler

unread,
Feb 26, 2016, 5:50:35 PM2/26/16
to bcolz, fal...@gmail.com
Does bcolz generate a file for every distinct value in a column?
It looks like the vast maority of files is in the data subdirectory for the column that stores the uuid strings.
Is there a way to turn that off?

thanks
Daniel

Kilian Mie

unread,
Feb 26, 2016, 8:22:25 PM2/26/16
to bcolz
Read csv in chunks via pandas.read_csv(), convert your string column from Python object dtype to a fix length numpy dtype, say, 'S20', then append as numpy array to ctable.

Also, set chunklen=1000000 (or similar) at ctable creation which will avoid creating hundreds of files under the /data folder (probably not optimal for compression though)

The 2 steps above worked well for me (20 million rows, 40-60 columns)

Kilian Mie

unread,
Feb 28, 2016, 1:14:04 PM2/28/16
to bcolz, fal...@gmail.com
Daniel, try this:

df0 = ddf.from_castra("data.castra")
df = odo.odo(df0, pd.DataFrame)
names = df.columns.tolist()
types = ['float32', 'float32', 'S20']  # adjust 'S20' to your max string length needs
cols = [bcolz.carray(df[c].values, dtype=dt) for c, dt in zip(names, types)]

ct = bcolz.zeros(0, dtype=np.dtype(zip(names, types)), mode='w', chunklen=1000000, rootdir="data.bcolz")
ct.append(cols)

Daniel Mahler

unread,
Feb 29, 2016, 3:22:38 PM2/29/16
to kil...@gmail.com, fal...@gmail.com, bc...@googlegroups.com
Thanks Killan!! That did it.
Reply all
Reply to author
Forward
0 new messages