data size blowing out when storing in bcolz

Daniel Mahler

unread,

Feb 26, 2016, 7:59:06 AM2/26/16

to bcolz

I have a dataset with ~7M rows and 3 columns, 2 numeric and 1 consisting of ~20M distinct uuid strings.

The data takes around 3G as a csv file and castra can store it in about 2G.

I would like to test out bcolz with this data.

I tried

odo(dask.dataframe.from_castra('data.castra'), 'data.bcolz')

which generated ~70G of data before exhausting inodes on the disk and crashing.

How should I import this data into bcolz?

thanks

Daniel

Francesc Alted

unread,

Feb 26, 2016, 10:50:06 AM2/26/16

to Bcolz

70 GB? Hmm, I suppose this is odo doing strange things. For importing the CSV data directly into bcolz I would use pandas CSV reader in conjuntion with ctable.append() method. Look at:

https://github.com/Blosc/bcolz/blob/master/doc/tutorial_ctable.ipynb

for a tutorial on how to create and populate ctable objects in bcolz.

Francesc

--
You received this message because you are subscribed to the Google Groups "bcolz" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bcolz+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Francesc Alted

unread,

Feb 26, 2016, 10:53:48 AM2/26/16

to Bcolz

Also, as you tables are quite large, make sure to read the chapter on optimization tips at: http://bcolz.blosc.org/opt-tips.html, and most specially the "Informing about the length of your carrays" section.

HTH

--

Francesc Alted

Daniel Mahler

unread,

Feb 26, 2016, 2:58:57 PM2/26/16

to bcolz, fal...@gmail.com

Thanks Francesc. What is a good way to import from a dask dataframe (backed by castra)?

thanks

Daniel

Matthew Rocklin

unread,

Feb 26, 2016, 3:06:40 PM2/26/16

to bc...@googlegroups.com, Francesc Alted

You could use the dask.dataframe.to_hdf function and then use bcolz code to migrate. One could also add a to_bcolz function to dask.dataframe. I won't personally have time to do this for the next few weeks though.

Daniel Mahler

unread,

Feb 26, 2016, 4:03:21 PM2/26/16

to bcolz, fal...@gmail.com

Hi Matthew

There is no to_bcolz on dask dataframe, only from_bcolz.

cheers

D

Matthew Rocklin

unread,

Feb 26, 2016, 4:04:23 PM2/26/16

to bc...@googlegroups.com, Francesc Alted

Yes, that's why I said we could consider adding it. I won't personally have time to do this for the next few weeks though.

Daniel Mahler

unread,

Feb 26, 2016, 4:08:43 PM2/26/16

to bcolz, fal...@gmail.com

Sorry, I misread your message.

Long day/week ...

Matthew Rocklin

unread,

Feb 26, 2016, 4:10:27 PM2/26/16

to bc...@googlegroups.com, Francesc Alted

Same. No worries.

Daniel Mahler

unread,

Feb 26, 2016, 5:06:32 PM2/26/16

to bcolz, fal...@gmail.com

The data actually fits in memory on my ec2 instance,

so I tried to do the import via pandas

I still get the same problem even when I incorporate the advice from the tips page:

df0 = ddf.from_castra("data.castra")

df = odo.odo(df0, pd.DataFrame)

cnames = df.columns.tolist()

cols = [df[c] for c in cnames]

bc = bcolz.ctable(cols, names=cnames, expectedlen=df.shape[0], rootdir="data.bcolz", cparams=bcolz.cparams(clevel=9))

the final statement seems take forever and generates data several time larger then the input file.

It also generates over a million files

I thought the tips page says the the expectedlen parameter should preent that.

Am I doing something wrong?

thanks

Daniel

Daniel Mahler

unread,

Feb 26, 2016, 5:14:47 PM2/26/16

to bcolz, fal...@gmail.com

> It also generates over a million files

Make that over 15M and counting.

Daniel Mahler

unread,

Feb 26, 2016, 5:50:35 PM2/26/16

to bcolz, fal...@gmail.com

Does bcolz generate a file for every distinct value in a column?

It looks like the vast maority of files is in the data subdirectory for the column that stores the uuid strings.

Is there a way to turn that off?

thanks

Daniel

Kilian Mie

unread,

Feb 26, 2016, 8:22:25 PM2/26/16

to bcolz

Read csv in chunks via pandas.read_csv(), convert your string column from Python object dtype to a fix length numpy dtype, say, 'S20', then append as numpy array to ctable.

Also, set chunklen=1000000 (or similar) at ctable creation which will avoid creating hundreds of files under the /data folder (probably not optimal for compression though)

The 2 steps above worked well for me (20 million rows, 40-60 columns)

Kilian Mie

unread,

Feb 28, 2016, 1:14:04 PM2/28/16

to bcolz, fal...@gmail.com

Daniel, try this:

df0 = ddf.from_castra("data.castra")

df = odo.odo(df0, pd.DataFrame)

names = df.columns.tolist()

types = ['float32', 'float32', 'S20'] # adjust 'S20' to your max string length needs

cols = [bcolz.carray(df[c].values, dtype=dt) for c, dt in zip(names, types)]

ct = bcolz.zeros(0, dtype=np.dtype(zip(names, types)), mode='w', chunklen=1000000, rootdir="data.bcolz")

ct.append(cols)

Daniel Mahler

unread,

Feb 29, 2016, 3:22:38 PM2/29/16

to kil...@gmail.com, fal...@gmail.com, bc...@googlegroups.com

Thanks Killan!! That did it.

Reply all

Reply to author

Forward