benchmark for some csv reader

Dilawar Singh

unread,

Sep 10, 2016, 2:44:50 AM9/10/16

to Web and Coding Club IIT Bombay

CSV is simplest format for store data. Compression (gzip) can reduce filesize to approx 20% (sometimes comparable to binary format). CSV is easy to visualize and easy to have a look at real values during the test phase. HDF is binary format which has extremely well support (matlab default .mat files are hdf5 files).

On large csv file (>~ 10000 lines to GB size), it become a pain to load csv data using inbuilt python csv module. Numpy does help a lot but its performance also degrades. Haskell still does not have great out-of-box support for CSV.

For haskell, I implemented my own csv reader using cassava library (default values are `double` except header). The reader from missingh library (parsec based) was taking too long (~ 17 seconds) for a file with 43200 lines. I compared the result with python-numpy and python-pandas csv reader. Below is rough comparison for a file with ~43000 lines (20 columns of floating point values in scientific notation %g).

cassava (ignore #)	3.3 sec
cassava (no support for ignoring #)	2.7 sec
numpy `loadtxt`	> 10 sec
pandas `read_csv`	1.5 sec

Python-pandas does really well at reading csv file. Also pandas and numpy plays along quite well.

I was hoping that my csv reader would do better than pandas but it didn’t. But it still beats the parsec based reader hands down. The code is here https://github.com/dilawar/HBatteries/blob/master/src/HBatteries/CSV.hs. But this code is pain to use because to optimize for performance, it uses Haskell ByteString instead of Text (i.e. only latin/ASCII encoding is supported).

Dilawar

Saket Choudhary

unread,

Sep 10, 2016, 3:13:18 AM9/10/16

to wncc...@googlegroups.com

> On large csv file (>~ 10000 lines to GB size), it become a pain to load csv
> data using inbuilt python csv module. Numpy does help a lot but its
> performance also degrades. Haskell still does not have great out-of-box
> support for CSV.
>
>

Have you also tried benchmarking dask[1]?

http://dask.pydata.org/en/latest/dataframe.html

Dilawar Singh

unread,

Sep 10, 2016, 4:20:34 AM9/10/16

to wncc...@googlegroups.com

Thanks Saket. I just had a look and changed the script to use dask for this task. Here are few observations.

- Seems like designed to work on clusters (big +1) but also works on laptop/desktop as well.

- `pip install dask[complete] --user` depends on quite a lot of other libraries as well. But installed flawlessly on openSUSE-42.1/Ubuntu-14.04.

- Reads csv file lazily i.e. unless a computation is performed nothing is done. The `read_csv` executed in milliseconds.

- When I called `data['time'].mean().compute()` (unless compute() is called, it doesn't evaluate, Much like Haskell), it took ~1.4 seconds. Which is as good as pandas. The document seem to suggest that its performance is comparable to pandas. Pandas is its dependencies so I am guessing it uses pandas csv_reader to read the file.
- The `.compute()` function can take additional argument to evaluate in parallel which is big plus.

I was writing the CSV parser in Haskell just because in Haskell its is easier to parallelize. But this seems to work just fine.

Thanks again,

Dilawar

--
--
The website for the club is http://wncc-iitb.org/
To post to this group, send email to wncc...@googlegroups.com
---
You received this message because you are subscribed to the Google Groups "Web and Coding Club IIT Bombay" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wncc_iitb+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward