CSV is simplest format for store data. Compression (gzip) can reduce filesize to approx 20% (sometimes comparable to binary format). CSV is easy to visualize and easy to have a look at real values during the test phase. HDF is binary format which has extremely well support (matlab default .mat files are hdf5 files).
On large csv file (>~ 10000 lines to GB size), it become a pain to load csv data using inbuilt python csv module. Numpy does help a lot but its performance also degrades. Haskell still does not have great out-of-box support for CSV.
For haskell, I implemented my own csv reader using cassava
library (default values are `double` except header). The reader from missingh
library (parsec based) was taking too long (~ 17 seconds) for a file with 43200 lines. I compared the result with python-numpy
and python-pandas
csv reader. Below is rough comparison for a file with ~43000 lines (20 columns of floating point values in scientific notation %g).
cassava (ignore #) | 3.3 sec |
cassava (no support for ignoring #) | 2.7 sec |
numpy loadtxt |
> 10 sec |
pandas read_csv |
1.5 sec |
parsec
based reader hands down. The code is here https://github.com/dilawar/HBatteries/blob/master/src/HBatteries/CSV.hs. But this code is pain to use because to optimize for performance, it uses Haskell ByteString instead of Text (i.e. only latin/ASCII encoding is supported). --
--
The website for the club is http://wncc-iitb.org/
To post to this group, send email to wncc...@googlegroups.com
---
You received this message because you are subscribed to the Google Groups "Web and Coding Club IIT Bombay" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wncc_iitb+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.