CSV is simplest format for store data. Compression (gzip) can
reduce filesize to approx 20% (sometimes comparable to binary format).
CSV is easy to visualize and easy to have a look at real values during
the test phase. HDF is binary format which has extremely well support
(matlab default .mat files are hdf5 files).
On
large csv file (>~ 10000 lines to GB size), it become a pain to load
csv data using inbuilt python csv module. Numpy does help a lot but its
performance also degrades.
I compared the result with python-numpy and python-pandas
csv reader. Below is rough comparison for a file with ~43000 lines (20
columns of floating point values in scientific notation %g).
numpy loadtxt |
> 10 sec |
pandas read_csv |
1.5 sec |