Working with csv files with python

10 views

Skip to first unread message

Dilawar Singh

unread,

Sep 10, 2016, 5:03:51 AM9/10/16

to Programming and Computing Group, NCBS Bangalore

CSV is simplest format for store data. Compression (gzip) can reduce filesize to approx 20% (sometimes comparable to binary format). CSV is easy to visualize and easy to have a look at real values during the test phase. HDF is binary format which has extremely well support (matlab default .mat files are hdf5 files).

On large csv file (>~ 10000 lines to GB size), it become a pain to load csv data using inbuilt python csv module. Numpy does help a lot but its performance also degrades.

I compared the result with python-numpy and python-pandas csv reader. Below is rough comparison for a file with ~43000 lines (20 columns of floating point values in scientific notation %g).



numpy `loadtxt`	> 10 sec
pandas `read_csv`	1.5 sec

Python-pandas does really well at reading csv file. Also pandas and numpy plays along quite well. python-dash is another upcoming library which uses pandas/numpy as backend. Dash is really good at parallelizing computation. It aims at clusters but can be used with desktop/laptop.

Dilawar