How to Convert Huge CSV to HDF5

venki

unread,

Jul 4, 2011, 4:49:58 AM7/4/11

to h5py

Hi All,
I would like to convert a huge CSV file (2.5) GB in to HDF file, but
the problem is I have a memory of only 3 Gigs, the entire data may not
fit in to an Numpy Array in memory. Is there any way / Sample code
available which can help me to convert the CSV to HDF5 by processing
line by line of the CSV? Thanks in advance.

Thanks,
Venkatesh

Paul Anton Letnes

unread,

Jul 6, 2011, 5:01:41 AM7/6/11

to h5...@googlegroups.com

Hi!

Just a few quick thoughts:
- Since the file is in ASCII format, the binary representation should be less space consuming (1 line with 1 number of the format 1.234567e12 is 12 bytes in ASCII, 4 bytes in numpy.float32). I'd try numpy.loadtxt (or numpy.genfromtxt, or similar) on a smaller portion of the file and see how large the array actually turns out to be.
- If you have a multicolumn file, can you convert one column at a time? The resulting arrays will be smaller than the full file, so problem solved
- If you are only doing this once, you could simply "swap out" your RAM to disk (OS does this automagically) - leave your script running overnight, no problem
- Most scalable, probably: read and append chunks (say, 10k lines at a time) to the hdf5 file. Simply use a couple of loops and some numpy-style slicing, and you're good. One line at a time will work, but will probably be much slower. For a once-off think you might be good with that, too.

Did that help?
Paul

> --
> You received this message because you are subscribed to the Google Groups "h5py" group.
> To post to this group, send email to h5...@googlegroups.com.
> To unsubscribe from this group, send email to h5py+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/h5py?hl=en.
>

venki

unread,

Jul 8, 2011, 1:14:01 PM7/8/11

to h5py

Thanks Paul.. That was of great help

On Jul 6, 2:01 pm, Paul Anton Letnes <paul.anton.let...@gmail.com>
wrote:

Reply all

Reply to author

Forward