Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Newbie - converting csv files to arrays in NumPy

1,606 views
Skip to first unread message

oyekomova

unread,
Jan 9, 2007, 3:08:00 PM1/9/07
to
I would like to know how to convert a csv file with a header row into a
floating point array without the header row.

Marc 'BlackJack' Rintsch

unread,
Jan 9, 2007, 3:36:18 PM1/9/07
to
In <1168373279.9...@o58g2000hsb.googlegroups.com>, oyekomova
wrote:

> I would like to know how to convert a csv file with a header row into a
> floating point array without the header row.

Take a look at the `csv` module in the standard library.

Ciao,
Marc 'BlackJack' Rintsch

Robert Kern

unread,
Jan 9, 2007, 3:36:47 PM1/9/07
to pytho...@python.org
oyekomova wrote:
> I would like to know how to convert a csv file with a header row into a
> floating point array without the header row.

Use the standard library module csv. Something like the following is a cheap and
cheerful solution:


import csv
import numpy

def float_array_from_csv(filename, skip_header=True):
f = open(filename)
try:
reader = csv.reader(f)
floats = []
if skip_header:
reader.next()
for row in reader:
floats.append(map(float, row))
finally:
f.close()

return numpy.array(floats)

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco

oyekomova

unread,
Jan 10, 2007, 2:48:06 PM1/10/07
to
Thanks for your help. I compared the following code in NumPy with the
csvread in Matlab for a very large csv file. Matlab read the file in
577 seconds. On the other hand, this code below kept running for over 2
hours. Can this program be made more efficient? FYI - The csv file was
a simple 6 column file with a header row and more than a million
records.


import csv
from numpy import array
import time
t1=time.clock()
file_to_read = file('somename.csv','r')
read_from = csv.reader(file_to_read)
read_from.next()

datalist = [ map(float, row[:]) for row in read_from ]

# now the real data
data = array(datalist, dtype = float)

elapsed=time.clock()-t1
print elapsed

sturlamolden

unread,
Jan 10, 2007, 3:43:42 PM1/10/07
to

oyekomova wrote:
> Thanks for your help. I compared the following code in NumPy with the
> csvread in Matlab for a very large csv file. Matlab read the file in
> 577 seconds. On the other hand, this code below kept running for over 2
> hours. Can this program be made more efficient? FYI - The csv file was
> a simple 6 column file with a header row and more than a million
> records.
>
>
> import csv
> from numpy import array
> import time
> t1=time.clock()
> file_to_read = file('somename.csv','r')
> read_from = csv.reader(file_to_read)
> read_from.next()

> datalist = [ map(float, row[:]) for row in read_from ]

I'm willing to bet that this is your problem. Python lists are arrays
under the hood!

Try something like this instead:


# read the whole file in one chunk
lines = file_to_read.readlines()
# count the number of columns
n = 1
for c in lines[1]:
if c == ',': n += 1
# count the number of rows
m = len(lines[1:])
#allocate
data = empty((m,n),dtype=float)
# create csv reader, skip header
reader = csv.reader(lines[1:])
# read
for i in arange(0,m):
data[i,:] = map(float,reader.next())

And if this is too slow, you may consider vectorizing the last loop:

data = empty((m,n),dtype=float)
newstr = ",".join(lines[1:])
flatdata = data.reshape((n*m)) # flatdata is a view of data, not a copy
reader = csv.reader([newstr])
flatdata[:] = map(float,reader.next())

I hope this helps!

Gabriel Genellina

unread,
Jan 10, 2007, 11:46:34 PM1/10/07
to pytho...@python.org
At Wednesday 10/1/2007 16:48, oyekomova wrote:

>Thanks for your help. I compared the following code in NumPy with the
>csvread in Matlab for a very large csv file. Matlab read the file in
>577 seconds. On the other hand, this code below kept running for over 2
>hours. Can this program be made more efficient? FYI - The csv file was
>a simple 6 column file with a header row and more than a million
>records.
>
>
>import csv
>from numpy import array
>import time
>t1=time.clock()
>file_to_read = file('somename.csv','r')
>read_from = csv.reader(file_to_read)
>read_from.next()
>
>datalist = [ map(float, row[:]) for row in read_from ]
>
># now the real data
>data = array(datalist, dtype = float)
>
>elapsed=time.clock()-t1
>print elapsed

Replace that row[:] by row, it's just a waste of time and memory.
And see http://www.scipy.org/Cookbook/InputOutput


--
Gabriel Genellina
Softlab SRL




__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

John Machin

unread,
Jan 11, 2007, 3:11:25 AM1/11/07
to

Please consider using
m = len(lines) - 1

> #allocate
> data = empty((m,n),dtype=float)
> # create csv reader, skip header
> reader = csv.reader(lines[1:])

lines[1:] again?
The OP set you an example:
read_from.next()
so you could use:
reader = csv.reader(lines)
_unused = reader.next()

Istvan Albert

unread,
Jan 11, 2007, 9:54:12 AM1/11/07
to

oyekomova wrote:

> csvread in Matlab for a very large csv file. Matlab read the file in
> 577 seconds. On the other hand, this code below kept running for over 2
> hours. Can this program be made more efficient? FYI

There must be something wrong with your setup/program. I work with
large csv files as well and I never have performance problems of that
magnitude. Make sure you are not doing something else while parsing
your data.

Parsing 1 million lines with six columns with the program below takes
87 seconds on my laptop. Even your original version with extra slices
and all would still only be take about 50% more time.

import time, csv, random
from numpy import array

def make_data(rows=1E6, cols=6):
fp = open('data.txt', 'wt')
counter = range(cols)
for row in xrange( int(rows) ):
vals = map(str, [ random.random() for x in counter ] )
fp.write( '%s\n' % ','.join( vals ) )
fp.close()

def read_test():
start = time.clock()
reader = csv.reader( file('data.txt') )
data = [ map(float, row) for row in reader ]
data = array(data, dtype = float)
print 'Data size', len(data)
print 'Elapsed', time.clock() - start

#make_data()
read_test()

Travis E. Oliphant

unread,
Jan 12, 2007, 12:59:43 AM1/12/07
to pytho...@python.org
oyekomova wrote:
> Thanks for your help. I compared the following code in NumPy with the
> csvread in Matlab for a very large csv file. Matlab read the file in
> 577 seconds. On the other hand, this code below kept running for over 2
> hours. Can this program be made more efficient? FYI - The csv file was
> a simple 6 column file with a header row and more than a million
> records.
>
>

There is some facility to read simply-formatted files directly into NumPy.

You might try something like this.

numpy.fromfile('somename.csv', sep=',')

and then reshape the array.

-Travis

Travis E. Oliphant

unread,
Jan 12, 2007, 1:07:32 AM1/12/07
to pytho...@python.org
oyekomova wrote:
> Thanks for your help. I compared the following code in NumPy with the
> csvread in Matlab for a very large csv file. Matlab read the file in
> 577 seconds. On the other hand, this code below kept running for over 2
> hours. Can this program be made more efficient? FYI - The csv file was
> a simple 6 column file with a header row and more than a million
> records.
>
>
> import csv
> from numpy import array
> import time
> t1=time.clock()
> file_to_read = file('somename.csv','r')
> read_from = csv.reader(file_to_read)
> read_from.next()
>
> datalist = [ map(float, row[:]) for row in read_from ]
>
> # now the real data
> data = array(datalist, dtype = float)
>
> elapsed=time.clock()-t1
> print elapsed
>


If you use numpy.fromfile, you need to skip past the initial header row
yourself. Something like this:

fid = open('somename.csv')
data = numpy.fromfile(fid, sep=',').reshape(-1,6)
# for 6-column data.

-Travis

Robert Kern

unread,
Jan 12, 2007, 1:28:43 AM1/12/07
to pytho...@python.org
Travis E. Oliphant wrote:

> If you use numpy.fromfile, you need to skip past the initial header row
> yourself. Something like this:
>
> fid = open('somename.csv')

# I think you also meant to include this line:
header = fid.readline()

> data = numpy.fromfile(fid, sep=',').reshape(-1,6)
> # for 6-column data.

--

oyekomova

unread,
Jan 13, 2007, 2:13:57 PM1/13/07
to
Thanks to everyone for their excellent suggestions. I was able to
acheive the following results with all your suggestions. However, I am
unable to cross file size of 6 million rows. I would appreciate any
helpful suggestions on avoiding memory errors. None of the solutions
posted was able to cross this limit.

>>> Data size 999999
Elapsed 31.60352213
>>> ================================ RESTART ================================
>>>
Data size 1999999
Elapsed 63.4050884573
>>> ================================ RESTART ================================
>>>
Data size 4999999
Elapsed 177.888915777
>
Data size 5999999'
Traceback (most recent call last):
File "C:/Documents/some.py", line 27, in <module>
read_test()
File "C:/Documents/some.py", line 21, in read_test


data = array(data, dtype = float)

MemoryError

sturlamolden

unread,
Jan 13, 2007, 5:47:53 PM1/13/07
to

oyekomova wrote:
> Thanks to everyone for their excellent suggestions. I was able to
> acheive the following results with all your suggestions. However, I am
> unable to cross file size of 6 million rows. I would appreciate any
> helpful suggestions on avoiding memory errors. None of the solutions
> posted was able to cross this limit.

The error message means you are running out of RAM.

With 6 million rows and 6 columns, the size of the data array is (only)
274 MiB. I have no problem allocating it on my laptop. How large is the
csv file and how much RAM do you have?

Also it helps to post the whole code you are trying to run. I don't
care much for guesswork.

oyekomova

unread,
Jan 13, 2007, 7:39:34 PM1/13/07
to
Thanks for your note. I have 1Gig of RAM. Also, Matlab has no problem
in reading the file into memory. I am just running Istvan's code that
was posted earlier.

import time, csv, random
from numpy import array


def make_data(rows=1E6, cols=6):
fp = open('data.txt', 'wt')
counter = range(cols)
for row in xrange( int(rows) ):
vals = map(str, [ random.random() for x in counter ] )
fp.write( '%s\n' % ','.join( vals ) )
fp.close()


def read_test():
start = time.clock()
reader = csv.reader( file('data.txt') )
data = [ map(float, row) for row in reader ]

data = array(data, dtype = float)

print 'Data size', len(data)
print 'Elapsed', time.clock() - start


#make_data()
read_test()

On Jan 13, 5:47 pm, "sturlamolden" <sturlamol...@yahoo.no> wrote:
> oyekomova wrote:
> > Thanks to everyone for their excellent suggestions. I was able to
> > acheive the following results with all your suggestions. However, I am
> > unable to cross file size of 6 million rows. I would appreciate any
> > helpful suggestions on avoiding memory errors. None of the solutions

> > posted was able to cross this limit.The error message means you are running out of RAM.

sk...@pobox.com

unread,
Jan 13, 2007, 7:58:45 PM1/13/07
to oyekomova, pytho...@python.org

oyekomova> def read_test():
oyekomova> start = time.clock()
oyekomova> reader = csv.reader( file('data.txt') )
oyekomova> data = [ map(float, row) for row in reader ]
oyekomova> data = array(data, dtype = float)
oyekomova> print 'Data size', len(data)
oyekomova> print 'Elapsed', time.clock() - start

You have the entire file in memory as well as the entire array. Try
operating line-by-line.

#!/usr/bin/env python

import array
import time
import random
import csv

def make_data(nrows=1000000, cols=6):
counter = range(cols)
writer = csv.writer(open('data.txt', 'wt'))
for row in xrange(nrows):
writer.writerow([random.random() for x in counter])

def read_test():


reader = csv.reader( file('data.txt') )

data = array.array('f')
for row in reader:
data.extend(map(float, row))


print 'Data size', len(data)

start = time.clock()
make_data()
print "generate data:", (time.clock()-start)

start = time.clock()
read_test()
print "read data:", (time.clock()-start)

Skip

sturlamolden

unread,
Jan 14, 2007, 1:37:04 AM1/14/07
to
oyekomova wrote:

> Thanks for your note. I have 1Gig of RAM. Also, Matlab has no problem
> in reading the file into memory. I am just running Istvan's code that
> was posted earlier.

You have a CSV file of about 520 MiB, which is read into memory. Then
you have a list of list of floats, created by list comprehension, which
is larger than 274 MiB. Additionally you try to allocate a NumPy array
slightly larger than 274 MiB. Now your process is already exceeding 1
GiB, and you are probably running other processes too. That is why you
run out of memory.

So you have three options:

1. Buy more RAM.

2. Low-level code a csv-reader in C.

3. Read the data in chunks. That would mean something like this:


import time, csv, random
import numpy

def make_data(rows=6E6, cols=6):


fp = open('data.txt', 'wt')
counter = range(cols)
for row in xrange( int(rows) ):
vals = map(str, [ random.random() for x in counter ] )
fp.write( '%s\n' % ','.join( vals ) )
fp.close()

def read_test():
start = time.clock()

arrlist = None
r = 0
CHUNK_SIZE_HINT = 4096 * 4 # seems to be good
fid = file('data.txt')
while 1:
chunk = fid.readlines(CHUNK_SIZE_HINT)
if not chunk: break
reader = csv.reader(chunk)


data = [ map(float, row) for row in reader ]

arrlist = [ numpy.array(data,dtype=float), arrlist ]
r += arrlist[0].shape[0]
del data
del reader
del chunk
print 'Created list of chunks, elapsed time so far: ', time.clock()
- start
print 'Joining list...'
data = numpy.empty((r,arrlist[0].shape[1]),dtype=float)
r1 = r
while arrlist:
r0 = r1 - arrlist[0].shape[0]
data[r0:r1,:] = arrlist[0]
r1 = r0
del arrlist[0]
arrlist = arrlist[0]
print 'Elapsed time:', time.clock() - start

make_data()
read_test()

This can process a CSV file of 6 million rows in about 150 seconds on
my laptop. A CSV file of 1 million rows takes about 25 seconds.

Just reading the 6 million row CSV file ( using fid.readlines() ) takes
about 40 seconds on my laptop. Python lists are not particularly
efficient. You can probably reduce the time to ~60 seconds by writing a
new CSV reader for NumPy arrays in a C extension.

oyekomova

unread,
Jan 14, 2007, 12:56:34 PM1/14/07
to
Thank you so much. Your solution works! I greatly appreciate your
help.

Travis E. Oliphant

unread,
Jan 15, 2007, 4:06:56 PM1/15/07
to pytho...@python.org
oyekomova wrote:
> Thanks to everyone for their excellent suggestions. I was able to
> acheive the following results with all your suggestions. However, I am
> unable to cross file size of 6 million rows. I would appreciate any
> helpful suggestions on avoiding memory errors. None of the solutions
> posted was able to cross this limit.

Did you try using numpy.fromfile ?

This will not require you to allocate more memory than needed. If you
specify a count, it will also not have to re-allocate memory in blocks
as the array size grows.

It's limitation is that it is not a very sophisticated csv reader (it
only understands a single separator (plus line-feeds are typically seen
as a separator).

-Travis

oyekomova

unread,
Jan 16, 2007, 8:33:30 PM1/16/07
to
Travis-
Yes, I tried your suggestion, but found that it took longer to read a
large file. Thanks for your help.
0 new messages