fastest way to read a text file in to a numpy array

Heli

unread,

Jun 28, 2016, 9:46:05 AM6/28/16

to

Hi,

I need to read a file in to a 2d numpy array containing many number of lines.
I was wondering what is the fastest way to do this?

Is even reading the file in to numpy array the best method or there are better approaches?

Thanks for your suggestions,

Michael Selik

unread,

Jun 28, 2016, 10:00:36 AM6/28/16

to

On Tue, Jun 28, 2016 at 9:51 AM Heli <hem...@gmail.com> wrote:

> Is even reading the file in to numpy array the best method or there are
> better approaches?
>

What are you trying to accomplish?
Summary statistics, data transformation, analysis...?

Michael Selik

unread,

Jun 28, 2016, 10:30:42 AM6/28/16

to

On Tue, Jun 28, 2016 at 10:08 AM Hedieh Ebrahimi <hem...@gmail.com> wrote:

> File 1 has :
> x1,y1,z1
> x2,y2,z2
> ....
>
> and file2 has :
> x1,y1,z1,value1
> x2,y2,z2,value2
> x3,y3,z3,value3
> ...
>
> I need to read the coordinates from file 1 and then interpolate a value
> for these coordinates on file 2 to the closest coordinate possible. The
> problem is file 2 is has around 5M lines. So I was wondering what would be
> the fastest approach?
>

Is this a one-time task, or something you'll need to repeat frequently?
How many points need to be interpolated?
How do you define distance? Euclidean 3d distance? K-nearest?

5 million can probably fit into memory, so it's not so bad.

NumPy is a good option for broadcasting the distance function across all 5
million labeled points for each unlabeled point. Given that file format,
NumPy can probably read from file directly into an array.

http://stackoverflow.com/questions/3518778/how-to-read-csv-into-record-array-in-numpy

Cody Piersall

unread,

Jun 28, 2016, 10:45:53 AM6/28/16

to

On Tue, Jun 28, 2016 at 8:45 AM, Heli <hem...@gmail.com> wrote:
> Hi,
>
> I need to read a file in to a 2d numpy array containing many number of lines.
> I was wondering what is the fastest way to do this?
>
> Is even reading the file in to numpy array the best method or there are better approaches?
>

numpy.genfromtxt[1] is a pretty robust function for reading text files.

If you're generating the file from a numpy array already, you should
use arr.save()[2] and numpy.load()[3].

[1]: http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
[2]: http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
[3]: http://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html

Cody

Heli

unread,

Jun 30, 2016, 11:49:57 AM6/30/16

to

Dear all,

After a few tests, I think I will need to correct a bit my question. I will give an example here.

I have file 1 with 250 lines:
X1,Y1,Z1
X2,Y2,Z2
....

Then I have file 2 with 3M lines:
X1,Y1,Z1,value11,value12, value13,....
X2,Y2,Z2,value21,value22, value23,...
....

I will need to interpolate values for the coordinates on file 1 from file 2. (using nearest)
I am using the scipy.griddata for this.

scipy.interpolate.griddata(points, values, xi, method='linear', fill_value=nan, rescale=False)

When slicing the code, reading files in to numpy is not the culprit, but the griddata is.

time to read file2= 2 min
time to interpolate= 48 min

I need to repeat the griddata above to get interpolation for each of the column of values. I was wondering if there are any ways to improve the time spent in interpolation.

Thank you very much in advance for your help,

Christian Gollwitzer

unread,

Jun 30, 2016, 5:02:59 PM6/30/16

to

Am 30.06.16 um 17:49 schrieb Heli:

> Dear all,
>
> After a few tests, I think I will need to correct a bit my question. I will give an example here.
>
> I have file 1 with 250 lines:
> X1,Y1,Z1
> X2,Y2,Z2
> ....
>
> Then I have file 2 with 3M lines:
> X1,Y1,Z1,value11,value12, value13,....
> X2,Y2,Z2,value21,value22, value23,...
> ....
>
> I will need to interpolate values for the coordinates on file 1 from file 2. (using nearest)
> I am using the scipy.griddata for this.
>
> scipy.interpolate.griddata(points, values, xi, method='linear', fill_value=nan, rescale=False)

This constructs a Delaunay triangulation and no wonder takes some time
if you run it over 3M datapoints. You can probably save a factor of
three, because:

> I need to repeat the griddata above to get interpolation for each of the column of values.

I think this is wrong. It should, according to the docs, happily
interpolate from a 2D array of values. BTW, you stated you want nearest
interpolation, but you chose "linear". I think it won't make a big
difference on runtime, though. (nearest uses a KDtree, Linear uses QHull)

> I was wondering if there are any ways to improve the time spent in interpolation.

Are you sure you need the full generality of this algorithm? i.e., are
your values given on a scattered cloud of points in the 3D space, or
maybe the X,Y,Z in file2 are in fact on a rectangular grid? In the
former case, there is probably nothing you can really do. In the latter,
there should be a more efficient algorithm by looking up the nearest
index from X,Y,Z by index arithmetics. Or maybe even reshaping it into a
3D-array.

Christian