Large data arrays?

Ole Streicher

unread,

Apr 23, 2009, 6:22:42 AM4/23/09

to

Hi,

for my application, I need to use quite large data arrays
(100.000 x 4000 values) with floating point numbers where I need a fast
row-wise and column-wise access (main case: return a column with the sum
over a number of selected rows, and vice versa).

I would use the numpy array for that, but they seem to be
memory-resistent. So, one of these arrays would use about 1.6 GB
memory which far too much. So I was thinking about a memory mapped
file for that. As far as I understand, there is one in numpy.

For this, I have two questions:

1. Are the "numpy.memmap" array unlimited in size (resp. only limited
by the maximal file size)? And are they part of the system's memory
limit (~3GB for 32bit systems)?

2. Since I need row-wise as well as column-wise access, a simple usage
of a big array as memory mapped file will probably lead to a very poor
performance, since one of them would need to read values splattered
around the whole file. Are there any "plug and play" solutions for
that? If not: what would be the best way to solve this problem?
Probably, one needs to use someting like the "Morton layout" for the
data. Would one then build a subclass of memmap (or ndarray?) that
implements this specific layout? How would one do that? (Sorry, I am
still a beginner with respect to python).

Best regards

Ole

Nick Craig-Wood

unread,

Apr 23, 2009, 7:30:04 AM4/23/09

to

Ole Streicher <ole-use...@gmx.net> wrote:
> for my application, I need to use quite large data arrays
> (100.000 x 4000 values) with floating point numbers where I need a fast
> row-wise and column-wise access (main case: return a column with the sum
> over a number of selected rows, and vice versa).
>
> I would use the numpy array for that, but they seem to be
> memory-resistent. So, one of these arrays would use about 1.6 GB
> memory which far too much. So I was thinking about a memory mapped
> file for that. As far as I understand, there is one in numpy.
>
> For this, I have two questions:
>
> 1. Are the "numpy.memmap" array unlimited in size (resp. only limited
> by the maximal file size)? And are they part of the system's memory
> limit (~3GB for 32bit systems)?

mmaps come out of your applications memory space, so out of that 3 GB
limit. You don't need that much RAM of course but it does use up
address space.

> 2. Since I need row-wise as well as column-wise access, a simple usage
> of a big array as memory mapped file will probably lead to a very poor
> performance, since one of them would need to read values splattered
> around the whole file. Are there any "plug and play" solutions for
> that? If not: what would be the best way to solve this problem?
> Probably, one needs to use someting like the "Morton layout" for the
> data. Would one then build a subclass of memmap (or ndarray?) that
> implements this specific layout? How would one do that? (Sorry, I am
> still a beginner with respect to python).

Sorry don't know very much about numpy, but it occurs to me that you
could have two copies of your mmapped array, one the transpose of the
other which would then speed up the two access patterns enormously.

You needn't mmap the two arrays (files) at the same time either.

--
Nick Craig-Wood <ni...@craig-wood.com> -- http://www.craig-wood.com/nick

Ole Streicher

unread,

Apr 23, 2009, 8:34:43 AM4/23/09

to

Hi Nick,

Nick Craig-Wood <ni...@craig-wood.com> writes:
> mmaps come out of your applications memory space, so out of that 3 GB
> limit. You don't need that much RAM of course but it does use up
> address space.

Hmm. So I have no chance to use >= 2 of these arrays simultaniously?

> Sorry don't know very much about numpy, but it occurs to me that you
> could have two copies of your mmapped array, one the transpose of the
> other which would then speed up the two access patterns enormously.

That would be a solution, but it takes twice the amount of address
space (which seems already to be the limiting factor). In my case (1.6
GB per array), I could even not use one array.

Also, I would need to fill two large files at program start: one for
each orientation (row-wise or column-wise). Depending on the input
data (which are also either row-wise or column-wise), the filling of
the array with opposite direction would take a lot of time because of
the inefficiencies.

For that, using both directions probably would be not a good
solution. What I found is the "Morton layout" which uses a kind of
fractal interleaving and sound not that complicated. But I have no
idea on how to turn it into a "numpy" style: can I just extend from
numpy.ndarray (or numpy.memmap), and which functions/methods then need
to be overwritten? The best would be ofcourse that someone already did
this before that I could use without trapping in all these pitfalls
which occur when one implements a very generic algorithm.

Best regards

Ole

Scott David Daniels

unread,

Apr 23, 2009, 1:29:11 PM4/23/09

to

Ole Streicher wrote:
> Nick Craig-Wood <ni...@craig-wood.com> writes:
>> mmaps come out of your applications memory space, so out of that 3 GB
>> limit. You don't need that much RAM of course but it does use up
>> address space.
>
> Hmm. So I have no chance to use >= 2 of these arrays simultaniously?

> ... That would be a solution, but it takes twice the amount of address

> space (which seems already to be the limiting factor). In my case (1.6
> GB per array), I could even not use one array.

Looks like you are out of luck here. I believe looking to do this
without managing the I/O yourself is likely to involve someone else's
trade-offs, which likely fail to match your use case. You might check
gmane.comp.python.scientific.user
which I think is gmane's mirror of comp.python.scipy.user, but I'm
not certain of the mailing list name. Certainly the people over there
regularly deal with accessing large data.

--Scott David Daniels
Scott....@Acm.Org

Nick Craig-Wood

unread,

Apr 23, 2009, 3:30:04 PM4/23/09

to

Ole Streicher <ole-use...@gmx.net> wrote:
> Hi Nick,
>
> Nick Craig-Wood <ni...@craig-wood.com> writes:
> > mmaps come out of your applications memory space, so out of that 3 GB
> > limit. You don't need that much RAM of course but it does use up
> > address space.
>
> Hmm. So I have no chance to use >= 2 of these arrays simultaniously?
>
> > Sorry don't know very much about numpy, but it occurs to me that you
> > could have two copies of your mmapped array, one the transpose of the
> > other which would then speed up the two access patterns enormously.
>
> That would be a solution, but it takes twice the amount of address
> space (which seems already to be the limiting factor). In my case (1.6
> GB per array), I could even not use one array.

You don't need them mapped at the same time so you could get away with
just one copy mapped.

Also you can map the array in parts and use dramatically less address
space.

> Also, I would need to fill two large files at program start: one for
> each orientation (row-wise or column-wise). Depending on the input
> data (which are also either row-wise or column-wise), the filling of
> the array with opposite direction would take a lot of time because of
> the inefficiencies.
>
> For that, using both directions probably would be not a good
> solution. What I found is the "Morton layout" which uses a kind of
> fractal interleaving and sound not that complicated.

It sounds cool!

> But I have no idea on how to turn it into a "numpy" style: can I
> just extend from numpy.ndarray (or numpy.memmap), and which
> functions/methods then need to be overwritten? The best would be
> ofcourse that someone already did this before that I could use
> without trapping in all these pitfalls which occur when one
> implements a very generic algorithm.

I'd start by writing a function which took (x, y) in array
co-ordinates and transformed that into (z) remapped in the Morton
layout.

Then instead of accessing array[x][y] you access
morton_array[f(x,y)]. That doesn't require any subclassing and is
relatively easy to implement. I'd try that and see if it works first!

Alternatively you could install a 64bit OS on your machine and use my
scheme!

Aahz

unread,

Apr 23, 2009, 3:48:12 PM4/23/09

to

In article <ytziqkv...@burgos.aip.de>,

Ole Streicher <ole-us...@gmx.net> wrote:
>
>for my application, I need to use quite large data arrays
>(100.000 x 4000 values) with floating point numbers where I need a fast
>row-wise and column-wise access (main case: return a column with the sum
>over a number of selected rows, and vice versa).
>
>I would use the numpy array for that, but they seem to be
>memory-resistent. So, one of these arrays would use about 1.6 GB
>memory which far too much. So I was thinking about a memory mapped
>file for that. As far as I understand, there is one in numpy.

You probably want to ask on a NumPy or SciPy list:
http://scipy.org/Mailing_Lists
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

"If you think it's expensive to hire a professional to do the job, wait
until you hire an amateur." --Red Adair

John Machin

unread,

Apr 23, 2009, 7:53:36 PM4/23/09

to

The Morton layout wastes space if the matrix is not square. Your 100K
x 4K is very non-square. Looks like you might want to use e.g. 25
Morton arrays, each 4K x 4K.

Cheers,
John

Ole Streicher

unread,

Apr 24, 2009, 3:17:03 AM4/24/09

to

Hi John,

John Machin <sjma...@lexicon.net> writes:
> The Morton layout wastes space if the matrix is not square. Your 100K
> x 4K is very non-square. Looks like you might want to use e.g. 25
> Morton arrays, each 4K x 4K.

What I found was that Morton layout shall be usable, if the shape is
rectangular and both dimensions are powers of two. But, all examples
were done with equal dimensions, so I am a bit confused here.

From my access pattern, it would be probably better to combine 25 rows
into one slice and have one matrix where every cell contains 25 rows.

Are there any objections about that?

Best regards

Ole

Ole Streicher

unread,

Apr 24, 2009, 3:21:18 AM4/24/09

to

Hi Nick,

Nick Craig-Wood <ni...@craig-wood.com> writes:
> I'd start by writing a function which took (x, y) in array
> co-ordinates and transformed that into (z) remapped in the Morton
> layout.

This removes the possibility to use the sum() and similar methods of
numpy. Implementing them myself is probably much worse than using
Numpys own.

> Alternatively you could install a 64bit OS on your machine and use
> my scheme!

Well: I am just the developer. Ofcourse I could just raise the
requirements to use my software, but I think it is good style to keep
them as low as possible.

Best regards

Ole

John Machin

unread,

Apr 24, 2009, 11:10:57 AM4/24/09

to

On Apr 24, 5:17 pm, Ole Streicher <ole-usenet-s...@gmx.net> wrote:
> Hi John,

>
> John Machin <sjmac...@lexicon.net> writes:
> > The Morton layout wastes space if the matrix is not square. Your 100K
> > x 4K is very non-square. Looks like you might want to use e.g. 25
> > Morton arrays, each 4K x 4K.
>
> What I found was that Morton layout shall be usable, if the shape is
> rectangular and both dimensions are powers of two. But, all examples
> were done with equal dimensions, so I am a bit confused here.

Yes, you are right, it can be done in one hit with a rectangular
array. How it is done: in your case you have a 2**17 x 2**12 array, so
the Morton index corresponding to (i, j) would have the top 5 bits of
i followed by the remaining 12 bits of i interleaved with the 12 bits
of j -- scarcely distinguishable from my original suggestion of 25
4Kx4K arrays, once you've ignored the trailing approx 2**17 - 1000000
elements that you don't need to allocate space for ;-)

>
> From my access pattern, it would be probably better to combine 25 rows
> into one slice and have one matrix where every cell contains 25 rows.
>
> Are there any objections about that?

Can't object, because I'm not sure what you mean ... how many elements
in a "cell"?

Cheers,
John

Ole Streicher

unread,

Apr 24, 2009, 11:14:47 AM4/24/09

to

Hi John,

John Machin <sjma...@lexicon.net> writes:
>> From my access pattern, it would be probably better to combine 25 rows
>> into one slice and have one matrix where every cell contains 25 rows.
>> Are there any objections about that?
> Can't object, because I'm not sure what you mean ... how many elements
> in a "cell"?

Well, a matrix consists of "cells"? A 10x10 matrix has 100 "cells".

Regards

Ole

John Machin

unread,

Apr 24, 2009, 11:19:37 AM4/24/09

to

On Apr 25, 1:14 am, Ole Streicher <ole-usenet-s...@gmx.net> wrote:
> Hi John,
>

Yes yes but you said "every cell contains 25 rows" ... what's in a
cell? 25 rows, with each row containing what?

Ole Streicher

unread,

Apr 24, 2009, 11:30:11 AM4/24/09

to

Hi John

I mean: original cells.
I have 100.000x4096 entries:

(0,0) (0,1) ... (0,4095)
(1,0) (1,1) ... (1,4095)
...
(100.000,0) (100.000,1) ... (100.000,4095)

This will be re-organized in a new matrix, containing 4096 columns (as
before) and 4000 rows. The leftmost cell (first row, first col) in the
new matrix then contains the array

(0,0)
(1,0)
...
(24,0)

The second column of the first row contains the array

(0,1)
(1,1)
...
(24,1)

and so on. The first column of the second row contains

(25,0)
...
(49,0)

That way, I get a new matrix where every cell contains an array of 24
"original" cells. Disadvantage (what I see now when I write it down)
is that this is bad for numpy since it deals with arrays instead of
numbers in matrix positions.

Best regards

Ole

Scott David Daniels

unread,

Apr 24, 2009, 2:47:20 PM4/24/09

to

Ole Streicher wrote:
> Hi John
>
> John Machin <sjma...@lexicon.net> writes:
>> On Apr 25, 1:14 am, Ole Streicher <ole-usenet-s...@gmx.net> wrote:
>>> John Machin <sjmac...@lexicon.net> writes:
>>>>> From my access pattern, it would be probably better to combine 25 rows
>>>>> into one slice and have one matrix where every cell contains 25 rows.
>>>>> Are there any objections about that?
>>>> Can't object, because I'm not sure what you mean ... how many elements
>>>> in a "cell"?
>>> Well, a matrix consists of "cells"? A 10x10 matrix has 100 "cells".
>> Yes yes but you said "every cell contains 25 rows" ... what's in a
>> cell? 25 rows, with each row containing what?
>
> I mean: original cells.
> I have 100.000x4096 entries:
>
> (0,0) (0,1) ... (0,4095)
> (1,0) (1,1) ... (1,4095)
> ...
> (100.000,0) (100.000,1) ... (100.000,4095)

Choose a block size, and place the block in your output. For example,
using 128K byte blocks (and assuming each cell holds a single 8-byte
number), we could decide each block was a 128 x 128 sub-matrix of your
original. Then to get to a particular block, seek to its base address,
and use:
src = open('data.file', 'rb') # or wb or...

src.seek(block_number * 128 * 128 * 8)
block = numpy.fromfile(src, count=128 * 128)
block.shape = (128, 128)
and then you've got your sub-block.

--Scott David Daniels
Scott....@Acm.Org