Re: Populating huge data structures from disk

Chris Mellon

unread,

Nov 6, 2007, 2:04:57 PM11/6/07

to pytho...@python.org

On Nov 6, 2007 12:18 PM, Michael Bacarella <mb...@gpshopper.com> wrote:
>
>
>
>
> For various reasons I need to cache about 8GB of data from disk into core on
> application startup.
>

Are you sure? On PC hardware, at least, doing this doesn't make any
guarantee that accessing it actually going to be any faster. Is just
mmap()ing the file a problem for some reason?

I assume you're on a 64 bit machine.

> Building this cache takes nearly 2 hours on modern hardware. I am surprised
> to discover that the bottleneck here is CPU.
>
>
>
> The reason this is surprising is because I expect something like this to be
> very fast:
>
>
>
> #!python
>
>
>
> import array
>
> a = array.array('L')
>
> f = open('/dev/zero','r')
>
> while True:
>
> a.fromstring(f.read(8))
>
>

This just creates the same array over and over, forever. Is this
really the code you meant to write? I don't know why you'd expect an
infinite loop to be "fast"...

>
>
>
> Profiling this application shows all of the time is spent inside
> a.fromstring.
>

Obviously, because that's all that's inside your while True loop.
There's nothing else that it could spend time on.

> Little difference if I use list instead of array.
>
>
>
> Is there anything I could tell the Python runtime to help it run this
> pathologically slanted case faster?
>

This code executes in a couple seconds for me (size reduced to fit in
my 32 bit memory space):

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import array
>>> s = '\x00' * ((1024 **3)/2)
>>> len(s)
536870912
>>> a = array.array('L')
>>> a.fromstring(s)
>>>

You might also want to look at array.fromfile()

Michael Bacarella

unread,

Nov 6, 2007, 3:40:26 PM11/6/07

to pytho...@python.org

> > For various reasons I need to cache about 8GB of data from disk into
core on
> > application startup.
>
> Are you sure? On PC hardware, at least, doing this doesn't make any
> guarantee that accessing it actually going to be any faster. Is just
> mmap()ing the file a problem for some reason?
>
> I assume you're on a 64 bit machine.

Very sure. If we hit the disk at all performance drops unacceptably. The
application
has low locality of reference so on-demand caching isn't an option. We get
the behavior
we want when we pre-cache; the issue is simply that it takes so long to
build this cache.

> > Building this cache takes nearly 2 hours on modern hardware. I am
surprised
> > to discover that the bottleneck here is CPU.
> >
> > The reason this is surprising is because I expect something like this to
be
> > very fast:
> >
> > #!python
> > import array
> >
> > a = array.array('L')
> >
> > f = open('/dev/zero','r')
> >
> > while True:
> >
> > a.fromstring(f.read(8))
>
> This just creates the same array over and over, forever. Is this
> really the code you meant to write? I don't know why you'd expect an
> infinite loop to be "fast"...

Not exactly. fromstring() appends to the array. It's growing the array
towards
infinity. Since infinity never finishes it's hard to get an idea of how
slow
this looks. Let's do 800MB instead.

Here's an example of loading 800MB in C:

$ time ./eat800

real 0m44.939s
user 0m10.620s
sys 0m34.303s

$ cat eat800.c
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>

int main(void)
{
int f = open("/dev/zero",O_RDONLY);
int vlen = 8;
long *v = malloc((sizeof (long)) * vlen);
int i;

for (i = 0; i < 100000000; i++) {
if (i >= vlen) {
vlen *= 2;
v = (long *)realloc(v,(sizeof (long)) * vlen);
}
read(f,v+i,sizeof (long));
}
return 0;
}

Here's the similar operation in Python:
$ time python eat800.py

real 3m8.407s
user 2m40.189s
sys 0m27.934s

$ cat eat800.py
#!/usr/bin/python

import array
a = array.array('L')

f = open('/dev/zero')
for i in xrange(100000000):
a.fromstring(f.read(8))

They spend about the same amount of time in system, but Python spends 4.7x
as much
CPU in userland as C does.

And there's no solace in lists either:

$ time python eat800.py

real 4m2.796s
user 3m57.865s
sys 0m3.638s

$ cat eat800.py
#!/usr/bin/python

import struct

d = []
f = open('/dev/zero')
for i in xrange(100000000):
d.append(struct.unpack('L',f.read(8))[0])

cPickle with protocol 2 has some promise but is more complicated because
arrays can't be pickled. In a perfect world I could do something like this
somewhere in the backroom:

x = lengthy_number_crunching()
magic.save_mmap("/important-data")

and in the application do...

x = magic.mmap("/important-data")
magic.mlock("/important-data")

and once the mlock finishes bringing important-data into RAM, at
the speed of your disk I/O subsystem, all accesses to x will be
hits against RAM.

Any thoughts?

Chris Mellon

unread,

Nov 6, 2007, 4:00:44 PM11/6/07

to pytho...@python.org

On Nov 6, 2007 2:40 PM, Michael Bacarella <mb...@gpshopper.com> wrote:
>
> > > For various reasons I need to cache about 8GB of data from disk into
> core on
> > > application startup.
> >
> > Are you sure? On PC hardware, at least, doing this doesn't make any
> > guarantee that accessing it actually going to be any faster. Is just
> > mmap()ing the file a problem for some reason?
> >
> > I assume you're on a 64 bit machine.
>
> Very sure. If we hit the disk at all performance drops unacceptably. The
> application
> has low locality of reference so on-demand caching isn't an option. We get
> the behavior
> we want when we pre-cache; the issue is simply that it takes so long to
> build this cache.
>

You're not going to avoid hitting disk just by reading into your
memory space. If your performance needs are really so tight that you
can't rely on the VM system to keep pages you're using in memory,
you're going to need to do this at a much lower (and system specific)
level.

mmap() with a reasonable VM system shouldn't be any slower than
reading it all into memory.

> > > Building this cache takes nearly 2 hours on modern hardware. I am
> surprised
> > > to discover that the bottleneck here is CPU.
> > >
> > > The reason this is surprising is because I expect something like this to
> be
> > > very fast:
> > >
> > > #!python
> > > import array
> > >
> > > a = array.array('L')
> > >
> > > f = open('/dev/zero','r')
> > >
> > > while True:
> > >
> > > a.fromstring(f.read(8))
> >
> > This just creates the same array over and over, forever. Is this
> > really the code you meant to write? I don't know why you'd expect an
> > infinite loop to be "fast"...
>
> Not exactly. fromstring() appends to the array. It's growing the array
> towards

You're correct, I misread the results of my testing.

> infinity. Since infinity never finishes it's hard to get an idea of how
> slow
> this looks. Let's do 800MB instead.
>

That makes this a useless benchmark, though...

Note that you're not doing the same thing at all. You're
pre-allocating the array in the C code, but not in Python (and I don't
think you can). Is there some reason you're growing a 8 gig array 8
bytes at a time?

> They spend about the same amount of time in system, but Python spends 4.7x
> as much
> CPU in userland as C does.
>

Python has to grow the array. It's possible that this is tripping a
degenerate case in the gc behavior also (I don't know if array uses
PyObjects for its internal buffer), and if it is you'll see an
improvement by disabling GC.

> And there's no solace in lists either:
>
> $ time python eat800.py
>
> real 4m2.796s
> user 3m57.865s
> sys 0m3.638s
>
> $ cat eat800.py
> #!/usr/bin/python
>
> import struct
>
> d = []
> f = open('/dev/zero')
> for i in xrange(100000000):
> d.append(struct.unpack('L',f.read(8))[0])
>
>
> cPickle with protocol 2 has some promise but is more complicated because
> arrays can't be pickled. In a perfect world I could do something like this
> somewhere in the backroom:
>
> x = lengthy_number_crunching()
> magic.save_mmap("/important-data")
>
> and in the application do...
>
> x = magic.mmap("/important-data")
> magic.mlock("/important-data")
>
> and once the mlock finishes bringing important-data into RAM, at
> the speed of your disk I/O subsystem, all accesses to x will be
> hits against RAM.
>

You've basically described what mmap does, as far as I can tell. Have
you tried just mmapping the file?

>
> Any thoughts?
>
>

Did you try array.fromfile like I suggested?

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Neil Cerutti

unread,

Nov 6, 2007, 4:43:26 PM11/6/07

to

Disable the garbage collector, use a while loop and manual index
instead of an iterator, preallocate your list, e.g.,
[None]*100000000, and hope they don't have blasters!

--
Neil Cerutti

Michael Bacarella

unread,

Nov 6, 2007, 4:42:04 PM11/6/07

to pytho...@python.org

> Note that you're not doing the same thing at all. You're
> pre-allocating the array in the C code, but not in Python (and I don't
> think you can). Is there some reason you're growing a 8 gig array 8
> bytes at a time?
>
> They spend about the same amount of time in system, but Python spends 4.7x
> as much
> CPU in userland as C does.
>
> Python has to grow the array. It's possible that this is tripping a
> degenerate case in the gc behavior also (I don't know if array uses
> PyObjects for its internal buffer), and if it is you'll see an
> improvement by disabling GC.

That does explain why it's consuming 4.7x as much CPU.

> > x = lengthy_number_crunching()
> > magic.save_mmap("/important-data")
> >
> > and in the application do...
> >
> > x = magic.mmap("/important-data")
> > magic.mlock("/important-data")
> >
> > and once the mlock finishes bringing important-data into RAM, at
> > the speed of your disk I/O subsystem, all accesses to x will be
> > hits against RAM.
>

> You've basically described what mmap does, as far as I can tell. Have
> you tried just mmapping the file?

Yes, that would be why my fantasy functions have 'mmap' in their names.

However, in C you can mmap arbitrarily complex data structures whereas
in Python all you can mmap without transformations is an array or a string.
I didn't say this earlier, but I do need to pull more than arrays
and strings into RAM. Not being able to pre-allocate storage is a big
loser for this approach.

Hrvoje Niksic

unread,

Nov 6, 2007, 4:44:04 PM11/6/07

to

"Michael Bacarella" <mb...@gpshopper.com> writes:

> cPickle with protocol 2 has some promise but is more complicated because
> arrays can't be pickled.

This is not true:

>>> import array
>>> a = array.array('L')

>>> a.extend(xrange(10))
>>> a
array('L', [0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L])
>>> import cPickle as pickle
>>> s = pickle.dumps(a, -1)
>>> pickle.loads(s)
array('L', [0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L])

But I don't think unpickling will be any faster than array.fromstring.

Anyway, why not use array.fromfile instead? It's actually *faster*
than the C code you posted:

$ time ./eat80 # not eat800 because I didn't feel like waiting
./eat80 0.58s user 2.43s system 93% cpu 3.226 total

$ cat eat80.py

#!/usr/bin/python
import array
a = array.array('L')
f = open('/dev/zero')

a.fromfile(f, 10000000)
print len(a)

$ time ./eat80.py
10000000
./eat80.py 0.02s user 0.00s system 48% cpu 0.058 total

Paul Rubin

unread,

Nov 6, 2007, 5:14:04 PM11/6/07

to

"Michael Bacarella" <mb...@gpshopper.com> writes:
> Very sure. If we hit the disk at all performance drops
> unacceptably. The application has low locality of reference so
> on-demand caching isn't an option. We get the behavior we want when
> we pre-cache; the issue is simply that it takes so long to build
> this cache.

The way I do it is run a separate process that mmaps the file and
reads one byte from each page every half hour or so. You are right
that it makes a huge difference.

Michael Bacarella

unread,

Nov 6, 2007, 5:36:50 PM11/6/07

to pytho...@python.org

Why not just disable swap?

Chris Mellon

unread,

Nov 6, 2007, 5:43:38 PM11/6/07

to pytho...@python.org

Well, for certain limited values of "arbitrary", but okay. It's true
that you can cast pointers into the mmapped region into, say, pointers
to a struct of structs.

The Python equivalent would be to have your Python classes be wrappers
around access to this memory buffer, calculating the offset needed to
get any particular field. ctypes.Structure is probably a good starting
point if you want to implement this. Sadly, it looks like mmap.mmap()
doesn't expose the address of its buffer so you'll either need to use
the C mmap (via ctypes, probably) or use array.array to load the
bytes and use it's bufferinfo() to get the address to load your
structs from.

>whereas
> in Python all you can mmap without transformations is an array or a string.

Read "array or string" as "stream of bytes" in this context.

> I didn't say this earlier, but I do need to pull more than arrays
> and strings into RAM. Not being able to pre-allocate storage is a big
> loser for this approach.
>

It is a little annoying that there's no way to pre-allocate an array.
It doesn't over-allocate, either, so building on a few bytes at a time
is pretty much worst case behavior.

Paul Rubin

unread,

Nov 6, 2007, 7:31:51 PM11/6/07

to

"Michael Bacarella" <mb...@gpshopper.com> writes:
> > The way I do it is run a separate process that mmaps the file and
> > reads one byte from each page every half hour or so. You are right
> > that it makes a huge difference.
>
> Why not just disable swap?

The system is demand paged. If swap is disabled, pages can still get
released from memory since they can be paged back in later. Remember
we are talking about an actual disk file (persistent data) that
happens to be mmap'ed into a running program. So paging would be
to the file system and not the swap area.

Hrvoje Niksic

unread,

Nov 7, 2007, 2:41:04 AM11/7/07

to

"Chris Mellon" <ark...@gmail.com> writes:

> It is a little annoying that there's no way to pre-allocate an
> array. It doesn't over-allocate, either, so building on a few bytes
> at a time is pretty much worst case behavior.

The fine source says:

array_resize(arrayobject *self, Py_ssize_t newsize)
{
...
/* This over-allocates proportional to the array size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 34, 46, 56, 67, 79, ...
* Note, the pattern starts out the same as for lists but then
* grows at a smaller rate so that larger arrays only overallocate
* by about 1/16th -- this is done because arrays are presumed to be more
* memory critical.
*/

_new_size = (newsize >> 4) + (self->ob_size < 8 ? 3 : 7) + newsize;

Rhamphoryncus

unread,

Nov 7, 2007, 3:42:11 PM11/7/07

to

I don't see how needing transformations is an issue, as it's just a
constant overhead (in big-O terms.)

The bigger concern is if what your storing is self contained (numbers,
strings) or if it contains inter-object references. The latter may
require you to translate between indexes and temporary handle
objects. This in turn may require some sort of garbage collection
scheme (refcounting, tracing). Note that the addresses of the mmap'd
region can change each time the program runs, so even in C you may
need to use indexes. (You may also want to eliminate C's padding,
although that would make the objects not directly accessible (due to
hardware alignment requirements.))

Having a separate process (or thread) that occasionally pokes each
page (as suggested by Paul Rubin) seems like the cleanest way to
ensure they stay "hot". One pass during startup is insufficient, as
unused portions of a long-running program may get swapped out. Also
note that poking need only touch 1 byte per page, much cheaper than
copying the entire page (so long as the page is already loaded from
disk.)

--
Adam Olsen, aka Rhamphoryncus