Groups

RangeIter(fill_cache=False) still uses lots of memory

260 views

Skip to first unread message

Dustin Boswell

unread,

Aug 7, 2012, 4:07:06 PM8/7/12

to py-le...@googlegroups.com

I am iterating over a large table, simply counting the number of rows, and I noticed that the memory usage is very high (about 2GB). I used the fill_cache=False param, but that didn't help.

Here is some code to replicate the problem:

import leveldb

import os

# this creates a large table with random data (the exact data doesn't matter)

db = leveldb.LevelDB("/mnt/dustin/test_level_db")

for x in xrange(1000*1000*1000):

db.Put(key="%010d"%x, value=str(x**3))

Once this table is created, I ran a new process:

db = leveldb.LevelDB("/mnt/dustin/test_level_db")

num_rows = 0

for (key, value) in db.RangeIter(include_value=True, fill_cache=False):

num_rows += 1

if num_rows % 10000000 == 0:

print "Iterated over %dM rows" % (num_rows/1000000)

os.system("ps u -p %d" % os.getpid())

which produced the following output:

Iterated over 10M rows

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

dustin 28964 97.8 1.2 250968 219576 pts/1 S+ 13:03 0:06 python test_level_db.py

Iterated over 20M rows

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

dustin 28964 98.5 2.4 479872 448408 pts/1 R+ 13:03 0:13 python test_level_db.py

Iterated over 30M rows

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

dustin 28964 98.5 3.8 713816 682164 pts/1 S+ 13:03 0:20 python test_level_db.py

Iterated over 40M rows

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

dustin 28964 102 5.1 949700 917876 pts/1 S+ 13:03 0:27 python test_level_db.py

Iterated over 50M rows

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

dustin 28964 101 6.4 1186420 1154532 pts/1 S+ 13:03 0:34 python test_level_db.py

Iterated over 60M rows

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

dustin 28964 100 7.6 1396724 1364628 pts/1 S+ 13:03 0:41 python test_level_db.py

Iterated over 70M rows

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

dustin 28964 100 8.1 1491676 1460924 pts/1 S+ 13:03 0:48 python test_level_db.py

Iterated over 80M rows

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

dustin 28964 100 8.2 1511856 1481052 pts/1 S+ 13:03 0:55 python test_level_db.py

Iterated over 90M rows

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

dustin 28964 99.7 8.3 1523380 1492436 pts/1 S+ 13:03 1:01 python test_level_db.py

Iterated over 100M rows

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

dustin 28964 99.5 8.3 1532064 1501004 pts/1 S+ 13:03 1:08 python test_level_db.py

As you can see, the memory usage (both the VSZ and RSS) just grows and grows...

Any ideas on why this is happening, or how to stop it?

Árni Már Jónsson

unread,

Aug 7, 2012, 5:53:19 PM8/7/12

to py-le...@googlegroups.com

This is looking very weird. Just ripped out pretty much all the code in py-leveldb, and still, this behaviour persists.

I think I've ruled out memory leaks wrt the Python API. The remaining code seems to match the iteration example in the leveldb documentation.

I'll report back once I've done a valgrind pass on this.

Arni

Dustin Boswell

unread,

Aug 7, 2012, 6:13:56 PM8/7/12

to py-le...@googlegroups.com

Thanks for looking into it. If you can reproduce this in straight C/C++, I'll repost my email to the leveldb group.

Árni Már Jónsson

unread,

Aug 7, 2012, 6:57:09 PM8/7/12

to py-le...@googlegroups.com

This is memory associated with leveldb. When the LevelDB object is GC'ed after the iteration, the memory goes down to regular levels. Haven't figured out why it's using way more than the cache settings would indicate.

import leveldb, os, time

db = leveldb.LevelDB("/home/arni/py-leveldb/justin")

for i, key_value in enumerate(db.RangeIter(include_value = True, fill_cache=False)):

if i % 1000000 == 0:

print "Iterated over %dM rows" % (i/1000000)

os.system("ps u -p %d" % os.getpid())

del db

while True:

os.system("ps u -p %d" % os.getpid())

time.sleep(1.0)

Árni Már Jónsson

unread,

Aug 7, 2012, 7:23:06 PM8/7/12

to py-le...@googlegroups.com

A straighforward c++ translation behaves the same. I'm using just the default cache settings. See http://leveldb.googlecode.com/svn/trunk/doc/index.html for more options.

#include <stdlib.h>

#include <stdio.h>

#include <assert.h>

#include <unistd.h>

#include <leveldb/db.h>

#include <leveldb/write_batch.h>

#include <leveldb/comparator.h>

#include <leveldb/cache.h>

int checkmem()

{

char command[64];

sprintf(command, "ps u -p %i", (int)getpid());

system(command);

}

int main()

{

leveldb::DB* db = 0;

leveldb::Options options;

leveldb::Status status;

status = leveldb::DB::Open(options, "/home/arni/py-leveldb/justin", &db);

assert(status.ok());

// iteration

leveldb::Iterator* it = db->NewIterator(leveldb::ReadOptions());

size_t n = 0;

for (it->SeekToFirst(); it->Valid(); it->Next()) {

n += 1;

if (n % 1000000 == 0) {

printf("%iM scanned\n", (int)(n / 1000000));

checkmem();

}

}

assert(it->status().ok());

printf("\n\ndeleting iterator\n");

delete it;

checkmem();

printf("\n\ndeleting db\n");

delete db;

checkmem();

for (int i = 0; i < 30; i++) {

checkmem();

sleep(1);

}

exit(EXIT_SUCCESS);

}

I compiled from ./py-leveldb/testing using: g++ -I../leveldb/include test.cc ../leveldb/libleveldb.a ../snappy-read-only/.libs/libsnappy.a -lpthread

Dustin Boswell

unread,

Aug 7, 2012, 7:37:01 PM8/7/12

to py-le...@googlegroups.com

Ok, I'll post it to leveldb google group. Thanks again!

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu