RangeIter(fill_cache=False) still uses lots of memory

260 views
Skip to first unread message

Dustin Boswell

unread,
Aug 7, 2012, 4:07:06 PM8/7/12
to py-le...@googlegroups.com
I am iterating over a large table, simply counting the number of rows, and I noticed that the memory usage is very high (about 2GB).  I used the fill_cache=False param, but that didn't help.

Here is some code to replicate the problem:

import leveldb                                                           
import os                                                                

# this creates a large table with random data (the exact data doesn't matter)                                                                         
db = leveldb.LevelDB("/mnt/dustin/test_level_db")                        
for x in xrange(1000*1000*1000):                                         
    db.Put(key="%010d"%x, value=str(x**3))                               
                                                                         
Once this table is created, I ran a new process:
                                                                      
db = leveldb.LevelDB("/mnt/dustin/test_level_db")                       
num_rows = 0                                                            
for (key, value) in db.RangeIter(include_value=True, fill_cache=False): 
    num_rows += 1                                                       
    if num_rows % 10000000 == 0:                                        
        print "Iterated over %dM rows" % (num_rows/1000000)             
        os.system("ps u -p %d" % os.getpid())                           

which produced the following output:

Iterated over 10M rows
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dustin   28964 97.8  1.2 250968 219576 pts/1   S+   13:03   0:06 python test_level_db.py
Iterated over 20M rows
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dustin   28964 98.5  2.4 479872 448408 pts/1   R+   13:03   0:13 python test_level_db.py
Iterated over 30M rows
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dustin   28964 98.5  3.8 713816 682164 pts/1   S+   13:03   0:20 python test_level_db.py
Iterated over 40M rows
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dustin   28964  102  5.1 949700 917876 pts/1   S+   13:03   0:27 python test_level_db.py
Iterated over 50M rows
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dustin   28964  101  6.4 1186420 1154532 pts/1 S+   13:03   0:34 python test_level_db.py
Iterated over 60M rows
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dustin   28964  100  7.6 1396724 1364628 pts/1 S+   13:03   0:41 python test_level_db.py
Iterated over 70M rows
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dustin   28964  100  8.1 1491676 1460924 pts/1 S+   13:03   0:48 python test_level_db.py
Iterated over 80M rows
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dustin   28964  100  8.2 1511856 1481052 pts/1 S+   13:03   0:55 python test_level_db.py
Iterated over 90M rows
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dustin   28964 99.7  8.3 1523380 1492436 pts/1 S+   13:03   1:01 python test_level_db.py
Iterated over 100M rows
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
dustin   28964 99.5  8.3 1532064 1501004 pts/1 S+   13:03   1:08 python test_level_db.py


As you can see, the memory usage (both the VSZ and RSS) just grows and grows...
Any ideas on why this is happening, or how to stop it?

Árni Már Jónsson

unread,
Aug 7, 2012, 5:53:19 PM8/7/12
to py-le...@googlegroups.com
This is looking very weird. Just ripped out pretty much all the code in py-leveldb, and still, this behaviour persists.

I think I've ruled out memory leaks wrt the Python API. The remaining code seems to match the iteration example in the leveldb documentation.

I'll report back once I've done a valgrind pass on this.

Arni

Dustin Boswell

unread,
Aug 7, 2012, 6:13:56 PM8/7/12
to py-le...@googlegroups.com
Thanks for looking into it.  If you can reproduce this in straight C/C++, I'll repost my email to the leveldb group.

Árni Már Jónsson

unread,
Aug 7, 2012, 6:57:09 PM8/7/12
to py-le...@googlegroups.com
This is memory associated with leveldb. When the LevelDB object is GC'ed after the iteration, the memory goes down to regular levels. Haven't figured out why it's using way more than the cache settings would indicate.

import leveldb, os, time

db = leveldb.LevelDB("/home/arni/py-leveldb/justin") 

for i, key_value in enumerate(db.RangeIter(include_value = True, fill_cache=False)): 
if i % 1000000 == 0:
print "Iterated over %dM rows" % (i/1000000) 
os.system("ps u -p %d" % os.getpid()) 

del db

while True:
os.system("ps u -p %d" % os.getpid()) 
time.sleep(1.0)

Árni Már Jónsson

unread,
Aug 7, 2012, 7:23:06 PM8/7/12
to py-le...@googlegroups.com
A straighforward c++ translation behaves the same. I'm using just the default cache settings. See http://leveldb.googlecode.com/svn/trunk/doc/index.html for more options.

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <unistd.h>

#include <leveldb/db.h>
#include <leveldb/write_batch.h>
#include <leveldb/comparator.h>
#include <leveldb/cache.h>

int checkmem()
{
char command[64];
sprintf(command, "ps u -p %i", (int)getpid());
system(command);
}

int main()
{
leveldb::DB* db = 0;
leveldb::Options options;
leveldb::Status status;

status = leveldb::DB::Open(options, "/home/arni/py-leveldb/justin", &db);
assert(status.ok());

// iteration
leveldb::Iterator* it = db->NewIterator(leveldb::ReadOptions());
size_t n = 0;

for (it->SeekToFirst(); it->Valid(); it->Next()) {
n += 1;

if (n % 1000000 == 0) {
printf("%iM scanned\n", (int)(n / 1000000));
checkmem();
}
}

assert(it->status().ok());

printf("\n\ndeleting iterator\n");
delete it;
checkmem();

printf("\n\ndeleting db\n");
delete db;
checkmem();


for (int i = 0; i < 30; i++) {
checkmem();
sleep(1);
}


exit(EXIT_SUCCESS);
}

I compiled from ./py-leveldb/testing using: g++ -I../leveldb/include test.cc ../leveldb/libleveldb.a ../snappy-read-only/.libs/libsnappy.a -lpthread

Dustin Boswell

unread,
Aug 7, 2012, 7:37:01 PM8/7/12
to py-le...@googlegroups.com
Ok, I'll post it to leveldb google group.  Thanks again!
Reply all
Reply to author
Forward
0 new messages