python bindings order of magnitude slower than java bindings?

319 views
Skip to first unread message

Sam

unread,
Mar 28, 2012, 4:26:22 PM3/28/12
to mongodb-user
Hello,

I've run a simple experiment comparing the performance of cursor
iteration using python vs. java and have found that the python
implementation is about 10x slower. I was hoping someone could tell me
if this difference is expected or if I'm doing something clearly
inefficient on the python side.

The benchmark is simple: it performs a query, iterates over the
cursor, and inspects the same field in each document. In the python
version, I can inspect about 22k documents per second. In the java
version, I can inspect about 220k documents per second.

I've seen a few similar questions about python performance and I've
taken the advice and made sure I'm using the C extensions:

>>> import pymongo
>>> pymongo.has_c()
True
>>> import bson
>>> bson.has_c()
True
>>>

Finally, I don't believe the discrepancy is due to fundamental
differences between python and java, at least at the level my my test
code. For example, if I store the queried documents in a python list,
I can iterate over that list very quickly. In other words, it's not an
inefficient python for-loop that accounts for the difference.

Here are some more details about the query:
- Both the python and java implementations use the same query on the
same collection and run on the same machine.
- The collection contains about 20 million documents.
- The query returns about 2 million documents, i.e., I'm retrieving
about 10% of the collection.
- Each document contains three simple fields: a date and two strings.
- The query is indexed and the time spent in the actual query is
negligible for both the python and java implementations.It's the
cursor iteration that accounts for the runtime.

Dan Crosta

unread,
Mar 28, 2012, 4:39:43 PM3/28/12
to mongod...@googlegroups.com
Hi Sam,

Can you post your sample code that you're using to execute this benchmark? (If it's long, please use gist/pastebin/pastie/etc).

- Dan

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>

Sam

unread,
Mar 28, 2012, 5:14:01 PM3/28/12
to mongodb-user
Hi Dan,

Here is the python code:

------

#!/usr/bin/python

import time
import pymongo
import datetime

conn = pymongo.Connection('10.0.1.20', 27017)
coll = conn.testdb.test

print 'Collection size is: %d' % (coll.count())

year = 2010
count = 0

t1 = time.time()

#
# Query a year's worth of data, one month at a time.
#
for ii in range(1, 13):
start = datetime.datetime(year, ii, 1)
if ii == 12:
end = datetime.datetime(year+1, 1, 1)
else:
end = datetime.datetime(year, ii+1, 1)

query = {'date': {'$gte': start, '$lt': end}}
curs = coll.find(query);

for doc in curs:
if doc['val'] < 10000000:
count += 1

t2 = time.time()
print 'Found %d docs in %s seconds' % (count, t2 - t1)

Sam

unread,
Mar 28, 2012, 5:15:15 PM3/28/12
to mongodb-user
Here is the java code:

----------------

package com.test.app;

import java.util.Date;
import java.util.List;
import java.util.Set;
import java.util.ArrayList;

import com.mongodb.Mongo;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.BasicDBObject;
import com.mongodb.DBObject;
import com.mongodb.DBCursor;

public class App
{
private void run()
{
try {
Mongo m = new Mongo("10.0.1.20", 27017);
DB db = m.getDB("testdb");
DBCollection coll = db.getCollection("test");

System.out.println("Collection size is: " + coll.getCount());

int year = 2010;
long count = 0;

long t1 = System.currentTimeMillis();

// Query a year's worth of data, one month at a time.
for (int ii = 0; ii < 12; ii++) {
Date start = new Date(year - 1900, ii, 1);
Date end;
if (ii == 11) {
end = new Date(year - 1900 + 1, 0, 1);
} else {
end = new Date(year - 1900, ii + 1, 2);
}

BasicDBObject query = new BasicDBObject();
query.put("date", new BasicDBObject("$gte", start).append("$lt",
end));
DBCursor cur = coll.find(query);

while (cur.hasNext()) {
DBObject obj = cur.next();
Integer val = (Integer)obj.get("val");
if (val < 1000000) {
count++;
}
}
}

long t2 = System.currentTimeMillis();
float diff = (float)(t2 - t1) / 1000;

System.out.println("Found " + count + " docs in " + diff + "
seconds");

} catch (Exception e) {
System.err.println("Exception: " + e);
}
}

public static void main(String[] args)
{
App app = new App();
app.run();

Dan Crosta

unread,
Mar 28, 2012, 5:31:41 PM3/28/12
to mongod...@googlegroups.com
I doubt this could account for such a dramatic change in the results, but I notice that you're comparing doc's 'val' field to 10 million in the Python example, and 1 million in the Java example. On the one hand, this should mean the "count" variable gets a much higher value in Python, which would tend to exaggerate the performance of Python; on the other, it means that Python (potentially) has to do many more additions than Java does, depending on the distribution of your data. If it doesn't take too too long, I'd be curious to see how they stack up once both langauges are comparing with the same number.

- Dan

Sam

unread,
Mar 28, 2012, 5:43:42 PM3/28/12
to mongodb-user
Thanks for looking at this so closely. This difference between the two
implementations is actually a harmless oversight. I was just choosing
a number large enough that 'count' would always be incremented. The
purpose of the test-and-increment was just to ensure that I actually
accessed each document. If I fix the code to compare to the same
number, I still get the 10x slowdown of python vs. java.

A. Jesse Jiryu Davis

unread,
Apr 2, 2012, 5:43:07 PM4/2/12
to mongod...@googlegroups.com
Sam, I've made some test data and reran your benchmark, you can see my code here:


I noticed that neither your Python nor your Java code actually exits its loop when the count of documents exceeds 1 million, so it doesn't seem like you're timing what your print statements claim you are; I don't know if that's intentional or not. In any case, in my tests Java completes in 6 seconds and Python (with C extensions) in 13 seconds, which seems like a reasonable difference between the two languages.
> >>>> To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.
> >>>> For more options, visit this group athttp://groups.google.com/group/mongodb-user?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> > To post to this group, send email to mongod...@googlegroups.com.
> > To unsubscribe from this group, send email to mongodb-user+unsubscribe@googlegroups.com.

Sam

unread,
Apr 4, 2012, 1:35:10 PM4/4/12
to mongodb-user
Hmm, running your benchmarks, I see 4.5s for Java and 17s for Python.
That's a ratio of about 3.7 compared to the ratio of 2.1 that you
report. Perhaps the discrepancy could be due to tools version? For
example, I'm using python2.6 with Ubuntu 10.04. Maybe newer versions
of python perform better?

I was able to get close to my original 10x slowdown by making slight
modifications to your benchmarks.
See: https://gist.github.com/2304006

The changes are:
1) I insert shorter strings in populate.py ('a' instead of 'a' * 100)
2) I iterate over 2M documents instead of 1M.

My results are:
1M / long strings: Java 4.6s, Python 17.0s, ratio = 3.7
1M / short strings: Java 4.6s, Python 33.5s, ratio = 7.2
2M / long strings: Java 7.9s, Python 33.2s, ratio = 4.2
2M / short strings: Java 7.8s, Python 66.5s, ratio = 8.5

The big surprise in the results is how the Python benchmark
performance *degrades* when I insert *shorter* values. If anything, I
would have expected the opposite. Comparatively, the Java numbers are
essentially the same for long vs. short strings.
> > mongodb-user...@googlegroups.com.
> > > >>>> For more options, visit this group athttp://
> > groups.google.com/group/mongodb-user?hl=en.
>
> > > > --
> > > > You received this message because you are subscribed to the Google
> > Groups "mongodb-user" group.
> > > > To post to this group, send email to mongod...@googlegroups.com.
> > > > To unsubscribe from this group, send email to
> > mongodb-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages