I then wrote a small Ruby program that does the exact same thing.
I ran the Java program using the standard settings (regular heap, regular stack size, etc) and mongo driver 2.1-master on Mac OSX 10.6.4 using the 64-bit JVM. I also ran the Ruby program using the 1.1 ruby driver on MRI 1.9.2p0 and JRuby 1.5.1 (same JVM).
Imagine my surprise when the Ruby program blew away the Java program reading over 150 million documents! I find it shocking that the Ruby C extension was able to deserialize some of these documents, many of which contain a "vals" array with thousands of elements, in a shorter timespan. I am so surprised that I am certain I did something wrong.
Please check the gist and tell me what I did wrong with this comparison.
cr
PS - I'm not including results for the 2.1 release driver (from 20100819), but I did notice that it was consistently about 10% faster than the current java git master.
If you are just trying to test the bson parsing (de-serialization)
then it seems like sorting on the server is not needed and will only
add more variance to the results.
Also, this test doesn't test the serialization (saving) of data from
the language to bson. Just something to keep in mind.
I would not be surprised if the java code needs to be optimized to
reduce the number of object creations and such. It would be
interesting to see what a profiler has to say is the hot spot in the
java code.
> --
> You received this message because you are subscribed to the Google Groups "mongodb-dev" group.
> To post to this group, send email to mongo...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-dev...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-dev?hl=en.
>
>
> Can you post your sample data and the stats from your jvm (version,
> options, etc)? It is hard to tell what is being tested.
>
> If you are just trying to test the bson parsing (de-serialization)
> then it seems like sorting on the server is not needed and will only
> add more variance to the results.
>
> Also, this test doesn't test the serialization (saving) of data from
> the language to bson. Just something to keep in mind.
>
> I would not be surprised if the java code needs to be optimized to
> reduce the number of object creations and such. It would be
> interesting to see what a profiler has to say is the hot spot in the
> java code.
I have updated the gist to include the options as well as the JVM details.
I also modified the test programs to remove the sort operation. Results were unchanged.
BTW, I do realize this is really a deserialization test. That's by design. If there is a same program already written in Java that exercises only the BSON stuff, please point it out and I'll use it instead.
cr
--
> Why is it surprising that C is faster than Java? C is a system language whereas Java sits on top of its virtual machine. The virtual machine introduces some overhead to code execution and that is reflected in the execution time..
Because it isn't just C. As the BSON types are decoded, the runtime needs to create Ruby strings, hashes, arrays, etc. The deserialization has to work through the Ruby runtime's C API which has some significant overhead compared to what you would find in the C or C++ mongo driver.
So yes, it is still surprising to me.
cr
cr
Chuck,I put together a sample data set like yours, ran the scripts, and got a very similar result: the Ruby driver was much faster. Here are my slightly-modified scripts, plus a script to generate the sample data, in case anyone is interested in trying to reproduce:
> I submitted a patch to BSONDecoder which improves performance of
> decoding (http://github.com/theunique/mongo-java-driver/commit/
> efb91dbc42cff1d9c138bf27eda4062d27458741)
>
> Could someone repeat benchmark with the patch?
>
> ciao.hans.
I get a fatal error when trying to run a driver built from that commit.
I cloned your repository and reset to that commit before building. The build was clean but the run wasn't.
cr
> Sorry, was a bug.
> New BSONDecoder does readahead but beyond object boundary which is not
> allowed with multiple objects.
> So I had to limit the readahead to object boundary.
>
> Look at new commit of BSONDecoder
> http://github.com/theunique/mongo-java-driver/commit/c30892cc06baef0a7df1877f1d2efc941ec15142
>
> ciao.hans.
Running your code versus the 2.2 release showed some sizable differences.
2.2 took 342 seconds.
Your patch took 216 seconds.
I make that out to roughly 40% faster.
However, the java driver is still slower than the Ruby C extension. It ran the same test in 180 seconds.
cr
> --
> You received this message because you are subscribed to the Google Groups "mongodb-dev" group.
> To post to this group, send email to mongo...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-dev...@googlegroups.com.
> The reason the ruby driver is faster on that test is that ruby 1.9 has
> a native internal utf-8 representation and java does not.
> Almost all the time in java is spent converting utf-8 into java's
> string classes.
> On larger objects, or other types of tests, the result will be
> different of course.
I'm not sure I follow. The only strings in the test are the key names '_id', 'ts', 'drn', 'cid' and 'vals'.
Why would keys added by a Ruby program be saved as utf-8 if that is going to cause performance problems for other drivers? Shouldn't there be one string representation enforced for keys across all drivers?
Or am I misunderstanding this?
cr
So now I'm really confused. I don't think the test I have been doing uses utf-8 strings at all, so I think Eliot's explanation for the performance difference is in error.
cr
I don't know about the performance issue Eliot speaks of.
BTW, I did a quick google on 'java utf-8' and ran across an article about faster *encoding* of Java strings to utf-8.
http://blog.rapleaf.com/dev/2010/04/26/faster-string-to-utf-8-encoding-in-java/
I don't know if that's helpful or not. I imagine the driver is already using whatever tricks it can to minimize this cost. Too bad that String.getBytes("utf-8") is so slow.
cr
trivial would be all low ascii chars. then maybe some other java
function is fast.
i don't know details; brainstorming...
See links from earlier in this thread to results, test programs, etc.
cr
> FYI, I tested again against the latest master this weekend. My original test showed that the Ruby C extension completed the test in 3 minutes (give or take 1 or 2 seconds). The latest Java driver clocks in at 5:45 which makes it roughly twice as slow. It's a tad better than 2.1 but still far behind.
>
> See links from earlier in this thread to results, test programs, etc.
I just tested the latest Java driver 2.4rc0 against the same dataset. *Vast* improvement in the results very likely due to the DBList change [1]. It's nearly at parity with Ruby and its C extension; it's only about 10% slower now.
Nice work! Hopefully this improvement will make its way into the next Ruby driver release for JRuby.
cr
I'll integrate as soon as we have a final 2.4 release.