Some more progress on Lucene integration

Grzegorz Kossakowski

unread,

Apr 11, 2010, 8:38:05 PM4/11/10

to gimd-d...@googlegroups.com

Hi,

I just wanted to report that I've got some more progress on Lucene
integration. I didn't submit changes for review because they are not
quite ready for formal review. It was quite a lot of work to integrate
Lucene and it generated a handful number of commits that I'll need to
clean up.

Anyway I decided to make my changes public, so they can be pulled from here:

http://github.com/gkossakowski/gimd/tree/lucene

At the moment tests for this code fail due to some strange bug in
Formatter that I cannot find. It's more apparent that test-case fails
due to bug in Formatter by checking out following branch

http://github.com/gkossakowski/gimd/tree/formatter-bug

I've shared this code just in case you Shawn could have a look at it.
I'm clueless at the moment because it manifests itself in rather big
test-data so debugging is quite painful.

This drives me to two conclusions:
1. Integrating Lucene is a lot of work and I'm very far from complete
integration so I'd appreciate some help.
2. Situations like above with Formatter make me even more motivated to
start writing ScalaCheck[1] test-cases that are more likely to detect
such problems early on compared to hand-written test-cases. However,
this is again lots of work to do...

[1] http://code.google.com/p/scalacheck/

--
Best regards,
Grzegorz Kossakowski

Grzegorz Kossakowski

unread,

Apr 19, 2010, 1:44:44 PM4/19/10

to gimd-d...@googlegroups.com

2010/4/12 Grzegorz Kossakowski <grzegorz.k...@gmail.com>

Hi,

I just wanted to report that I've got some more progress on Lucene
integration. I didn't submit changes for review because they are not
quite ready for formal review. It was quite a lot of work to integrate
Lucene and it generated a handful number of commits that I'll need to
clean up.

Anyway I decided to make my changes public, so they can be pulled from here:

http://github.com/gkossakowski/gimd/tree/lucene

This is updated to incorporate bug fix that I describe below. Since the bug is fixed the small test-cases passes and being at the same time performance check it reveals:
a) Lucene integration even very basic already provides good speed-up for simple queries on data as large as 14k of nodes.
b) parsing seems to be very slow in Gimd, I don't know why but I believe it should be faster. Profiling (with yourkit) reveals that functional
constructs should not have that much of impact on performance. Namely, gc uses less than 1% of CPU during parsing.

At the moment tests for this code fail due to some strange bug in
Formatter that I cannot find. It's more apparent that test-case fails
due to bug in Formatter by checking out following branch

http://github.com/gkossakowski/gimd/tree/formatter-bug

I've shared this code just in case you Shawn could have a look at it.
I'm clueless at the moment because it manifests itself in rather big
test-data so debugging is quite painful.

This drives me to two conclusions:
1. Integrating Lucene is a lot of work and I'm very far from complete
integration so I'd appreciate some help.
2. Situations like above with Formatter make me even more motivated to
start writing ScalaCheck[1] test-cases that are more likely to detect
such problems early on compared to hand-written test-cases. However,
this is again lots of work to do...

I've both integrated ScalaCheck and found a cause for the bug I described above. The bug report can be found here:

http://code.google.com/p/gimd/issues/detail?id=2

I've posted my fix for review and it can be found here:

http://review.source.android.com/14320

I'd be grateful for comments on this.

Grzegorz Kossakowski

unread,

May 10, 2010, 5:32:09 PM5/10/10

to Gimd

On Apr 19, 7:44 pm, Grzegorz Kossakowski
<grzegorz.kossakow...@gmail.com> wrote:
> 2010/4/12 Grzegorz Kossakowski <grzegorz.kossakow...@gmail.com>

> b) parsing seems to be very slow in Gimd, I don't know why but I believe it
> should be faster. Profiling (with yourkit) reveals that functional
> constructs should not have that much of impact on performance. Namely,
> gc uses less than 1% of CPU during parsing.

I found the cause and it was bug in Scala's collections that was
exhibited by Scala's sorting function.

Fixing that bug would be probably quite difficult and frankly speaking
nobody in Scala community is interested in 2.7.x that much. Most
people focus on making 2.8.0 ready for release which is a good thing.

After RC1 of Scala 2.8.0 has been released I decided to try to port
Gimd to Scala 2.8.0 as it brings lots of useful features and bug-
fixes. The migration wasn't that hard. The biggest issues I had were
those related to collections that were completely redesigned in 2.8.
The pain was worth it as they seem to be the best collections I ever
used.

When it comes to parsing it became much faster. After some
optimizations of the parser I managed to reduce parsing time to
acceptable level. Specifically for a file that is 23k of size with a
few thousands of fields it takes around 18ms on average to parse it on
my desktop machine which is three years old.

I think this is quite good result.

I'll elaborate on 2.8.0 migration and other work in separate thread.

Reply all

Reply to author

Forward