replace document slow?

7 views
Skip to first unread message

Bruno Rezende

unread,
Feb 3, 2011, 7:40:07 AM2/3/11
to xappy-discuss
(it seems I'm having problems emailing xappy-discuss, so sorry if this
message is sent twice)

Hi,

I'm doing some incremental updates in a xapian database using xappy
api. The changes to the documents are minimal, just adding/removing
some terms. The way I'm doing is something like:

1. get the documents from a search connection
2. change the terms with ProcessedDocument.add_term /
ProcessedDocument.remove_term
3. call IndexerConnection.replace(changed_doc)

I'm getting an average of ~200 items/sec. If instead of using the
document returned by search connection I get the document from the
indexer connection and continue using replace(doc), I see no real
gain.

I tried this too:

1. get the documents from a search connection
2. get each document from indexer connection
3. change the terms with ProcessedDocument.add_term /
ProcessedDocument.remove_term
4. see if the changes would be applied to the index, without calling
IndexerConnection.replace(changed_doc)

with this approach the number of items per second was raised to ~1900
items/sec. But, then the changes were not applied to the db.

Searching a bit I could find this ticket that is closed:

http://trac.xapian.org/ticket/250 (replace_document should make
minimal changes to database file).

it seems this ticket is exactly about what I'm doing. The changes were
backported to version 1.0.18. Are these changes available to the
xapian version xappy uses (the one that get_xapian.py,
http://code.google.com/p/xappy/source/browse/trunk/libs/get_xapian.py,
retrieves)?


--
Bruno

boult...@gmail.com

unread,
Feb 4, 2011, 5:13:10 AM2/4/11
to xappy-discuss
On Feb 3, 12:40 pm, Bruno Rezende <brunovianareze...@gmail.com> wrote:
> I'm doing some incremental updates in a xapian database using xappy
> api. The changes to the documents are minimal, just adding/removing
> some terms. The way I'm doing is something like:
>
> 1. get the documents from a search connection
> 2. change the terms with ProcessedDocument.add_term /
> ProcessedDocument.remove_term
> 3. call IndexerConnection.replace(changed_doc)
>
> I'm getting an average of ~200 items/sec. If instead of using the
> document returned by search connection I get the document from the
> indexer connection and continue using replace(doc), I see no real
> gain.
>
> I tried this too:
>
> 1. get the documents from a search connection
> 2. get each document from indexer connection
> 3. change the terms with ProcessedDocument.add_term /
> ProcessedDocument.remove_term
> 4. see if the changes would be applied to the index, without calling
> IndexerConnection.replace(changed_doc)
>
> with this approach the number of items per second was raised to ~1900
> items/sec. But, then the changes were not applied to the db.

Indeed, you need to call replace() to put the changes back into the
database.

> Searching a bit I could find this ticket that is closed:
>
> http://trac.xapian.org/ticket/250(replace_document should make
> minimal changes to database file).
>
> it seems this ticket is exactly about what I'm doing. The changes were
> backported to version 1.0.18. Are these changes available to the
> xapian version xappy uses (the one that get_xapian.py,http://code.google.com/p/xappy/source/browse/trunk/libs/get_xapian.py,
> retrieves)?

Yes, they're definitely included in that version. I assume you're
using chert databases, too (the improvements didn't work so will with
flint, due to the way document lengths were stored).

What sort of speed do you get if you change your code to delete the
old document and then add it back, rather than replacing it? I'd
expect that to be much slower, since that's what the old code path did
(ie, before xapian ticket 250 was fixed).

Are you flushing frequently when doing this update, or not at all
during the update?

What sort of speed do you get when doing the initial update?

One thought occurs; do you have a query cache enabled on this index?
I think that may be being updated when you call replace(), and could
account for some of the time.

It's possible that there's a lot of unnecessary parsing going on in
python here; I think some profiling output will be needed to dig into
this (at the least, finding out whether the time is being spent in
Python, or in the Xapian C++ code).

--
Richard

Bruno Rezende

unread,
Feb 4, 2011, 6:03:46 AM2/4/11
to xappy-...@googlegroups.com
Hi,


On Fri, Feb 4, 2011 at 8:13 AM, boult...@googlemail.com
<boult...@gmail.com> wrote:
> On Feb 3, 12:40 pm, Bruno Rezende <brunovianareze...@gmail.com> wrote:
...

>>
>> http://trac.xapian.org/ticket/250(replace_document should make
>> minimal changes to database file).
>>
>> it seems this ticket is exactly about what I'm doing. The changes were
>> backported to version 1.0.18. Are these changes available to the
>> xapian version xappy uses (the one that get_xapian.py,http://code.google.com/p/xappy/source/browse/trunk/libs/get_xapian.py,
>> retrieves)?
>
> Yes, they're definitely included in that version.  I assume you're
> using chert databases, too (the improvements didn't work so will with
> flint, due to the way document lengths were stored).

Yes, I'm using chert.


>
> What sort of speed do you get if you change your code to delete the
> old document and then add it back, rather than replacing it?  I'd
> expect that to be much slower, since that's what the old code path did
> (ie, before xapian ticket 250 was fixed).
>

I don't have this info now. I'll do a test and report back.


> Are you flushing frequently when doing this update, or not at all
> during the update?
>

I flush at each 10K items. I'm using this value to try to keep memory
usage low, we had a case where the memory usage went up to 15GB. But,
I think it didn't work very well, we had some days ago a 4GB memory
usage case.


> What sort of speed do you get when doing the initial update?

by initial update you call when I add it for the first time? I'll need
to check on this machine.


>
> One thought occurs; do you have a query cache enabled on this index?
> I think that may be being updated when you call replace(), and could
> account for some of the time.
>

yes, I have. I'll try to disable the cache and test this too.


> It's possible that there's a lot of unnecessary parsing going on in
> python here; I think some profiling output will be needed to dig into
> this (at the least, finding out whether the time is being spent in
> Python, or in the Xapian C++ code).
>

ok. I'll do some more testings and see if I can get some profiling info.

--
Bruno

Bruno Rezende

unread,
Feb 4, 2011, 5:46:38 AM2/4/11
to xappy-...@googlegroups.com
Hi,

On Fri, Feb 4, 2011 at 8:13 AM, boult...@googlemail.com
<boult...@gmail.com> wrote:

> On Feb 3, 12:40 pm, Bruno Rezende <brunovianareze...@gmail.com> wrote:

...


>>
>> http://trac.xapian.org/ticket/250(replace_document should make
>> minimal changes to database file).
>>
>> it seems this ticket is exactly about what I'm doing. The changes were
>> backported to version 1.0.18. Are these changes available to the
>> xapian version xappy uses (the one that get_xapian.py,http://code.google.com/p/xappy/source/browse/trunk/libs/get_xapian.py,
>> retrieves)?
>
> Yes, they're definitely included in that version.  I assume you're
> using chert databases, too (the improvements didn't work so will with
> flint, due to the way document lengths were stored).

Yes, I'm using chert.

>


> What sort of speed do you get if you change your code to delete the
> old document and then add it back, rather than replacing it?  I'd
> expect that to be much slower, since that's what the old code path did
> (ie, before xapian ticket 250 was fixed).
>

I don't have this info now. I'll do a test and report back.

> Are you flushing frequently when doing this update, or not at all
> during the update?
>

I flush at each 10K items. I'm using this value to try to keep memory


usage low, we had a case where the memory usage went up to 15GB. But,
I think it didn't work very well, we had some days ago a 4GB memory
usage case.

> What sort of speed do you get when doing the initial update?

by initial update you call when I add it for the first time? I'll need


to check on this machine.

>


> One thought occurs; do you have a query cache enabled on this index?
> I think that may be being updated when you call replace(), and could
> account for some of the time.
>

yes, I have. I'll try to disable the cache and test this too.

> It's possible that there's a lot of unnecessary parsing going on in


> python here; I think some profiling output will be needed to dig into
> this (at the least, finding out whether the time is being spent in
> Python, or in the Xapian C++ code).
>

ok. I'll do some more testings and see if I can get some profiling info.

> --
> Richard
>
> --
> You received this message because you are subscribed to the Google Groups "xappy-discuss" group.
> To post to this group, send email to xappy-...@googlegroups.com.
> To unsubscribe from this group, send email to xappy-discus...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/xappy-discuss?hl=en.
>
>

--
Bruno

Bruno Rezende

unread,
Feb 17, 2011, 7:29:52 AM2/17/11
to xappy-...@googlegroups.com
Just a follow up: bulk_update performance is ok, my memory problem was
caused by iterating a search result, updating the index and re-opening
the connection that generated the search result. I'm avoiding doing
this and the memory usage is ok. So, I will wait before doing any time
measure. Thanks for the help, Richard!

--
Bruno

Bruno Rezende

unread,
Feb 17, 2011, 8:13:38 AM2/17/11
to xappy-discuss
ops, replace 'bulk_update' by 'incremental indexing'...

On Feb 17, 10:29 am, Bruno Rezende <brunovianareze...@gmail.com>
wrote:
Reply all
Reply to author
Forward
0 new messages