Search Indexing Problems: Over 48 hours

56 views
Skip to first unread message

Anne-Marie

unread,
May 4, 2011, 9:02:49 AM5/4/11
to ICA-AtoM Users
Currently, after about 53:30 hrs, our indexing process is still
running.

We imported 56 XML files totaling 11.38 MB last Friday (with indexing
disabled) and then set up the indexing. We came back Monday morning to
find an error message time-stamped around Sunday 4 am that the server
had terminated the indexing process because of memory limits, so my
programmer increased the allocation from 512 MB to 1024 MB and
restarted it. It's now been going for over two days.

Is anyone able to estimate how long this process may take?

Anne-Marie Viola
Archival Consultant
International Centre for the Study of the Preservation and Restoration
of Cultural Property (ICCROM)
Rome, Italy
http://www.iccrom.org

David Juhasz

unread,
May 4, 2011, 5:56:32 PM5/4/11
to ica-ato...@googlegroups.com
Hi Anne-Marie,

It's impossible to know exactly how long it will take to index your
data, as it depends on the processing power of the computer, how many
descriptions and actors you have, the organization of your descriptions,
etc. It is not unusual for the index to take a day for a medium sized
collection (several tens of thousands of descriptions) but two days is
unusual. Do you know how many descriptions and actors you are indexing?

In the end, as long as their is no error message, then the process is
working - so it's probably best to let it keep going. If you do
encounter an error before the index is finished building, then please
let us know (on this list) and we can provide some tools for helping the
indexing process in the future.

Regards,
David

--
David Juhasz,
Software Engineer

Artefactual Systems Inc.
www.artefactual.com

Anne-Marie

unread,
May 6, 2011, 6:12:46 AM5/6/11
to ICA-AtoM Users
After almost 92 hours of processing our indexing failed this morning
at 11,926 of 12,048 records. We received the following error message
regarding data load:
PHP Fatal error: Allowed memory size of 1,073,741,824 bytes (1024MB)
exhausted (tried to allocate 47 bytes) in /var/www/html/ica-atom/
plugins/sfLucenePlugin/lib/vendor/Zend/Search/Lucene/Storage/Directory/
Filesystem.php on line 161

Our programmer understands this error to be caused by a memory limit
of the operating system (regarding the open file limit) so he has
increased this amount to 50,000 in accordance with the suggestions
made in the forum post here
https://groups.google.com/group/ica-atom-users/msg/11e8b81ed0495c1a?hl=en%3Ae2deecfa1249c.
He has also increased the PHP memory limit to 2 GB and the RAM to 3
GB.

Is there some way to commence indexing at where the previous effort
stopped or must we start over?

Best regards,
Anne-Marie Viola
Project Archivist
International Centre for the Study of the Preservation and Restoration
of Cultural Property (ICCROM)
Rome, Italy
http://www.iccrom.org

David Juhasz

unread,
May 6, 2011, 4:48:31 PM5/6/11
to ica-ato...@googlegroups.com
Hi Anne-Marie,

Sorry to hear this.  The good news is that there is a way to resume the index build.  Please see this issue report:

http://code.google.com/p/qubit-toolkit/issues/detail?id=1128

Near the bottom of the page are two patches that can be used to restart the indexing process - if you are using Release 1.1 then you want the patch in comment #10.

After applying the patch, you can resume the index build at the 11,926 archival description as so:

php symfony search:populate QubitSearch --actorOffset=-1 --ioOffset=11925

N.B. You *must* include the --actorOffset=-1 directive or your existing index will be erased!

Anne-Marie

unread,
May 10, 2011, 4:26:45 AM5/10/11
to ICA-AtoM Users
Thank you for your feedback, David. As it was a Friday -- rather than
wait until Monday for guidance -- we elected to resume the build from
scratch to capitalize on the three days we would have before getting
back into the office. Unfortunately though, increasing our memory
limits seems to have slowed our process -- that or I'm concerned
something is wrong with our database now after our initial indexing
efforts. Now after four days, when we had previously indexed nearly
all 12,000 records, we have only indexed 7,000 of 12,000 records.

Do you have any suggestions? Having spoken to other programmers
outside of this project, they wondered if there could be some
recursive aspect slowing down the process. I see that in the Qubit
documentation you note that the process is very slow. Have you found
any of the following helpful, http://wiki.apache.org/jakarta-lucene/ImproveIndexingSpeed?

Appreciate the assistance,
Anne-Marie Viola
Project Archivist
International Centre for the Study of the Preservation and Restoration
of Cultural Property (ICCROM)
Rome, Italy
http://www.iccrom.org

On May 6, 10:48 pm, David Juhasz <da...@artefactual.com> wrote:
> Hi Anne-Marie,
>
> Sorry to hear this.  The good news is that there is a way to resume the
> index build.  Please see this issue report:
>
> http://code.google.com/p/qubit-toolkit/issues/detail?id=1128
>
> Near the bottom of the page are two patches that can be used to restart
> the indexing process - if you are using Release 1.1 then you want the
> patch in comment #10.
>
> After applying the patch, you can resume the index build at the 11,926
> archival description as so:
>
> php symfony search:populate QubitSearch --actorOffset=-1 --ioOffset=11925
>
> *N.B. You *must* include the --actorOffset=-1 directive or your existing
> index will be erased!*

Maria Mata Caravaca

unread,
May 10, 2011, 9:35:34 AM5/10/11
to ICA-AtoM Users
Dear ICA-AtoM users,
 
This email is a follow-up of the one that Anne-Marie sent you this morning regarding our Search Indexing Problems (see Anne-Marie's email below). The update is this (it has been written by our IT expert):
 
Index advancement:
Time elapsed: 354.232 seconds (approx. 4 days)
Progress: 7.342 of 12048
 
That's really slow. Can you ask an opinion to the list telling them the comparison with previous run? It might be the max file open thing (too high?), the cache from previous run?...?
And if any good advice comes out, what they suggest to do? Is stopping and restarting an option given at what stage we will be by tomorrow (between 9.000 and 10.000 ant this speed I think)?
 
Could you please give any advice.
Many thanks,
María

>>> Anne-Marie <amhv...@gmail.com> 10/5/2011 10:26 AM >>>
--
You received this message because you are subscribed to the Google Groups "ICA-AtoM Users" group.
To post to this group, send email to ica-ato...@googlegroups.com.
To unsubscribe from this group, send email to ica-atom-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/ica-atom-users?hl=en.

Jesús García Crespo

unread,
May 10, 2011, 12:39:51 PM5/10/11
to ica-ato...@googlegroups.com
Hi María, 

On Tue, May 10, 2011 at 6:35 AM, Maria Mata Caravaca <m...@iccrom.org> wrote:
This email is a follow-up of the one that Anne-Marie sent you this morning regarding our Search Indexing Problems (see Anne-Marie's email below). The update is this (it has been written by our IT expert):
 
Index advancement:
Time elapsed: 354.232 seconds (approx. 4 days)
Progress: 7.342 of 12048
 
That's really slow. Can you ask an opinion to the list telling them the comparison with previous run? It might be the max file open thing (too high?), the cache from previous run?...?
And if any good advice comes out, what they suggest to do? Is stopping and restarting an option given at what stage we will be by tomorrow (between 9.000 and 10.000 ant this speed I think)?

Unfortunately, current ICA-AtoM search engine is really slow while building the index. We hope to improve this as soon as possible. The best workaround that I know is to optimize the search index a few times while it is being built. When the index is optimized the number of file which contents is decreased drastically, which makes the built to run pretty faster. That is the reason why the second time you tried to build the index is running slower.

Please, follow these instructions:

1) Press CTRL+C to interrupt the index build.
2) Optimize the index: php symfony search:optimize QubitSearch
3) Restart the build from the last record indexed.

You'll need to apply this patch [1], which makes possible to restart the search build given an offset (please, follow these instructions [2]).


Regards,

--
Jesús García Crespo,
Software Engineer, Artefactual Systems Inc.
http://www.artefactual.com | +1.604.527.2056
Reply all
Reply to author
Forward
0 new messages