indexing time - RAM, etc.

Naomi Dushay

unread,

Jan 4, 2010, 1:58:52 PM1/4/10

to solrma...@googlegroups.com, Erik Hatcher

Our index is taking a long time to build -- now over 12 hours. I'm wondering about other sites with lots of records -- how long does it take to generate a full index, and what are your params set to?

Here's our info:

number of marc records: more than 6 million

chunk size for indexing: <500,000 records (records broken into chunks by id numbers, lots of gaps)

optimization of index: only at end of all indexing (i.e. after last chunk only).

jvm options: -Xmx16g -Xms16g

java version "1.5.0"

Java(TM) 2 Runtime Environment, Standard Edition (build pxa64dev-20080315 (SR7))

IBM J9 VM (build 2.3, J2RE 1.5.0 IBM J9 2.3 Linux amd64-64 j9vmxa6423-20080315 (JIT enabled)

J9VM - 20080314_17962_LHdSMr

JIT - 20080130_0718ifx2_r8

GC - 200802_08)

JCL - 20080314

solr config settings:

   <useCompoundFile>false</useCompoundFile>
   <mergeFactor>20</mergeFactor>
   <ramBufferSizeMB>10240</ramBufferSizeMB>
   <maxMergeDocs>2147483647</maxMergeDocs>
     <writeLockTimeout>1000</writeLockTimeout>
     <commitLockTimeout>10000</commitLockTimeout>
     <lockType>single</lockType>
</indexDefaults>

<mainIndex>
     <useCompoundFile>false</useCompoundFile>
     <ramBufferSizeMB>10240</ramBufferSizeMB>
     <mergeFactor>20</mergeFactor>
     <maxMergeDocs>2147483647</maxMergeDocs>
     <maxFieldLength>10000</maxFieldLength>
     <unlockOnStartup>false</unlockOnStartup>
     <deletionPolicy class="solr.SolrDeletionPolicy">
   <str name="keepOptimizedOnly">false</str>
   <str name="maxCommitsToKeep">1</str>
     </deletionPolicy>
</mainIndex>

Ideas:

1. increase jvm RAM to 20g

2. increase ramBufferSizeMB to a bit less than 20g (19g? 19.5g)

3. increase mergeFactor to ... ?? 30 ??

other suggestions?

FYI, for RUNNING Solr, we needed to do some serious tweaking and ended up with:

java version "1.6.0_07"

Java(TM) SE Runtime Environment (build 1.6.0_07-b06)

Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)

java opts: -server -Xmx12g -Xms12g -d64 -XX:+UseParallelGC -XX:+AggressiveOpts -XX:NewRatio=5

- Naomi

Andrew Nagy

unread,

Jan 4, 2010, 5:48:00 PM1/4/10

to solrma...@googlegroups.com

How often do you committ?

> --
>
> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrma...@googlegroups.com.
> To unsubscribe from this group, send email to
> solrmarc-tec...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/solrmarc-tech?hl=en.
>
>
>

--
Sent from my mobile device

Naomi Dushay

unread,

Jan 4, 2010, 6:55:45 PM1/4/10

to solrma...@googlegroups.com

I use the defaults for the SolrMarc code, which is once per file (once
per 500K chunk).

- Naomi

Alan Rykhus

unread,

Jan 5, 2010, 9:41:33 AM1/5/10

to solrma...@googlegroups.com

Hello Naomi,

I happened to be rebuilding our database from scratch yesterday,
implementing a couple of new things and a couple of fixes.

We also have just over 6 million bibs. There were 2 other almost
identical instances of Solr running on the machine. It took 13 hours to
build. My settings are slightly different.

The only option to java 1.6.0.0 in a 64 bit environment is -Xmx1024m

<indexDefaults>
<useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>32</ramBufferSizeMB>

<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
<lockType>single</lockType>
</indexDefaults>

<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>32</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>

<maxMergeDocs>2147483647</maxMergeDocs>
<maxFieldLength>10000</maxFieldLength>
<unlockOnStartup>false</unlockOnStartup>

</mainIndex>

As for the configuration to the actual instance of Solr that is running,
would the setting there make any difference? When you fire off solrmarc
Solr doesn't have to be running to build the database. Solrmarc (I don't
have the right term here) builds/replenishes the database in its own
space, I beleive?

al

Alan Rykhus
PALS, A Program of the Minnesota State Colleges and Universities
(507)389-1975
alan....@mnsu.edu

Erik Hatcher

unread,

Jan 5, 2010, 6:14:41 AM1/5/10

to Naomi Dushay, solrma...@googlegroups.com

On Jan 4, 2010, at 1:58 PM, Naomi Dushay wrote:
> Our index is taking a long time to build -- now over 12 hours.

Woah, that is quite a long time for the "small" number of documents
you have. Let's fix that...

> I'm wondering about other sites with lots of records -- how long
> does it take to generate a full index, and what are your params set
> to?

I've seen projects index tens of millions of documents in a couple of
hours or less. Using the CSV indexer, I've indexed nearly 2M docs in
20 minutes on my laptop.

But, in SolrMarc case there is a lot of additional processing going on
so we have to account for that. But perhaps the processing could be
segregated, parallelized. One option is to pre-process all the MARC
records and have that processing output Solr XML files, and then have
Solr slurp those files directly. But a better option to speed things
up is parallelization - have SolrMarc process MARC files in multiple
threads (or multiple processes), rather than serially running through
6M documents and sending them to Solr one-by-one.

> Here's our info:
>
> number of marc records: more than 6 million
>
> chunk size for indexing: <500,000 records (records broken into
> chunks by id numbers, lots of gaps)

This means you're sending in 500k or less documents per commit. I've
seen cases where a commit is warranted more often, but that is when
there are a lot of document updates going on (same document id being
indexed again).

Some other process data neeeded:

- SolrMarc is rebuilding the entire index itself from scratch? Is
it hitting a Solr server or using the EmbeddedSolrServer "offline"?
Is the Solr server you're indexing into being hit for searches at the
same time? Do you have a master/slave setup where you have an
indexing server that then replicates to a search server? Are you
running the indexer and Solr on the same box (if you're using
embedded, the answer is of course yes)?

I know those questions overlap, sorry - just trying to ask as much as
I can think of here to understand fully what you've got going on.

> optimization of index: only at end of all indexing (i.e. after
> last chunk only).

And optimization isn't always necessary - it depends on whether search
performance is acceptable or not, and you can mitigate that a bit with
the merge factor (see below).

> jvm options: -Xmx16g -Xms16g

Good grief that's a lot of RAM! That is definitely not needed for
Solr, and perhaps is way too much allocated to allow the operating
system some breathing room. How much physical RAM do you have on the
box?

One example here - The Motley Fool, fool.com, has 22M documents, and
16G RAM servers, allocating 8GB RAM to their JVMs. And when I first
worked with them, they only had 8GB RAM servers with 4GB RAM to the
JVM and everything worked just fine.

You may be being bitten by large, excessive GC pauses? More RAM to a
JVM does not equate to better performance, and often times it can be
problematic. Especially with the large ramBufferSizeMB you've got
specified!

Indexing, from Solr's perspective, is not a heavy RAM hog. It's more
so on the search side with filter caches, FieldCache (for sorting),
etc. So if this is your indexing server only, and not being hit for
searches, I think you're giving it too much.

But profiling is always warranted for these types of questions to see
what is being used. I recommend pointing jconsole at it at least and
reporting the numbers you see there. Please send along all the stats
from the VM Summary tab (screenshot works).

> java version "1.5.0"

That's now an outdated JVM. Can you bump up to the latest 1.6 JVM
instead? Oh, I see you did later. Good!

Also, run with the -server switch, which I also see you did later.

> <indexDefaults>
> <useCompoundFile>false</useCompoundFile>
> <mergeFactor>20</mergeFactor>

That merge factor is too high. Set it back to the default of 10, or
maybe even less. But start with 10.

> <ramBufferSizeMB>10240</ramBufferSizeMB>

That's enormous! No need for it to be that big. 10G of buffered
docs? Let it flush to disk more frequently. The default setting is
32 (megabytes). Your Solr documents aren't really all that big (the
MARC ones) and RAM is better utilized for other purposes, like within
your indexer itself perhaps. Maybe set this down to 128 and see what
happens.

> <maxMergeDocs>2147483647</maxMergeDocs>

I'd comment that line out. This is telling Lucene not to flush until
it gets this many documents. Go by RAM rather than number of
documents. But not so much RAM!

> <mainIndex>
> <useCompoundFile>false</useCompoundFile>
> <ramBufferSizeMB>10240</ramBufferSizeMB>
> <mergeFactor>20</mergeFactor>
> <maxMergeDocs>2147483647</maxMergeDocs>

Likewise here - it's confusing, but just set the same index settings
here as above.

> <maxFieldLength>10000</maxFieldLength>

Not related to your indexing speed, but... I'd set this number to
Integer.MAX_VALUE = 2147483647

You never want any terms not indexed, and very likely you're not even
getting close to this number anyway.

> Ideas:
> 1. increase jvm RAM to 20g

NO! Unnecessary. RAM isn't the problem here.

> 2. increase ramBufferSizeMB to a bit less than 20g (19g? 19.5g)

NO! Let it flush more frequently, no need to tie up so much RAM.

> 3. increase mergeFactor to ... ?? 30 ??

NO! Decrease it back to 10.

> other suggestions?

I'm not certain, as this stuff is as much art as science, but I'd
guess that your merge factor, RAM buffer size, and max merge docs
settings are a large part of the problem here.

> FYI, for RUNNING Solr, we needed to do some serious tweaking and
> ended up with:
>
> java version "1.6.0_07"
> Java(TM) SE Runtime Environment (build 1.6.0_07-b06)
> Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode)
>
> java opts: -server -Xmx12g -Xms12g -d64 -XX:+UseParallelGC -XX:
> +AggressiveOpts -XX:NewRatio=5

I'm no JVM tuning expert, so I can't comment on those settings at the
moment but I'll look around for some info on it. But again, 12G of
RAM to Solr? Is that really necessary? Profile and see how much RAM
you need for the facets in the filter cache and the sort keys. But
don't go overboard and give the JVM too much memory, it adversely
affects GC time.

But wait, Java 1.5 for indexing, and Java 1.6 for searching, right?

But, so you are running the indexer in "embedded" mode (I still need
to dig into the indexer and see what makes it tick in more detail).
But I definitely recommend using indexing via HTTP. And parallelizing
your indexing if possible - index more than one MARC file at a time
either multithreaded or multiprocessed. Solr can handle many
simultaneous indexing requests, so no worries there.

Erik

p.s. Naomi - if you can get me the MARC files, I'd like to do some
testing of this and help out even more. Even a single MARC file and
some indexing instructions and I'll give it a go to tune some of this
better.

Jonathan Rochkind

unread,

Jan 5, 2010, 10:04:23 AM1/5/10

to solrma...@googlegroups.com, Naomi Dushay

Wow, thanks so much for your advice Erik. Anything you can do to help Naomi's case will undoubtedly help the rest of us too, as we have very similar cases.

Adding threads to SolrMarc seems like a bit of a chore for the non-Java expert (me), but seems worth setting as a goal. I know you mentioned to me in channel about making sure SolrMarc is using the "streaming solr updater" (or something like that) when it accesses Solr over HTTP. I'd be interested in more information about how to do that.

Jonathan
________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] On Behalf Of Erik Hatcher [erik.h...@gmail.com]
Sent: Tuesday, January 05, 2010 6:14 AM
To: Naomi Dushay; solrma...@googlegroups.com
Subject: [solrmarc-tech] Re: indexing time - RAM, etc.

> jvm options: -Xmx16g -Xms16g

> java version "1.5.0"

> <ramBufferSizeMB>10240</ramBufferSizeMB>

> <maxMergeDocs>2147483647</maxMergeDocs>

> <maxFieldLength>10000</maxFieldLength>

> other suggestions?

Erik

--

Erik Hatcher

unread,

Jan 5, 2010, 10:10:55 AM1/5/10

to solrma...@googlegroups.com

From Solr 1.4's solrconfig.xml:

I strongly recommend you comment out maxBufferedDocs setting, and let
the RAM buffer be the deciding factor on when to flush.

But, also profiling will be important here to see where the time is
being spent. How much time is being spent in the actual MARC
processing versus just the indexing into Solr?

I can tell you that 12 or 13 hours is way off base for just slamming
6M docs into Solr. It can handle that in an hour or less. So
something is way off, or the MARC processing is very heavy.

Erik

>>>> To post to this group, send email to solrmarc-
>>>> te...@googlegroups.com.

>>>> To unsubscribe from this group, send email to
>>>> solrmarc-tec...@googlegroups.com.
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/solrmarc-tech?hl=en.
>>>>
>>>>
>>>>
>>>
>>> --
>>> Sent from my mobile device
>>>
>>> --
>>>
>>> You received this message because you are subscribed to the Google
>>> Groups "solrmarc-tech" group.
>>> To post to this group, send email to solrma...@googlegroups.com.
>>> To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com
>>> .
>>> For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en
>>> .
>>>
>>>
>>
>> --
>>
>> You received this message because you are subscribed to the Google
>> Groups "solrmarc-tech" group.
>> To post to this group, send email to solrma...@googlegroups.com.
>> To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com
>> .
>> For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en
>> .
>>
>>
> --

> Alan Rykhus
> PALS, A Program of the Minnesota State Colleges and Universities
> (507)389-1975
> alan....@mnsu.edu
>

Robert Haschart

unread,

Jan 5, 2010, 1:00:02 PM1/5/10

to solrma...@googlegroups.com

As a datapoint, here at UVA we have about 4.2M bibliographic records in our index.
I just ran a indextest on all of the records, which does all of the SolrMarc processing up to the point where it would then send the indexed record to Solr, and then instead merely prints that info to stdout, which I piped to /dev/null    So this should be a pretty fair indication of the amount of work that SolrMarc does separate from the indexing into Solr work.

The time for processing all of these 4.2M records in this way was reported as

real    115m18.480s

so just under two hours. If I then check the time reported for a recent full indexing run, which started at 23:42 on 12/29 and finished at 06:19 on 12/30 which works out to be about 6 hours and 40 minutes.

This seems to indicate that the indexing into Solr portion of the SolrMarc full re-index run on UVAs 4.2M records is about 4 hours and 45 minutes.

-Bob Haschart

Erik Hatcher wrote:

.
To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrmarc-tech?hl=en.

Erik Hatcher

unread,

Jan 5, 2010, 1:52:50 PM1/5/10

to solrma...@googlegroups.com

Bob,

What does UVa have in their solrconfig? What JVM settings? What kind
of machine and RAM? This is using EmbeddedSolrServer, right?

4h45m is still way too long for this. We can do better.

Can someone give me a dataset that is reasonably large enough to try
in my own environment?

Erik

Cicer0

unread,

Jan 5, 2010, 2:03:48 PM1/5/10

to solrmarc-tech

Sorry to add to the "bad" data points, but we have very poor indexing
numbers at Yale as well (8.2 million records), and it seems to get
exponentially worse as the index grows. The first million takes about
a hour, but the last million takes 8 hours, and after that incremental
loads of 10-20,000 take two hours (including optimization). We are
using some of the same large memory numbers as Stanford, which I will
try soon to cut back as Erik suggests, but otherwise we use pretty
much out-of the box solrconfig settings. We are in the middle of a
vufind RC2 upgrade, so right now there is no free time to experiment
with tuning, but I look forward to seeing what the final
recommendation turns out to be.

> >>>>> For more options, visit this group athttp://groups.google.com/group/solrmarc-tech?hl=en

> >>>>> .
>
> >>>> --
>
> >>>> You received this message because you are subscribed to the Google
> >>>> Groups "solrmarc-tech" group.
> >>>> To post to this group, send email to solrmarc-
> >>>> te...@googlegroups.com.
> >>>> To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com
> >>>> .

> >>>> For more options, visit this group athttp://groups.google.com/group/solrmarc-tech?hl=en

> >>>> .
>
> >>> --
> >>> Alan Rykhus
> >>> PALS, A Program of the Minnesota State Colleges and Universities
> >>> (507)389-1975

> >>> alan.ryk...@mnsu.edu

>
> >>> --
>
> >>> You received this message because you are subscribed to the Google
> >>> Groups "solrmarc-tech" group.
> >>> To post to this group, send email to solrma...@googlegroups.com.
> >>> To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com
> >>> .

> >>> For more options, visit this group athttp://groups.google.com/group/solrmarc-tech?hl=en

> >>> .
>
> >> --
>
> >> You received this message because you are subscribed to the Google
> >> Groups "solrmarc-tech" group.
> >> To post to this group, send email to solrma...@googlegroups.com.
> >> To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com
> >> .

> >> For more options, visit this group athttp://groups.google.com/group/solrmarc-tech?hl=en

Ross Singer

unread,

Jan 5, 2010, 2:12:59 PM1/5/10

to solrma...@googlegroups.com

Erik, the Talis MARC corpus at internet archive is 5.5 million bibs:

http://www.archive.org/details/talis_openlibrary_contribution

Western Washington's is (right around 1 million):
http://www.archive.org/details/marc_western_washington_univ

Boston Public Library (who knows):
http://www.archive.org/details/bpl_marc

IA seems hella slow at the moment, though.

-Ross.

>>> solrmarc-tec...@googlegroups.com.
>>> For more options, visit this group at
>>> http://groups.google.com/group/solrmarc-tech?hl=en.
>>>
>>>
>>>
>>
>> --

>> You received this message because you are subscribed to the Google Groups
>> "solrmarc-tech" group.
>> To post to this group, send email to solrma...@googlegroups.com.
>> To unsubscribe from this group, send email to

>> solrmarc-tec...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/solrmarc-tech?hl=en.
>
>
> --

> You received this message because you are subscribed to the Google Groups
> "solrmarc-tech" group.
> To post to this group, send email to solrma...@googlegroups.com.
> To unsubscribe from this group, send email to

Robert Haschart

unread,

Jan 5, 2010, 2:47:45 PM1/5/10

to solrma...@googlegroups.com

Erik,

The machine we run it on is a four 3.33 GHz Xeon Processor Linux
Machine, with something like 36GB of RAM.
We are running: java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) 64-Bit Server VM (build 10.0-b22, mixed mode)

The JVM args we use for the SolrMarc invocation are pretty minimal:

-Xmx2048m

Our Solrconfig.xml is largely taken from one of the examples, the major
changes we've made mostly involve custom query handlers.

I'd have to ask Bess, but we can probably get you access to a full dump
of the MARC records from our system, and get you set up with a runnable

<?xml version="1.0" encoding="UTF-8" ?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

<config>
<!-- Set this to 'false' if you want solr to continue working after it
has
encountered an severe configuration error. In a production
environment,
you may want solr to keep working even if one handler is
mis-configured.

You may also set this to false using by setting the system property:
-Dsolr.abortOnConfigurationError=false
-->

<abortOnConfigurationError>${solr.abortOnConfigurationError:true}</abortOnConfigurationError>

<indexDefaults>

<useCompoundFile>false</useCompoundFile>

<!--
If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene
will flush based on whichever limit is hit first.

-->

<!-- Tell Lucene when to flush documents to disk.
Giving Lucene more memory for indexing means faster indexing at the
cost of more RAM

If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will
flush based on whichever limit is hit first.

-->

<!--
Expert:
The Merge Policy in Lucene controls how merging is handled by
Lucene. The default in 2.3 is the LogByteSizeMergePolicy, previous
versions used LogDocMergePolicy.

LogByteSizeMergePolicy chooses segments to merge based on their
size. The Lucene 2.2 default, LogDocMergePolicy chose when
to merge based on number of documents

Other implementations of MergePolicy must have a no-argument
constructor
-->

<!--
This option specifies which Lucene LockFactory implementation to use.

single = SingleInstanceLockFactory - suggested for a read-only index
or when there is no possibility of another process trying
to modify the index.
native = NativeFSLockFactory
simple = SimpleFSLockFactory

(For backwards compatibility with Solr 1.2, 'simple' is the default
if not specified.)
-->
<lockType>single</lockType>
</indexDefaults>

<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>32</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>

<unlockOnStartup>false</unlockOnStartup>
</mainIndex>

<!-- Enables JMX if and only if an existing MBeanServer is found, use
this if you want to configure JMX through JVM parameters. Remove
this to disable exposing Solr configuration and statistics to JMX.

If you want to connect to a particular server, specify the agentId
e.g. <jmx agentId="myAgent" />

If you want to start a new MBeanServer, specify the serviceUrl
e.g <jmx
serviceUrl="service:jmx:rmi:///jndi/rmi://localhost:9999/solr" />

For more details see http://wiki.apache.org/solr/SolrJmx
-->
<jmx />

<updateHandler class="solr.DirectUpdateHandler2">

</updateHandler>

<filterCache
class="solr.FastLRUCache"
size="1000000"
initialSize="100000"
autowarmCount="50000"/>

<queryResultCache
class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="32"/>

<documentCache
class="solr.LRUCache"
size="512"
initialSize="512"
autowarmCount="0"/>

<!-- If true, stored fields that are not requested will be loaded
lazily.

This can result in a significant speed improvement if the usual case
is to
not load all stored fields, especially if the skipped fields are
large compressed
text fields.
-->
<enableLazyFieldLoading>true</enableLazyFieldLoading>

<queryResultWindowSize>50</queryResultWindowSize>

<queryResultMaxDocsCached>200</queryResultMaxDocsCached>

<HashDocSet maxSize="3000" loadFactor="0.75"/>

<listener event="newSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst> <str name="q">solr</str> <str name="start">0</str> <str
name="rows">10</str> </lst>
<lst> <str name="q">rocks</str> <str name="start">0</str> <str
name="rows">10</str> </lst>
<lst><str name="q">static newSearcher warming query from
solrconfig.xml</str></lst>
</arr>
</listener>

<listener event="firstSearcher" class="solr.QuerySenderListener">
<arr name="queries">
<lst> <str name="q">fast_warm</str> <str name="start">0</str>
<str name="rows">10</str> </lst>
<lst><str name="q">static firstSearcher warming query from
solrconfig.xml</str></lst>
</arr>
</listener>

<useColdSearcher>false</useColdSearcher>

<maxWarmingSearchers>2</maxWarmingSearchers>

</query>

<requestDispatcher handleSelect="true" >

<requestParsers enableRemoteStreaming="false"
multipartUploadLimitInKB="2048" />

<httpCaching lastModifiedFrom="openTime"
etagSeed="Solr">
<!-- lastModFrom="openTime" is the default, the Last-Modified value
(and validation against If-Modified-Since requests) will all be
relative to when the current Searcher was opened.
You can change it to lastModFrom="dirLastMod" if you want the
value to exactly corrispond to when the physical index was last
modified.

etagSeed="..." is an option you can change to force the ETag
header (and validation against If-None-Match requests) to be
differnet even if the index has not changed (ie: when making
significant changes to your config file)

lastModifiedFrom and etagSeed are both ignored if you use the
never304="true" option.
-->
<!-- If you include a <cacheControl> directive, it will be used to
generate a Cache-Control header, as well as an Expires header
if the value contains "max-age="

By default, no Cache-Control header is generated.

You can use the <cacheControl> option even if you have set
never304="true"
-->

</httpCaching>
</requestDispatcher>

<requestHandler name="standard" class="solr.SearchHandler" default="true">

<lst name="defaults">
<str name="echoParams">explicit</str>

</lst>
</requestHandler>

<requestHandler name="dismax" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<str name="q.alt">*:*</str>
<str name="qf">text</str>
</lst>
<str name="qf">text</str>

<str name="spellcheck.onlyMorePopular">false</str>

<str name="spellcheck.extendedResults">false</str>

<str name="spellcheck.count">5</str>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

<requestHandler name="search" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
title_text^20.0 author_text^10.0 text
</str>
<str name="pf">
title_facet^100 title_unstem_text^50 title_text^20
author_unstem_text^20 subject_unstem_text^20 text^15.0
</str>
<str name="fl">
id, score, media_resource_id_display, media_description_display,
title_display, subtitle_display, main_title_display,
series_title_display, part_display, date_display,
date_received_facet, author_display, creator_display,
digital_collection_facet,
datafile_name_display, format_facet, location_facet,
call_number_display, isbn_display,
published_display, source_facet, content_model_facet,
mint_display, accession_display,
thumb_obv_display, thumb_rev_display, year_facet, year_display,
published_date_display,
linked_author_display, linked_title_display,
linked_responsibility_statement_display, url_display
</str>
<str name="facet">on</str>
<str name="facet.mincount">1</str>
<str name="facet.limit">10</str>
<int name="ps">100</int>
<str name="q.alt">*:*</str>
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

<requestHandler name="author_search" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
author_text^20.0 author_added_entry_text author_facet
</str>
<str name="pf">
title_unstem_text^20.0 author_unstem_text^20
subject_unstem_text^20 text^15.0
</str>
<str name="fl">
id, score, media_resource_id_display, media_description_display,
title_display, subtitle_display, main_title_display,
series_title_display, part_display, date_display,
date_received_facet, author_display, creator_display,
digital_collection_facet,
datafile_name_display, format_facet, location_facet,
call_number_display, isbn_display,
published_display, source_facet, content_model_facet,
mint_display, accession_display,
thumb_obv_display, thumb_rev_display, year_facet, year_display,
published_date_display,
linked_author_display, linked_title_display,
linked_responsibility_statement_display, url_display
</str>
<str name="facet">on</str>
<str name="facet.mincount">1</str>
<str name="facet.limit">10</str>
<int name="ps">100</int>
<str name="q.alt">*:*</str>
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

<requestHandler name="title_search" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
title_text^20.0 subtitle_text^15 title_added_entry_text
series_title_text notes_text uniform_title_text
</str>
<str name="pf">
title_facet^100 title_unstem_text^20.0 author_unstem_text^20
subject_unstem_text^20 text^15.0
</str>
<str name="fl">
id, score, media_resource_id_display, media_description_display,
title_display, subtitle_display, main_title_display,
series_title_display, part_display, date_display,
date_received_facet, author_display, creator_display,
digital_collection_facet,
datafile_name_display, format_facet, location_facet,
call_number_display, isbn_display,
published_display, source_facet, content_model_facet,
mint_display, accession_display,
thumb_obv_display, thumb_rev_display, year_facet, year_display,
published_date_display,
linked_author_display, linked_title_display,
linked_responsibility_statement_display, url_display
</str>
<str name="facet">on</str>
<str name="facet.mincount">1</str>
<str name="facet.limit">10</str>
<int name="ps">100</int>
<str name="q.alt">*:*</str>
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

<requestHandler name="subject_search" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
subject_text^20.0 text^15 subject_facet subject_era_facet
</str>
<str name="pf">
title_unstem_text^20.0 author_unstem_text^20
subject_unstem_text^20 text^15.0
</str>
<str name="fl">
id, score, media_resource_id_display, media_description_display,
title_display, subtitle_display, main_title_display,
series_title_display, part_display, date_display,
date_received_facet, author_display, creator_display,
digital_collection_facet,
datafile_name_display, format_facet, location_facet,
call_number_display, isbn_display,
published_display, source_facet, content_model_facet,
mint_display, accession_display,
thumb_obv_display, thumb_rev_display, year_facet, year_display,
published_date_display,
linked_author_display, linked_title_display,
linked_responsibility_statement_display, url_display
</str>
<str name="facet">on</str>
<str name="facet.mincount">1</str>
<str name="facet.limit">10</str>
<int name="ps">100</int>
<str name="q.alt">*:*</str>
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

<requestHandler name="call_number_search" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
call_number_text^20.0 call_number_facet
</str>
<str name="pf">
title_unstem_text^20.0 author_unstem_text^20
subject_unstem_text^20 text^15.0
</str>
<str name="fl">
id, score, media_resource_id_display, media_description_display,
title_display, subtitle_display, main_title_display,
series_title_display, part_display, date_display,
date_received_facet, author_display, creator_display,
digital_collection_facet,
datafile_name_display, format_facet, location_facet,
call_number_display, isbn_display,
published_display, source_facet, content_model_facet,
mint_display, accession_display,
thumb_obv_display, thumb_rev_display, year_facet, year_display,
published_date_display,
linked_author_display, linked_title_display,
linked_responsibility_statement_display, url_display
</str>
<str name="facet">on</str>
<str name="facet.mincount">1</str>
<str name="facet.limit">10</str>
<int name="ps">100</int>
<str name="q.alt">*:*</str>
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

<requestHandler name="music_search" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
author_text^20 author_added_entry_text^10 title_text^1.0 text
</str>
<str name="pf">
title_facet^50 title_unstem_text^20.0 author_unstem_text^20
subject_unstem_text^20 text^15.0
</str>
<str name="fl">
id, score, media_resource_id_display, media_description_display,
title_display, subtitle_display, main_title_display,
series_title_display, part_display, date_display,
date_received_facet, author_display, creator_display,
digital_collection_facet,
datafile_name_display, format_facet, location_facet,
call_number_display, isbn_display,
published_display, source_facet, content_model_facet,
mint_display, accession_display,
thumb_obv_display, thumb_rev_display, year_facet, year_display,
published_date_display,
linked_author_display, linked_title_display,
linked_responsibility_statement_display, url_display
</str>
<str name="facet">on</str>
<str name="facet.mincount">1</str>
<str name="facet.limit">10</str>
<int name="ps">100</int>
<str name="q.alt">*:*</str>
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>


<requestHandler name="document" class="solr.SearchHandler" >
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="fl">
accession_display, material_display, media_resource_id_display,
media_description_display, media_retrieval_id_display,
marc_display, format_facet, location_facet, call_number_display,
isbn_display,
edition_display, call_number_facet, score, datafile_name_display,
digital_collection_facet, collection_facet, timestamp,
subject_facet, subject_genre_facet,
content_model_facet, id, composition_era_facet,
content_type_facet, author_facet, author_display,
linked_author_display, source_facet, language_facet,
title_display, subtitle_display, main_title_display,
series_title_display, alternate_title_display,
linked_title_display, linked_responsibility_statement_display,
year_facet, year_display,
accession_display, thumb_obv_display, thumb_rev_display,
material_display,
denomination_display, url_display, region_facet, subject_facet,
mint_display,
shadowed_location_facet, desc_meta_file_display,
admin_meta_file_display
</str>
<str name="rows">1</str>
<str name="q">{!raw f=id v=$id}</str>
<str name="q.alt">*:*</str>
</lst>
</requestHandler>

<requestHandler name="document_lean" class="solr.SearchHandler" >
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="fl">
id, title_display, format_facet, source_facet,
content_model_facet, isbn_display, author_display, thumb_obv_display,
media_resource_id_display
</str>
<str name="rows">1</str>
<str name="q">{!raw f=id v=$id}</str>
<str name="q.alt">*:*</str>
</lst>
</requestHandler>

<!--
Search components are registered to SolrCore and used by Search Handlers

By default, the following components are avaliable:

<searchComponent name="query"
class="org.apache.solr.handler.component.QueryComponent" />
<searchComponent name="facet"
class="org.apache.solr.handler.component.FacetComponent" />
<searchComponent name="mlt"
class="org.apache.solr.handler.component.MoreLikeThisComponent" />
<searchComponent name="highlight"
class="org.apache.solr.handler.component.HighlightComponent" />
<searchComponent name="stats"
class="org.apache.solr.handler.component.StatsComponent" />
<searchComponent name="debug"
class="org.apache.solr.handler.component.DebugComponent" />

Default configuration in a requestHandler would look like:
<arr name="components">
<str>query</str>
<str>facet</str>
<str>mlt</str>
<str>highlight</str>
<str>stats</str>
<str>debug</str>
</arr>

If you register a searchComponent to one of the standard names, that
will be used instead.
To insert components before or after the 'standard' components, use:

<arr name="first-components">
<str>myFirstComponentName</str>
</arr>

<arr name="last-components">
<str>myLastComponentName</str>
</arr>
-->

<str name="queryAnalyzerFieldType">textSpell</str>

<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spell</str>
<str name="spellcheckIndexDir">./spellchecker1</str>
</lst>
<lst name="spellchecker">
<str name="name">jarowinkler</str>
<str name="field">spell</str>

<str
name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
<str name="spellcheckIndexDir">./spellchecker2</str>
</lst>

</searchComponent>


<requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="echoParams">explicit</str>
</lst>
<arr name="last-components">
<str>elevator</str>
</arr>
</requestHandler>

<!-- Update request handler.

Note: Since solr1.1 requestHandlers requires a valid content type
header if posted in
the body. For example, curl now requires: -H
'Content-type:text/xml; charset=utf-8'
The response format differs from solr1.1 formatting and returns a
standard error code.

To enable solr1.1 behavior, remove the /update handler or change
its path
-->
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

<requestHandler name="/analysis" class="solr.AnalysisRequestHandler" />

<requestHandler name="/update/csv" class="solr.CSVRequestHandler"
startup="lazy" />

<requestHandler name="/admin/"
class="org.apache.solr.handler.admin.AdminHandlers" />

<requestHandler name="/admin/ping" class="PingRequestHandler">
<lst name="defaults">
<str name="qt">standard</str>
<str name="q">solrpingquery</str>
<str name="echoParams">all</str>
</lst>
</requestHandler>

<requestHandler name="/debug/dump" class="solr.DumpRequestHandler" >
<lst name="defaults">
<str name="echoParams">explicit</str> 
<str name="echoHandler">true</str>
</lst>
</requestHandler>

<fragmenter name="regex"
class="org.apache.solr.highlight.RegexFragmenter">
<lst name="defaults">

<int name="hl.fragsize">70</int>

<float name="hl.regex.slop">0.5</float>

<str name="hl.regex.pattern">[-\w ,/\n\"']{20,200}</str>
</lst>
</fragmenter>

<formatter name="html"
class="org.apache.solr.highlight.HtmlFormatter" default="true">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
</highlighting>

<!-- queryResponseWriter plugins... query responses will be written
using the
writer specified by the 'wt' request parameter matching the name of
a registered
writer.
The "default" writer is the default and will be used if 'wt' is not
specified
in the request. XMLResponseWriter will be used if nothing is
specified here.
The json, python, and ruby writers are also available by default.

<queryResponseWriter name="xslt"
class="org.apache.solr.request.XSLTResponseWriter">
<int name="xsltCacheLifetimeSeconds">5</int>
</queryResponseWriter>

<admin>
<defaultQuery>solr</defaultQuery>

</admin>

</config>

Alan Rykhus

unread,

Jan 5, 2010, 5:47:00 PM1/5/10

to solrma...@googlegroups.com

Hello Erik,

How do you want the records? Our marc data is currently in 72 files, a
little over 9 Megs total in size.

al

Till Kinstler

unread,

Jan 6, 2010, 3:33:09 AM1/6/10

to solrma...@googlegroups.com

Cicer0 schrieb:

> Sorry to add to the "bad" data points, but we have very poor indexing
> numbers at Yale as well (8.2 million records), and it seems to get
> exponentially worse as the index grows.

Same here. Our index currently has about 20 million records. In November
I posted some numbers on this list in
http://groups.google.com/group/solrmarc-tech/msg/2e30799adbc3e81b
Those numbers were taken using a pre-1.4 Solr version and solrmarc 2.0.
I may now report that things haven't changed with Solr 1.4 final and a
recent solrmarc SVN checkout.
Some more words on the environment: Indexing is done on a 16 GB machine,
which is serving searches at the same time. The JVM running the Solr
search server gets up to 8 GB of heap. It is a multicore installtion
serving mutltiple indexes of different sizes, though I haven't done much
index specific tuning of solrconfig settings, that's what the index
configurations look like:

<indexDefaults>
<useCompoundFile>false</useCompoundFile>
<mergeFactor>10</mergeFactor>

<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
<lockType>single</lockType>
</indexDefaults>

<useCompoundFile>false</useCompoundFile>

<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>
<maxFieldLength>2147483647</maxFieldLength>
<unlockOnStartup>false</unlockOnStartup>

<str name="keepOptimizedOnly">false</str>

<!--
Delete all commit points once they have reached the given age.
Supports DateMathParser syntax e.g.

<str name="maxCommitAge">30MINUTES</str>
<str name="maxCommitAge">1DAY</str>
-->
</deletionPolicy>
</mainIndex>

solrmarc's JVM gets another 4GB of heap for indexing. More heap for
solrmarc seems to help a bit, but the overall effect (significant
slowdown as index size grows) stays the same. solrmarc is always using
"EmbeddedSolr".

Till

--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der G�ttinger Sieben 1, D 37073 G�ttingen
kins...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de

Cicer0

unread,

Jan 6, 2010, 1:22:22 PM1/6/10

to solrmarc-tech

For a variety of reasons, it would be difficult to just "hand out" our
records. But if the controlling factor is indeed (size of index( x
(size of import), couldn't you just cook up a "input generator" that
cranks out arbitrarily large dummy record files and keep feeding them
to an index until it chokes? I guess one other thing that might be
important is that we do populate a much larger set of marc tags than
most libraries (though possibly not more than Stanford). Perhaps this
copy of our added fields might be relevant: (from
marc_local.properties)

language = 008[35-37]:041a:041d:041j, language_map.properties
author2-role = 700e:710e
author_additional = 505r
title_alt = 130adfgklnpst:240a:246a:730adfgklnpst:740a
physical = 300abcefg
issn = 022a:440x:490x:730x:776x:780x:785x
format = script(format.bsh), getFormat, format_map.properties^M
callnumber-a = 090a:050a, first
#callnumber = script(callnumber.bsh), getFullCallNumber^M
#callnumber-subject = script(callnumber.bsh), getCallNumberSubject,
callnumber_s
ubject_map.properties^M
#callnumber-subject-code = script(callnumber.bsh),
getCallNumberSubject^M
#callnumber-label = script(callnumber.bsh), getCallNumberLabel^M
publishDate = 008[7-10]:008[11-14]:260c,(pattern_map.date_cleanup_2),
first
pattern_map.date_cleanup_2.pattern_0 = [^0-9]*((20|1[98765432])[0-9]
[0-9]).*=>$1
pattern_map.date_cleanup_2.pattern_1 = [^0-9]*((20|1[98765432])[0-9])
u.*=>$10

On Jan 5, 1:52 pm, Erik Hatcher <erik.hatc...@gmail.com> wrote:

Erik Hatcher

unread,

Jan 6, 2010, 4:00:21 PM1/6/10

to solrma...@googlegroups.com

Just a quick update, I've now got the 2.1 branch of SolrMarc checked
out, and thanks to Bob's help I've gotten it built and tested. Now
I'm downloading the one of the large MARC files Ross pointed to, and
will do some testing with this over the next week or so. I'll report
back as soon as I have some info.

Erik

Greg Pendlebury

unread,

Jan 6, 2010, 9:55:02 PM1/6/10

to solrma...@googlegroups.com

On an otherwise excellent list of information I wanted to interject on a couple of points.

2010/1/5 Erik Hatcher <erik.h...@gmail.com>

> <indexDefaults>
> <useCompoundFile>false</useCompoundFile>
> <mergeFactor>20</mergeFactor>

That merge factor is too high. Set it back to the default of 10, or
maybe even less. But start with 10.

At the risk of sounding like a goose... is that right? I thought (and my local setup would agree) that mergeFactor is useless so long as useCompoundFile = false. Once it's true I agree it's too high, but not for performance, the suggested value of 30 would index faster then 20 (or 10) but all three would index slower them currently because no merging is currently being done.

I agree that 10 is the better option though if you want merging... or even lower if willing to take the hit. I run my (admittedly smaller data set) all the way down to 2 so I have one index segment to optimise search speeds. I haven't got any data to highlight the practical (if any) improvements you'll see below a value of 10, but theoretically it's better.

When we optimised our index we did pretty much everything Erik has listed, but deliberately chose to sacrifice speed at index time for query speed with the mergeFactor value.

The other point:

>> -server -Xmx12g -Xms12g -d64 -XX:+UseParallelGC -XX:+AggressiveOpts -XX:NewRatio=5

-server and -d64 are mutually exclusive, you only choose one. Testing on our Solaris server (and the results are very much OS dependant) showed -server to be better. Once you turn on -d64 Solaris throws way too much RAM into each thread's stack space, even after manually dropping the amount (-Xss) -server still performed better.

Because each OS have a slightly different JVM (and default command line options) it's worth benchmarking the three setups (-server, -d64 and nothing) to see what works best... but I suspect it will be -server. Just keep in mind that putting in nothing falls back to the default (-server on Solaris).

Ta,
Greg

Greg Pendlebury

unread,

Jan 6, 2010, 10:01:52 PM1/6/10

to solrma...@googlegroups.com

2010/1/7 Greg Pendlebury <greg.pe...@gmail.com>

-server and -d64 are mutually exclusive, you only choose one. Testing on our Solaris server (and the results are very much OS dependant) showed -server to be better. Once you turn on -d64 Solaris throws way too much RAM into each thread's stack space, even after manually dropping the amount (-Xss) -server still performed better.

Because each OS have a slightly different JVM (and default command line options) it's worth benchmarking the three setups (-server, -d64 and nothing) to see what works best... but I suspect it will be -server. Just keep in mind that putting in nothing falls back to the default (-server on Solaris).

Oops, forgot to mention the RAM. I'm sure you're already aware the -d64 is required to throw so much RAM at the process, but I totally agree with Erik that RAM isn't usually the problem (especially not at index time, solr might benefit from more as a server... not so sure there). I'd try a -server setup with 3.5g of RAM (or 3.8, I can never remember the limit) to eliminate -d64 and see what happens.

Erik Hatcher

unread,

Jan 7, 2010, 9:57:25 AM1/7/10

to solrma...@googlegroups.com

On Jan 6, 2010, at 9:55 PM, Greg Pendlebury wrote:
> On an otherwise excellent list of information I wanted to interject
> on a couple of points.
>
> 2010/1/5 Erik Hatcher <erik.h...@gmail.com>
> > <indexDefaults>
> > <useCompoundFile>false</useCompoundFile>
> > <mergeFactor>20</mergeFactor>
>
> That merge factor is too high. Set it back to the default of 10, or
> maybe even less. But start with 10.
>
>
> At the risk of sounding like a goose... is that right? I thought
> (and my local setup would agree) that mergeFactor is useless so long
> as useCompoundFile = false. Once it's true I agree it's too high,
> but not for performance, the suggested value of 30 would index
> faster then 20 (or 10) but all three would index slower them
> currently because no merging is currently being done.

With Lucene 2.9/Solr 1.4, merges happen concurrently in the
background. So there is no need to worry too much about the merge
factor. Too high and you get a lot of files and painful optimize.

compound file does not, as you suggest, prohibit merging... a compound
file is like the directory except inside a single file. merges still
happen on the internal "files".

> I agree that 10 is the better option though if you want merging...
> or even lower if willing to take the hit. I run my (admittedly
> smaller data set) all the way down to 2 so I have one index segment
> to optimise search speeds. I haven't got any data to highlight the
> practical (if any) improvements you'll see below a value of 10, but
> theoretically it's better.

Reducing it lower than 10 is often a recommendation we make to get
quicker searching without having to optimize. But again, now that
this merging happens concurrently and can be told to index only a few
segments at a time to incrementally optimize, it's often best not to
second guess this too much and stick with 10 unless there are really
some issues.

Erik

Greg Pendlebury

unread,

Jan 7, 2010, 5:25:36 PM1/7/10

to solrma...@googlegroups.com

Aaah, thank-you. The compound file info in particular was news to me.

Ta,
Greg

2010/1/8 Erik Hatcher <erik.h...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "solrmarc-tech" group.
To post to this group, send email to solrma...@googlegroups.com.

To unsubscribe from this group, send email to solrmarc-tec...@googlegroups.com.

Cicer0

unread,

Jan 11, 2010, 3:24:25 PM1/11/10

to solrmarc-tech

Wait! Erik says here "With Lucene 2.9/Solr 1.4, merges happen
concurrently in the background."

I don't think that is happening here (even though we use mergefactor
10). When we finish an import (before optimizing) there are easily a
hundred un-merged segments in our index, and then the optimize alone
will take and additional hour to clean up the mess. Is there a way
that the automerge could be accidentally disabled?

For reference I include the <mainindex> portion of our solrconfig.xml:

<config>
<!-- Set this to 'false' if you want solr to continue working after
it has
encountered an severe configuration error. In a production
environment,
you may want solr to keep working even if one handler is mis-
configured.

You may also set this to false using by setting the system
property:
-Dsolr.abortOnConfigurationError=false
-->

<abortOnConfigurationError>${solr.abortOnConfigurationError:false}</
abortOnCon
figurationError>

<dataDir>${solr.solr.home:./solr}/biblio</dataDir>

<useCompoundFile>false</useCompoundFile>

<!--
If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene
will flush
based on whichever limit is hit first.

-->

<!-- Tell Lucene when to flush documents to disk.
Giving Lucene more memory for indexing means faster indexing at
the cost of
more RAM

If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene

will flush b
ased on whichever limit is hit first.

-->
<ramBufferSizeMB>32</ramBufferSizeMB>

<!--

Expert: Turn on Lucene's auto commit capability.

TODO: Add recommendations on why you would want to do this.

<lockType>single</lockType>
</indexDefaults>

<useCompoundFile>false</useCompoundFile>

<!-- If true, unlock any held write or commit locks on startup.

This defeats the locking mechanism that allows multiple
processes to safely access a lucene index, and should be
used with care.
This is not needed if lock type is 'none' or 'single'
-->
<unlockOnStartup>false</unlockOnStartup>

<!--
Custom deletion policies can specified here. The class must
implement org.apache.lucene.index.IndexDeletionPolicy.

http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/index/IndexDel
etionPolicy.html

The standard Solr IndexDeletionPolicy implementation supports
deleting
index commit points on number of commits, age of commit point
and
optimized status.

The latest commit point should always be preserved regardless
of the criteria.
-->

<str name="keepOptimizedOnly">false</str>

<!--
Delete all commit points once they have reached the given
age.
Supports DateMathParser syntax e.g.

<str name="maxCommitAge">30MINUTES</str>
<str name="maxCommitAge">1DAY</str>
-->
</deletionPolicy>

</mainIndex>

On Jan 7, 9:57 am, Erik Hatcher <erik.hatc...@gmail.com> wrote:
> On Jan 6, 2010, at 9:55 PM, Greg Pendlebury wrote:
>
>
>
> > On an otherwise excellent list of information I wanted to interject
> > on a couple of points.
>

> > 2010/1/5 Erik Hatcher <erik.hatc...@gmail.com>

Naomi Dushay

unread,

Jan 13, 2010, 6:53:41 PM1/13/10

to solrma...@googlegroups.com

Our experience is that each "chunk" of records takes about an hour,
all the way through the process. We only optimize after the last
segment:

note timestamps on files are roughly an hour apart

(started job roughly 19:00)
Jan 12 19:58 log000-049.txt 486088 records
Jan 12 20:51 log050-099.txt 491110 records
Jan 12 21:43 log100-149.txt 477976 records
Jan 12 22:38 log150-199.txt 490835 records
Jan 12 23:28 log200-249.txt 469292 records
Jan 13 00:25 log250-299.txt 477624
Jan 13 01:24 log300-349.txt 428567
Jan 13 01:47 log350-399.txt 166421
Jan 13 02:31 log400-449.txt 335662
Jan 13 03:05 log450-499.txt 280054
Jan 13 03:34 log500-549.txt 207567
Jan 13 04:10 log550-599.txt 277984
Jan 13 04:34 log600-649.txt 172344
Jan 13 05:19 log650-699.txt 304653
Jan 13 05:47 log700-749.txt 248123
Jan 13 06:35 log750-799.txt 388703
Jan 13 08:09 log800-849.txt 416710 (also includes optimize)

- Naomi

jbarnett

unread,

Jan 14, 2010, 1:58:03 PM1/14/10

to solrmarc-tech

We use larger files (800K), and within them, the first 400,000 do seem
to be fairly consistent. We optimize between each file. Maybe the
secret is just smaller files

> ...
>
> read more »

Naomi Dushay

unread,

Jan 14, 2010, 4:28:35 PM1/14/10

to solrma...@googlegroups.com

the secret is not to optimize after each file, unless you are updating
a solr index currently in use by an application. The optimization
would account for the length of time growing as more records are added
to the index. If you optimize only at the end, the timing will
improve greatly.

- Naomi

> To post to this group, send email to solrma...@googlegroups.com.

Jonathan Rochkind

unread,

Jan 14, 2010, 4:33:54 PM1/14/10

to solrma...@googlegroups.com

If you guys wanted to write up how you manage to update a solr index
that _isn't_ the one currently in use by the app... and then somehow
swap your solr index so it will be live.... that recipe would probably
be useful to us.

Although I guess if one of us can find time to investigate the erik
hatcher simpler solr update handler approach, and it really does have
the order of magnitude speed increase he anticipates, it might not be
neccesary.

Jonathan

Till Kinstler

unread,

Jan 15, 2010, 3:41:52 AM1/15/10

to solrma...@googlegroups.com

Jonathan Rochkind schrieb:

> If you guys wanted to write up how you manage to update a solr index
> that _isn't_ the one currently in use by the app... and then somehow
> swap your solr index so it will be live.... that recipe would probably
> be useful to us.

Hmmm, general ways to do that:
1) copy/rsync/... the production index to a separate Sorl index core or
Solr installation
2) apply your updates to that copy
3) copy/rsync the updated index back to your production Solr core or
Solr installation
4) send a <commit/> to the production core/installation
5) sleep n, goto 1.

To avoid the somewhat risky overwriting (I am not sure what happens, if
copy/rsync/... fails midway) of the production index while in use (step 3):
3) point your application to the updated index core/installation
4) copy/rsync the updated index over the old production core
5) apply the next updates to that now unused core
6) goto 3)

But try HTTP POST in solrmarc-2.1 (after that you'll have a clue if MARC
processing in solrmarc or index writing is the bottleneck, for us it's
clearly index writing, though I am not sure why, I guess because of
memory issues)...

Erik Hatcher

unread,

Jan 15, 2010, 5:00:44 AM1/15/10

to solrma...@googlegroups.com

Rather than these involved steps, I recommend using Solr's replication
feature. It is specifically designed to have master and slaves
configuration, where you index into the master, and then it is
replicated to the slaves (either automatically or by request).

Solr 1.4's replication is over HTTP rather than rsync, so it makes for
a lot easier configuration and deployment. Also note that you will
likely want to replicate configuration files also, not just the index,
so that the schema/solrconfig are kept in sync with the index (granted
changes to these will be rare).

For more details, see http://wiki.apache.org/solr/SolrReplication

Erik

> Platz der Göttinger Sieben 1, D 37073 Göttingen

> kins...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de

Jonathan Rochkind

unread,

Jan 15, 2010, 1:29:58 PM1/15/10

to solrma...@googlegroups.com

Awesome, thanks.

________________________________________
From: solrma...@googlegroups.com [solrma...@googlegroups.com] On Behalf Of Erik Hatcher [erik.h...@gmail.com]

Sent: Friday, January 15, 2010 5:00 AM
To: solrma...@googlegroups.com
Subject: Re: [solrmarc-tech] Re: indexing time - RAM, etc.

Reply all

Reply to author

Forward