Fwd: [islandora] gSearch not inserting into solr index?

116 views

Skip to first unread message

Aaron Coburn

unread,

May 22, 2015, 2:41:02 PM5/22/15

to <islandora@googlegroups.com>

Hi, Brad,

When your application sends an update to Solr, if it includes the `?commit=true` parameter, you will simulate a synchronous (lag-free) interaction with Solr. That means that every update will immediately be available in your searchable index.

I would *strongly* caution you against that approach, however. Solr is not like a transactional RDBMS (like MySQL or Postgres), and there are actually a lot of advantages to having a degree of lag in the system.

That last sentence may sound strange, but let me explain. When Solr accepts an update, that data is buffered in memory but not flushed into the (in-memory) search index until a commit occurs. That process (the "soft commit"), however, It actually somewhat expensive for Solr, especially when it is under any sort of load. It is even more expensive to flush that data into persistent (file-based) storage -- that is, a "hard commit". To optimize this, that is, to support higher throughput for your indexing and searching operations, it is almost always desirable to allow Solr to handle these operations in larger batches. By setting the `commitWithin` parameter to some value, you're telling Solr the maximum time span in which these buffers must be flushed (solr may choose to do this earlier).

This asynchronous interaction is actually really beneficial for the application that is pushing the data into solr. For instance (and I'm making up numbers here), if it takes 100 ms for every Solr update and 1000 ms for every Solr commit, assume you want to index 1000 documents: it would take 100 seconds for all of the updates (if run serially -- faster if done in parallel) plus an additional ~5 seconds if the commitWithin is set at 20 seconds. For the same set, with a commit at every update, it would take ~1100 seconds (or ~18 minutes) -- and those couldn't run much faster in parallel, because the commit is a blocking (synchronized) operation.

Basically, the point is, if you treat Solr operations as synchronous (using commit=true for every update), your entire application will slow down significantly -- both for search and update.

Hope that helps,
Aaron

> On May 22, 2015, at 2:06 PM, Brad Spry <brad...@gmail.com> wrote:
>
> Aaron,
>
> It's obvious you placed an OR between your explanations of the options. Are the three options mutually exclusive?
>
> I've tweaked all three options and I'm still not satisfied... I'm working if there is a point of declining return when all three are set?
>
> I want all the lag out of the system, there's no excuse for it from an infrastructure perspective. It's gotta be software where the lag lies, not hardware.
>
>
> Brad
>
>
>
> On Thursday, May 9, 2013 at 1:52:24 PM UTC-4, Aaron Coburn wrote:
> When documents are added to Solr, they are not visible to new search requests until a "commit" operation has been executed. [1]
>
> When you ask gsearch to run an "optimize" operation, it is a type of "hard commit" on Solr, and then the new items will be available to search requests.
>
> There are numerous ways to address this, depending on your needs. You can either run a 'commit' or 'optimize' command manually after bulk ingests.
>
> Or, you can add a "commitWithin" attribute to the <add> element of the Solr DocumentXML:
>
> <add commitWithin="15000">
> <doc>
> ...
> </doc>
> </add>
>
> Or, you can update the solrconfig.xml file inside Solr. For that, you will want to configure an <autoCommit> or <autoSoftCommit> clause. For example (in Solr 4.2):
>
> <autoCommit>
> <maxTime>15000</maxTime>
> <openSearcher>true</openSearcher>
> </autoCommit>
> (commits every 15 seconds)
>
> Or:
>
> <autoSoftCommit>
> <maxTime>1000</maxTime>
> </autoSoftCommit>
> (commits every 1 second)
>
> Aaron
>
> [1] http://wiki.apache.org/solr/UpdateXmlMessages
>
> On May 9, 2013, at 1:39 PM, John <jyo...@gmail.com> wrote:
>
>> We have having some unexpected behaviour with gsearch. When we add a record I can see in the logs that updateIndex is run and that the index has been updated. Here is a snipit from the logs.
>>
>> <updateIndex xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:foxml="info:fedora/fedora-system:def/foxml#" xmlns:zs="http://www.loc.gov/zing/srw/" warnCount="0" docCount="3" deleteTotal="0" updateTotal="0" insertTotal="1" indexName="FgsIndex"/>
>>
>> The problem is that it is not available in solr until I go the the fedoragsearch/rest interface and press 'updateIndex optimize'.
>>
>> Looking at solr I can see the a few new files are created when I ingest a new item and that they disappear when I optimize. Should solr be looking at these? Or, should gsearch be running optimize after ingest? Did I miss something during configuration?
>>
>> Any ideas would be greatly appreciated.
>>
>> John
>>
>> --
>> You received this message because you are subscribed to the Google Groups "islandora" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to islandora+...@googlegroups.com.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>

Brad Spry

unread,

May 26, 2015, 4:27:29 PM5/26/15

to isla...@googlegroups.com

Thank You so much for your thorough explanation Aaron!

Your SOLR insight is invaluable...

Thank You Again,

Brad

Reply all

Reply to author

Forward

0 new messages