Looks to be 5,786,920 records (reported from my DIH-based indexer, see later). Good enough size for testing here.
First I gave the 2.1 branch of SolrMarc a try. I chatted with Bob, got the exact scoop to index a MARC file into Solr over HTTP. I'm not even gonna fiddle with the embedded indexer, as I don't believe it is the way to go.
I kicked it off at 5pm last night. At 2am, it's still going (9 HOURS!!!!). I ctrl-c'd it as I was tired of my computer churning so hard. It was mostly done I found out as it reported:
... INFO [main] (MarcImporter.java:256) - Added record 5051841 read from file: 8f89ebfab7654a4b8e923d3d893481be INFO [Thread-1] (MarcImporter.java:455) - Starting Shutdown hook INFO [Thread-1] (MarcImporter.java:459) - Stopping main loop INFO [main] (MarcImporter.java:256) - Added record 5051842 read from file: ddd9ca40d8a24cea865bdafa588877c9 INFO [main] (MarcImporter.java:506) - Adding 5051842 of 5051842 documents to index INFO [main] (MarcImporter.java:507) - Deleting 0 documents from index INFO [main] (MarcImporter.java:381) - Calling commit INFO [main] (MarcImporter.java:392) - Done with the commit, closing Solr INFO [main] (MarcImporter.java:395) - Setting Solr closed flag INFO [Thread-1] (MarcImporter.java:474) - Finished Shutdown hook INFO [main] (MarcImporter.java:516) - Finished indexing in 573:02.00 INFO [main] (MarcImporter.java:525) - Indexed 5051842 at a rate of about 22.0 per sec INFO [main] (MarcImporter.java:526) - Deleted 0 records
Confirmed the number of documents via SolrMarc's cool tools: ~/dev/solrmarc/dist/bin: printrecord /Users/erikhatcher/Downloads/ talis-openlibrary-contribution.mrc | egrep '^001' | wc -l 5786920
Thanks Bob for the help to get this rolling!
As I was kicking off the indexer, I perused the code and was a bit aghast at the SolrProxy stuff, and all the reflection going on in there. SolrServer is the abstraction I recommend using, in the SolrJ library. And yes, I understand the rationale (class loader, Solr version, etc), but let's go ahead and jump to the real win here....
So rather than digging into the code in depth, I wanted to prove out a simpler way of indexing MARC. I had done this earlier (1.5 years ago or so?) using a custom update request handler, very simple loop using plain ol' MARC4J. I just wanted to see how fast the simplest possible thing that could work runs. But rather than the update handler approach, I figured I'd go for the gusto and frame it within the DataImportHandler framework as a custom EntityProcessor.
So I've done that. My results:
Indexed that entire set in 55 minutes, 15 seconds. Yes, you read that right.
What's the difference? Ok, some shortcuts here - I'm not using the SolrMarc mappings, just MARC4J to index the 001 id value, and took the Record object and toString'd it into an indexed text field. Makes the whole thing searchable at least. That's all. So no heavy lifting on the mappings, just pure MARC -> Solr. With my SolrMarc indexing try, I pointed at the UvaBlacklight Solr config and mappings, and with my "DIHmarc" (die MARC? dim-mak?) implementation I used Solr's example schema. So some difference analysis work too.
I'm posting a .zip file to the files section shortly with the full (incredibly tiny) deal and a README.txt showing how easy this is to use with Solr 1.4.
What's left? The README.txt has these points: * The MARC file path is currently hardcoded into MarcEntityProcessor. * SolrMarc mappings aren't wired in, see placeholder in MarcEntityProcessor * Parameterization is needed, and integration to allow MarcEntityProcessor to work on a list of files or input streams * Consider whether SolrMarc mappings are the way to go for straightforward things, or use DIH Transformers
What now? Maybe we commit this and iterate on it on a branch or somewhere where I can commit to it also? And we make it parameterizable and then wire in SolrMarc with this placeholder resides in my prototype:
// Insert Record => Map SolrMarc magic API call here
I imagine there is some MARC4J iteration work that might need to be tweaked - I just used the stock marc4j.jar, and old one I had lying around when I did the update handler a couple of years ago (oh the pains of that collecting dust!).
Why is my implementation so much faster? I don't know yet, and frankly I don't think it matters. :) It surely isn't simply the SolrMarc mappings, it's got to be in the indexing code, maybe with the reflection code? Maybe with schema.xml (that version="0.4" is way wrong). Maybe other complexities? Mainly I think DIH is a better way anyway, so I started my work with that.
The difference in speeds:
SolrMarc: 22 docs / s DIHmarc: 1,745 docs / s
Futhermore, with this in the DataImportHandler framework it allows blending with other data sources. So one next step could be to build a config to iterate through a directory of files and index them linearly. Or hit a relational database for lookup values to incorporate (the SQL result can be cached, so don't worry about performance here!), or to fetch a URL that might be pointed to from the MARC record and incorporate its text into a field, or.... well the possibilities are pretty wide open. Sure, SolrMarc's flexible config made some of this possible too, but DIH is more heavily used simply by having a broader user base. So, keep the SolrMarc mapping magic, but consider how it fits into the DIH framework. DIH has Transformers.
And yes, DIH is not without faults - its API is awkward, many others feel the same way. We'll see DIH evolve even further soon. But, we're finding at Lucid engagements that DIH solves the problem and while it may look a little uglier, it's way quicker and easier to use it than writing a bunch of custom stuff for every project.
Ok, so, who's up for testing this out next on their data sets? Let's see some other numbers here.
> You guys make me manic... up all night because I couldn't sleep until > this problem was well under control. And I think I pretty much kicked > its butt.
> Looks to be 5,786,920 records (reported from my DIH-based indexer, see > later). Good enough size for testing here.
> First I gave the 2.1 branch of SolrMarc a try. I chatted with Bob, > got the exact scoop to index a MARC file into Solr over HTTP. I'm not > even gonna fiddle with the embedded indexer, as I don't believe it is > the way to go.
> I kicked it off at 5pm last night. At 2am, it's still going (9 > HOURS!!!!). I ctrl-c'd it as I was tired of my computer churning so > hard. It was mostly done I found out as it reported:
> ... > INFO [main] (MarcImporter.java:256) - Added record 5051841 read from > file: 8f89ebfab7654a4b8e923d3d893481be > INFO [Thread-1] (MarcImporter.java:455) - Starting Shutdown hook > INFO [Thread-1] (MarcImporter.java:459) - Stopping main loop > INFO [main] (MarcImporter.java:256) - Added record 5051842 read from > file: ddd9ca40d8a24cea865bdafa588877c9 > INFO [main] (MarcImporter.java:506) - Adding 5051842 of 5051842 > documents to index > INFO [main] (MarcImporter.java:507) - Deleting 0 documents from > index > INFO [main] (MarcImporter.java:381) - Calling commit > INFO [main] (MarcImporter.java:392) - Done with the commit, closing > Solr > INFO [main] (MarcImporter.java:395) - Setting Solr closed flag > INFO [Thread-1] (MarcImporter.java:474) - Finished Shutdown hook > INFO [main] (MarcImporter.java:516) - Finished indexing in 573:02.00 > INFO [main] (MarcImporter.java:525) - Indexed 5051842 at a rate of > about 22.0 per sec > INFO [main] (MarcImporter.java:526) - Deleted 0 records
> Confirmed the number of documents via SolrMarc's cool tools: > ~/dev/solrmarc/dist/bin: printrecord /Users/erikhatcher/Downloads/ > talis-openlibrary-contribution.mrc | egrep '^001' | wc -l 5786920
> Thanks Bob for the help to get this rolling!
> As I was kicking off the indexer, I perused the code and was a bit > aghast at the SolrProxy stuff, and all the reflection going on in > there. SolrServer is the abstraction I recommend using, in the SolrJ > library. And yes, I understand the rationale (class loader, Solr > version, etc), but let's go ahead and jump to the real win here....
> So rather than digging into the code in depth, I wanted to prove out a > simpler way of indexing MARC. I had done this earlier (1.5 years ago > or so?) using a custom update request handler, very simple loop using > plain ol' MARC4J. I just wanted to see how fast the simplest possible > thing that could work runs. But rather than the update handler > approach, I figured I'd go for the gusto and frame it within the > DataImportHandler framework as a custom EntityProcessor.
> So I've done that. My results:
> Indexed that entire set in 55 minutes, 15 seconds. Yes, you read > that right.
> What's the difference? Ok, some shortcuts here - I'm not using the > SolrMarc mappings, just MARC4J to index the 001 id value, and took the > Record object and toString'd it into an indexed text field. Makes the > whole thing searchable at least. That's all. So no heavy lifting on > the mappings, just pure MARC -> Solr. With my SolrMarc indexing try, > I pointed at the UvaBlacklight Solr config and mappings, and with my > "DIHmarc" (die MARC? dim-mak?) implementation I used Solr's example > schema. So some difference analysis work too.
> I'm posting a .zip file to the files section shortly with the full > (incredibly tiny) deal and a README.txt showing how easy this is to > use with Solr 1.4.
> What's left? The README.txt has these points: > * The MARC file path is currently hardcoded into > MarcEntityProcessor. > * SolrMarc mappings aren't wired in, see placeholder in > MarcEntityProcessor > * Parameterization is needed, and integration to allow > MarcEntityProcessor to work on a list of files or input streams > * Consider whether SolrMarc mappings are the way to go for > straightforward things, or use DIH Transformers
> What now? Maybe we commit this and iterate on it on a branch or > somewhere where I can commit to it also? And we make it > parameterizable and then wire in SolrMarc with this placeholder > resides in my prototype:
> // Insert Record => Map SolrMarc magic API call here
> I imagine there is some MARC4J iteration work that might need to be > tweaked - I just used the stock marc4j.jar, and old one I had lying > around when I did the update handler a couple of years ago (oh the > pains of that collecting dust!).
> Why is my implementation so much faster? I don't know yet, and > frankly I don't think it matters. :) It surely isn't simply the > SolrMarc mappings, it's got to be in the indexing code, maybe with the > reflection code? Maybe with schema.xml (that version="0.4" is way > wrong). Maybe other complexities? Mainly I think DIH is a better way > anyway, so I started my work with that.
> The difference in speeds:
> SolrMarc: 22 docs / s > DIHmarc: 1,745 docs / s
> Futhermore, with this in the DataImportHandler framework it allows > blending with other data sources. So one next step could be to build > a config to iterate through a directory of files and index them > linearly. Or hit a relational database for lookup values to > incorporate (the SQL result can be cached, so don't worry about > performance here!), or to fetch a URL that might be pointed to from > the MARC record and incorporate its text into a field, or.... well the > possibilities are pretty wide open. Sure, SolrMarc's flexible config > made some of this possible too, but DIH is more heavily used simply by > having a broader user base. So, keep the SolrMarc mapping magic, but > consider how it fits into the DIH framework. DIH has Transformers.
> And yes, DIH is not without faults - its API is awkward, many others > feel the same way. We'll see DIH evolve even further soon. But, > we're finding at Lucid engagements that DIH solves the problem and > while it may look a little uglier, it's way quicker and easier to use > it than writing a bunch of custom stuff for every project.
> Ok, so, who's up for testing this out next on their data sets? Let's > see some other numbers here.
This is very interseting. But I'm wondering... for most of our actual cases, it's not feasible to give up on all the SolrMarc mappings and just map everything as full text. It's an interesting demonstration, but.... we actually need the power of solrmarc to do sophisticated mappings.
So... I'm left unsure how to proceed that would be useful for me. Simply using your solution with no mappings is not a solution. So theoretically we can add features to your custom Solr update handler to allow just as sophisticated mappings as SolrMarc... but will that just make it slow again, depending on how they're added?
Comparing: SolrMarc with sophisticated mappings; TO: Request handler with no mappings. We have (at least) two dimensions, without profiling we can't really be sure that speed gain is due to 'custom update handler', rather than just abandoning the sophisticated mappings. But we actually need the sophisticated mappings for our applications. So I'm not following why "it doesn't matter" why your code is so much faster; because if someoen spent lots of time adding the ability for sophisticated locally defined mappings in your request handler (when that ability is already implemented in SolrMarc), only to find that they just wound up with the same performance problems... that would be kind of mis-spent development time, no?
Does what I'm saying make sense? The thing is that the ability to locally define sophisticated mappings is kind of crucial to us.
Erik Hatcher wrote: > You guys make me manic... up all night because I couldn't sleep until > this problem was well under control. And I think I pretty much kicked > its butt.
> Looks to be 5,786,920 records (reported from my DIH-based indexer, see > later). Good enough size for testing here.
> First I gave the 2.1 branch of SolrMarc a try. I chatted with Bob, > got the exact scoop to index a MARC file into Solr over HTTP. I'm not > even gonna fiddle with the embedded indexer, as I don't believe it is > the way to go.
> I kicked it off at 5pm last night. At 2am, it's still going (9 > HOURS!!!!). I ctrl-c'd it as I was tired of my computer churning so > hard. It was mostly done I found out as it reported:
> ... > INFO [main] (MarcImporter.java:256) - Added record 5051841 read from > file: 8f89ebfab7654a4b8e923d3d893481be > INFO [Thread-1] (MarcImporter.java:455) - Starting Shutdown hook > INFO [Thread-1] (MarcImporter.java:459) - Stopping main loop > INFO [main] (MarcImporter.java:256) - Added record 5051842 read from > file: ddd9ca40d8a24cea865bdafa588877c9 > INFO [main] (MarcImporter.java:506) - Adding 5051842 of 5051842 > documents to index > INFO [main] (MarcImporter.java:507) - Deleting 0 documents from > index > INFO [main] (MarcImporter.java:381) - Calling commit > INFO [main] (MarcImporter.java:392) - Done with the commit, closing > Solr > INFO [main] (MarcImporter.java:395) - Setting Solr closed flag > INFO [Thread-1] (MarcImporter.java:474) - Finished Shutdown hook > INFO [main] (MarcImporter.java:516) - Finished indexing in 573:02.00 > INFO [main] (MarcImporter.java:525) - Indexed 5051842 at a rate of > about 22.0 per sec > INFO [main] (MarcImporter.java:526) - Deleted 0 records
> Confirmed the number of documents via SolrMarc's cool tools: > ~/dev/solrmarc/dist/bin: printrecord /Users/erikhatcher/Downloads/ > talis-openlibrary-contribution.mrc | egrep '^001' | wc -l 5786920
> Thanks Bob for the help to get this rolling!
> As I was kicking off the indexer, I perused the code and was a bit > aghast at the SolrProxy stuff, and all the reflection going on in > there. SolrServer is the abstraction I recommend using, in the SolrJ > library. And yes, I understand the rationale (class loader, Solr > version, etc), but let's go ahead and jump to the real win here....
> So rather than digging into the code in depth, I wanted to prove out a > simpler way of indexing MARC. I had done this earlier (1.5 years ago > or so?) using a custom update request handler, very simple loop using > plain ol' MARC4J. I just wanted to see how fast the simplest possible > thing that could work runs. But rather than the update handler > approach, I figured I'd go for the gusto and frame it within the > DataImportHandler framework as a custom EntityProcessor.
> So I've done that. My results:
> Indexed that entire set in 55 minutes, 15 seconds. Yes, you read > that right.
> What's the difference? Ok, some shortcuts here - I'm not using the > SolrMarc mappings, just MARC4J to index the 001 id value, and took the > Record object and toString'd it into an indexed text field. Makes the > whole thing searchable at least. That's all. So no heavy lifting on > the mappings, just pure MARC -> Solr. With my SolrMarc indexing try, > I pointed at the UvaBlacklight Solr config and mappings, and with my > "DIHmarc" (die MARC? dim-mak?) implementation I used Solr's example > schema. So some difference analysis work too.
> I'm posting a .zip file to the files section shortly with the full > (incredibly tiny) deal and a README.txt showing how easy this is to > use with Solr 1.4.
> What's left? The README.txt has these points: > * The MARC file path is currently hardcoded into > MarcEntityProcessor. > * SolrMarc mappings aren't wired in, see placeholder in > MarcEntityProcessor > * Parameterization is needed, and integration to allow > MarcEntityProcessor to work on a list of files or input streams > * Consider whether SolrMarc mappings are the way to go for > straightforward things, or use DIH Transformers
> What now? Maybe we commit this and iterate on it on a branch or > somewhere where I can commit to it also? And we make it > parameterizable and then wire in SolrMarc with this placeholder > resides in my prototype:
> // Insert Record => Map SolrMarc magic API call here
> I imagine there is some MARC4J iteration work that might need to be > tweaked - I just used the stock marc4j.jar, and old one I had lying > around when I did the update handler a couple of years ago (oh the > pains of that collecting dust!).
> Why is my implementation so much faster? I don't know yet, and > frankly I don't think it matters. :) It surely isn't simply the > SolrMarc mappings, it's got to be in the indexing code, maybe with the > reflection code? Maybe with schema.xml (that version="0.4" is way > wrong). Maybe other complexities? Mainly I think DIH is a better way > anyway, so I started my work with that.
> The difference in speeds:
> SolrMarc: 22 docs / s > DIHmarc: 1,745 docs / s
> Futhermore, with this in the DataImportHandler framework it allows > blending with other data sources. So one next step could be to build > a config to iterate through a directory of files and index them > linearly. Or hit a relational database for lookup values to > incorporate (the SQL result can be cached, so don't worry about > performance here!), or to fetch a URL that might be pointed to from > the MARC record and incorporate its text into a field, or.... well the > possibilities are pretty wide open. Sure, SolrMarc's flexible config > made some of this possible too, but DIH is more heavily used simply by > having a broader user base. So, keep the SolrMarc mapping magic, but > consider how it fits into the DIH framework. DIH has Transformers.
> And yes, DIH is not without faults - its API is awkward, many others > feel the same way. We'll see DIH evolve even further soon. But, > we're finding at Lucid engagements that DIH solves the problem and > while it may look a little uglier, it's way quicker and easier to use > it than writing a bunch of custom stuff for every project.
> Ok, so, who's up for testing this out next on their data sets? Let's > see some other numbers here.
He's not talking about removing the mappings, he just didn't include them in his proof-of-concept. The idea would be to swap out MARC4J with SolrMarc and have it create the mappings -- the difference is that it wouldn't do all the other stuff it does, just take the MARC and make it a Solr doc.
Let SolrMarc do its job (take a MARC record, analyze, turn it into a Hash -- basically) and let Solr do its job (add/delete/commit/optimize).
On Thu, Jan 7, 2010 at 10:57 AM, Jonathan Rochkind <rochk...@jhu.edu> wrote: > This is very interseting. But I'm wondering... for most of our actual > cases, it's not feasible to give up on all the SolrMarc mappings and just > map everything as full text. It's an interesting demonstration, but.... > we actually need the power of solrmarc to do sophisticated mappings.
> So... I'm left unsure how to proceed that would be useful for me. Simply > using your solution with no mappings is not a solution. So theoretically we > can add features to your custom Solr update handler to allow just as > sophisticated mappings as SolrMarc... but will that just make it slow again, > depending on how they're added?
> Comparing: SolrMarc with sophisticated mappings; TO: Request handler with > no mappings. We have (at least) two dimensions, without profiling we can't > really be sure that speed gain is due to 'custom update handler', rather > than just abandoning the sophisticated mappings. But we actually need the > sophisticated mappings for our applications. So I'm not following why "it > doesn't matter" why your code is so much faster; because if someoen spent > lots of time adding the ability for sophisticated locally defined mappings > in your request handler (when that ability is already implemented in > SolrMarc), only to find that they just wound up with the same performance > problems... that would be kind of mis-spent development time, no?
> Does what I'm saying make sense? The thing is that the ability to locally > define sophisticated mappings is kind of crucial to us.
> Jonathan
> Erik Hatcher wrote:
>> You guys make me manic... up all night because I couldn't sleep until >> this problem was well under control. And I think I pretty much kicked >> its butt.
>> Looks to be 5,786,920 records (reported from my DIH-based indexer, see >> later). Good enough size for testing here.
>> First I gave the 2.1 branch of SolrMarc a try. I chatted with Bob, >> got the exact scoop to index a MARC file into Solr over HTTP. I'm not >> even gonna fiddle with the embedded indexer, as I don't believe it is >> the way to go.
>> I kicked it off at 5pm last night. At 2am, it's still going (9 >> HOURS!!!!). I ctrl-c'd it as I was tired of my computer churning so >> hard. It was mostly done I found out as it reported:
>> ... >> INFO [main] (MarcImporter.java:256) - Added record 5051841 read from >> file: 8f89ebfab7654a4b8e923d3d893481be >> INFO [Thread-1] (MarcImporter.java:455) - Starting Shutdown hook >> INFO [Thread-1] (MarcImporter.java:459) - Stopping main loop >> INFO [main] (MarcImporter.java:256) - Added record 5051842 read from >> file: ddd9ca40d8a24cea865bdafa588877c9 >> INFO [main] (MarcImporter.java:506) - Adding 5051842 of 5051842 >> documents to index >> INFO [main] (MarcImporter.java:507) - Deleting 0 documents from >> index >> INFO [main] (MarcImporter.java:381) - Calling commit >> INFO [main] (MarcImporter.java:392) - Done with the commit, closing >> Solr >> INFO [main] (MarcImporter.java:395) - Setting Solr closed flag >> INFO [Thread-1] (MarcImporter.java:474) - Finished Shutdown hook >> INFO [main] (MarcImporter.java:516) - Finished indexing in 573:02.00 >> INFO [main] (MarcImporter.java:525) - Indexed 5051842 at a rate of >> about 22.0 per sec >> INFO [main] (MarcImporter.java:526) - Deleted 0 records
>> Confirmed the number of documents via SolrMarc's cool tools: >> ~/dev/solrmarc/dist/bin: printrecord /Users/erikhatcher/Downloads/ >> talis-openlibrary-contribution.mrc | egrep '^001' | wc -l 5786920
>> Thanks Bob for the help to get this rolling!
>> As I was kicking off the indexer, I perused the code and was a bit >> aghast at the SolrProxy stuff, and all the reflection going on in >> there. SolrServer is the abstraction I recommend using, in the SolrJ >> library. And yes, I understand the rationale (class loader, Solr >> version, etc), but let's go ahead and jump to the real win here....
>> So rather than digging into the code in depth, I wanted to prove out a >> simpler way of indexing MARC. I had done this earlier (1.5 years ago >> or so?) using a custom update request handler, very simple loop using >> plain ol' MARC4J. I just wanted to see how fast the simplest possible >> thing that could work runs. But rather than the update handler >> approach, I figured I'd go for the gusto and frame it within the >> DataImportHandler framework as a custom EntityProcessor.
>> So I've done that. My results:
>> Indexed that entire set in 55 minutes, 15 seconds. Yes, you read >> that right.
>> What's the difference? Ok, some shortcuts here - I'm not using the >> SolrMarc mappings, just MARC4J to index the 001 id value, and took the >> Record object and toString'd it into an indexed text field. Makes the >> whole thing searchable at least. That's all. So no heavy lifting on >> the mappings, just pure MARC -> Solr. With my SolrMarc indexing try, >> I pointed at the UvaBlacklight Solr config and mappings, and with my >> "DIHmarc" (die MARC? dim-mak?) implementation I used Solr's example >> schema. So some difference analysis work too.
>> I'm posting a .zip file to the files section shortly with the full >> (incredibly tiny) deal and a README.txt showing how easy this is to >> use with Solr 1.4.
>> What's left? The README.txt has these points: >> * The MARC file path is currently hardcoded into >> MarcEntityProcessor. >> * SolrMarc mappings aren't wired in, see placeholder in >> MarcEntityProcessor >> * Parameterization is needed, and integration to allow >> MarcEntityProcessor to work on a list of files or input streams >> * Consider whether SolrMarc mappings are the way to go for >> straightforward things, or use DIH Transformers
>> What now? Maybe we commit this and iterate on it on a branch or >> somewhere where I can commit to it also? And we make it >> parameterizable and then wire in SolrMarc with this placeholder >> resides in my prototype:
>> // Insert Record => Map SolrMarc magic API call here
>> I imagine there is some MARC4J iteration work that might need to be >> tweaked - I just used the stock marc4j.jar, and old one I had lying >> around when I did the update handler a couple of years ago (oh the >> pains of that collecting dust!).
>> Why is my implementation so much faster? I don't know yet, and >> frankly I don't think it matters. :) It surely isn't simply the >> SolrMarc mappings, it's got to be in the indexing code, maybe with the >> reflection code? Maybe with schema.xml (that version="0.4" is way >> wrong). Maybe other complexities? Mainly I think DIH is a better way >> anyway, so I started my work with that.
>> The difference in speeds:
>> SolrMarc: 22 docs / s >> DIHmarc: 1,745 docs / s
>> Futhermore, with this in the DataImportHandler framework it allows >> blending with other data sources. So one next step could be to build >> a config to iterate through a directory of files and index them >> linearly. Or hit a relational database for lookup values to >> incorporate (the SQL result can be cached, so don't worry about >> performance here!), or to fetch a URL that might be pointed to from >> the MARC record and incorporate its text into a field, or.... well the >> possibilities are pretty wide open. Sure, SolrMarc's flexible config >> made some of this possible too, but DIH is more heavily used simply by >> having a broader user base. So, keep the SolrMarc mapping magic, but >> consider how it fits into the DIH framework. DIH has Transformers.
>> And yes, DIH is not without faults - its API is awkward, many others >> feel the same way. We'll see DIH evolve even further soon. But, >> we're finding at Lucid engagements that DIH solves the problem and >> while it may look a little uglier, it's way quicker and easier to use >> it than writing a bunch of custom stuff for every project.
>> Ok, so, who's up for testing this out next on their data sets? Let's >> see some other numbers here.
>> Erik
> -- > You received this message because you are subscribed to the Google Groups > "solrmarc-tech" group. > To post to this group, send email to solrmarc-tech@googlegroups.com. > To unsubscribe from this group, send email to > solrmarc-tech+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/solrmarc-tech?hl=en.
Aha, I see, you're right, I did misunderstand, if the idea is it's potentially possible to use actually existing SolrMarc code with this new approach too, that sounds promissing. It would be interesting to see if once you do that... your bottleneck comes back. But it's definitely worth investigating, if anyone does it they'll definitely have my gratitude.
Ross Singer wrote: > Jonathan, I think you're missing Erik's point.
> He's not talking about removing the mappings, he just didn't include > them in his proof-of-concept. The idea would be to swap out MARC4J > with SolrMarc and have it create the mappings -- the difference is > that it wouldn't do all the other stuff it does, just take the MARC > and make it a Solr doc.
> Let SolrMarc do its job (take a MARC record, analyze, turn it into a > Hash -- basically) and let Solr do its job > (add/delete/commit/optimize).
> -Ross.
> On Thu, Jan 7, 2010 at 10:57 AM, Jonathan Rochkind <rochk...@jhu.edu> wrote:
>> This is very interseting. But I'm wondering... for most of our actual >> cases, it's not feasible to give up on all the SolrMarc mappings and just >> map everything as full text. It's an interesting demonstration, but.... >> we actually need the power of solrmarc to do sophisticated mappings.
>> So... I'm left unsure how to proceed that would be useful for me. Simply >> using your solution with no mappings is not a solution. So theoretically we >> can add features to your custom Solr update handler to allow just as >> sophisticated mappings as SolrMarc... but will that just make it slow again, >> depending on how they're added?
>> Comparing: SolrMarc with sophisticated mappings; TO: Request handler with >> no mappings. We have (at least) two dimensions, without profiling we can't >> really be sure that speed gain is due to 'custom update handler', rather >> than just abandoning the sophisticated mappings. But we actually need the >> sophisticated mappings for our applications. So I'm not following why "it >> doesn't matter" why your code is so much faster; because if someoen spent >> lots of time adding the ability for sophisticated locally defined mappings >> in your request handler (when that ability is already implemented in >> SolrMarc), only to find that they just wound up with the same performance >> problems... that would be kind of mis-spent development time, no?
>> Does what I'm saying make sense? The thing is that the ability to locally >> define sophisticated mappings is kind of crucial to us.
>> Jonathan
>> Erik Hatcher wrote:
>>> You guys make me manic... up all night because I couldn't sleep until >>> this problem was well under control. And I think I pretty much kicked >>> its butt.
>>> Looks to be 5,786,920 records (reported from my DIH-based indexer, see >>> later). Good enough size for testing here.
>>> First I gave the 2.1 branch of SolrMarc a try. I chatted with Bob, >>> got the exact scoop to index a MARC file into Solr over HTTP. I'm not >>> even gonna fiddle with the embedded indexer, as I don't believe it is >>> the way to go.
>>> I kicked it off at 5pm last night. At 2am, it's still going (9 >>> HOURS!!!!). I ctrl-c'd it as I was tired of my computer churning so >>> hard. It was mostly done I found out as it reported:
>>> ... >>> INFO [main] (MarcImporter.java:256) - Added record 5051841 read from >>> file: 8f89ebfab7654a4b8e923d3d893481be >>> INFO [Thread-1] (MarcImporter.java:455) - Starting Shutdown hook >>> INFO [Thread-1] (MarcImporter.java:459) - Stopping main loop >>> INFO [main] (MarcImporter.java:256) - Added record 5051842 read from >>> file: ddd9ca40d8a24cea865bdafa588877c9 >>> INFO [main] (MarcImporter.java:506) - Adding 5051842 of 5051842 >>> documents to index >>> INFO [main] (MarcImporter.java:507) - Deleting 0 documents from >>> index >>> INFO [main] (MarcImporter.java:381) - Calling commit >>> INFO [main] (MarcImporter.java:392) - Done with the commit, closing >>> Solr >>> INFO [main] (MarcImporter.java:395) - Setting Solr closed flag >>> INFO [Thread-1] (MarcImporter.java:474) - Finished Shutdown hook >>> INFO [main] (MarcImporter.java:516) - Finished indexing in 573:02.00 >>> INFO [main] (MarcImporter.java:525) - Indexed 5051842 at a rate of >>> about 22.0 per sec >>> INFO [main] (MarcImporter.java:526) - Deleted 0 records
>>> Confirmed the number of documents via SolrMarc's cool tools: >>> ~/dev/solrmarc/dist/bin: printrecord /Users/erikhatcher/Downloads/ >>> talis-openlibrary-contribution.mrc | egrep '^001' | wc -l 5786920
>>> Thanks Bob for the help to get this rolling!
>>> As I was kicking off the indexer, I perused the code and was a bit >>> aghast at the SolrProxy stuff, and all the reflection going on in >>> there. SolrServer is the abstraction I recommend using, in the SolrJ >>> library. And yes, I understand the rationale (class loader, Solr >>> version, etc), but let's go ahead and jump to the real win here....
>>> So rather than digging into the code in depth, I wanted to prove out a >>> simpler way of indexing MARC. I had done this earlier (1.5 years ago >>> or so?) using a custom update request handler, very simple loop using >>> plain ol' MARC4J. I just wanted to see how fast the simplest possible >>> thing that could work runs. But rather than the update handler >>> approach, I figured I'd go for the gusto and frame it within the >>> DataImportHandler framework as a custom EntityProcessor.
>>> So I've done that. My results:
>>> Indexed that entire set in 55 minutes, 15 seconds. Yes, you read >>> that right.
>>> What's the difference? Ok, some shortcuts here - I'm not using the >>> SolrMarc mappings, just MARC4J to index the 001 id value, and took the >>> Record object and toString'd it into an indexed text field. Makes the >>> whole thing searchable at least. That's all. So no heavy lifting on >>> the mappings, just pure MARC -> Solr. With my SolrMarc indexing try, >>> I pointed at the UvaBlacklight Solr config and mappings, and with my >>> "DIHmarc" (die MARC? dim-mak?) implementation I used Solr's example >>> schema. So some difference analysis work too.
>>> I'm posting a .zip file to the files section shortly with the full >>> (incredibly tiny) deal and a README.txt showing how easy this is to >>> use with Solr 1.4.
>>> What's left? The README.txt has these points: >>> * The MARC file path is currently hardcoded into >>> MarcEntityProcessor. >>> * SolrMarc mappings aren't wired in, see placeholder in >>> MarcEntityProcessor >>> * Parameterization is needed, and integration to allow >>> MarcEntityProcessor to work on a list of files or input streams >>> * Consider whether SolrMarc mappings are the way to go for >>> straightforward things, or use DIH Transformers
>>> What now? Maybe we commit this and iterate on it on a branch or >>> somewhere where I can commit to it also? And we make it >>> parameterizable and then wire in SolrMarc with this placeholder >>> resides in my prototype:
>>> // Insert Record => Map SolrMarc magic API call here
>>> I imagine there is some MARC4J iteration work that might need to be >>> tweaked - I just used the stock marc4j.jar, and old one I had lying >>> around when I did the update handler a couple of years ago (oh the >>> pains of that collecting dust!).
>>> Why is my implementation so much faster? I don't know yet, and >>> frankly I don't think it matters. :) It surely isn't simply the >>> SolrMarc mappings, it's got to be in the indexing code, maybe with the >>> reflection code? Maybe with schema.xml (that version="0.4" is way >>> wrong). Maybe other complexities? Mainly I think DIH is a better way >>> anyway, so I started my work with that.
>>> The difference in speeds:
>>> SolrMarc: 22 docs / s >>> DIHmarc: 1,745 docs / s
>>> Futhermore, with this in the DataImportHandler framework it allows >>> blending with other data sources. So one next step could be to build >>> a config to iterate through a directory of files and index them >>> linearly. Or hit a relational database for lookup values to >>> incorporate (the SQL result can be cached, so don't worry about >>> performance here!), or to fetch a URL that might be pointed to from >>> the MARC record and incorporate its text into a field, or.... well the >>> possibilities are pretty wide open. Sure, SolrMarc's flexible config >>> made some of this possible too, but DIH is more heavily used simply by >>> having a broader user base. So, keep the SolrMarc mapping magic, but >>> consider how it fits into the DIH framework. DIH has Transformers.
>>> And yes, DIH is not without faults - its API is awkward, many others >>> feel the same way. We'll see DIH evolve even further soon. But, >>> we're finding at Lucid engagements that DIH solves the problem and >>> while it may look a little uglier, it's way quicker and easier to use >>> it than writing a bunch of custom stuff for every project.
>>> Ok, so, who's up for testing this out next on their data sets? Let's >>> see some other numbers here.
>>> Erik
>> -- >> You received this message because you are subscribed to the Google Groups >> "solrmarc-tech" group. >> To post to this group, send email to solrmarc-tech@googlegroups.com. >> To unsubscribe from this group, send email to >> solrmarc-tech+unsubscribe@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/solrmarc-tech?hl=en.
> Aha, I see, you're right, I did misunderstand, if the idea is it's > potentially possible to use actually existing SolrMarc code with this > new approach too, that sounds promissing. It would be interesting to > see if once you do that... your bottleneck comes back.
I wouldn't expect that. My stupid little profiling efforts (I am no way a provicient Java profiler) seem to indicate that the increasing slowness of solrmarc indexing come from the actual indexing in its embedded Solr/Lucene part, not from the record processing in solrmarc. Of course record processing in solrmarc eats additional processing time (depending on what you do there), but that should be pretty independent of index size. And currently indexing speed seems to depend on index size (at least for my data, index configuration and hardware). Of course we shouldn't throw away the sophisticated options solrmarc offers to map MARCish records to Solr indexes. That's really useful!
Till
-- Till Kinstler Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) Platz der G ttinger Sieben 1, D 37073 G ttingen kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
> I wouldn't expect that. My stupid little profiling efforts (I am no way > a provicient Java profiler) seem to indicate that the increasing > slowness of solrmarc indexing come from the actual indexing in its > embedded Solr/Lucene part, not from the record processing in solrmarc.
Has anyone tried benchmarking/profiling SolrMarc 2.1 using it's ability to simply http post to solr instead of using the embedded solr/lucene? Some have always been suspicious of the embedded solr/lucene.
> Of course record processing in solrmarc eats additional processing time > (depending on what you do there), but that should be pretty independent > of index size. And currently indexing speed seems to depend on index > size (at least for my data, index configuration and hardware). > Of course we shouldn't throw away the sophisticated options solrmarc > offers to map MARCish records to Solr indexes. That's really useful!
Oops, nevermind Erik's test _was_ of SolrMarc using http post indexing, not embedded solr/lucene. So the slowness of SolrMarc in Erik's test definitely wasn't related to the embedded solr/lucene. Hmm.
>> I wouldn't expect that. My stupid little profiling efforts (I am no way >> a provicient Java profiler) seem to indicate that the increasing >> slowness of solrmarc indexing come from the actual indexing in its >> embedded Solr/Lucene part, not from the record processing in solrmarc.
> Has anyone tried benchmarking/profiling SolrMarc 2.1 using it's ability > to simply http post to solr instead of using the embedded solr/lucene? > Some have always been suspicious of the embedded solr/lucene.
> Jonathan
>> Of course record processing in solrmarc eats additional processing time >> (depending on what you do there), but that should be pretty independent >> of index size. And currently indexing speed seems to depend on index >> size (at least for my data, index configuration and hardware). >> Of course we shouldn't throw away the sophisticated options solrmarc >> offers to map MARCish records to Solr indexes. That's really useful!
On Jan 7, 2010, at 12:03 PM, Jonathan Rochkind wrote:
> Has anyone tried benchmarking/profiling SolrMarc 2.1 using it's > ability to simply http post to solr instead of using the embedded > solr/lucene? Some have always been suspicious of the embedded solr/ > lucene.
My 9 hour aborted run on the 5.7M .mrc file last night was with SolrMarc 2.1 over HTTP.
I posted a possibly related question on the "indexing time -RAM" thread, but for completeness could you confirm that your 9 hour aborted run did not have an excessive number of un-merged segments (which I am currently speculating is behind the relation ship of index size to index time relationship)
On Jan 7, 12:26 pm, Erik Hatcher <erik.hatc...@gmail.com> wrote:
> On Jan 7, 2010, at 12:03 PM, Jonathan Rochkind wrote:
> > Has anyone tried benchmarking/profiling SolrMarc 2.1 using it's > > ability to simply http post to solr instead of using the embedded > > solr/lucene? Some have always been suspicious of the embedded solr/ > > lucene.
> My 9 hour aborted run on the 5.7M .mrc file last night was with > SolrMarc 2.1 over HTTP.
> Has anyone tried benchmarking/profiling SolrMarc 2.1 using it's ability > to simply http post to solr instead of using the embedded solr/lucene?
I finally found some minutes to upgrade my solrmarc zoo to a checkout of the solrmarc-2.1 branch (following the instructions by Bob and Willem). I just did a short test using both methods: I added 132835 MARC records to an index holding 21504573 records. using embedded Solr: INFO [main] (MarcImporter.java:516) - Finished indexing in 5:00,00 INFO [main] (MarcImporter.java:525) - Indexed 132835 at a rate of about 441.0 per sec using HTTP POST (by unsetting the solr.path property): INFO [main] (MarcImporter.java:516) - Finished indexing in 10:15,00 INFO [main] (MarcImporter.java:525) - Indexed 132835 at a rate of about 215.0 per sec As expected, memory consumption is much lower using HTTP post, because solrmarc doesn't load a second Solr instance on the machine.
Till
-- Till Kinstler Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) Platz der G ttinger Sieben 1, D 37073 G ttingen kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de