ExtractingRequestHandler

93 views
Skip to first unread message

J

unread,
Feb 12, 2010, 9:14:59 AM2/12/10
to SolrNet
Hey guys,

First off, great job and thank you! SolrNet is so cool and we're
really coming up with some awesome stuff with it :)

Regarding support for ExtractingRequestHandler, is this coming soon or
should we research alternate methods for parsing documents?

Thanks!

Mauricio Scheffer

unread,
Feb 12, 2010, 9:25:00 AM2/12/10
to SolrNet
I will implement it eventually, but not in the near future.
It would be great if you could implement it and send a patch.

Cheers,
Mauricio

J

unread,
Feb 12, 2010, 3:55:02 PM2/12/10
to SolrNet
What are the main steps required, would you say it's complex to
implement?

Thanks.

On Feb 12, 7:25 am, Mauricio Scheffer <mauricioschef...@gmail.com>
wrote:

Mauricio Scheffer

unread,
Feb 13, 2010, 1:36:55 PM2/13/10
to SolrNet
I can't tell for sure since I've never used that feature of Solr. From
a cursory look at the wiki:

* at the top level (ISolrOperations) there should be a method
AddFile(Stream content, IDictionary<string, string> parameters). I
have no idea what this returns.
* in order to accomodate these parameters, ISolrConnection.Post has
to be modified to accept querystring parameters in addition to form
parameters
* I don't know if Solr supports sending several files in a single
request, if so there should be a method AddFile(IEnumerable<Stream>
contents, IDictionary<string, string> parameters)
* Integration tests are a must. I don't know if unit-tests apply
here.
* The extract-only feature could be implemented later, or maybe it
could just be a special case of AddFile() with extractOnly=true in the
parameters.

J

unread,
Feb 26, 2010, 10:05:35 AM2/26/10
to SolrNet
Thanks for the rundown. Just an update on my research: I tested Tika
on its own and didn't get the performance I was after. Right now I'm
using a combination of iTextSharp and Aspose to get the job done, but
that could change in the future, especially as Tika matures.

- J

On Feb 13, 11:36 am, Mauricio Scheffer <mauricioschef...@gmail.com>

KevM

unread,
Mar 19, 2010, 2:28:45 PM3/19/10
to SolrNet
One of the reasons I am looking at Solr in our .Net shop is its
ability to use Tika to index rich documents. Among the other
interesting features. I am curious what performance limitations your
ran into. What is your usage scenario? Can you tell me about your
experience integrating with iTextSharp and Aspose?

J

unread,
Apr 3, 2010, 1:03:16 AM4/3/10
to SolrNet
Using Tika is probably fine when you don't need to quickly index many
files at once. I was indexing thousands of PDF, Word, and PowerPoint
files. The PDF documents were the biggest headache. I guess many of
them were poorly structured or something, which made the text parsing/
extraction much more difficult.. Aspose couldn't even extract the text
without throwing an exception. I got iTextSharp to perform decently,
but the quality of the text extracted was lower. I then used Aspose
for extracting text from Word and PowerPoint files. This feature was
put on hold in our system, so I haven't looked at it in a while. I
just built some quick prototypes, so I still have to run more tests.
Reply all
Reply to author
Forward
0 new messages