ExtractingRequestHandler support

418 views
Skip to first unread message

Naz

unread,
Feb 18, 2011, 12:58:59 PM2/18/11
to SolrNet
Hi
I have just pushed the ExtractingRequestHandler work I have done so
far into my fork https://github.com/nazjunaid/SolrNet

I think I have resolved most of the issues raised by mauricio about
the previous patch http://code.google.com/p/solrnet/issues/detail?id=79#hc5

I still need to sort out the unit tests and maybe add some
commenting.

When I'm happy with what I got I will submit a pull request in the
meantime feel free to check it out I'm open to suggestions or issues
you may have with it.

Naz

Mauricio Scheffer

unread,
Feb 18, 2011, 1:32:28 PM2/18/11
to sol...@googlegroups.com
Looks good, Naz! Let us know if you need any help with this.

Cheers,
Mauricio




--
You received this message because you are subscribed to the Google Groups "SolrNet" group.
To post to this group, send email to sol...@googlegroups.com.
To unsubscribe from this group, send email to solrnet+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrnet?hl=en.


Naz

unread,
Feb 21, 2011, 12:45:31 PM2/21/11
to SolrNet
Just pushed some more updates to https://github.com/nazjunaid/SolrNet

Renamed AddFile operation to Extract as this is more inline with the
Solr operation.
Added support for extract only http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput
with an ExtractResponseParser.
Fixed unit tests and added some for ExtractResponse.

I think I've covered everything now, I might have a go at writing some
docs to help people get started.

Any issues let me know.

Naz

On Feb 18, 6:32 pm, Mauricio Scheffer <mauricioschef...@gmail.com>
wrote:
> Looks good, Naz! Let us know if you need any help with this.
>
> Cheers,
> Mauricio
>
> On Fri, Feb 18, 2011 at 2:58 PM, Naz <nazmul...@gmail.com> wrote:
> > Hi
> > I have just pushed the ExtractingRequestHandler work I have done so
> > far into my forkhttps://github.com/nazjunaid/SolrNet

Ed

unread,
Mar 18, 2011, 10:32:22 AM3/18/11
to SolrNet
Great work on SolrNet and the new ExtractingRequestHandler, thanks
Mauricio and Naz!

A couple of questions - I'm new to this code so please excuse me if
I'm missing something obvious.

I want to call ISolrOperations<T>.Extract() and pass in some fields to
accompany the metadata and content that the ExtractingRequestHandler
pulls out of my binary file.

As far as I can tell, the ExtractParams.Fields property isn't being
used. I would expect it to add literal.* parameters for each of the
fields and include them in the call to Solr's
ExtractingRequestHandler. But ExtractCommand seems to be ignoring the
ExtractParams.Fields property. I can see it wouldn't be hard to add
this into ExtractCommand.ConvertToQueryParameters() but am I missing
something that's already there?

Even better, I would like to call an overload of
ISolrOperations<MyProduct>.Extract() which takes my MyProduct DTO
which has been decorated with [SolrField] attribs as per your mapping
example at http://code.google.com/p/solrnet/wiki/Mapping. It would
then pull the field mappings and content from my DTO, in the same way
that the Add() and Query() methods do - something like
ISolrOperations<T>.Extract(T, ExtractParams)

Is there some support for this approach that I haven't seen, or would
it be new functionality?

thanks

Ed

Naz

unread,
Mar 18, 2011, 12:00:05 PM3/18/11
to SolrNet
Hi Ed
The Fields property should be used but I can see it's not wired up in
the ExtractCommand this is a bug and should be easy to fix I'll try to
have a look at it this weekend.

Regarding using it with your existing dto I'm actually doing something
similar in my app but the way I've done it is to just call Solr and
get the extracted results only and then put it into my dto in a
separate property called Doc.

I've attached a extract of the documentation I was working on below if
you have a look it should explain how to do this.

Naz

---------------------------------------------------------

Extract Only
Solr can return the extracted content from Tika without indexing the
document. This can be returned as text or XHTML. This feature is
useful for when you want to leverage Solr for extraction but use the
result in an existing separate index or elsewhere. By default if
ExtractOnly is set to true the output is XHTML unless you specify
another ExtractType.
<code>
solr.Add(new ExtractParameters(“1”, File.OpenRead(“C:\MyDocument.pdf”)
{
ExtractOnly = true,
ExtractType = ExtractType.Text
});
</code>

On Mar 18, 2:32 pm, Ed <mysterydev...@gmail.com> wrote:
> Great work on SolrNet and the new ExtractingRequestHandler, thanks
> Mauricio and Naz!
>
> A couple of questions - I'm new to this code so please excuse me if
> I'm missing something obvious.
>
> I want to call ISolrOperations<T>.Extract() and pass in some fields to
> accompany the metadata and content that the ExtractingRequestHandler
> pulls out of my binary file.
>
> As far as I can tell, the ExtractParams.Fields property isn't being
> used.  I would expect it to add literal.* parameters for each of the
> fields and include them in the call to Solr's
> ExtractingRequestHandler.  But ExtractCommand seems to be ignoring the
> ExtractParams.Fields property.  I can see it wouldn't be hard to add
> this into ExtractCommand.ConvertToQueryParameters() but am I missing
> something that's already there?
>
> Even better, I would like to call an overload of
> ISolrOperations<MyProduct>.Extract() which takes my MyProduct DTO
> which has been decorated with [SolrField] attribs as per your mapping
> example athttp://code.google.com/p/solrnet/wiki/Mapping.  It would

Ed

unread,
Mar 18, 2011, 12:31:34 PM3/18/11
to SolrNet
thanks Naz that makes a lot of sense.

I can see it's feasible to use ExtractOnly, and then populate the
metadata fields along with the extracted content. I was hoping to
reduce the number of hits on the server by indexing the metadata and
extracting the content at the same time.

Grateful if you can look at wiring up ExtractParams.Fields. I'll have
a think about using a DTO directly in the Extract() method. It's good
to know that I wasn't missing something obvious.

Cheers

Ed

Naz

unread,
Mar 22, 2011, 7:51:42 AM3/22/11
to sol...@googlegroups.com, Ed
I've fixed the fields not bieng passed to Solr issue and also added a test so we can check all parameters are bieng passed correctly

https://github.com/mausch/SolrNet/pull/11


vikky

unread,
Apr 11, 2011, 12:32:28 PM4/11/11
to SolrNet
Greatwork!!! Have been looking for this.
I think this fix comes as part of Solrnet 0.4.0. When is this
available for download?

Mauricio Scheffer

unread,
Apr 11, 2011, 2:02:16 PM4/11/11
to SolrNet
Currently, you have to download the source code from https://github.com/mausch/SolrNet
and compile it yourself.

--
Mauricio

vikky

unread,
Apr 12, 2011, 2:03:27 AM4/12/11
to SolrNet
Hi Ed,
Populating the metadata fields along with the extracted content - by
this do you mean you create XML fields/docs for posting the info into
solr? Please clarify.
I am having a scenario where I have to index the contents of my pdf
file along with my own metadata fields.
It will be great if you can tell me is there any other option of doing
this without creating the XML files.
I am also concerned about some data backup. At some point of time if
my system crashes and corrupts the index it is going to be a tough
time rebuilding the index without the help of these xml files. Please
advice on the same.

Regards
Vikky

Ed

unread,
Apr 15, 2011, 10:31:24 AM4/15/11
to SolrNet
Hello Vikki

Yes I think we're talking about the same thing, but I'm not using XML
files for my metadata.

My code calls Extract() with ExtractParameters that include:

- Content = a filestream (eg. the PDF file)
- Id = a document id
- Fields = a collection of ExtractFields which are my own metadata
fields

The ExtractingRequestHandler parses the content of the PDF file and
pulls out any metadata fields it can find. With Naz's changes above,
SolrNet is also passing the contents of the Fields collection in as
additional metadata fields to be indexed alongside what
ExtractingRequestHandler has found.

So there's no need to create XML files.

I'm populating my metadata fields from a database, so I can get them
back if my Solr index dies.

For advice on Solr backups you probably need the solr-user mailing
list, because this group is specifically for SolrNet.

Hope this helps

Ed

scott

unread,
May 16, 2011, 12:24:37 AM5/16/11
to SolrNet
Hello Everyone,

This is great work, exactly what I needed! Very useful discussion
too.

I just wanted to chime in on a topic that Ed brought up regarding
passing in his DTO directly instead of mapping to an ExtractField list
manually. This capability would definitely make my code cleaner as
well. For what it is worth I wrote a little utility method that does
this mapping, albeit via the XML serializer.

private static IEnumerable<ExtractField>
BuildExtractFieldsFromItem<T>(T item)
{
var extractFields = new List<ExtractField>();
var serializer =
ServiceLocator.Current.GetInstance<ISolrDocumentSerializer<T>>();
XmlDocument doc = serializer.Serialize(item, null);
XmlNodeList fields = doc.GetElementsByTagName("field");
foreach (XmlNode field in fields)
{
string name = null;
XmlAttribute nameAttr = field.Attributes["name"];
if (nameAttr != null)
{
name = nameAttr.InnerText;
}

// ID needs to be passed in ExtractParameters.Id,
doing so here causes exception
if (!string.IsNullOrEmpty(name) && !name.Equals("id")
&&
!string.IsNullOrEmpty(field.InnerText))
{
extractFields.Add(new ExtractField(name,
field.InnerText));
}
}
return extractFields;
}

Kind regards,
Scott
Reply all
Reply to author
Forward
0 new messages