Re: [SolrNet] Extract method using Stream

714 views
Skip to first unread message

Mauricio Scheffer

unread,
Oct 9, 2012, 11:00:27 AM10/9/12
to sol...@googlegroups.com
The resourceName is a "file name" (which can be real or not) given as a hint to Tika to infer the content type. E.g. "john.pdf" would indicate that it's PDF content.

Most of the parameters in SolrNet are one to one mappings of the corresponding Solr parameters, so you can usually use the Solr documentation as reference. In this case: http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters

I just created a new issue about ExtractParams, I don't quite like the current design: http://code.google.com/p/solrnet/issues/detail?id=194

Cheers,
Mauricio

On Mon, Oct 8, 2012 at 9:08 PM, Shreejay <shre...@gmail.com> wrote:

Hi Mauricio, 

I am trying to use SolrNet (0.4.0.2002) to extract the contents of a PDF file. The PDF file is on a server and I am reading it into a memory stream, using the URL. 

What should be the "resourceName" param in this case for ExtractParameters? I could not find any examples for using the Extract method using Streams. 



C# code :

            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            WebResponse response = request.GetResponse();
            byte[] b = null;
            using (Stream stream = response.GetResponseStream())
 
            using (MemoryStream ms = new MemoryStream())
  {
                    int count = 0;
                    do
                    {
                        byte[] buf = new byte[1024];
                        count = stream.Read(buf, 0, 1024);
                        ms.Write(buf, 0, count);
                    } while (stream.CanRead && count > 0);
                    b = ms.ToArray();
                    ExtractParameters exParams = new ExtractParameters(ms, "doc1", "???")
                        ExtractOnly = true,
                        ExtractFormat = ExtractFormat.Text,
                        StreamType = "application/pdf"
                       
 };
                    var content = solr.Extract(exParams);
     


Thanks.

--Jay

--
You received this message because you are subscribed to the Google Groups "SolrNet" group.
To view this discussion on the web visit https://groups.google.com/d/msg/solrnet/-/9nxiNzgRj5QJ.
To post to this group, send email to sol...@googlegroups.com.
To unsubscribe from this group, send email to solrnet+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/solrnet?hl=en.

Shreejay

unread,
Oct 9, 2012, 1:06:40 PM10/9/12
to sol...@googlegroups.com
Thanks for the reply Mauricio. I had gone through the Solr Wiki, and I tried giving resource.name = "abc.pdf" but I was getting an exception The request was aborted: The request was canceled.

The solr call was http://localhost:8983/solr/update/extract?literal.id=doc1&resource.name=abc.pdf&stream.type=application/pdf&extractOnly=true&extractFormat=text&version=2.2

On running this call, I get "missint content stream" error. Is there a particular parameter I am missing or the way I am calling ExtractParams? 




<response>
<lst name="responseHeader">
<int name="status">400</int>
<int name="QTime">1</int>
</lst>
<lst name="error">
<str name="msg">missing content stream</str>
<int name="code">40
 
0</int>
</lst>
</response>

Thanks,
Shreejay

Mauricio Scheffer

unread,
Oct 9, 2012, 2:45:17 PM10/9/12
to sol...@googlegroups.com
I think you have to 'rewind' your MemoryStream, i.e. ms.Position = 0; before passing it to Solr.
Either that or just pass the response stream directly to Solr.

--
Mauricio


To view this discussion on the web visit https://groups.google.com/d/msg/solrnet/-/EIr7PfgOjTMJ.

Shreejay

unread,
Oct 9, 2012, 4:09:55 PM10/9/12
to sol...@googlegroups.com
Thank you so much. ms.Position = 0 worked like a charm. 

--Shreejay
Reply all
Reply to author
Forward
0 new messages