SolrNet Document Content Length exceeds which extracting

114 views
Skip to first unread message

Usman Zahid

unread,
Sep 21, 2015, 4:50:54 AM9/21/15
to SolrNet

I am using SolrNet to extract contents from document Content in a dynamic field. My Solr Document(SolrDoc)

[SolrUniqueKey("id")]
public int Id { get; set; }

[SolrField("*")]
public Dictionary<string, object> DynamicFields { get; set; }

I can extract the contents and add them to the dynamic field

solrDoc.DynamicFields.Add("content", extractResponse.Content);

Problem occurs when document content length exceeds 35000. to extract the contents i do

ExtractResponse extractResponse =
solr.Extract(new ExtractParameters(fileStream, solrDoc.Id)
       {
          ExtractFormat = ExtractFormat.Text,
          ExtractOnly = true,
          AutoCommit = true,
          StreamType = mimeType,

       });

Anyway i can index complete document content using this dynamic field generation way. or any other better way to extract complete document contents?

I am using Solr 5.3.0

Thanks.

Mauricio Scheffer

unread,
Sep 21, 2015, 5:23:41 AM9/21/15
to sol...@googlegroups.com
Hi, could you describe in detail what happens when the document content length exceeds 35000 (I assume that's bytes)?

Cheers



--
Mauricio

--
You received this message because you are subscribed to the Google Groups "SolrNet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to solrnet+u...@googlegroups.com.
To post to this group, send email to sol...@googlegroups.com.
Visit this group at http://groups.google.com/group/solrnet.
For more options, visit https://groups.google.com/d/optout.

Usman Zahid

unread,
Sep 22, 2015, 3:23:35 AM9/22/15
to SolrNet
No thats is just the word count in the document. the actual document is 250KBs

Usman Zahid

unread,
Sep 22, 2015, 3:33:31 AM9/22/15
to SolrNet
I guess the problem is related to adding the content field in Solr Document..
When the field length exceeds 35000 it gives 400 bad request. Possible Analysis Error.

Any thoughts on this?


On Monday, September 21, 2015 at 1:50:54 PM UTC+5, Usman Zahid wrote:

Mauricio Scheffer

unread,
Sep 22, 2015, 6:42:11 AM9/22/15
to sol...@googlegroups.com
Can you check the full stack trace in the Solr log?
Googling around for "Possible analysis error" I see that it could be either some issue in a nested exception or something incorrectly configured in the analysis definitions.


--
Mauricio

--

Usman Zahid

unread,
Sep 22, 2015, 8:07:15 AM9/22/15
to SolrNet
i looked at the stack trace in solr logs and i found this exception.

Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 84, 104, 101, 32, 85, 110]...', original message: bytes can be at most 32766 in length; got 42576

Any work around this, using SolrNet?

On Monday, September 21, 2015 at 1:50:54 PM UTC+5, Usman Zahid wrote:

Mauricio Scheffer

unread,
Sep 22, 2015, 8:25:35 AM9/22/15
to sol...@googlegroups.com
That's related to Lucene, not to SolrNet or even Solr itself. 

See http://stackoverflow.com/questions/24019868/utf8-encoding-is-longer-than-the-max-length-32766 . You'll probably want to change your analyzer to avoid having such huge terms, perhaps you're using a keyword tokenizer for that field?



--
Mauricio

--

Usman Zahid

unread,
Sep 23, 2015, 1:53:12 AM9/23/15
to SolrNet
Ahhhh... So dynamically i am not allowed to index huge fields.. i have not defined any schema in solr 5.3.0 because that is the main requirement. i do not know what field i will be indexing and what data i am expecting.

Any idea how should i proceed with Solr 5.3.0 in schemaless mode? Thanks


On Monday, September 21, 2015 at 1:50:54 PM UTC+5, Usman Zahid wrote:

Mauricio Scheffer

unread,
Sep 23, 2015, 4:50:15 AM9/23/15
to sol...@googlegroups.com
You *can* definitely index huge fields (that's what Lucene and Solr are built for!) but you can't have huge *terms/tokens* since they don't make sense in a full-text search context.
I recommend defining a schema with the text analysis you need. You could use dynamic fields for example.

Cheers



--
Mauricio

--
Reply all
Reply to author
Forward
0 new messages