Handling Non-XML friendly Characters

1,008 views
Skip to first unread message

mike

unread,
Apr 23, 2009, 11:21:07 AM4/23/09
to SolrNet
How does solrnet or solr in general handle non-xml characters? Do I
need to add CDATA or anything? Does Tomcat handle it for me if I set
the connector to UTF-8?

Thanks,
Mike

mike

unread,
Apr 23, 2009, 12:03:26 PM4/23/09
to SolrNet
I think setting the utf-8 setting in tomcat fixed his. Now I need to
handle Unicode. Any ideas?

Mauricio Scheffer

unread,
Apr 23, 2009, 12:17:14 PM4/23/09
to SolrNet
Solr itself has no problems handling UTF-8.
I use it with Jetty, it just works with UTF-8 out of the box, no extra
config needed.
I never tried it with Tomcat, but it seems that setting the connector
should be enough (http://www.jspwiki.org/wiki/TomcatAndUTF8 and
http://wiki.apache.org/tomcat/Tomcat/UTF-8)

Now for SolrNet, the sample app includes a sample document with UTF-8
values (http://code.google.com/p/solrnet/source/browse/trunk/
SampleSolrApp/exampledocs/utf8-example.xml) and querying works fine
(run the app and visit http://localhost:8082/?q=%C3%AA%C3%A2%C3%AE%C3%B4%C3%BB)

I also have an integration test that adds a document with UTF-8 values
(http://code.google.com/p/solrnet/source/browse/trunk/SolrNet.Tests/
Integration.Sample/Tests.cs#39)

Let me know if you see anything unusual, UTF-8 support should be
transparent to the dev.

Mauricio Scheffer

unread,
Apr 23, 2009, 12:19:00 PM4/23/09
to SolrNet
Are you getting any errors? What doesn't work? adding a document,
querying?

mike

unread,
Apr 23, 2009, 12:39:05 PM4/23/09
to SolrNet
I'm getting:

SEVERE: [com.ctc.wstx.exc.WstxLazyException]
com.ctc.wstx.exc.WstxParsingException: Illegal character entity:
expansion character (code 0x7) not a valid XML character

From the tomcat log file. It only happens with unicode data.. Example:
"Có| w tym"

I set the connector to UTF-8 and it didn't seem to do anything.. I'm
continueing the search...

Mauricio Scheffer

unread,
Apr 23, 2009, 12:45:54 PM4/23/09
to sol...@googlegroups.com
Could you try the same operation against a Solr+Jetty instance? There's one in trunk, just copy your config and schema, let's see what happens.

Thanks

mike

unread,
Apr 23, 2009, 12:53:25 PM4/23/09
to SolrNet
Seems maybe something with illegal utf-8 charactors.. example: #243

On Apr 23, 11:45 am, Mauricio Scheffer <mauricioschef...@gmail.com>
wrote:
> > continueing the search...- Hide quoted text -
>
> - Show quoted text -

Mauricio Scheffer

unread,
Apr 23, 2009, 1:27:33 PM4/23/09
to sol...@googlegroups.com
yep, 0x7 is not a valid xml character according to the spec (http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char).
It's weird that the framework's XML lib doesn't catch that... or maybe I'm using constructing the XML manually where you're getting the error.
What operation(s) are you executing?

mike

unread,
Apr 23, 2009, 4:05:16 PM4/23/09
to SolrNet
Update - fixed with this before indexing:

public static string StripIllegalXMLChars(this string s)
{
return System.Text.RegularExpressions.Regex.Replace(s, @"[\x00-
\x08\x09\x0A\x0D\x0e-\x1f]", "");
}

This is working now..

Thanks for help,
Mike

Mauricio Scheffer

unread,
Apr 23, 2009, 4:14:35 PM4/23/09
to sol...@googlegroups.com
I just created an issue to deal with this internally:


Thanks.

Mauricio Scheffer

unread,
Apr 27, 2009, 12:15:25 AM4/27/09
to SolrNet
Hi Mike, I've been trying to repro this to no avail. Could you send me
the exact problematic text? Or was this binary data?

I've tried "Có| w tym" but it works just fine.

Thanks!

mike

unread,
Apr 27, 2009, 10:58:29 AM4/27/09
to SolrNet
Try this:

Don't know what it means so hopefully nothing bad:

Có| w tym przypadku wiele nie trzeba tBumaczy - idea jest niezwykle
prosta i nie trudna do zauwa|enia. R cznik podzielony jest na dwie
strefy wycierania "Face" i "Arse". Wszystko jest dobrze póki Twoja
twarz nie wygl da jak dolna ...

Thanks,
Mike

text -

mike

unread,
Apr 27, 2009, 11:31:56 AM4/27/09
to SolrNet
Looks like google stripped them out. One character is the Bell
character = Hex 07 or x07. You need to some how get that into a string
since I can't paste if for you here.

Others:
http://groups.google.com/group/acts_as_solr/browse_thread/thread/686274fea8a74750

-mike

Mauricio Scheffer

unread,
Apr 27, 2009, 9:45:18 PM4/27/09
to SolrNet
Fixed in r353

http://code.google.com/p/solrnet/source/detail?r=353

On Apr 27, 12:31 pm, mike <mausti...@gmail.com> wrote:
> Looks like google stripped them out. One character is the Bell
> character = Hex 07 or x07. You need to some how get that into a string
> since I can't paste if for you here.
>
> Others:http://groups.google.com/group/acts_as_solr/browse_thread/thread/6862...

Antharas

unread,
May 8, 2009, 7:36:58 PM5/8/09
to SolrNet
Hi,

I was having the same problem but with charecter x00 and x04.

Thanks for your fix, but what about x04?

Cheers,
Duc.

On Apr 28, 2:45 am, Mauricio Scheffer <mauricioschef...@gmail.com>
wrote:

Mauricio Scheffer

unread,
May 8, 2009, 9:02:55 PM5/8/09
to SolrNet
Could you post here the solr exception from the log and the code that
generates this exception?
Note that if this is binary data, you should probably base64-encode
it, see:

http://www.mail-archive.com/solr...@lucene.apache.org/msg00571.html
https://issues.apache.org/jira/browse/SOLR-1116

All characters from 0x0 to 0x6 are filtered (
http://code.google.com/p/solrnet/source/diff?spec=svn353&r=353&format=side&path=/trunk/SolrNet.Tests/Integration.Sample/Tests.cs
)

Antharas

unread,
Jun 4, 2009, 11:34:30 AM6/4/09
to SolrNet
I can't reproduce that error again with my custom filter and new
version of Solrnet.


Thanks for your great works with SolrNet. I can't wait to see the
support for Solr 1.4 :)


Cheers,
Antharas.

On May 9, 2:02 am, Mauricio Scheffer <mauricioschef...@gmail.com>
wrote:
> Could you post here the solr exception from the log and the code that
> generates this exception?
> Note that if this is binary data, you should probably base64-encode
> it, see:
>
> http://www.mail-archive.com/solr-u...@lucene.apache.org/msg00571.htmlhttps://issues.apache.org/jira/browse/SOLR-1116
>
> All characters from 0x0 to 0x6 are filtered (http://code.google.com/p/solrnet/source/diff?spec=svn353&r=353&format...

Mauricio Scheffer

unread,
Jun 4, 2009, 11:24:42 PM6/4/09
to sol...@googlegroups.com
Thanks! I'd say that multi-select facets ( http://issues.apache.org/jira/browse/SOLR-911 ) is *the* killer feature of Solr 1.4 (or at least the feature I could most use right now). 
Reply all
Reply to author
Forward
0 new messages