RDFXMLParser emit ASCII N-Triples

Andreas Harth

unread,

May 26, 2013, 7:08:09 AM5/26/13

to ldsp...@googlegroups.com, Tobias Käfer, Aidan Hogan

Hi,

there's a bug in LDSpider: the emitted N-Quads data is not in ASCII
format, but rather Unicode - NQuads is supposed to have ASCII encoding
though.

Is there a way to tell the RDFXMLParser to emit escaped for Nx Nodes?

The same issue may appear in conjunction with Any23, but haven't tested
that.

Cheers,
Andreas.

Tobias Käfer

unread,

May 28, 2013, 3:52:01 AM5/28/13

to ldsp...@googlegroups.com, Andreas Harth, Aidan Hogan

Hi,

I can find plenty of \uxxxx character sequences in the files it emits for me, so what mal-encoded character sequences have you come across?

Cheers,

Tobias

Andreas Harth

unread,

May 28, 2013, 5:07:46 AM5/28/13

to ldsp...@googlegroups.com

Hi Tobias,

On 28/05/13 09:52, Tobias Käfer wrote:
> I can find plenty of \uxxxx character sequences in the files it emits
> for me, so what mal-encoded character sequences have you come
> across?

ah, you're right there, literals are ok.

But there are problems with IRIs.

For example, I get the following triple for [1]:

<http://dbpedia.org/resource/Garden_Grove,_California>
<http://www.w3.org/2002/07/owl#sameAs> <http://ko.dbpedia.org/resource
/가든그로브> .

On systems with Unicode support that works (although I have to do a bit
of sed wizardry to get rapper to parse the N-Quads files). But on
systems which do not have Unicode as default encoding, I get a lot of
????'s in IRIs, which messes up the dataset big time.

There are presumable issues with BNode as well, as BNodes get
constructed from IRIs. But I haven't tested that, DBpedia is the
only dataset with IRIs I've come across, and they are skimpy with
blank nodes.

Cheers,
Andreas.

[1] http://dbpedia.org/resource/Garden_Grove,_California

Tobias Käfer

unread,

Jul 15, 2013, 8:36:49 AM7/15/13

to ldsp...@googlegroups.com

Hi Andreas,

seems like you are using an outdated version of NxParser for your crawling. In r110 (Aug 28, 2012) of the source of NxParser, this has been fixed.

Btw.: We discussed the way of encoding here [1] and decided for rapper-compatibility, so we take the URIs as they are and encode them according to the unicode-related rules of N-Triples, instead of percent-encoding.

Cheers,

Tobias

[1] http://answers.semanticweb.com/questions/18508/best-way-to-encode-uri-refsiris-for-n-triples

Reply all

Reply to author

Forward