RDFXMLParser emit ASCII N-Triples

37 views
Skip to first unread message

Andreas Harth

unread,
May 26, 2013, 7:08:09 AM5/26/13
to ldsp...@googlegroups.com, Tobias Käfer, Aidan Hogan
Hi,

there's a bug in LDSpider: the emitted N-Quads data is not in ASCII
format, but rather Unicode - NQuads is supposed to have ASCII encoding
though.

Is there a way to tell the RDFXMLParser to emit escaped for Nx Nodes?

The same issue may appear in conjunction with Any23, but haven't tested
that.

Cheers,
Andreas.

Tobias Käfer

unread,
May 28, 2013, 3:52:01 AM5/28/13
to ldsp...@googlegroups.com, Andreas Harth, Aidan Hogan
Hi,

I can find plenty of \uxxxx character sequences in the files it emits for me, so what mal-encoded character sequences have you come across?

Cheers,

Tobias

Andreas Harth

unread,
May 28, 2013, 5:07:46 AM5/28/13
to ldsp...@googlegroups.com
Hi Tobias,

On 28/05/13 09:52, Tobias Käfer wrote:
> I can find plenty of \uxxxx character sequences in the files it emits
> for me, so what mal-encoded character sequences have you come
> across?

ah, you're right there, literals are ok.

But there are problems with IRIs.

For example, I get the following triple for [1]:

<http://dbpedia.org/resource/Garden_Grove,_California>
<http://www.w3.org/2002/07/owl#sameAs> <http://ko.dbpedia.org/resource
/가든그로브> .

On systems with Unicode support that works (although I have to do a bit
of sed wizardry to get rapper to parse the N-Quads files). But on
systems which do not have Unicode as default encoding, I get a lot of
????'s in IRIs, which messes up the dataset big time.

There are presumable issues with BNode as well, as BNodes get
constructed from IRIs. But I haven't tested that, DBpedia is the
only dataset with IRIs I've come across, and they are skimpy with
blank nodes.

Cheers,
Andreas.

[1] http://dbpedia.org/resource/Garden_Grove,_California

Tobias Käfer

unread,
Jul 15, 2013, 8:36:49 AM7/15/13
to ldsp...@googlegroups.com
Hi Andreas,

seems like you are using an outdated version of NxParser for your crawling. In r110 (Aug 28, 2012) of the source of NxParser, this has been fixed.

Btw.: We discussed the way of encoding here [1] and decided for rapper-compatibility, so we take the URIs as they are and encode them according to the unicode-related rules of N-Triples, instead of percent-encoding.

Cheers,

Tobias

Reply all
Reply to author
Forward
0 new messages