Hi Tobias,
On 28/05/13 09:52, Tobias Käfer wrote:
> I can find plenty of \uxxxx character sequences in the files it emits
> for me, so what mal-encoded character sequences have you come
> across?
ah, you're right there, literals are ok.
But there are problems with IRIs.
For example, I get the following triple for [1]:
<
http://dbpedia.org/resource/Garden_Grove,_California>
<
http://www.w3.org/2002/07/owl#sameAs> <
http://ko.dbpedia.org/resource
/가든그로브> .
On systems with Unicode support that works (although I have to do a bit
of sed wizardry to get rapper to parse the N-Quads files). But on
systems which do not have Unicode as default encoding, I get a lot of
????'s in IRIs, which messes up the dataset big time.
There are presumable issues with BNode as well, as BNodes get
constructed from IRIs. But I haven't tested that, DBpedia is the
only dataset with IRIs I've come across, and they are skimpy with
blank nodes.
Cheers,
Andreas.
[1]
http://dbpedia.org/resource/Garden_Grove,_California