Problem while importing n-quads fo Fuseki

135 views
Skip to first unread message

m.rosz...@uw.edu.pl

unread,
Jul 14, 2016, 6:23:45 AM7/14/16
to Web Data Commons
Dear all,
I tried to import some data from Class-Specific Subsets of the Schema.org Data (November 2015 Corpus) to Fuseki and I got stuck.I managed to import rather small file http://schema.org/Library (17 MB). But what I really want to do is to make some queries against http://schema.org/Book dataset (3 GB) especially looking for hosts.  Each time Fuseki loader after a while returns an error : Illegal unicode escape sequence value: \" (022) at particular line and then server responds with 500 Server Error. I understand that there an inappropriate character. It's quite a large file and I gave no idea how to fix it or maybe look for n-quads parser or change triple store.
I appreciate any suggestions,

All the best,
Marcin Roszkowski

Robert Meusel

unread,
Jul 14, 2016, 6:39:41 AM7/14/16
to Web Data Commons
Hi Marcin,

Unfortunately I do not have too many knowledge about Fuseki and what kind of importer is used to enter quads into the store. But maybe there is an option to either ignore invalid quads (which would be the best thing) or at least print the line of the error. I have not imported the data sofar to any triple store.
In case you are only looking for hosts which provide the information you can also simply right a script which extracts the fourth pattern of the quads, make them unique and you have what you need.

You can have a look at the code we used to do the statistics (https://subversion.assembla.com/svn/commondata/StructuredDataProfiler/tags/0/0/1/)

Hope this helps,
Robert

Emir Muñoz

unread,
Jul 14, 2016, 1:41:44 PM7/14/16
to web-data...@googlegroups.com
Hi Marcin,

Indeed, there are some small syntax issues with escaping double quotes or so in RDF data in general. The process to deal with it might end up being mostly manually. I can see two optional processes, one that I usually follow and one that I haven't try.

Process 1) --- mostly manually:
First, you should use a parser to point you the errors in the original RDF file. If you are a Java user, I recommend you any23 [1] rover command. That will show you the errors in the data. Second, you should open the file in a text editor (or use `sed` command line) and solve the errors manually. 

Process 2) -- I haven't try it:
You could try LOD Laundromat [2] Basket, which claims to take your dirty data and clean it for you. Again, I haven't tried this, but I trust in the guys behind the project ;)

Now, once you finished with the cleansing of the data (using either process), and the any23 parser doesn't show you any error, you can load your data into Fuseki "without problems". Again, that is not so easy as users would like, so you have a couple of options. 

You can use any23 rover to convert your NQuads into NTriples and load them from the interface of Fuseki. (This is not recommendable for large files.) Or, you can use the Apache Jena command `tdbloader2` [3], which builds indices on top of your data, and gives support for NQuads. It basically reads all the triples in the input file and builds SPOG, SOPG, OPSG, etc., indices to optimize your queries. 

With the TDB generated, you can start Fuseki with a configuration file [4] specifying the location of your TDB folder.

That should do the trick :)

--
You received this message because you are subscribed to the Google Groups "Web Data Commons" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-data-commo...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

m.rosz...@uw.edu.pl

unread,
Jul 15, 2016, 2:44:27 AM7/15/16
to Web Data Commons
Emir, Robert,
thank you so much. I'm not Java user so  the script which extracts last element of quad that Robert mentioned is out reach for me. I'll try with any23 and look for errors. I've also sent this file into LOD Laundromat. 
Anyway,
thank you guys.
Marcin

m.rosz...@uw.edu.pl

unread,
Jul 15, 2016, 5:15:20 AM7/15/16
to Web Data Commons
Robert,
thanks a lot. For my small research extracting only fourth element would be enough. Since I'm not Java user, this will be quite a challenge for me. Anyway, I'll try. 
All the best,
Marcin

Marcin Roszkowski

unread,
Jul 25, 2016, 3:16:08 PM7/25/16
to web-data...@googlegroups.com
Emir,
I'm struggling and not giving up. I've sent my file to LOD Laundromat
and it's been over 10 days and still nothing. I was finally able to
run any23 with rover option against sample file with n-quads. Can you
suggest any rover command which will point errors in my data? I must
say that any23 documentation is not helping at all.
All the best,
Marcin
> You received this message because you are subscribed to a topic in the
> Google Groups "Web Data Commons" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/web-data-commons/be18fIZuFgk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
--
dr Marcin Roszkowski
Zakład Systemów Informacyjnych
Instytut Informacji Naukowej i Studiów Bibliologicznych
Uniwersytet Warszawski

Emir Muñoz

unread,
Jul 25, 2016, 5:49:47 PM7/25/16
to web-data...@googlegroups.com
Hi Marcin,

Congratulations for not giving up!
I know it is not as easy as one would like some times. That is why resources like this are very welcome.

To help you out, if you already configured any23, then you should be able to run the following command:

```
any23 rover -f input_file.nq -o output_file.nt
```

this command will try to convert your input NQuads file to a NTriples file.

You can find extra information in this page [1], or test the online validation and conversion of any23 [2] if your file is small. How big are your files?

Feel free to drop me a line if you need extra help.

Cheers,
Emir


Marcin Roszkowski

unread,
Jul 27, 2016, 2:18:25 AM7/27/16
to web-data...@googlegroups.com
Emir,
thanks. I'm trying to make use of Class-Specific Subsets of the
Schema.org Data contained in the November 2015 Corpus -
http://schema.org/Book. The file I'm trying to process is quite heavy,
over 14 GB (http://data.dws.informatik.uni-mannheim.de/structureddata/2015-11/quads/classspecific/schemaOrgBook.gz)
For the time being I would like info about hosts - the fourth element
of the quad but it would be great to have valid dataset to query.
Anyway, any23 did the conversion to ntriples but didn't point to any
errors but there are some for sure. My Jena Fuseki loader always
points to the same type of error: "illegal unicode escape sequence
value". So I've been using sed to fix or delete specific line. Due to
the fact that the file is large and this is iterative process it is
very time consuming. It takes sed over 20 min to process the file.

Regards,
Marcin

Emir Muñoz

unread,
Jul 28, 2016, 5:31:17 AM7/28/16
to web-data...@googlegroups.com
Hi Marcin,

I understand now your scalability problems. What about splitting the file in several parts, e.g., using the split command [1].
That will make your job faster since you will be dealing with smaller files. Also, the splitting won't change anything for the loading of the data in the triplestore.
And I highly recommend you to load the data using TDB (as I mentioned before), which will make the answer to your queries a bit faster.

Hope that helps you.

Robin Keskisärkkä

unread,
Sep 19, 2016, 9:31:52 AM9/19/16
to Web Data Commons
Hi!
In reading the replies below and seeing that your working on a fairly sizeable file I too would go the route of splitting the file in to smaller pieces. From a terminal run the following command to split the uncompressed text file into pieces of 500m (or smaller as needed):
$ gunzip -c schemaOrgBook.gz | split -b 500m - books_

This will create a bunch of files with names like "books_aa, books_ab, ...". To add the .nq file ending run this command:
$ for old in books_*; do mv $old `basename $old`.nq; done

Now you'll have a bunch of smaller files that you process with Fuseki. Once you stumble across a file that contains and error simply open it in a good editor like Sublime and fix it manually or use e.g. sed.

Cheers,
Robin

Robin Keskisärkkä

unread,
Sep 20, 2016, 11:38:23 AM9/20/16
to Web Data Commons

Edit: I realised that it's probably safer to split by lines instead:
$ gunzip -c schemaOrgBook.gz | split -l 1000000 - books_

Cheers,
Robin

Marcin Roszkowski

unread,
Sep 21, 2016, 6:08:57 AM9/21/16
to web-data...@googlegroups.com
Robin,
thanks a lot! This is exactly what i have been doing for couple of
days. I'm splitting file into smaller ones and checking for
inconsistencies. I was managed to clear bad encoding and it seems to
be the best solution for now. I think I have to switch into more
flexible triple store. Querying these files in Fuseki takes a lot of
time.
All the best,
Marcin
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Web Data Commons" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/web-data-commons/be18fIZuFgk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> web-data-commo...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
dr Marcin Roszkowski
Katedra Informatologii,
Wydział Dziennikarstwa, Informacji i Bibliologii
Uniwersytet Warszawski
Reply all
Reply to author
Forward
0 new messages