Problem uploading N-Quads to Fuseki

554 views
Skip to first unread message

Lauren K.

unread,
Jan 30, 2012, 4:15:39 PM1/30/12
to bio2rdf
Hello,

I am trying to upload ncbi.gene2pubmed.nq (located here:
http://download.bio2rdf.org/data/ncbi/) to a Fuseki server, but Fuseki
does not accept .nq files. Do you have the data available in RDF/XML,
OWL, N3, or N-Triples format?

Thanks,
Lauren

Peter Ansell

unread,
Feb 1, 2012, 3:43:14 PM2/1/12
to bio...@googlegroups.com
The .nq files are NQuads files which contain an extra item in each
line to define the graph. You could strip the last part of each NQuad
to make an NTriples file, but then all of the triples would be
inserted a single graph, which may be what you desire anyway.

Marc-Alexandre would have to correct me if I am wrong, but I think the
current design for the NCBI files is to have separate graphs for each
record so that any extra resources for a record are located in the
graph, even if they do not have the record URI as their subject. That
process necessitates that we use a Quads format instead of a Triples
format so that the RDF generation process can specify which graph is
to be used for each triple, by writing it as a quad.

I am a little surprised that Fuseki can't parse NQuads files yet given
it is part of Jena, but I haven't used it myself so I haven't run into
it before.

Cheers,

Peter

> --
> You received this message because you are subscribed to the Google Groups "bio2rdf" group.
> To post to this group, send email to bio...@googlegroups.com.
> To unsubscribe from this group, send email to bio2rdf+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/bio2rdf?hl=en.
>

Manuel Salvadores

unread,
Feb 1, 2012, 12:18:34 PM2/1/12
to bio...@googlegroups.com
Hello,

I recommend you to use rapper a tool that is distributed with the
raptor RDF parser.
This tool can transform RDF formats performs very fast.

http://librdf.org/raptor/

These are all the formats supported:

ntriples N-Triples (default)
turtle Turtle Terse RDF Triple Language
rdfxml-xmp RDF/XML (XMP Profile)
rdfxml-abbrev RDF/XML (Abbreviated)
rdfxml RDF/XML
rss-1.0 RSS 1.0
atom Atom 1.0
dot GraphViz DOT format
json-triples RDF/JSON Triples
json RDF/JSON Resource-Centric
html HTML Table
nquads N-Quads


Manuel

PS: you might need a linux/unix box to use it.

Lauren K.

unread,
Feb 2, 2012, 12:03:27 PM2/2/12
to bio2rdf
Hi Peter,

Thank you for your answer - I would indeed like to insert all the
triples into a single graph so that I can query across the original
PubMed graphs. Do you know of a good tool for stripping the graph
resource from each quad to convert them back to triples? I have been
investigating NxParser as an option but it does not look like it would
be a very straightforward process.

-Lauren

On Feb 1, 3:43 pm, Peter Ansell <ansell.pe...@gmail.com> wrote:
> The .nq files are NQuads files which contain an extra item in each
> line to define the graph. You could strip the last part of each NQuad
> to make an NTriples file, but then all of the triples would be
> inserted a single graph, which may be what you desire anyway.
>
> Marc-Alexandre would have to correct me if I am wrong, but I think the
> current design for the NCBI files is to have separate graphs for each
> record so that any extra resources for a record are located in the
> graph, even if they do not have the record URI as their subject. That
> process necessitates that we use a Quads format instead of a Triples
> format so that the RDF generation process can specify which graph is
> to be used for each triple, by writing it as a quad.
>
> I am a little surprised that Fuseki can't parse NQuads files yet given
> it is part of Jena, but I haven't used it myself so I haven't run into
> it before.
>
> Cheers,
>
> Peter
>

Peter Ansell

unread,
Feb 16, 2012, 1:57:06 AM2/16/12
to bio...@googlegroups.com
Hi Lauren,

Just getting back to this as I realised it wasn't a closed issue.

You could use the NQuads parser from Sesametools and the NTriples
writer from Sesame to generate a quick little program to convert
NQuads to NTriples.

It would stream from any InputStream or Reader to any OutputStream or
Writer, so the size of the file should not be an issue if you are
streaming from disk back onto disk.

The current versions for Maven dependencies and Maven repositories to
do this would be at least the following:

<dependency>
<groupId>org.openrdf.sesame</groupId>
<artifactId>sesame-rio-ntriples</artifactId>
<version>2.6.3</version>
</dependency>
<dependency>
<groupId>net.fortytwo.sesametools</groupId>
<artifactId>nquads</artifactId>
<version>1.6</version>
</dependency>

<repository>
<id>aduna-repo</id>
<name>Aduna repository</name>
<url>http://repo.aduna-software.org/maven2/releases</url>
</repository>
<repository>
<id>fortytwo</id>
<name>fortytwo.net Maven repository</name>
<url>http://fortytwo.net/maven2</url>
</repository>

I don't have the time right at the moment to code the program right at
this minute, but it would be quite short.

Cheers,

Peter

Marc-Alexandre Nolin

unread,
Feb 16, 2012, 9:44:39 AM2/16/12
to bio...@googlegroups.com
Oups, I've miss this discussion. Sorry.

The quads structure, when you can use it, allow you to have a simple
query to retrieve all the triple that belong to a specific pubmed.

Example:

select *
from <http://bio2df.org/pubmed_record:XXXXXXX>
where
{?s ?p ?o .}

To retrieve the same think without the graph, you would need something
much more complicated with optional path and a depth that I actually
don't know since I don't need too know with the nquad :)

You sais:"Thank you for your answer - I would indeed like to insert all the


triples into a single graph so that I can query across the original
PubMed graphs"

Having a nquads structure does not forbide you from querying across
the "original" Pubmed graph. Its completely transparent for the users
if you are doing Sparql query without forcing a specific graph.

Since nquad have all the information on each line and use no shortcut
to simplify the structure like n3 and Turtle, you can manage it quite
easily with Perl or Sed/Awk . I don't really know the later, but for
perl, I can suggest a simple Perl one liner

zcat YOUR_COMPRESSES_NQUAD_FILE.nq.gz | perl -nle
'/^(\S+)\s+(\S+)\s+(\S+)\s+\S+\s+\.$/; print "$1\t$2\t$3 .\n";' | gzip
> YOUR_COMPRESSED_NTRIPLE_FILE.nt.gz

I think another one would be necessary to catch the lines where the
literal have spaces in it.

Bye !!

Marc-Alexandre

2012/2/16 Peter Ansell <ansell...@gmail.com>:

Peter Ansell

unread,
Feb 19, 2012, 8:54:57 PM2/19/12
to bio...@googlegroups.com
Hi Lauren,

The following function should do what you want using the Java OpenRDF
(Sesame/Rio) API:

import java.io.IOException;
import net.fortytwo.sesametools.nquads.NQuadsFormat;
import org.openrdf.rio.RDFFormat;
import org.openrdf.rio.RDFHandler;
import org.openrdf.rio.RDFHandlerException;
import org.openrdf.rio.RDFParseException;
import org.openrdf.rio.RDFParser;
import org.openrdf.rio.Rio;

public void convertNQuadsToNTriples(InputStream nquads,
OutputStream ntriples) throws RDFParseException, RDFHandlerException,
IOException
{
RDFHandler handler = Rio.createWriter(RDFFormat.NTRIPLES, ntriples);

RDFParser parser = Rio.createParser(NQuadsFormat.NQUADS);

parser.setRDFHandler(handler);

parser.parse(nquads, "http://base.uri.not.defined/");
}

It requires the nquads library from Sesametools as noted in the last
email, along with the sesame-rio-ntriples library from Sesame.

It should stream well unless there are bugs in the parser or writer,
as Bio2RDF does not use BlankNodes, which are difficult and slow for
parsers and writers. There are some settings you can set on the parser
(including the following three) which may help improve performance if
there are no errors and there are no BlankNodes.

parser.setStopAtFirstError(false);
parser.setPreserveBNodeIDs(false);
parser.setVerifyData(false);

Cheers,

Peter

Reply all
Reply to author
Forward
0 new messages