Marc-Alexandre would have to correct me if I am wrong, but I think the
current design for the NCBI files is to have separate graphs for each
record so that any extra resources for a record are located in the
graph, even if they do not have the record URI as their subject. That
process necessitates that we use a Quads format instead of a Triples
format so that the RDF generation process can specify which graph is
to be used for each triple, by writing it as a quad.
I am a little surprised that Fuseki can't parse NQuads files yet given
it is part of Jena, but I haven't used it myself so I haven't run into
it before.
Cheers,
Peter
> --
> You received this message because you are subscribed to the Google Groups "bio2rdf" group.
> To post to this group, send email to bio...@googlegroups.com.
> To unsubscribe from this group, send email to bio2rdf+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/bio2rdf?hl=en.
>
I recommend you to use rapper a tool that is distributed with the
raptor RDF parser.
This tool can transform RDF formats performs very fast.
These are all the formats supported:
ntriples N-Triples (default)
turtle Turtle Terse RDF Triple Language
rdfxml-xmp RDF/XML (XMP Profile)
rdfxml-abbrev RDF/XML (Abbreviated)
rdfxml RDF/XML
rss-1.0 RSS 1.0
atom Atom 1.0
dot GraphViz DOT format
json-triples RDF/JSON Triples
json RDF/JSON Resource-Centric
html HTML Table
nquads N-Quads
Manuel
PS: you might need a linux/unix box to use it.
Just getting back to this as I realised it wasn't a closed issue.
You could use the NQuads parser from Sesametools and the NTriples
writer from Sesame to generate a quick little program to convert
NQuads to NTriples.
It would stream from any InputStream or Reader to any OutputStream or
Writer, so the size of the file should not be an issue if you are
streaming from disk back onto disk.
The current versions for Maven dependencies and Maven repositories to
do this would be at least the following:
<dependency>
<groupId>org.openrdf.sesame</groupId>
<artifactId>sesame-rio-ntriples</artifactId>
<version>2.6.3</version>
</dependency>
<dependency>
<groupId>net.fortytwo.sesametools</groupId>
<artifactId>nquads</artifactId>
<version>1.6</version>
</dependency>
<repository>
<id>aduna-repo</id>
<name>Aduna repository</name>
<url>http://repo.aduna-software.org/maven2/releases</url>
</repository>
<repository>
<id>fortytwo</id>
<name>fortytwo.net Maven repository</name>
<url>http://fortytwo.net/maven2</url>
</repository>
I don't have the time right at the moment to code the program right at
this minute, but it would be quite short.
Cheers,
Peter
The quads structure, when you can use it, allow you to have a simple
query to retrieve all the triple that belong to a specific pubmed.
Example:
select *
from <http://bio2df.org/pubmed_record:XXXXXXX>
where
{?s ?p ?o .}
To retrieve the same think without the graph, you would need something
much more complicated with optional path and a depth that I actually
don't know since I don't need too know with the nquad :)
You sais:"Thank you for your answer - I would indeed like to insert all the
triples into a single graph so that I can query across the original
PubMed graphs"
Having a nquads structure does not forbide you from querying across
the "original" Pubmed graph. Its completely transparent for the users
if you are doing Sparql query without forcing a specific graph.
Since nquad have all the information on each line and use no shortcut
to simplify the structure like n3 and Turtle, you can manage it quite
easily with Perl or Sed/Awk . I don't really know the later, but for
perl, I can suggest a simple Perl one liner
zcat YOUR_COMPRESSES_NQUAD_FILE.nq.gz | perl -nle
'/^(\S+)\s+(\S+)\s+(\S+)\s+\S+\s+\.$/; print "$1\t$2\t$3 .\n";' | gzip
> YOUR_COMPRESSED_NTRIPLE_FILE.nt.gz
I think another one would be necessary to catch the lines where the
literal have spaces in it.
Bye !!
Marc-Alexandre
2012/2/16 Peter Ansell <ansell...@gmail.com>:
The following function should do what you want using the Java OpenRDF
(Sesame/Rio) API:
import java.io.IOException;
import net.fortytwo.sesametools.nquads.NQuadsFormat;
import org.openrdf.rio.RDFFormat;
import org.openrdf.rio.RDFHandler;
import org.openrdf.rio.RDFHandlerException;
import org.openrdf.rio.RDFParseException;
import org.openrdf.rio.RDFParser;
import org.openrdf.rio.Rio;
public void convertNQuadsToNTriples(InputStream nquads,
OutputStream ntriples) throws RDFParseException, RDFHandlerException,
IOException
{
RDFHandler handler = Rio.createWriter(RDFFormat.NTRIPLES, ntriples);
RDFParser parser = Rio.createParser(NQuadsFormat.NQUADS);
parser.setRDFHandler(handler);
parser.parse(nquads, "http://base.uri.not.defined/");
}
It requires the nquads library from Sesametools as noted in the last
email, along with the sesame-rio-ntriples library from Sesame.
It should stream well unless there are bugs in the parser or writer,
as Bio2RDF does not use BlankNodes, which are difficult and slow for
parsers and writers. There are some settings you can set on the parser
(including the following three) which may help improve performance if
there are no errors and there are no BlankNodes.
parser.setStopAtFirstError(false);
parser.setPreserveBNodeIDs(false);
parser.setVerifyData(false);
Cheers,
Peter