Using FALDO for basic genome features

Chris Mungall

unread,

Apr 27, 2014, 2:31:36 PM4/27/14

to Michel Dumontier, faldo, Joachim Baran, Toshiaki Katayama, Robert Buels, Robert Hoehndorf, Raoul Bonnal, Takatomo Fujisawa, Peter Cock, Francesco Strozzi

At the Network of Biothings hackathon yesterday I started a translation
from ClinVar to triples using FALDO to represent the position of the
variant.

This is not unusual in that the source data here stores position as
tuples

(GenomeBuild, Chromosome, Begin, End)

If FALDO is to take off, others will be producing triples from similar
tuples.

We are a little bit silent on how the generate the value of the
reference predicate. The manuscript states "FALDO makes very few
assumptions about the representation of the reference sequence". I think
avoiding overspecification and allowing flexibility is good, but we
should give more guidance here.

For example, is it up to my infrastructure to perform a lookup on
(GenomeBuild, Chromosome) -> ReferenceURI?

For example (GRCh37, Chr8) -->
http://www.ebi.ac.uk/genomes/CM000670.html

Let's say this is not a pain to do. Is it then up to the consumer of my
triples to do some kind of reverse lookup when they want to expose the
build and chromosome in their system?

Is it considered good practice for me to also produce triples such as:

<http://www.ebi.ac.uk/genomes/CM000670> hasChromosome :Chr8 .
<http://www.ebi.ac.uk/genomes/CM000670> hasBuild :GRCh37 .

This way consumers can easily get at what they want. But FALDO is silent
on this.

It is tempting to produce the data as JSON with a simple object for
producing the location:

{
"build" : …,
"chromosome" : …,
"begin" : …,
"end" : ..
}

And include a JSON-LD context for mapping this to an RDF model that is
not FALDO but has a defined translation to FALDO. This avoids creating
all the additional URIs, and makes it easier for consumers of the data
to get what they need.

Jerven Bolleman

unread,

Apr 27, 2014, 6:40:40 PM4/27/14

to Chris Mungall, Michel Dumontier, faldo, Joachim Baran, Toshiaki Katayama, Robert Buels, Robert Hoehndorf, Raoul Bonnal, Takatomo Fujisawa, Peter Cock, Francesco Strozzi

Hi Chris,

On 27 Apr 2014, at 20:31, Chris Mungall <cjmu...@lbl.gov> wrote:

> At the Network of Biothings hackathon yesterday I started a translation from ClinVar to triples using FALDO to represent the position of the variant.
>
> This is not unusual in that the source data here stores position as tuples
>
> (GenomeBuild, Chromosome, Begin, End)
>
> If FALDO is to take off, others will be producing triples from similar tuples.
>
> We are a little bit silent on how the generate the value of the reference predicate. The manuscript states "FALDO makes very few
> assumptions about the representation of the reference sequence". I think avoiding overspecification and allowing flexibility is good, but we should give more guidance here.
>
> For example, is it up to my infrastructure to perform a lookup on (GenomeBuild, Chromosome) -> ReferenceURI?
>
> For example (GRCh37, Chr8) --> http://www.ebi.ac.uk/genomes/CM000670.html

I think we need a Cool URI’s for biological records document. But at the moment yes, reference URI’s are not cool at the moment, which means you do need
to do lookups :(

>
> Let's say this is not a pain to do. Is it then up to the consumer of my triples to do some kind of reverse lookup when they want to expose the build and chromosome in their system?
>
> Is it considered good practice for me to also produce triples such as:
>
> <http://www.ebi.ac.uk/genomes/CM000670> hasChromosome :Chr8 .
> <http://www.ebi.ac.uk/genomes/CM000670> hasBuild :GRCh37 .

I think that is biologists talk ;) not semantics.
I would go more for something like this.
> <http://www.ebi.ac.uk/genomes/CM000670> representsChromosome :Chr8 .
> <http://www.ebi.ac.uk/genomes/CM000670> assembly build:GRCh37 .
In the end its down to a practical decision, we can’t model databases we don’t control.
And before you know it FALDO ends up doing ensembl/ena/ddbj/and refseq in RDF.

Secondly the reference sequences is one of the 6 contig sequences so it would be

<http://www.ebi.ac.uk/ena/data/view/GL000062> a ena:Contig .
<http://www.ebi.ac.uk/ena/data/view/GL000062> partOf <http://www.ebi.ac.uk/genomes/CM000670> .
_:1 a faldo:ExactPostion ;
faldo:position 1;
faldo:reference <http://www.ebi.ac.uk/ena/data/view/GL000062> .

>
> This way consumers can easily get at what they want. But FALDO is silent on this.

Its because this is already covered in basic URI design… We can put something on the wiki..
FALDO is limited in scope (maybe too limited)

>
> It is tempting to produce the data as JSON with a simple object for producing the location:
>
> {
> "build" : …,
> "chromosome" : …,
> "begin" : …,
> "end" : ..
> }

Its late here and I can’t think of anything better than this for now.
{
"@context": {
"faldo" : "http://biohackathon.org/faldo#",
"build" : {"@id":"faldo:reference", "@type":"@id"},
"begin" : {"@id":"faldo:begin"},
"end" : {"@id":"faldo:end"},
"pos" : {"@id":"faldo:position"}
},
"@type" : "faldo:Region",
"build" : "GRCh37" ,
"chromosome" :"Chr09",
"begin" : {"pos" : 1},
"end" : {"pos" : 2}
}

You can play more with it here.

http://json-ld.org/playground/#startTab=tab-nquads&json-ld=%7B%22%40context%22%3A%7B%22faldo%22%3A%22http%3A%2F%2Fbiohackathon.org%2Ffaldo%23%22%2C%22build%22%3A%7B%22%40id%22%3A%22faldo%3Areference%22%2C%22%40type%22%3A%22%40id%22%7D%2C%22begin%22%3A%7B%22%40id%22%3A%22faldo%3Abegin%22%7D%2C%22end%22%3A%7B%22%40id%22%3A%22faldo%3Aend%22%7D%2C%22pos%22%3A%7B%22%40id%22%3A%22faldo%3Aposition%22%7D%7D%2C%22%40type%22%3A%22faldo%3ARegion%22%2C%22build%22%3A%22GRCh37%22%2C%22chromosome%22%3A%22Chr09%22%2C%22begin%22%3A%7B%22pos%22%3A1%7D%2C%22end%22%3A%7B%22pos%22%3A2%7D%7D

>

> And include a JSON-LD context for mapping this to an RDF model that is not FALDO but has a defined translation to FALDO. This avoids creating all the additional URIs, and makes it easier for consumers of the data to get what they need.

I agree we avoid all the references and we have a section about that for DDBJ. Which applies here as well.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "FALDO" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to faldo+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jerven Bolleman

unread,

Apr 28, 2014, 3:09:02 AM4/28/14

to Chris Mungall, faldo, Michel Dumontier, faldo, Joachim Baran, Toshiaki Katayama, Robert Buels, Robert Hoehndorf, Raoul Bonnal, Takatomo Fujisawa, Peter Cock, Francesco Strozzi

Hi Chris,

Waking up with fresh ideas I can get you this JSON

{
"@context": {
"@base" : "http://example/9606/GRch37/Chr8/",
"faldo": "http://biohackathon.org/faldo#",
"GRch37": "http://example/9606/GRch37/",

"begin": {
"@id" : "faldo:begin",

"@type" : "@id"

},
"end": {
"@id": "faldo:end",

"@type" : "@id"
},
"chromosome": {

"@id": "faldo:reference",
"@type" : "@id"
}
},

"@id" : "1to2",
"chromosome": "GRch37:Chr08",
"begin": "1",
"end": "2"
}

<http://example/9606/GRch37/Chr8/1to2> <http://biohackathon.org/faldo#begin> <http://example/9606/GRch37/Chr8/1> .
<http://example/9606/GRch37/Chr8/1to2> <http://biohackathon.org/faldo#end> <http://example/9606/GRch37/Chr8/2> .
<http://example/9606/GRch37/Chr8/1to2> <http://biohackathon.org/faldo#reference> <http://example/9606/GRch37/Chr08> .
<http://example/9606/GRch37/Chr8/1to2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://biohackathon.org/faldo#Region> .

This is used the hope that your JSON-LD service is driven by nice URI’s and that in RDF we do not need to be complete in every document,
as long as the generated uris are dereferencable…

Now back to bashing OWL into doing what I want it to do ;)

Regards,
Jerven

Michel Dumontier

unread,

Apr 28, 2014, 11:38:09 AM4/28/14

to Jerven Bolleman, Chris Mungall, faldo, Joachim Baran, Toshiaki Katayama, Robert Buels, Robert Hoehndorf, Raoul Bonnal, Takatomo Fujisawa, Peter Cock, Francesco Strozzi

On Mon, Apr 28, 2014 at 12:09 AM, Jerven Bolleman <m...@jerven.eu> wrote:

Hi Chris,

Waking up with fresh ideas I can get you this JSON

{
"@context": {
"@base" : "http://example/9606/GRch37/Chr8/",
"faldo": "http://biohackathon.org/faldo#",
"GRch37": "http://example/9606/GRch37/",

you mean "reference": here, right?

"begin": {
"@id" : "faldo:begin",

"@type" : "@id"

},
"end": {
"@id": "faldo:end",

"@type" : "@id"
},
"chromosome": {

"@id": "faldo:reference",
"@type" : "@id"

why wouldn't this attribute just be associated with the reference?

m.

Jerven Bolleman

unread,

Apr 28, 2014, 11:43:17 AM4/28/14

to Michel Dumontier, Chris Mungall, faldo, Joachim Baran, Toshiaki Katayama, Robert Buels, Robert Hoehndorf, Raoul Bonnal, Takatomo Fujisawa, Peter Cock, Francesco Strozzi

Hi Michel,

Could you expand your comment, I am not following it.
I wanted to make JSON-LD that was very close in structure to the JSON Chris sees being created by projects.

Cheers,
Jerven

Michel Dumontier

unread,

Apr 28, 2014, 12:56:00 PM4/28/14

to Jerven Bolleman, Chris Mungall, faldo, Joachim Baran, Toshiaki Katayama, Robert Buels, Robert Hoehndorf, Raoul Bonnal, Takatomo Fujisawa, Peter Cock, Francesco Strozzi

Ah,

shouldn't the reference contain the build and chromosome information (when available)?

m.

Michel Dumontier

unread,

Apr 28, 2014, 1:48:16 PM4/28/14

to Jerven Bolleman, Chris Mungall, faldo, Joachim Baran, Toshiaki Katayama, Robert Buels, Robert Hoehndorf, Raoul Bonnal, Takatomo Fujisawa, Peter Cock, Francesco Strozzi

Jerven replied directly to me

"Yes, it should, on the entity of the genome build. However I don't know an ontology or schema for these. And it was not on the original json of Chris. So I put this information in the chromosome uri, as if it was a cool uri. Where pure json users can get at it."

i understand this. Chris - can you chime in on producing a pattern where

:position

:start :startpos

:end :endpos

:reference :ref

:ref

:build "X"

:chr "Y"

Michel Dumontier

Associate Professor of Medicine (Biomedical Informatics), Stanford University

Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group

http://dumontierlab.com

Chris Mungall

unread,

Apr 28, 2014, 2:15:00 PM4/28/14

to Jerven Bolleman, Michel Dumontier, faldo, Joachim Baran, Toshiaki Katayama, Robert Buels, Robert Hoehndorf, Raoul Bonnal, Takatomo Fujisawa, Peter Cock, Francesco Strozzi

On 27 Apr 2014, at 15:40, Jerven Bolleman wrote:

> Hi Chris,
>
>
> On 27 Apr 2014, at 20:31, Chris Mungall <cjmu...@lbl.gov> wrote:
>
>> At the Network of Biothings hackathon yesterday I started a
>> translation from ClinVar to triples using FALDO to represent the
>> position of the variant.
>>
>> This is not unusual in that the source data here stores position as
>> tuples
>>
>> (GenomeBuild, Chromosome, Begin, End)
>>
>> If FALDO is to take off, others will be producing triples from
>> similar tuples.
>>
>> We are a little bit silent on how the generate the value of the
>> reference predicate. The manuscript states "FALDO makes very few
>> assumptions about the representation of the reference sequence". I
>> think avoiding overspecification and allowing flexibility is good,
>> but we should give more guidance here.
>>
>> For example, is it up to my infrastructure to perform a lookup on
>> (GenomeBuild, Chromosome) -> ReferenceURI?
>>
>> For example (GRCh37, Chr8) -->
>> http://www.ebi.ac.uk/genomes/CM000670.html
>
> I think we need a Cool URI’s for biological records document. But at
> the moment yes, reference URI’s are not cool at the moment, which
> means you do need
> to do lookups :(

I just grabbed that URL as a quick example, maybe there are better ones.
Even assuming we have cool URIs, there is still an extra burden on me
when producing the JSON-LD/ttl to do some kind of lookup somewhere to
get the URI from the (Chr,Build) tuple. Alternatively, I could perhaps
use a bNode for the reference - the tuple serves to uniquely identify it
(provided I have a standard vocabulary to express the Chr and Build on
the bNode.

Jerven Bolleman

unread,

Apr 28, 2014, 3:27:33 PM4/28/14

to Chris Mungall, faldo, Michel Dumontier, faldo, Joachim Baran, Toshiaki Katayama, Robert Buels, Robert Hoehndorf, Raoul Bonnal, Takatomo Fujisawa, Peter Cock, Francesco Strozzi

Hi Chris, Michel,

Yes you could use a bnode, but then you push the problem to your users. As they now have to figure out which CRch37 chromosome 8 they actually mean…
You could of course do something like this.

{

"@context": {
"@base" : "http://example/9606/GRch37/Chr8/",
"faldo": "http://biohackathon.org/faldo#",

"rdfs" : "http://www.w3.org/2000/01/rdf-schema#",
"organism" : "http://purl.uniprot.org/taxonomy/",

"begin": {
"@id" : "faldo:begin",

"@type" : "@id"

},
"end": {
"@id": "faldo:end",

"@type" : "@id"
},
"chromosome": {

"@id": "faldo:reference",
"@type" : "@id"
},

"label": {
"@id": "rdfs:label"
}
},
"@graph" : [{
"@id" : "1to2",
"chromosome": "Chr08",

"begin": "1",
"end": "2"

},{
"@id" : "3to4",
"chromosome": "Chr08",
"begin": "3",
"end": "4"
},{
"@id" : "Chr08",
"@type" : "SO:0000340" ,
"build" : "GRch37" ,
"label" : "Chromsome 8",
"species" : "organism:9606"
}]
}

Where the Chr08 node has some properties that allow one to identify the Chromosome reference i.e. has some
owl:hasKey things...

But then we end up modelling genome assemblies, not sure that should still be in the paper?

Regards,
Jerven

Reply all

Reply to author

Forward