FALDO questions

9 views
Skip to first unread message

Matthew Brush

unread,
Jan 4, 2014, 3:40:51 PM1/4/14
to fa...@googlegroups.com

Hi all.  My name is Matthew Brush and I am working with Melissa Haendel and Chris Mungall on the Monarch Initiative.  We are building a genotype model [1,2] for this work that will be interoperable with the SO, and also use FALDO to describe positions of variant feature locations that are linked to organismal and cellular phenotypes. I have been looking over the FALDO documentation and manuscript had a couple questions.


1. The first concerns the faldo:location property. The main conceptual hurdle I need help with is understanding the semantics and utility of this relation - particularly why it is needed when linking an SO feature such as a gene or gene variant to its position.   For example, with the <_:geneCheY>  gene case in the github readme file [3], why cannot a faldo:Position be linked directly to the <_:geneCheY> feature?  Instead it first links <_:geneCheY>  to the  <_:example> faldo:Region using the faldo:location property, and then links this  <_:example> Region  to its start and end faldo:Positions.

    

One explanation I can imagine for needing a location property here is that faldo only allows Positions to be assigned to faldo:Region.  This notion seems to be supported by the statement in the manuscripts that "every faldo:Position refers to the sequence it is on". So my assumption here is that an so:feature such as the  <_:geneCheY>  gene does not qualify as a faldo:Region, and therefore cannot be directly linked to a faldo:Position . . . is this correct?  The faldo:location property is thus needed to define a faldo:Region instance that aligns with the <_:geneCheY> feature in question and is part of the reference sequence . . . and can thus be assigned a begin and end positions using a one-based offset from the start of the reference .  Am I on the right track here? 

 

If my assumptions above are correct, why cannot the <_:geneCheY>  gene be an instance of a faldo:Region? I should clarify that in the SO, instances of so:features such as the <_:geneCheY> gene are considered to be extents of genomic sequence whose identity are dependent on their sequence and their position.  This fact is being clarified and formalized in some refactoring work we are doing with SO developers.  Given this, it sounds to me like an so:feature such as <_:geneCheY>  could possibly qualify as a faldo:Region? But if this is not the case, can you explain why?   Is it because faldo:Regions are considered to be part of the reference sequence?


2. A second question concerns how 'variants' are positioned using the faldo model, as this is a key use case for Monarch project.  All of the examples given in faldo documentation and the manuscript seem to describe cases where the feature of interest aligns perfectly with a reference. How are faldo positions assigned to 'variants' whose sequence do not perfectly match that of the reference being used?  For example, consider a mouse Shh gene variant that contains a large internal deletion that spans exons 2 and 3. Is the location of this gene variant based on the best possible alignment with a reference, such that in this case the begin and end positions would end up correspond to that of the wild-type Shh gene in a reference genome (even though the test feature will in effect be much shorter than this reference)?   

 

This approach of describing variant positions in terms of where its beginning and end aligns with a reference seems like it would be consistent with other efforts.  For example, the position of the four base position of the deletion variant 'rs71186062' is documented in Ensembl  [4]  in terms of the reference location this spans - namely Chromosome 7:155594771-155594774.  Of course, this would leave much open to how exactly the alignment process/algorithm was implemented, which may itself has pros and cons in different scenarios.


Thanks all- hopefully my thoughts and questions here are clear! Best,

Matt


[1] http://www.slideshare.net/mhb120/brush-icbo-2013
[2] http://monarch-ontology.googlecode.com/svn/trunk/docs/Brush_et_al_2013%28submitted%29.pdf
[3] https://github.com/JervenBolleman/FALDO/blob/master/README.md
[4] http://uswest.ensembl.org/Homo_sapiens/Variation/Mappings?db=core;g=ENSG00000164690;r=7:155592680-155604967;vdb=variation;v=rs71186062;vf=15535257;source=dbSNP

               

               


Jerven Bolleman

unread,
Jan 6, 2014, 5:23:28 AM1/6/14
to Matthew Brush, fa...@googlegroups.com
Hi Matthew,

On point 1. why is a gene not a region. The normal definition of gene
is a unit of inheritance (this is how I learned it at least). A gene
is encoded as a set of nucleotides on a genome.
However, the set of nucleotides is a "real world" object that can be
represented in many databases. So the equivalent gene feature might be
present on lets say a ENSEMBL chromsome and in a DDBJ record. If the
gene is the region then the equivalent gene between ENSEMBL/DDBJ can
not be inferred to be the same (Ensembl will count from chromosome
start, DDBJ from record start giving two incompatible starts for the
gene as region). So in FALDO we make a distinction between features
and the location, as the same feature is likely to have multiple
locations in different databases (including versions). i.e. Gene !=
Annotation of gene region.

2 is a more difficult case and here I assumed the UniProt way. i.e.
you encode the variation as a replace operation.
Take the bit between here and there and paste in this instead.
e.g.
annotation:VSP_008711 rdf:type up:Alternative_Sequence_Annotation ;
rdfs:comment "In isoform 2." ;
up:substitution "VWLPRPYSARGAA" ;
up:range range:22861407670514989tt113tt125 .
range:22861407670514989tt113tt125 rdf:type faldo:Region ;
faldo:begin position:22861407670514989tt113 ;
faldo:end position:22861407670514989tt125 .
position:22861407670514989tt113 rdf:type faldo:Position ,
faldo:ExactPosition ;
faldo:position 113 ;
faldo:reference isoform:Q8TCD5-1 .
position:22861407670514989tt125 rdf:type faldo:Position ,
faldo:ExactPosition ;
faldo:position 125 ;
faldo:reference isoform:Q8TCD5-1 .

Here we quickly get into the semantics of what we annotate instead of
where is the thing we annotate is, which I like to keep out of FALDO.
I would actually love to see a method that infers the existence of a
sequence based on such a variant annotation.
Which one could do as a SPIN rule or maybe SWRL. OWL2 would probably
not work due to the need for substring and concat.

Hope this makes the thinking a bit clearer.

Regards,
Jerven
> --
> You received this message because you are subscribed to the Google Groups
> "FALDO" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to faldo+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Jerven Bolleman
m...@jerven.eu
Reply all
Reply to author
Forward
0 new messages