Hi all. My name is Matthew Brush and I am working with Melissa Haendel and Chris Mungall on the Monarch Initiative. We are building a genotype model [1,2] for this work that will be interoperable with the SO, and also use FALDO to describe positions of variant feature locations that are linked to organismal and cellular phenotypes. I have been looking over the FALDO documentation and manuscript had a couple questions.
1. The first concerns the
faldo:location property. The main conceptual hurdle I need help with is
understanding the semantics and utility of this relation - particularly
why it is needed when linking an SO feature such as a gene or gene variant
to its position. For example, with the
<_:geneCheY> gene case in the github readme file [3], why
cannot a faldo:Position be linked directly to the <_:geneCheY>
feature? Instead it first links
<_:geneCheY> to the <_:example> faldo:Region using
the faldo:location property, and then links this <_:example>
Region to its start and end faldo:Positions.
One explanation I can imagine for needing a location property here is that faldo only allows Positions to be assigned to faldo:Region. This notion seems to be supported by the statement in the manuscripts that "every faldo:Position refers to the sequence it is on". So my assumption here is that an so:feature such as the <_:geneCheY> gene does not qualify as a faldo:Region, and therefore cannot be directly linked to a faldo:Position . . . is this correct? The faldo:location property is thus needed to define a faldo:Region instance that aligns with the <_:geneCheY> feature in question and is part of the reference sequence . . . and can thus be assigned a begin and end positions using a one-based offset from the start of the reference . Am I on the right track here?
If my assumptions above are correct, why cannot the <_:geneCheY> gene be an instance of a faldo:Region? I should clarify that in the SO, instances of so:features such as the <_:geneCheY> gene are considered to be extents of genomic sequence whose identity are dependent on their sequence and their position. This fact is being clarified and formalized in some refactoring work we are doing with SO developers. Given this, it sounds to me like an so:feature such as <_:geneCheY> could possibly qualify as a faldo:Region? But if this is not the case, can you explain why? Is it because faldo:Regions are considered to be part of the reference sequence?
This approach of describing variant positions in terms of where its beginning and end aligns with a reference seems like it would be consistent with other efforts. For example, the position of the four base position of the deletion variant 'rs71186062' is documented in Ensembl [4] in terms of the reference location this spans - namely Chromosome 7:155594771-155594774. Of course, this would leave much open to how exactly the alignment process/algorithm was implemented, which may itself has pros and cons in different scenarios.