Dear Luis,
May I ask how should I understand file "GMGC10.relationships.txt" and can I use this file to map original orfs to Unigene Ids?
Its said on gmgc website
""" This table shows the structure of the clusters at at 95 % . Note that this table requires ca. 300 GB to store and contains 8, 533, 537, 889 rows.
This table contains triplets of the form[ORF1][REL][ORF2]. [ORF2] is always an ORF in the GMGCv1,
while [ORF1] is one of the original ORFs. [REL] is one of
=: the ORFs are identical
C: ORF1 is contained in ORF2 as a substring
R: ORF1 can be represented by ORF2(i.e., ORF1 is 95 % identical to a substring of ORF2). """
My initial understanding is that the ORF2 is actually original unigene names as described in file "GMGC10.meta.tsv"?
as I found all the ORF2 IDs can be mapped to the original unigene names described in file "GMGC10.meta.tsv".
If this is the case, i could map original orfs (orf1) to Unigene via file "GMGC10.relationships.txt" ?
While I got confused that some orf1s have "R" relationship with multiple orf2. In this case, which orf2 I should choose as the the Unigene of orf1 ?
For example in file "GMGC10.relationships.txt ", there are records:
SAMEA3664728_49210 R Fr12_174014655
SAMEA3664728_49210 R Fr12_4011728
Fr12_174014655 R Fr12_4011728
Fr12_4011728 R Fr12_174014655
I guess this means that original orf SAMEA3664728_49210 can be represented by Unigene Fr12_174014655(GMGC10.015_033_689.PITA) and Fr12_4011728 ('GMGC10.033_806_784.PITA')?
in this case why are Fr12_174014655 and Fr12_4011728 not clustered together as one same Unigene group ?
Best regards,
Tao
If this is the case, i could map original orfs (orf1) to Unigene via file "GMGC10.relationships.txt" ?While I got confused that some orf1s have "R" relationship with multiple orf2. In this case, which orf2 I should choose as the the Unigene of orf1 ?
For example in file "GMGC10.relationships.txt ", there are records:SAMEA3664728_49210 R Fr12_174014655SAMEA3664728_49210 R Fr12_4011728Fr12_174014655 R Fr12_4011728Fr12_4011728 R Fr12_174014655I guess this means that original orf SAMEA3664728_49210 can be represented by Unigene Fr12_174014655(GMGC10.015_033_689.PITA) and Fr12_4011728 ('GMGC10.033_806_784.PITA')?in this case why are Fr12_174014655 and Fr12_4011728 not clustered together as one same Unigene group ?
Best regards,Tao
--You received this message because you are subscribed to the Google Groups "gmgc-users" group.To unsubscribe from this group and stop receiving emails from it, send an email to gmgc-users+...@googlegroups.com.To view this discussion on the web visit https://groups.google.com/d/msgid/gmgc-users/328ee258-6125-4726-890f-a1a8b68deefan%40googlegroups.com.
So if I understand correctly,1. "R" relationship means " A is considered representable by B if there is a sequence A′such that A′ is a substring of B and the edit distance from A to A′ is≤5% of the length of A."
2. "=" relationship means "When the lengths are identical (or similar), thisdefinition corresponds to the species-level 95% nucleotide identitycriterion"
3. "C" relationship "When A is a fragment of B (even withminor changes), however, then only B is kept. "