Mapping from original orfs to Unigene

fang talon

unread,

Nov 30, 2023, 7:21:33 AM11/30/23

to gmgc-users

Dear Luis,

May I ask how should I understand file "GMGC10.relationships.txt" and can I use this file to map original orfs to Unigene Ids?
Its said on gmgc website
""" This table shows the structure of the clusters at at 95 % . Note that this table requires ca. 300 GB to store and contains 8, 533, 537, 889 rows.
This table contains triplets of the form[ORF1][REL][ORF2]. [ORF2] is always an ORF in the GMGCv1,
while [ORF1] is one of the original ORFs. [REL] is one of
=: the ORFs are identical
C: ORF1 is contained in ORF2 as a substring
R: ORF1 can be represented by ORF2(i.e., ORF1 is 95 % identical to a substring of ORF2). """

My initial understanding is that the ORF2 is actually original unigene names as described in file "GMGC10.meta.tsv"?
as I found all the ORF2 IDs can be mapped to the original unigene names described in file "GMGC10.meta.tsv".
If this is the case, i could map original orfs (orf1) to Unigene via file "GMGC10.relationships.txt" ?
While I got confused that some orf1s have "R" relationship with multiple orf2. In this case, which orf2 I should choose as the the Unigene of orf1 ?

For example in file "GMGC10.relationships.txt ", there are records:
SAMEA3664728_49210 R Fr12_174014655
SAMEA3664728_49210 R Fr12_4011728
Fr12_174014655 R Fr12_4011728
Fr12_4011728 R Fr12_174014655
I guess this means that original orf SAMEA3664728_49210 can be represented by Unigene Fr12_174014655(GMGC10.015_033_689.PITA) and Fr12_4011728 ('GMGC10.033_806_784.PITA')?
in this case why are Fr12_174014655 and Fr12_4011728 not clustered together as one same Unigene group ?

Best regards,
Tao

Luis Pedro Coelho

unread,

Dec 1, 2023, 3:13:20 AM12/1/23

to fang talon, gmgc-users

Hi Tao,

Thanks for reaching out!

If this is the case, i could map original orfs (orf1) to Unigene via file "GMGC10.relationships.txt" ?
While I got confused that some orf1s have "R" relationship with multiple orf2. In this case, which orf2 I should choose as the the Unigene of orf1 ?

Yes, this happens.

Basically, it means that there is more than one possible representative for the original ORF (because of how sequence similarity works, A may be similar to C and B without C and B being similar to each other — if A is just a fragment is a very obvious example, but even more subtle ones are possible).

When we need to pick a single representative, we break ties by taking the lowest possible one (in alphabetical order). The decision is anyway arbitrary

For example in file "GMGC10.relationships.txt ", there are records:
SAMEA3664728_49210 R Fr12_174014655
SAMEA3664728_49210 R Fr12_4011728
Fr12_174014655 R Fr12_4011728
Fr12_4011728 R Fr12_174014655
I guess this means that original orf SAMEA3664728_49210 can be represented by Unigene Fr12_174014655(GMGC10.015_033_689.PITA) and Fr12_4011728 ('GMGC10.033_806_784.PITA')?
in this case why are Fr12_174014655 and Fr12_4011728 not clustered together as one same Unigene group ?

Because Fr12_174014655 and Fr12_4011728 may not themselves be similar enough.

HTH, Luis

Best regards,
Tao

--
You received this message because you are subscribed to the Google Groups "gmgc-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gmgc-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gmgc-users/328ee258-6125-4726-890f-a1a8b68deefan%40googlegroups.com.

fang talon

unread,

Dec 1, 2023, 3:40:17 AM12/1/23

to gmgc-users

Dear Luis,

Thanks so much for your help! You really cleared up my question. I used to think the 'R' relationship meant a 95% similarity between two ORFs. But now I get it's more about a 95% match to a part of ORF2, just like you described on the website 😅. Super helpful, thanks again! 😊

So in the paper , about "Non-redundant gene catalogue construction", it's said :
"Step 3: the matches resulting from the previous step are filtered (in
nucleotide space) so that only ‘representable’ relationships are kept.
Namely, A is considered representable by B if there is a sequence A′
such that A′ is a substring of B and the edit distance from A to A′ is
≤5% of the length of A. When the lengths are identical (or similar), this
definition corresponds to the species-level 95% nucleotide identity
criterion (Extended Data Fig. 2a). When A is a fragment of B (even with
minor changes), however, then only B is kept. The result of this step is a
graph where each vertex is an input gene sequence and directed edges
correspond to representable relationships."

So if I understand correctly,
1. "R" relationship means " A is considered representable by B if there is a sequence A′
such that A′ is a substring of B and the edit distance from A to A′ is
≤5% of the length of A."
2. "=" relationship means "When the lengths are identical (or similar), this
definition corresponds to the species-level 95% nucleotide identity
criterion"
3. "C" relationship "When A is a fragment of B (even with
minor changes), however, then only B is kept. "

Many thanks,
Tao

Luis Pedro Coelho

unread,

Dec 5, 2023, 6:06:46 AM12/5/23

to fang talon, gmgc-users

So if I understand correctly,
1. "R" relationship means " A is considered representable by B if there is a sequence A′
such that A′ is a substring of B and the edit distance from A to A′ is
≤5% of the length of A."

Yes.

2. "=" relationship means "When the lengths are identical (or similar), this
definition corresponds to the species-level 95% nucleotide identity
criterion"

No, = means strictly equal

3. "C" relationship "When A is a fragment of B (even with
minor changes), however, then only B is kept. "

No, C means it's a strict fragment, in Python terms A == B[start:post], for some coordinates, start/post.

HTH, Luis

Reply all

Reply to author

Forward