understanding genes and isoforms in a "corseted" assembly

M. Olalla Lorenzo-Carballa

unread,

May 23, 2016, 9:31:14 AM5/23/16

to corset-project

Hello Nadia and Alicia

I have used corset to cluster my trinity assembly, and now I would like to create a gene to transcript map for this clustered assembly, in a similar way to what trinity does, so I can run DE analyses at both gene and isoform levels using the scripts within the trinity package on this "corseted" assembly.

So, the easiest way for me would be to create a gene to transcript map from the clusters.txt file. So my question is about what one would call a gene and an isoform when talking about Corset clusters/superclusters.

So for something like the following, and from what I understand when I read the wiki: "The cluster naming is of the form Clusters-X.Y. The X is the super-cluster ID. Any transcript which shares even a single read with another transcript will have the same super-cluster ID. The Y indicates the cluster number within the super-cluster (ie. those which resulted from the hierarchical clustering and expression testing"

TRINITY_DN2726_c0_g2_i1    Cluster-0.0
TRINITY_DN2726_c0_g1_i1    Cluster-0.0
TRINITY_DN18976_c12_g1_i3    Cluster-1.0
TRINITY_DN18976_c12_g1_i1    Cluster-1.1
TRINITY_DN18976_c12_g1_i5    Cluster-1.1
TRINITY_DN18976_c12_g1_i2    Cluster-1.2
TRINITY_DN18976_c12_g1_i6    Cluster-1.3
TRINITY_DN18976_c12_g1_i4    Cluster-1.3
TRINITY_DN18976_c12_g1_i7    Cluster-1.3

I understand that a supercluster would be a gene which in this case has only one isoform (hence both trinity transcripts are clustered together in Cluster 0.0), and (for example) the supercluster 1 (Cluster-1) has in this case 4 isoforms (Cluster-1.0, -1.1, -1.2, -1.3).

Is this correct? Otherwise, any input would be greatly appreciated

Thanks in advance

Olalla

Nadia Davidson

unread,

May 23, 2016, 9:37:22 PM5/23/16

to corset-project

Hi Olalla,

For genes you should use the full cluster ID. So for example Cluster 1.1 could be considered a gene, and Cluster 1.2 a separate gene. The "super-cluster" level IDs shouldn't really be interpreted biologically (apart from that they may be genes that share some sequence).

Clustering contigs into isoform would be very ambiguous I think, and to the best of my knowledge this is not usually done. The best alternative would be to specify each contig as a separate isoform. So in your example above, Gene "Cluster-1.3", has three isoforms, TRINITY_DN18976_c12_g1_i6, TRINITY_DN18976_c12_g1_i4, TRINITY_DN18976_c12_g1_i7.

Of course, this is not ideal because these three contig don't necessarily represent different isoforms (they could be a single isoform which has failed to be fully assembled for example). We've been working on a better solution to solve this problem and hope to release some software for it later in the year, so keep an eye out if you're interested.

Cheers,

Nadia.

M. Olalla Lorenzo-Carballa

unread,

May 24, 2016, 2:11:40 AM5/24/16

to Nadia Davidson, corset-project

Hello Nadia

Thanks a lot for your reply. It isla really helpful and most definitely I will check any updates and/or new software on this matter :)

Kind regards

Olalla

--
You received this message because you are subscribed to a topic in the Google Groups "corset-project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/corset-project/3CvXhtEEb2g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to corset-projec...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Patrick Pereira

unread,

Nov 19, 2020, 4:35:33 PM11/19/20

to corset-project

Hello everyone

I'm having the following situation,

I ran corset into my Trinity assembly and I found a lot of differently expressed corset-clusters, however in some cases I found the same blast hit for different corset clusters

e.g.

Corset_Clsuster_ID Trinity_ID Swissprot_Blast_Hit

Cluster-37754.1 TRINITY_DN3306_c0_g1_i4 A0A2I0TFV8_LIMLA_Nedd4-binding_protein_2-like 2_OS=Limosa_lapponica_baueri_OX=1758121_GN=llap_17010_PE=4 SV=1

Cluster-37754.2 TRINITY_DN3306_c0_g1_i2 A0A2I0TFV8_LIMLA_Nedd4-binding_protein_2-like 2_OS=Limosa_lapponica_baueri_OX=1758121_GN=llap_17010_PE=4 SV=1

What is the meaning of the Corset IDs? Is there any isoform description in this cases? e.g .1 and .2 or each cluster is treated as a separated gene even when the IDs is almost the same like above?

Cheers,

Patrick

Nadia Davidson

unread,

Nov 19, 2020, 5:01:19 PM11/19/20

to corset-project

Hi Patrick,

The .1 .2 etc. refers to the gene, not isoform. There is a bit more explantation of this here, https://github.com/Oshlack/Corset/wiki/InstallingRunningUsage#output

It is not too unusual to see the same gene matching multiple clusters once you annotate them with blast and there can be a few reasons for this:

1. Assembly isn't perfect and a gene can be assembled in a fragmented way meaning that corset has difficulty clustering contigs from the same gene together. This is the most common reason in my experience.

2. There may be differential transcript usage in this gene. Corset, by default, uses expression patterns across experimental groups as part of the clustering and will separate clusters where the conditions are acting differently.

3. Cluster-37754.1 and Cluster-37754.2 could be two different paralogs of the gene you get the blast hit for.

You should be able to get an idea of which of these it is, if you take the transcripts from both clusters and align them against each other using blast online. If only a small amount of sequence is shared by the clusters it is likely 1. above. If most of the sequence is shared between the clusters then 2. and if they share sequence but with a high level of mismatch then 3.

Hope this sort of makes sense and I'm happy to answer any other questions you have.

Cheers,

Nadia.

Patrick Pereira

unread,

Nov 20, 2020, 10:17:17 AM11/20/20

to corset-project

Thank you very much for your reply,

I cases you described as differential transcript usage, can I do a manual correction directly into clusters.txt file by putting the transcripts into the same cluster?

e.g

change the above example to:

Corset_Clsuster_ID Trinity_ID Swissprot_Blast_Hit

Cluster-37754.1 TRINITY_DN3306_c0_g1_i4 A0A2I0TFV8_LIMLA_Nedd4-binding_protein_2-like 2_OS=Limosa_lapponica_baueri_OX=1758121_GN=llap_17010_PE=4 SV=1

Cluster-37754.1 TRINITY_DN3306_c0_g1_i2 A0A2I0TFV8_LIMLA_Nedd4-binding_protein_2-like 2_OS=Limosa_lapponica_baueri_OX=1758121_GN=llap_17010_PE=4 SV=1

Cheers,

Patrick

Nadia Davidson

unread,

Nov 22, 2020, 11:09:21 PM11/22/20

to corset-project

Hi Patrick,

You could do this if you only want to use the cluster file, but not if you want to use the count file as the counts for Cluster-37754.1 will only include reads from the first transcript. Another option is to run corset with the option "-D 9999999999999" or "-g 1,1,1,1," ie. all the same groups. This will switch off separating transcripts which have distinct patterns of expression across conditions (such as differential transcript usage).