Corset clusters = Trinity contigs

Jake W

unread,

Aug 19, 2016, 11:11:20 AM8/19/16

to corset-project

Dear Nadia (or others),
I run Corset and obtained approximately the same number of clusters as trinity contigs. In short there was no reduction in the data and when I scan the clusters.txt file i only see clusters with super cluster numbers and no subclusters (eg Cluster. I am concerned that maybe I missed a crucial parameter of the alignment. Any advice?

The experimental design is as follows:
-Trinity assembly from 16 paired end timeseries samples (1/timepoint) yielded 215857 trinity 'genes' and 330722 transcripts.
-I mapped single end data back to this assembly for triplicates per timepoint (3/timepoint = 48 samples) using bowtie:
for file in *.fastq;
do
        bowtie -n 2 -e 99999999 -m 200 \
        --phred33-quals -S -p 4 \
        Trinity.fasta \
        "$file" > "$file.sam.2" 2>"$file.log.2"
        samtools view -S -b "$file.sam.2" > "$file.bam.2";
done

Here's the samfile:

head 1-01.sam
@HD     VN:1.0 SO:unsorted
@SQ     SN:TRINITY_DN7_c0_g1_i1 LN:345
@SQ     SN:TRINITY_DN7_c0_g2_i1 LN:240
@SQ     SN:TRINITY_DN8_c0_g1_i1 LN:236
tail 1-01.sam
NS500449:235:HL5JJBGXX:4:23612:19920:20394      0       TRINITY_DN112585_c0_g3_i1       667     255     63M     *       0       0       GAACACTCTAATTTTTTCAAAGTAAACGTCGCAAGTCCTCCGCACACTCAGCTAAGAGCACAC        E/EE6EEEEEEEEEAEEEEEEEEAEAEEEEE/EEEEEEEEEEEAEEEE<EEEAEEEAEEE6E6 XA:i:0 MD:Z:63 NM:i:0

And here's the Corset call:
for FILE in `ls *.bam` ; do
   ./corset -r true-stop $FILE &
done
wait
/home/jwarner/Corset/Corset_code/corset-1.05-linux64/./corset -g 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7,8,8,8,\
    9,9,9,10,10,10,11,11,1,12,12,12,13,13,13,14,14,14,15,15,15,16,16,16 \
    -n R01A,R01B,R01C,R02A,R02B,R02C,R03A,R03B,R03C,\
    R04A,R04B,R04C,R05A,R05B,R05C,R06A,R06B,R06C,\
    R07A,R07B,R07C,R08A,R08B,R08C,R09A,R09B,R09C,\
    R10A,R10B,R10C,R11A,R11B,R11C,R12A,R12B,R12C,\
    R13A,R13B,R13C,R14A,R14B,R14C,R15A,R15B,R15C,\
    R16A,R16B,R16C \
    -i corset R*.corset-reads

head clusters.txt
TRINITY_DN91229_c0_g1_i1        Cluster-0.0
TRINITY_DN93817_c0_g1_i1        Cluster-1.0
TRINITY_DN73581_c0_g1_i1        Cluster-2.0
TRINITY_DN83669_c0_g1_i1        Cluster-3.0

Thanks in advance,
Jacob

Nadia Davidson

unread,

Aug 21, 2016, 8:21:04 AM8/21/16

to corset-project

Hi Jacob,

I can't see anything obviously wrong with the commands you've run. The most important thing is to allow multi-mapping and it looks like you've done that (-m 200). Depending on the dataset, it can be possible to have assemblies that don't contain a lot of redundancy and don't benefit as much from clustering. An example of this was the trinity assembly of some yeast data in the corset paper. To get an idea of what corsets doing, would you be able to reply with the following statistics:

- The number of clusters reported by corset (you can get this by doing wc -l cluster.txt"

- The number of contigs that pass corset filtering (you can get this from wc -l counts.txt"

Perhaps also paste a "head" of one of the .corset-reads files.

Hopefully we can work this out.

Cheers,

Nadia.

Hadley Horch

unread,

Mar 2, 2018, 11:07:32 AM3/2/18

to corset-project

Hi Nadia and Jacob,

This thread is a bit old, but we are finding a fairly similar thing. Trinity predicted 374,383 genes, and Corset gives us 326,190 lines in the clusters.txt file and 326,180 lines in the counts.txt file. My groups look like:

-g 1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7

In scrolling through the counts file, I note a lot of heterogeneity within my experimental groups. For example, even just looking at expression across three control animals, I see big differences--two controls will have 0 or low counts for many clusters, while the third sample will have very high counts, and this can be true across dozens or even hundreds of clusters. I work with a lab-bred cricket species. It has been cultured in lab for decades, gone through various bottlenecks, but there has been no attempt to truly inbreed and get a genetically identical line.

I re-ran corset without any grouping (so each sample was considered an independent experimental group), and the results were the same. Since I have 3 sets of age-matched controls, I next grouped all the controls together (even though they are age-matched to their experimental group and technically a little different). I did this on the assumption that more samples within groups might help with the ability to cluster? So the grouping looked like:

-g 1,1,1,2,2,2,1,1,1,3,3,3,1,1,1,4,4,4,5,5,5

Instead, I got less clustering:

wc -l clusters.txt = 379276

wc -l counts.txt = 379263

I next want to play with filtering (-l and -x), and are curious if you have any advice on values to use here. Also, if you have any other ideas for improving the amount of clustering, I'd love some advice. Perhaps there is just too much variability in the transcripts expressed in these animals to allow more clustering than what I've gotten so far?

Thanks so much for your help,

Hadley

Nadia Davidson

unread,

Mar 2, 2018, 11:29:16 PM3/2/18

to corset-project

Hi Hadley,

If all your samples are the same species the transcripts should cluster together nicely even if there's a a bit of variability between what's expressed. So I don't think the grouping could explain what you see. My guess is that the issue is more a technical one related to the way that reads were mapped. How were your reads mapped to the assembly?

Often this sort of thing might happen if the reads weren't allowed to multi-map (usually this requires using a alignment parameter like --all).