transabyss

128 views
Skip to first unread message

vittoria roncalli

unread,
Apr 4, 2012, 7:00:26 PM4/4/12
to Shaun Jackman, abyss...@googlegroups.com
Hi Shaun,
I am assembling 6 rna libraries (from different stages) paired end sequenced with Illumina using abyss, and I have planned to use transabyss to merge the different assemblies I will get using different kmers.
On downloading transabyss, I have seen that there are some external softwares (as bowtie) that serve for different kind of analysis.Considering that I do not have a reference genome, and the only analysis I am planning to do with transabyss is the merging assemblies, do I have to download all of these external software,or I can choose between them?Another question, to consider a good assembly,which is the percentage of paired end reads ?Is there a limit under that I can consider my assembly bad?
Thanks in advance

Shaun Jackman

unread,
Apr 5, 2012, 8:02:35 PM4/5/12
to vittoria roncalli, abyss...@googlegroups.com
Hi Vittoria,

For questions related to transcriptome assembly, contact the Trans-ABySS mailing list.

http://groups.google.com/group/trans-abyss
trans...@googlegroups.com

I typically see ~90% of the reads map back to the assembly.

Cheers,
Shaun

vittoria roncalli

unread,
Apr 10, 2012, 10:08:48 PM4/10/12
to Shaun Jackman, abyss...@googlegroups.com
Hi Shaun.
I got the first assembly in which I pulled together all mine 6 libraries. I used kmer=40.
These are the results.

Mapped 382497306 of 401836653 reads (95.2%)
Mapped 382484223 of 401836653 reads uniquely (95.2%)
Read 401836653 alignments
Mateless     8035577  3.92%
Unaligned    4899383  2.39%
Singleton    9452447  4.61%
FR          11440936  5.58%
RF               993  0.000485%
FF            103781  0.0506%
Different  171002998  83.4%
Total      204936115
FR Stats mean: 153.4 median: 139 sd: 56.06 n: 11439720 min: 19 max: 782
       _▃▆██▇▅▃▁▁____
Mate orientation FR: 11440936 (100%) RF: 993 (0.00868%)
The library 7410-7415-k40-3.hist is oriented forward-reverse (FR).
Stats mean: 153.4 median: 139 sd: 56.06 n: 11439720 min: 19 max: 782
       _▃▆██▇▅▃▁▁____
Minimum and maximum distance are set to -39 and 782 bp.
Overlap -v  -k40 -g 7410-7415-k40-4.adj -o 7410-7415-k40-4.fa 7410-7415-k40-3.fa 7410-7415-k40-3.adj 7410-7415-k40-3.di:


Correct me if I am wrong, but what is telling me, is that I have only 6% of FR paired end reads, and 83.4% of different.
Why do you think I have a so higher % of reads that are not in paired?
The coverage with all 6 libraries should be good (around 60-73x) .
I was planning to do another assembly with a k=50
 Do you think I should change more parameters?
Thanks
Vittoria


Da: Shaun Jackman <sjac...@bcgsc.ca>
A: vittoria roncalli <vittoria...@yahoo.it>
Cc: "abyss...@googlegroups.com" <abyss...@googlegroups.com>
Inviato: Giovedì 5 Aprile 2012 14:02
Oggetto: Re: transabyss

Shaun Jackman

unread,
Apr 11, 2012, 1:05:13 PM4/11/12
to vittoria roncalli, abyss...@googlegroups.com
Hi Vittoria,

These numbers look fine. It indicates that 6% of the paired reads map to the same contig and with FR orientation, and 83% of the paired reads map to different contigs. It’s these latter reads that are used for building contigs and scaffolds.

I would first try different values of k before changing other parameters.

Cheers,
Shaun

Shaun Jackman

unread,
Apr 12, 2012, 12:57:26 PM4/12/12
to vittoria roncalli, abyss...@googlegroups.com
Hi Vittoria,

The following stats count pairs of reads (except for Mateless).

Mateless
reads that do not have a mate based on their read IDs
Unaligned
both reads of the pair do not align
Singleton
one read of the pair aligns and the other does not
FR
both reads align to the same sequence with forward-reverse orientation
RF
both reads align to the same sequence with reverse-forward orientation
FF
both reads align to the same sequence with forward-forward orientation
Different
the two reads align to different sequences

Mateless is ideally 0 (unless you know you have reads without mates). For unaligned and singleton, the smaller the better.

Cheers,
Shaun

On 2012-04-11, at 17:28 , vittoria roncalli wrote:

> HI Shaun, thanks for helping me.
> Could you give ma an idea on what the exact output names are?
> What does it means mateless and different?Is good that it is a low percentage%?
>
>
> Da: Shaun Jackman <sjac...@bcgsc.ca>
> A: "vittoria...@yahoo.it" <vittoria...@yahoo.it>
> Inviato: Mercoledì 11 Aprile 2012 10:08
> Oggetto: Re: low % paired end assembly
>
> Hi Vittoria,
>
> The percentage of mapped reads is typically around 90%. Yours is 95%:


>
> > Mapped 382497306 of 401836653 reads (95.2%)
>

> Cheers,
> Shaun
>
> On 2012-04-11, at 10:55 , vittoria...@yahoo.it wrote:
>
> > Hi Shaun,
> > Sorry for my ignorance,but from your answer I understand that the % that has to bee around 90% is the mapped reads to the assembly?
> > The % of fr is not the one that should I take a look?
> >
> >
> > Sent from my iPhone

vittoria roncalli

unread,
May 24, 2012, 4:18:58 PM5/24/12
to Shaun Jackman, "abyss-users@googlegroups.com"
Hi,
I am trying to take a look at the stats of some assemblies I did, but I am not able to figure out where I can have those information

# of contigs
# contigs > 100 bp
Max contig length
Total (Mb) The only think I have is that STAT list. I am confused because,if the "n" is the # of contigs, and it is "14.45e6", this means that the assembly is not good? ( too many contigs).
Could  you help me to figure out where I can get the information?


STATS

n               n:200   n:N50   min     N80     N50     N20     max     sum
14.45e6 73232   22757   200     237     336     625     9735    25.23e6 7410-7415-k35-unitigs.fa
14.41e6 67141   18425   200     255     416     839     9735    26.61e6 7410-7415-k35-contigs.fa
14.41e6 61538   14776   200     262     475     1099    9735    26.59e6 7410-7415-k35-scaffolds.fa




Thanks in advance


Vittoria 





Da: Shaun Jackman <sjac...@bcgsc.ca>
A: vittoria roncalli <vittoria...@yahoo.it>
Inviato: Venerdì 13 Aprile 2012 11:47

Oggetto: Re: low % paired end assembly

Hi Vittoria,

For questions related to transcriptome assembly, contact the Trans-ABySS mailing list.
http://groups.google.com/group/trans-abyss
trans...@googlegroups.com

For transcriptome assemblies, the more assemblies the better. I’d suggest every other value of k between 50 and 96.

The alignment stats refer to the reads aligned to the assembly.

Cheers,
Shaun


On 2012-04-13, at 14:41 , vittoria roncalli wrote:

> Hi Shaun,
> I got the results from 3 different assemblies using k= 40,50,60.
> As you can see,the percentage of mapped reads decrease with longer kmers, while unaligned and singleton percentage decrease.
> Moreover, the N50 is very low, comparing to other papers. Is that normal? I thought that, having reads of 100 bp length, the best kmer was at least longer of 50 bp.
> What do you think? Should I try a kmer between 50-60, or do should I directly merge the assemblies with TransAbyss?
> Regarding last mail you sent me, with all the definitions, (I really appreciated), when you say alignment, do you mean the alignment to the first series of contigs (SET) generated during the first step of abyss?
> Thanks in advance
> I really appreciate your help
> Vittoria
>
>
> 6 LIbraries ILLUMINA k= 40

> Mapped 382497306 of 401836653 reads (95.2%)
> Mapped 382484223 of 401836653 reads uniquely (95.2%)
> Read 401836653 alignments
> Mateless    8035577  3.92%
> Unaligned    4899383  2.39%
> Singleton    9452447  4.61%
> FR          11440936  5.58%
> RF              993  0.000485%
> FF            103781  0.0506%
> Different  171002998  83.4%
> Total      204936115
> FR Stats mean: 153.4 median: 139 sd: 56.06 n: 11439720 min: 19 max: 782
>        _▃▆██▇▅▃▁▁____
> Mate orientation FR: 11440936 (100%) RF: 993 (0.00868%)
> The library 7410-7415-k40-3.hist is oriented forward-reverse (FR).
> Stats mean: 153.4 median: 139 sd: 56.06 n: 11439720 min: 19 max: 782
>        _▃▆██▇▅▃▁▁____
> Minimum and maximum distance are set to -39 and 782 bp.
> Overlap -v  -k40 -g 7410-7415-k40-4.adj -o 7410-7415-k40-4.fa 7410-7415-k40-3.fa 7410-7415-k40-3.adj 7410-7415-k40-3.di:
>
> STATS
> n              n:200  n:N50  min    N80    N50    N20    max    sum
> 12.89e6 90536  28532  200    234    328    597    9735    30.57e6 7410-7415-k40-unitigs.fa
> 12.84e6 82837  22874  200    251    406    817    9735    32.28e6 7410-7415-k40-contigs.fa
> 12.84e6 76788  18741  200    257    454    1050    9850    32.26e6 7410-7415-k40-scaffolds.fa
>
> k= 50   
> Mapped 369922722 of 401836653 reads (92.1%)
> Mapped 369916919 of 401836653 reads uniquely (92.1%)

> Read 401836653 alignments
> Mateless    8035577  3.92%
> Unaligned    8077266  3.94%
> Singleton  15638233  7.63%
> FR          11646160  5.68%
> RF              1381  0.000674%
> FF            79213  0.0387%
> Different  161458285  78.8%
> Total      204936115
> FR Stats mean: 159.2 median: 142 sd: 60.3 n: 11644940 min: 18 max: 808
>        _▃▆██▇▅▃▂▁▁▁____
> Mate orientation FR: 11646160 (100%) RF: 1381 (0.0119%)
> The library 7410-7415-k50-3.hist is oriented forward-reverse (FR).
> Stats mean: 159.2 median: 142 sd: 60.3 n: 11644940 min: 18 max: 808
>        _▃▆██▇▅▃▂▁▁▁____
> Minimum and maximum distance are set to -49 and 808 bp.
> Overlap -v  -k50 -g 7410-7415-k50-4.adj -o 7410-7415-k50-4.fa 7410-7415-k50-3.fa 7410-7415-k50-3.adj 7410-7415-k50-3.dist
>
> STATS
> n                  n:200  n:N50  min    N80    N50    N20    max    sum
> 9905128 140135  45198  200    230    314    569    9735    45.99e6 7410-7415-k50-unitigs.fa
> 9834147 128262  35938  200    245    388    788    9970    48.51e6 7410-7415-k50-contigs.fa
> 9825742 119857  29876  200    249    428    989    10096  48.44e6 7410-7415-k50-scaffolds.fa
> 7410-7415-k50-stats (END)
>
> K= 60
> Mapped 348418095 of 401836653 reads (86.7%)
> Mapped 348413692 of 401836653 reads uniquely (86.7%)

> Read 401836653 alignments
> Mateless    8035577  3.92%
> Unaligned  13234855  6.46%
> Singleton  26784300  13.1%
> FR          11476178  5.6%
> RF              1329  0.000648%
> FF            53166  0.0259%
> Different  145350710  70.9%
> Total      204936115
> FR Stats mean: 165.6 median: 148 sd: 63.54 n: 11474992 min: 18 max: 817
>        _▃▆██▇▆▅▄▃▂▁▁▁____
> Mate orientation FR: 11476178 (100%) RF: 1329 (0.0116%)
> The library 7410-7415-k50-3.hist is oriented forward-reverse (FR).
> Stats mean: 165.6 median: 148 sd: 63.54 n: 11474992 min: 18 max: 817
>        _▃▆██▇▆▅▄▃▂▁▁▁____
> Minimum and maximum distance are set to -59 and 817 bp.
> Overlap -v  -k60 -g 7410-7415-k50-4.adj -o 7410-7415-k50-4.fa 7410-7415-k50-3.fa 7410-7415-k50-3.adj 7410-7415-k50-3.dist
> 269+    7200234-
> STATS
>
> n                    n:200  n:N50  min    N80    N50    N20    max    sum
> 7311433 201718  66477  200    226    302    536    9735    64.2e6  7410-7415-k50-unitigs.fa
> 7207780 184550  53021  200    239    371    742    9735    67.65e6 7410-7415-k50-contigs.fa
> 7194874 171644  43460  200    243    409    950    9774    67.53e6 7410-7415-k50-scaffolds.fa

>
> Da: Shaun Jackman <sjac...@bcgsc.ca>
> A: vittoria roncalli <vittoria...@yahoo.it>
> Cc: abyss...@googlegroups.com
> Inviato: Giovedì 12 Aprile 2012 6:57

Shaun Jackman

unread,
May 25, 2012, 4:04:57 PM5/25/12
to vittoria roncalli, "abyss-users@googlegroups.com"
Hi Vittoria,

The number of contigs at least 200 bp is listed in the column `n:200`. Yes, there are a lot of small contigs, but you can filter them out. The maximum contig length is listed in the column `max`. The total of contigs at least 200 bp is listed in the column `sum`.

Cheers,
Shaun

Shaun Jackman

unread,
May 25, 2012, 4:32:40 PM5/25/12
to vittoria roncalli, abyss...@googlegroups.com
Hi Vittoria,

Yes, it’s reasonably normal for a de Bruijn graph assembly of transcriptome data. The small contigs may be genomic contamination. You can filter them out with the attached script:
faclean -l200

Cheers,
Shaun

faclean

Vittoria Roncalli

unread,
May 25, 2012, 8:56:30 PM5/25/12
to ABySS
Hi Shaun, sorry to bother you again about the same topic.
I was reading through your answer, but I got confused on the
definition of "sum".
If n;200 is the # of contigs at least 200 bp, I do not understand what
sum means,
I just want to define the # of contigs to get an idea on how the
assembly worked.
Thanks and sorry to bother you again.
>  faclean
> < 1KViewDownload
>
>
>
> On 2012-05-25, at 13:16 , vittoria roncalli wrote:
>
>
>
>
>
>
>
> > Hi Shaun, thanks for the information.
> > How can I remove them?
> > Is normal that there are so small contigs?
> > Thanks
> > Vittoria
>
> > Da: Shaun Jackman <sjack...@bcgsc.ca>
> > A: vittoria roncalli <vittoria_ronca...@yahoo.it>
> > Cc: ""abyss...@googlegroups.com"" <abyss...@googlegroups.com>
> > Inviato: Venerdì 25 Maggio 2012 10:04
> > Oggetto: Re: assembly statistic
>
> > Hi Vittoria,
>
> > The number of contigs at least 200 bp is listed in the column `n:200`. Yes, there are a lot of small contigs, but you can filter them out. The maximum contig length is listed in the column `max`. The total of contigs at least 200 bp is listed in the column `sum`.
>
> > Cheers,
> > Shaun
>
> > On 2012-05-24, at 13:18 , vittoria roncalli wrote:
>
> > > Hi,
> > > I am trying to take a look at the stats of some assemblies I did, but I am not able to figure out where I can have those information
>
> > > # of contigs
> > > # contigs > 100 bp
> > > Max contig length
> > > Total (Mb) The only think I have is that STAT list. I am confused because,if the "n" is the # of contigs, and it is "14.45e6", this means that the assembly is not good? ( too many contigs).
> > > Could  you help me to figure out where I can get the information?
>
> > > STATS
>
> > > n              n:200  n:N50  min    N80    N50    N20    max    sum
> > > 14.45e6 73232  22757  200    237    336    625    9735    25.23e6 7410-7415-k35-unitigs.fa
> > > 14.41e6 67141  18425  200    255    416    839    9735    26.61e6 7410-7415-k35-contigs.fa
> > > 14.41e6 61538  14776  200    262    475    1099    9735    26.59e6 7410-7415-k35-scaffolds.fa
>
> > > Thanks in advance
>
> > > Vittoria
>
> > > Da: Shaun Jackman <sjack...@bcgsc.ca>
> > > A: vittoria roncalli <vittoria_ronca...@yahoo.it>
> > > > Da: Shaun Jackman <sjack...@bcgsc.ca>
> > > > A: vittoria roncalli <vittoria_ronca...@yahoo.it>
> > > > > Da: Shaun Jackman <sjack...@bcgsc.ca>
> > > > > A: "vittoria_ronca...@yahoo.it" <vittoria_ronca...@yahoo.it>
> > > > > Inviato: Mercoledì 11 Aprile 2012 10:08
> > > > > Oggetto: Re: low % paired end assembly
>
> > > > > Hi Vittoria,
>
> > > > > The percentage of mapped reads is typically around 90%. Yours is 95%:
>
> > > > > > Mapped 382497306 of 401836653 reads (95.2%)
>
> > > > > Cheers,
> > > > > Shaun
>
> > > > > On 2012-04-11, at 10:55 , vittoria_ronca...@yahoo.it wrote:
>
> > > > > > Hi Shaun,
> > > > > > Sorry for my ignorance,but from your answer I understand that the % that has to bee around 90% is the mapped reads to the assembly?
> > > > > > The % of fr is not the one that should I take a look?
>
> > > > > > Sent from my iPhone
>
> > > > > > On 11/apr/2012, at 07:05, Shaun Jackman <sjack...@bcgsc.ca> wrote:
>
> ...
>
> read more »

Matthew MacManes

unread,
May 26, 2012, 1:25:56 PM5/26/12
to ABySS
sum is the total number of bases contained in contigs > 200bp in length. 
Matt
_______________________________________
Matthew MacManes, PhD
Postdoctoral Scholar – Fellow
University of California, Berkeley
California Institute for Quantitative Biosciences
Personal Website: http://macmanes.com/
Reply all
Reply to author
Forward
0 new messages