Abyss output

642 views
Skip to first unread message

bioinfo

unread,
Mar 23, 2011, 12:59:48 AM3/23/11
to ABySS
Hi group,

I have been trying to use ABySS for quite sometimes now. I just have
single end reads and the
./ABYSS input.bam -k $k -o tmp$k runs just fine. However, checking
the assembly files, there is a serious decline in assembly size with
every single increase in k value. For example:

17127582 for k=20
15420323 for k=21
13999661 for k=22
12816157 for k=23
11926295 for k=24
10244851 for k=25
8946369 for k=26
8034562 for k=27
7251599 for k=28
6627152 for k=29
6297644 for k=30
5426612 for k=31
4686761 for k=32
3991264 for k=33
3469106 for k=34
3199867 for k=35
2630821 for k=36
2149180 for k=37
1752301 for k=38
1460721 for k=39
1290036 for k=40

The total number of reads in this library is about 16 million and
about 58% of the sequences map with genomic DNA. My reads length is 50
bases and it is single end and the genome size is about 80 MB.

My question is:

1. What will be my best bet in combining these assemblies? Can someone
run abyss-pe if they just have the sorted bam files from bowtie
alignment for single-end reads?
2. When I run abyss-pe on my bam files as input using the following
shell script, it throws error saying
make: *** empty variable name. Stop.

for i in {25..49};

do
mkdir k$i;
cd k$i;
abyss-pe -k = $i in='/data/bowtieOutput/PS-1_F3V1.bam.sorted.bam' -o
tmp E=0 n
=10 v=-v;
cd ..;

done


Any help in this regard and merging the abyss assemblies will be
greatly appreciated.

Thanks

Shaun Jackman

unread,
Mar 23, 2011, 12:29:23 PM3/23/11
to bioinfo, ABySS
Hi,

You can use abyss-pe with single-end reads like so:
abyss-pe name=assembly k=$k n=10 v=-v se=input.bam

Your coverage is 10x, which is low:
$ calc 16e6*50/80e6
10
If you consider that only 58% of your reads map, it's even lower (5.8x).
Assembly requires at least 20x coverage for reasonable results. 50x or
more is better.

With low-coverage high-quality data, you could try an overlap assembler
(rather than a de Bruijn assembler) such as String Graph Assembler
(SGA):
https://github.com/jts/sga

You'll want to try adjusting the minimum overlap parameter.

Cheers,
Shaun

Shaun Jackman

unread,
Mar 23, 2011, 12:31:20 PM3/23/11
to bioinfo, ABySS
> 1. What will be my best bet in combining these assemblies? Can someone
> run abyss-pe if they just have the sorted bam files from bowtie
> alignment for single-end reads?

Most likely, the smaller assemblies are subsets of the larger
assemblies, and there will be nothing (or very little) to be gained by
merging these assemblies.

If you do try an overlap assembler as I suggested in my previous email,
you could try assembling the reads along with the ABySS contigs. I'd be
curious to hear of your experience if you try this approach.

Cheers,
Shaun

On Tue, 2011-03-22 at 21:59 -0700, bioinfo wrote:

Shaun Jackman

unread,
Mar 23, 2011, 4:22:38 PM3/23/11
to Sucheta, ABySS
Hi Sucheta (cc'ed to abyss-users)

Ah, I didn't realize that you had RNAseq data. We have a separate
software package, Trans-ABySS (which uses ABySS) for that. It merges the
assemblies using BLAT.
http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss

and the discussion group:
https://groups.google.com/forum/#!forum/trans-abyss

Those N50 sound very low though. Did you calculate those numbers using
abyss-fac (included with ABySS)?

Cheers,
Shaun

On Wed, 2011-03-23 at 12:35 -0700, tsuc...@gmail.com wrote:
> Thanks Shaun for the reply!!
> Actually the smaller assemblies have N50 size about double that of N50
> of larger assemblies. For example, for k=20, my N50 is just 24 whereas
> for k approaching 35, the N50 goes upto 45. Now since N50s are not
> real metrics for judging assemblies in RNAseq data, I am not sure
> which k value is appropriate. At the same time, the larger assemblies
> have very short contigs.
>
> Actually our covergae is not 10X as pointed by you in your earlier
> email. The genome size is 80 MB, but the total genes may be around
> 8-10 MBs. Since this is RNAseq data, my assumption is our coverage is
> way over 50X. I will certainly try and use an overlap assembler and
> see how it works. Do you know of any good assemblers that does this
> type of job specifically?
>
> Thanks
>
> Sucheta

Genome Analysis

unread,
Mar 23, 2011, 4:32:34 PM3/23/11
to Shaun Jackman, ABySS
Thanks Shaun for the response!! I have tried using trans-abyss but somehow can't get past the naming conventions of the directories. For example, trans-abyss wants me to have abyss assemblies with each k value inside k$k directory, where there will be the accessory files such as .adj ... along with fasta contig file . However, my runs with abyss only produces a contigs.fa(since I run abyss with simplest commands) file not anything else and trans-abyss does not run with that! So, I am stuck in a circular problem here. I would like to use abyss-pe (If that produces accessory files) and try running trans-abyss OR if there is any other recommendation that will also do.

Regarding N50, I calculated those with my own script.

Many thanks for your time.

Sucheta
--

Shaun Jackman

unread,
Mar 23, 2011, 6:13:47 PM3/23/11
to Genome Analysis, ABySS
Hi Sucheta,

You can use abyss-pe for assembly of single-end data like so:
abyss-pe name=${name} k=${k} E=0 n=10 v=-v se=input.bam

Can you report the results of running abyss-fac on your contigs?

You're best off contacting the Trans-ABySS team on their discussion
forum for Trans-ABySS-specific issues.
https://groups.google.com/forum/#!forum/trans-abyss

Cheers,
Shaun

tsuc...@gmail.com

unread,
Mar 24, 2011, 1:18:04 PM3/24/11
to Shaun Jackman, Genome Analysis, ABySS
Hi Shaun,

The abyss-pe commandline suggested by you runs fine with my bam files AND clearly the results are distinctly different from the simple run of ./ABYSS <input.bam> -k $k -o out.fa. May be you can explain little bit on this as well as on the nomenclature of the fasta header...

Here is a summary of running abyss-fac on the assembly:

k=25
n n:100 n:N50 min median mean N50 max sum
562211 16029 5075 100 137 170 168 1864 2731389 k25/assembly-1.fa
k=26
n n:100 n:N50 min median mean N50 max sum
467074 15122 4824 100 136 168 165 1864 2548725 k26/assembly-1.fa
k=27
n n:100 n:N50 min median mean N50 max sum
399836 14319 4615 100 135 166 163 1864 2386927 k27/assembly-1.fa
k=28
n n:100 n:N50 min median mean N50 max sum
345212 13539 4395 100 134 164 160 1864 2231192 k28/assembly-1.fa
k=29
n n:100 n:N50 min median mean N50 max sum
304060 12584 4113 100 133 163 158 1864 2052623 k29/assembly-1.fa
k=30
n n:100 n:N50 min median mean N50 max sum
282721 11682 3854 100 131 161 155 1864 1881141 k30/assembly-
k=31
n n:100 n:N50 min median mean N50 max sum
234415 10772 3579 100 130 159 153 1864 1718610 k31/assembly-
k=32
n n:100 n:N50 min median mean N50 max sum
195054 9748 3256 100 129 157 151 1864 1540179 k32/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k33/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
159198 8927 3021 100 127 155 148 1864 1387642 k33/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k34/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
134332 8060 2751 100 126 153 146 1806 1238877 k34/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k35/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
123204 7042 2428 100 126 152 144 1806 1072462 k35/assembly-
1.fa
n n:100 n:N50 min median mean N50 max sum
98421 6081 2100 100 125 151 143 1411 921234 k36/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k37/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
78223 5158 1791 100 125 150 142 1411 775682 k37/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k38/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
62940 4172 1461 100 124 149 141 1411 622127 k38/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k39/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
52339 3393 1205 100 123 147 139 1180 500025 k39/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k40/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
47165 2671 959 100 123 146 138 908 390832 k40/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k41/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
38913 1954 706 100 123 145 136 908 283945 k41/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k42/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
29374 1487 553 100 123 142 135 796 211744 k42/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k43/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
23673 1057 400 100 123 140 132 504 148357 k43/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k44/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
17068 720 283 100 120 134 127 642 97086 k44/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k45/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
14240 387 158 100 115 128 121 351 49589 k45/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k46/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
9505 172 71 100 114 126 119 267 21704 k46/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k47/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
4784 43 17 100 112 130 121 273 5601 k47/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k48/assembly-1.fa
n n:100 n:N50 min median mean N50 max sum
1948 26 11 100 120 132 125 305 3451 k48/assembly-
1.fa
sutripa@dwarrowdelf:~/abyss-1.2.5> ./bin/abyss-fac k49/assembly-1.fa
warning: `k49/assembly-1.fa' is empty

Now, looks like my best bet is to use k=25(>?!). What is the content of *bubbles* file in the k directories?

Many thanks

Shaun Jackman

unread,
Mar 24, 2011, 1:50:04 PM3/24/11
to tsuc...@gmail.com, ABySS
Hi Sucheta,

You can pass multiple files to abyss-fac:
abyss-fac k??/assembly-1.fa

The FASTA header output by ABySS is
>ID LENGTH COVERAGE

Coverage is the total k-mer coverage, which is not read coverage.

The ${name}-bubbles.fa file contains variant sequences. You can align
these sequences back to your contigs.

With RNAseq data, small values of k better assemble less-expressed
transcripts, and larger values of k better assemble more-expressed
transcripts. For this reason, rather than picking a single value of k, I
recommend merging the assemblies. Trans-ABySS uses BLAT to merge the
assemblies.

Cheers,
Shaun

bioinfo

unread,
Apr 6, 2011, 9:42:00 PM4/6/11
to ABySS
Hi Shaun,

Towards the end of the assembly-3.fa file, I get some values like this
in the header. I am wondering what these values mean?

>577961 1033 101526 438597+,162407+,459622+
>577962 324 50538 457168-,409760-,516232-
>577963 79 721 510854-,335250-,528513-

Thanks

Sucheta

Rod Docking

unread,
Apr 7, 2011, 12:54:50 PM4/7/11
to bioinfo, ABySS
Hi Sucheta:

    These are contigs that are made up of multiple smaller contigs.  The first one reads:

    "Contig ID 577961 is 1033bp long with coverage 101,526, and is made up of contigs 438597 in plus orientation, 162407 in plus orientation, and 459622 in plus orientation".  You should be able to find those contigs in one of the earlier output files (i.e. assembly-1.fa).

Cheers,
Rod
Reply all
Reply to author
Forward
0 new messages