ABYSS single-end assembly result

406 views
Skip to first unread message

ZH

unread,
Jul 10, 2013, 12:50:09 AM7/10/13
to abyss...@googlegroups.com
Hello everyone, I got a result after doing the ABYSS single-end assembly, but cannot understand what does the fasta file mean inside.
I used the command "ABYSS -k25 reads.fa -o contigs.fa".
And in the result contigs.fa:
It shows like:

>0 38 86
GTACGTACGTACGTTACGTTACGTTACGTACGTACGTA
>1 33 55
GTACGTACGTACGTTACGTTACGTACGTACGTA
>2 44 171
TCCAAGTAAAGTATAAAGGAAAAAACTGATATGCTGCCTTGATC
>3 35 83
AACCTACGGGTTTTTTAAGATTTTTCAATACCCAT
>4 35 150
AACCTACGGGTTTTTTAAGATTTTCAATACCCATG
>5 25 151
CAGGGGCCTTGTGCAGTAAACCCCC
>6 51 115
TTTTATCTTGTTTCAATTTTATTTATTATCCCTGATCCGGAAGTAACCTTT

I searched a lot on the internet, but got nothing.

Thanks so much for everyone's help.

Ka Ming Nip

unread,
Jul 10, 2013, 12:37:13 PM7/10/13
to abyss...@googlegroups.com
Hello,

Thanks for using ABySS.

Wikipedia has a good description of the FASTA format:
http://en.wikipedia.org/wiki/FASTA_format

You should generate an assembly with `abyss-pe' instead of `ABYSS'.

You may also consider using a larger k-mer.

Regards,
Ka Ming

ZH

unread,
Jul 17, 2013, 4:03:30 PM7/17/13
to abyss...@googlegroups.com
Thanks for replying.
I used ABYSS because the reads are single-end, and I cannot use -pe, it's for paired-end reads.
Also the result is a fasta format, but in the name parts after ">", the result used some numbers there, and the numbers indicate something. It's not a fasta formate problem.

Ben Vandervalk

unread,
Jul 17, 2013, 5:54:27 PM7/17/13
to ZH, abyss...@googlegroups.com
Hi ZH,

In the output FASTA file from ABYSS, the numbers in the FASTA header are:

<CONTIG_ID> <KMER_COVERAGE> <SEQUENCE_LENGTH>

- Ben


--
You received this message because you are subscribed to the Google Groups "ABySS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abyss-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

ZH

unread,
Jul 17, 2013, 6:09:21 PM7/17/13
to abyss...@googlegroups.com, ZH
Hi, but in my result, the second number seems to be the length of its sequence.

Ben Vandervalk

unread,
Jul 17, 2013, 6:11:03 PM7/17/13
to ZH, abyss...@googlegroups.com
Oops, my mistake.  It's

<CONTIG_ID> <SEQ_LEN> <KMER_COVERAGE>

- Ben

ZH

unread,
Jul 17, 2013, 6:13:25 PM7/17/13
to abyss...@googlegroups.com, ZH
Thank you again. Could you please explain to me that what does <KMER_COVERAGE> mean? Wasn't the k-mer value set before running the abyss? Also why <KMER_COVERAGE> is larger than the length of the sequence?

-ZH

Ben Vandervalk

unread,
Jul 17, 2013, 6:58:00 PM7/17/13
to ZH, abyss...@googlegroups.com
Hi ZH,

Kmer coverage is an approximation of read coverage based on kmers (sequences of length k).  It is the "number of read kmers per contig kmer". 

The formula for kmer coverage is:

Ck = sum (multiplicity(kmer_i))  / (L - k + 1)

where

Ck = kmer coverage
multiplicity(kmer_i) = number of times that kmer occurs in a read
L = length of contig
k = kmer size

and the sum from  i = 1 to  i = (L - k + 1)

I made another error in my previous reply -- the third number is actually not KMER_COVERAGE, but the denominator of the formula above, i.e. sum(multiplicity(kmer_i))

Kmer coverage is a useful approximation for read coverage because it can be computed without aligning the reads to the assembly.

- Ben

ZH

unread,
Jul 22, 2013, 3:48:41 PM7/22/13
to abyss...@googlegroups.com, ZH
Sorry, I forgot to thank you for the useful message. 
By the way, in the result files I got, there are several .fa files with the names bubbles.fa, indel.fa and unitigs.fa. The uniting.fa is the final result, right? The reads I used to do the assembly are hundreds even thousands length, actually they are not the reads, but the part genomes of a species. So how can I get the whole assembly sequence from the result? In the result, the sequences are piece and piece.

ZH

Tony Raymond

unread,
Jul 24, 2013, 8:21:45 PM7/24/13
to ZH, abyss...@googlegroups.com
Hi ZH,

Sorry if this was answered already, but the final assembly you are looking for is the unitigs.fa file. The bubbles.fa and indel.fa contain variant sequences, which were removed to make the overall assembly more contiguous.

Cheers,
Tony
________________________________________
From: abyss...@googlegroups.com [abyss...@googlegroups.com] On Behalf Of ZH [zh9...@gmail.com]
Sent: Monday, July 22, 2013 12:48 PM
To: abyss...@googlegroups.com
Cc: ZH
Subject: Re: ABYSS single-end assembly result

Sorry, I forgot to thank you for the useful message.
By the way, in the result files I got, there are several .fa files with the names bubbles.fa, indel.fa and unitigs.fa. The uniting.fa is the final result, right? The reads I used to do the assembly are hundreds even thousands length, actually they are not the reads, but the part genomes of a species. So how can I get the whole assembly sequence from the result? In the result, the sequences are piece and piece.

ZH

On Wednesday, July 17, 2013 4:58:00 PM UTC-6, Ben Vandervalk wrote:
Hi ZH,

Kmer coverage is an approximation of read coverage based on kmers (sequences of length k). It is the "number of read kmers per contig kmer".

The formula for kmer coverage is:

Ck = sum (multiplicity(kmer_i)) / (L - k + 1)

where

Ck = kmer coverage
multiplicity(kmer_i) = number of times that kmer occurs in a read
L = length of contig
k = kmer size

and the sum from i = 1 to i = (L - k + 1)

I made another error in my previous reply -- the third number is actually not KMER_COVERAGE, but the denominator of the formula above, i.e. sum(multiplicity(kmer_i))

Kmer coverage is a useful approximation for read coverage because it can be computed without aligning the reads to the assembly.

- Ben



To unsubscribe from this group and stop receiving emails from it, send an email to abyss-users...@googlegroups.com<javascript:>.

Zhaoming Gao

unread,
Sep 12, 2013, 11:57:28 PM9/12/13
to abyss...@googlegroups.com, ZH
Hi Vandervalk,

I am also looking this information. Could you tell me how to get the read coverage of each contig using "KMER_COVERAGE" or "sum(multiplicity(kmer_i))" in the fasta file. So, the third number is sum(multiplicity(kmer_i)), but not KMER_COVERAGE, is that right? By the way, I am using ABySS 1.3.6.

Another question, I got three fasta files, what is the differences of unitigs.fa and contigs.fa?

3141540 58729   21171   500     580     754     1109    13442   45.3e6  M4_abyss-unitigs.fa
3136055 58479   20425   500     584     772     1169    24997   46.41e6 M4_abyss-contigs.fa
3135319 58417   20309   500     585     775     1178    28946   46.57e6 M4_abyss-scaffolds.fa


Best regards,

Zhaoming GAO



Ben Vandervalk於 2013年7月18日星期四UTC+8上午6時58分00秒寫道:

Tony Raymond

unread,
Sep 19, 2013, 2:02:02 PM9/19/13
to Zhaoming Gao, abyss...@googlegroups.com, ZH
Hi,

Sorry for the delayed response! You are correct that the third number is the sum(multiplicity(kmer_i)).

As for the output files, I hope the attached (super simplified) flowchart will help with the understanding.

Cheers,
Tony

PastedGraphic-9.pdf
ATT00001..htm
Reply all
Reply to author
Forward
0 new messages