ABYSS single-end assembly result

ZH

unread,

Jul 10, 2013, 12:50:09 AM7/10/13

to abyss...@googlegroups.com

Hello everyone, I got a result after doing the ABYSS single-end assembly, but cannot understand what does the fasta file mean inside.

I used the command "ABYSS -k25 reads.fa -o contigs.fa".

And in the result contigs.fa:

It shows like:

>0 38 86

GTACGTACGTACGTTACGTTACGTTACGTACGTACGTA

>1 33 55

GTACGTACGTACGTTACGTTACGTACGTACGTA

>2 44 171

TCCAAGTAAAGTATAAAGGAAAAAACTGATATGCTGCCTTGATC

>3 35 83

AACCTACGGGTTTTTTAAGATTTTTCAATACCCAT

>4 35 150

AACCTACGGGTTTTTTAAGATTTTCAATACCCATG

>5 25 151

CAGGGGCCTTGTGCAGTAAACCCCC

>6 51 115

TTTTATCTTGTTTCAATTTTATTTATTATCCCTGATCCGGAAGTAACCTTT

I searched a lot on the internet, but got nothing.

Thanks so much for everyone's help.

Ka Ming Nip

unread,

Jul 10, 2013, 12:37:13 PM7/10/13

to abyss...@googlegroups.com

Hello,

Thanks for using ABySS.

Wikipedia has a good description of the FASTA format:
http://en.wikipedia.org/wiki/FASTA_format

You should generate an assembly with `abyss-pe' instead of `ABYSS'.

You may also consider using a larger k-mer.

Regards,
Ka Ming

ZH

unread,

Jul 17, 2013, 4:03:30 PM7/17/13

to abyss...@googlegroups.com

Thanks for replying.

I used ABYSS because the reads are single-end, and I cannot use -pe, it's for paired-end reads.

Also the result is a fasta format, but in the name parts after ">", the result used some numbers there, and the numbers indicate something. It's not a fasta formate problem.

Ben Vandervalk

unread,

Jul 17, 2013, 5:54:27 PM7/17/13

to ZH, abyss...@googlegroups.com

Hi ZH,

In the output FASTA file from ABYSS, the numbers in the FASTA header are:

<CONTIG_ID> <KMER_COVERAGE> <SEQUENCE_LENGTH>

- Ben

--
You received this message because you are subscribed to the Google Groups "ABySS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abyss-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ZH

unread,

Jul 17, 2013, 6:09:21 PM7/17/13

to abyss...@googlegroups.com, ZH

Hi, but in my result, the second number seems to be the length of its sequence.

Ben Vandervalk

unread,

Jul 17, 2013, 6:11:03 PM7/17/13

to ZH, abyss...@googlegroups.com

Oops, my mistake. It's

<CONTIG_ID> <SEQ_LEN> <KMER_COVERAGE>

- Ben

ZH

unread,

Jul 17, 2013, 6:13:25 PM7/17/13

to abyss...@googlegroups.com, ZH

Thank you again. Could you please explain to me that what does <KMER_COVERAGE> mean? Wasn't the k-mer value set before running the abyss? Also why <KMER_COVERAGE> is larger than the length of the sequence?

-ZH

Ben Vandervalk

unread,

Jul 17, 2013, 6:58:00 PM7/17/13

to ZH, abyss...@googlegroups.com

Hi ZH,

Kmer coverage is an approximation of read coverage based on kmers (sequences of length k). It is the "number of read kmers per contig kmer".

The formula for kmer coverage is:

Ck = sum (multiplicity(kmer_i)) / (L - k + 1)

where

Ck = kmer coverage
multiplicity(kmer_i) = number of times that kmer occurs in a read
L = length of contig
k = kmer size

and the sum from i = 1 to i = (L - k + 1)

I made another error in my previous reply -- the third number is actually not KMER_COVERAGE, but the denominator of the formula above, i.e. sum(multiplicity(kmer_i))

Kmer coverage is a useful approximation for read coverage because it can be computed without aligning the reads to the assembly.

- Ben

ZH

unread,

Jul 22, 2013, 3:48:41 PM7/22/13

to abyss...@googlegroups.com, ZH

Sorry, I forgot to thank you for the useful message.

By the way, in the result files I got, there are several .fa files with the names bubbles.fa, indel.fa and unitigs.fa. The uniting.fa is the final result, right? The reads I used to do the assembly are hundreds even thousands length, actually they are not the reads, but the part genomes of a species. So how can I get the whole assembly sequence from the result? In the result, the sequences are piece and piece.

ZH

Tony Raymond

unread,

Jul 24, 2013, 8:21:45 PM7/24/13

to ZH, abyss...@googlegroups.com

Hi ZH,

Sorry if this was answered already, but the final assembly you are looking for is the unitigs.fa file. The bubbles.fa and indel.fa contain variant sequences, which were removed to make the overall assembly more contiguous.

Cheers,
Tony
________________________________________
From: abyss...@googlegroups.com [abyss...@googlegroups.com] On Behalf Of ZH [zh9...@gmail.com]
Sent: Monday, July 22, 2013 12:48 PM
To: abyss...@googlegroups.com
Cc: ZH
Subject: Re: ABYSS single-end assembly result

Sorry, I forgot to thank you for the useful message.
By the way, in the result files I got, there are several .fa files with the names bubbles.fa, indel.fa and unitigs.fa. The uniting.fa is the final result, right? The reads I used to do the assembly are hundreds even thousands length, actually they are not the reads, but the part genomes of a species. So how can I get the whole assembly sequence from the result? In the result, the sequences are piece and piece.

ZH

On Wednesday, July 17, 2013 4:58:00 PM UTC-6, Ben Vandervalk wrote:
Hi ZH,

Kmer coverage is an approximation of read coverage based on kmers (sequences of length k). It is the "number of read kmers per contig kmer".

The formula for kmer coverage is:

Ck = sum (multiplicity(kmer_i)) / (L - k + 1)

where

Ck = kmer coverage
multiplicity(kmer_i) = number of times that kmer occurs in a read
L = length of contig
k = kmer size

and the sum from i = 1 to i = (L - k + 1)

I made another error in my previous reply -- the third number is actually not KMER_COVERAGE, but the denominator of the formula above, i.e. sum(multiplicity(kmer_i))

Kmer coverage is a useful approximation for read coverage because it can be computed without aligning the reads to the assembly.

- Ben

To unsubscribe from this group and stop receiving emails from it, send an email to abyss-users...@googlegroups.com<javascript:>.

Zhaoming Gao

unread,

Sep 12, 2013, 11:57:28 PM9/12/13

to abyss...@googlegroups.com, ZH

Hi Vandervalk,

I am also looking this information. Could you tell me how to get the read coverage of each contig using "KMER_COVERAGE" or "sum(multiplicity(kmer_i))" in the fasta file. So, the third number is sum(multiplicity(kmer_i)), but not KMER_COVERAGE, is that right? By the way, I am using ABySS 1.3.6.

Another question, I got three fasta files, what is the differences of unitigs.fa and contigs.fa?

3141540 58729 21171 500 580 754 1109 13442 45.3e6 M4_abyss-unitigs.fa

3136055 58479 20425 500 584 772 1169 24997 46.41e6 M4_abyss-contigs.fa

3135319 58417 20309 500 585 775 1178 28946 46.57e6 M4_abyss-scaffolds.fa

Best regards,

Zhaoming GAO

Ben Vandervalk於 2013年7月18日星期四UTC+8上午6時58分00秒寫道：

Tony Raymond

unread,

Sep 19, 2013, 2:02:02 PM9/19/13

to Zhaoming Gao, abyss...@googlegroups.com, ZH

Hi,

Sorry for the delayed response! You are correct that the third number is the sum(multiplicity(kmer_i)).

As for the output files, I hope the attached (super simplified) flowchart will help with the understanding.

Cheers,

Tony

PastedGraphic-9.pdf

ATT00001..htm

Reply all

Reply to author

Forward