coverage histogram

chrisbala

unread,

Jan 8, 2017, 11:16:58 AM1/8/17

to ABySS

Hi All,

I've been fighting a bit with a an attempt at a genome assembly.

Based on reading this forum, I suspect there is an issue with the raw data, but I just want to make sure I've understood everything. I've an attached a fastqc report, which to me looks ok. There are some over-represented kmers, but I don't think that is the problem.

I seem to have an excess of rare kmers (which seemingly indicates sequencing errors), but I am not sure why. I past the first few lines of the coverage.hist below.

These data have been quality trimmed using trim_galore. One relevant post I found suggests using QUAKE for correction. Is that still my next step? Or is it possible these data are fundamentally flawed?

Thanks!

Chris

1	1182909997
2	84699927
3	9033507
4	5000923
5	3572263
6	3223489
7	3322965
8	3843986
9	4777951
10	6183795
11	8121580
12	10580154
13	13495444

MH-S1_S4_L005_R1_001_val_1_fastqc.html

Ben Vandervalk

unread,

Jan 9, 2017, 1:20:22 PM1/9/17

to chrisbala, ABySS

Hi Chris,

It is quite normal to have a high percentage (e.g. 50%) of the coverage histogram consist of singleton k-mers due to sequencing errors. (In fact, I would suspect something was wrong if that wasn't the case.)

In the past, we have often used raw reads for assembly (because ABySS has its own algorithms for dealing with sequencing errors), although we do find that error correction with BFC does improves assembly quality slightly.

Can I ask: What makes you think your assembly is of poor quality? Do you have a low N50 or a low reconstruction? (You can calculate such metrics with the `abyss-fac` program on the *-8.fa file, if you are not already aware.)

Can you provide the `abyss-pe` command that you used?

- Ben

--
You received this message because you are subscribed to the Google Groups "ABySS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abyss-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher Balakrishnan

unread,

Jan 9, 2017, 3:49:45 PM1/9/17

to Ben Vandervalk, ABySS

Thanks Ben,

I’ve been involved in some genome projects, but stumbling through do it yourself assembly :)

Our N50 is just low, hence my concerns. With ALLPATHs, using both MP and PE data, this assembly was at n50 contig= ~12kb. In another concurrent assembly with similar data (data from different facility, different species) we had an N50 of 100 kb. We are working with birds, which are usually not too hard to assemble. Strategy was one lane MP one PE per species.

I was reading the ALLPATHs doc, and our data looked pretty similar to what they called “poor data quality”, where you get a major correction (loss of rare kmers) during whatever ALLPATHs does in their correction step. Anyway, I couldn’t think of other things, besides contamination, that would lead to such poor assembly. I haven’t ruled out contamination yet.

Now that it seems the kmer profile is ok, maybe I am overlooking obvious things. Maybe the depth of sequencing for the worse assembly is just a bit lower, and below some threshold where contiguity stats make a big bump up.

abyss command (didn’t add MP data yet):

abyss-pe k=32 np=8 name=AGPHv2 lib='pe4 pe9' pe4='MH-S1_S4_L005_R1_001_val_1.fq.gz MH-S1_S4_L005_R2_001_val_2.fq.gz' pe9='MH-S1_S9_L006_R1_001_val_1.fq.gz MH-S1_S9_L006_R2_001_val_2.fq.gz' &

Ben Vandervalk

unread,

Jan 9, 2017, 6:49:34 PM1/9/17

to Christopher Balakrishnan, ABySS

Hi Chris,

Your assembly command looks fine.

If you have not tried it already, you should running assemblies with a range of k-mer sizes and compare the resulting N50's. k-mer size is the most important parameter and can have a large affect on assembly quality. For read lengths up to 126 bp (as shown in your FASTQC report), something like k=80 or k=96 would be a more typical value. Maybe even higher.

- Ben

Christopher Balakrishnan

unread,

Jan 10, 2017, 8:44:36 AM1/10/17

to Ben Vandervalk, ABySS

Thanks, will do! (increase kmer)

Reply all

Reply to author

Forward