abyss-pe error: duplicate read ID

415 views
Skip to first unread message

Alejandro Sanchez

unread,
Apr 8, 2011, 5:23:51 AM4/8/11
to ABySS
Hi everyone,

I'm working with a HiSeq run and the latest version of ABySS 1.2.6 and
is crashing with this problem:

Reading `nippo_52-1.fa'...
Finding overlaps of exactly k-1 bp...
V=3970578 E=4887678 E/V=1.23097
Degree: ▂█▅
01234
0: 16% 1: 49% 2-4: 35% 5+: 0.00013% max: 5
Finding overlaps of fewer than k-1 bp...
V=3970578 E=4927497 E/V=1.2
Degree: ▂█▅
01234
0: 16% 1: 49% 2-4: 35% 5+: 0.007% max: 61
Bubbles: 52691 Popped: 47245 Too long: 0 Too many: 1 Dissimilar: 5449
The minimum coverage of single-end contigs is 1.34746.
The minimum coverage of merged contigs is 3.74419.
Consider increasing the coverage threshold parameter, c, to 3.74419.
Reading from standard input...
Reading target `nippo_52-3.fa'...
Read 369284666 bases, 1828614 contigs, 1828614 scaffolds from
`nippo_52-3.fa'. Expecting 276025352 k-mer.
Reading target `nippo_52-3.fa'...
Read 100000 contigs. Hash load: 13122012 / 1073741824 = 0.0122208 using
627 MB.
Read 200000 contigs. Hash load: 26243276 / 1073741824 = 0.024441 using
1.24 GB.
Read 300000 contigs. Hash load: 39132132 / 1073741824 = 0.0364446 using
1.87 GB.
Read 400000 contigs. Hash load: 52019007 / 1073741824 = 0.0484465 using
2.46 GB.
Read 500000 contigs. Hash load: 65062681 / 1073741824 = 0.0605943 using
3.05 GB.
Read 600000 contigs. Hash load: 77949832 / 1073741824 = 0.0725964 using
3.63 GB.
Read 700000 contigs. Hash load: 91125950 / 1073741824 = 0.0848677 using
4.19 GB.
Read 800000 contigs. Hash load: 104137968 / 1073741824 = 0.096986 using
4.75 GB.
Read 900000 contigs. Hash load: 117216551 / 1073741824 = 0.109166 using
5.3 GB.
Read 1000000 contigs. Hash load: 130183408 / 1073741824 = 0.121243 using
5.86 GB.
Read 1100000 contigs. Hash load: 143063707 / 1073741824 = 0.133238 using
6.39 GB.
Read 1200000 contigs. Hash load: 156129839 / 1073741824 = 0.145407 using
6.93 GB.
Read 1300000 contigs. Hash load: 169002481 / 1073741824 = 0.157396 using
7.47 GB.
Read 1400000 contigs. Hash load: 182058947 / 1073741824 = 0.169556 using
8.03 GB.
Read 1500000 contigs. Hash load: 195152207 / 1073741824 = 0.18175 using
8.58 GB.
Read 1600000 contigs. Hash load: 208166192 / 1073741824 = 0.19387 using
9.12 GB.
Read 1700000 contigs. Hash load: 221105979 / 1073741824 = 0.205921 using
9.67 GB.
Read 1800000 contigs. Hash load: 250061182 / 1073741824 = 0.232888 using
10.9 GB.
Read 1828614 contigs. Hash load: 276024987 / 1073741824 = 0.257068 using
12 GB.
Found 365 (0.000132234%) duplicate k-mer.
Reading
`/lustre/scratch103/sanger/as9/ABYSS_results/NIPPO/genomic/reads/mouse_deriv/5142_1_1.fastq'...
Reading
`/lustre/scratch103/sanger/as9/ABYSS_results/NIPPO/genomic/reads/mouse_deriv/5142_1_2.fastq'...
Reading
`/lustre/scratch103/sanger/as9/ABYSS_results/NIPPO/genomic/reads/mouse_deriv/5982_1_1.fastq'...
Reading
`/lustre/scratch103/sanger/as9/ABYSS_results/NIPPO/genomic/reads/mouse_deriv/5982_1_2.fastq'...
Read 3 alignments. Hash load: 3 / 5 = 0.6 using 0 B.
Read 6 alignments. Hash load: 6 / 11 = 0.545455 using 0 B.
Read 12 alignments. Hash load: 12 / 23 = 0.521739 using 0 B.
Read 24 alignments. Hash load: 24 / 47 = 0.510638 using 0 B.
Read 48 alignments. Hash load: 48 / 97 = 0.494845 using 0 B.
Read 98 alignments. Hash load: 98 / 199 = 0.492462 using 0 B.
Read 200 alignments. Hash load: 200 / 409 = 0.488998 using 0 B.
Read 410 alignments. Hash load: 410 / 823 = 0.498177 using 0 B.
Read 890 alignments. Hash load: 824 / 1741 = 0.473291 using 135 kB.
Read 2474 alignments. Hash load: 1742 / 3739 = 0.4659 using 401 kB.
error: duplicate read ID `HS18_5982:1:1101:1106:1969/1'
warning: the seed-length should be at least twice k: k=52, s=100
nippo_52-3.hist: No such file or directory
make: *** [nippo_52-3.dist] Error 1
make: *** Deleting file `nippo_52-3.dist'
farm2-head2[as9]71: more nippo_52-3.hist
nippo_52-3.hist: No such file or directory

The commands issued were:

AdjList -v -k52 -m30 nippo_52-1.fa >nippo_52-1.adj
PopBubbles -v -k52 -p0.9 -g nippo_52-3.adj nippo_52-1.fa nippo_52-1.adj
>nippo_52-1.path
MergeContigs -k52 -o nippo_52-3.fa nippo_52-1.fa nippo_52-1.adj
nippo_52-1.path
awk '!/^>/ {x[">" $1]=1; next} {getline s} $1 in x {print $0 "\n" s}' \
nippo_52-1.path nippo_52-1.fa >nippo_52-indel.fa
KAligner -v -i -j8 -k52
/lustre/scratch103/sanger/as9/ABYSS_results/NIPPO/genomic/reads/mouse_deriv/*.fast*
nippo_52-3.fa \
|ParseAligns -v -k52 -h nippo_52-3.hist \
|sort -snk3 -k4 \
|gzip >nippo_52-3.sam.gz
gunzip -c nippo_52-3.sam.gz \
|DistanceEst -v -j8 -k52 -s100 -n10 -o nippo_52-3.dist nippo_52-3.hist


I tried looking in the ABySS list and only found problems with
duplicated ids when converting to ACE files... Any ideas are welcome.

Cheers.

--
Alejandro Sanchez-Flores
Team133 Parasite Genomics
Wellcome Trust Sanger Institute
Cambridge, UK.

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Alejandro Sanchez

unread,
Apr 8, 2011, 5:44:28 AM4/8/11
to Alejandro Sanchez, ABySS
Sorry I forgot to mention that I checked the ID of the read that is
reported as being duplicated but is not... I searched in my fastq
files and only appears once.

Cheers.

Rod Docking

unread,
Apr 8, 2011, 12:40:46 PM4/8/11
to Alejandro Sanchez, Alejandro Sanchez, ABySS
Hi Alejandro:

    We might need to wait for Shaun's return early next week to fully debug this, but for now, for these error messages:


> Found 365 (0.000132234%) duplicate k-mer.
> error: duplicate read ID `HS18_5982:1:1101:1106:1969/1'
> warning: the seed-length should be at least twice k: k=52, s=100

    I'd suggest:

    (1) Trying to increase the seed size as corrected (though I don't think this will fix the duplicate read ID error)
    (2) Removing the offending read pair from your fastq files and re-running.  This will let you see if it's really just a single pair that ABySS is stumbling on or a more general issue with your reads.  When I see this error, it's usually because I've named both sets of reads with '/1' suffixes.

Regards,
Rod

Shaun Jackman

unread,
Apr 11, 2011, 2:53:17 PM4/11/11
to Alejandro Sanchez, ABySS
Hi Alejandro,

Please report the output of...
grep HS18_5982:1:1101:1106:1969 /lustre/scratch103/sanger/as9/ABYSS_results/NIPPO/genomic/reads/mouse_deriv/{5142,5982}_1_{1,2}.fastq
head /lustre/scratch103/sanger/as9/ABYSS_results/NIPPO/genomic/reads/mouse_deriv/{5142,5982}_1_{1,2}.fastq

Cheers,
Shaun

Alejandro Sanchez

unread,
Apr 11, 2011, 5:11:37 PM4/11/11
to Shaun Jackman, Alejandro Sanchez, ABySS
Hi Shaun,

Mystery solved... It was indeed duplicated in the reads /2 file. The
HiSeq runs here at Sanger are now stored as BAM files and there was a
problem with the pipeline that generates the fastq reads...

Problem fix now... is the kind of things that sometime you think "That
can't be wrong..."

Cheers.

Subing Cao

unread,
May 14, 2014, 4:58:30 PM5/14/14
to abyss...@googlegroups.com, Shaun Jackman, Alejandro Sanchez
Hi Alejandro,
How did you solve the problem? I have the same problem when I converted bam to fastq files? How do you remove the duplicated reads?

Thanks

Tomás Carrasco

unread,
Jan 4, 2017, 4:17:27 PM1/4/17
to ABySS, sjac...@bcgsc.ca, a...@sanger.ac.uk
I also has the same problem, anyone could please tellme how they fix it?, thanks

Shaun Jackman

unread,
Jan 4, 2017, 4:18:58 PM1/4/17
to ABySS, sjac...@bcgsc.ca, a...@sanger.ac.uk
Hi, Subing. Use `samtools fastq` to convert a BAM file to FASTQ.

Cheers,
Shaun
Reply all
Reply to author
Forward
0 new messages