New Illumina pipeline (1.9)

1,426 views
Skip to first unread message

mgalardini

unread,
Jul 26, 2011, 9:02:35 AM7/26/11
to ABySS
Hi everybody,

in case of a very high coverage in an illumina run (i.e. 500X), do you
know any Abyss pitfalls in the bacterial assembly process? For istance
the very same assembly with 500X coverage produced 10 times the
contigs of an 100X coverage assembly; do you have any suggestion to
overcome this problem (i.e. random sampling of reads before the genome
assembly, ...)?

Moreover as you may know, illumina recently released a new pipeline
for sequencing data management (1.9), comprising a new reads name
format, an extented quality values range (from 32 to 64 symbols). Do
you have any future release in preparation to handle these new
features of illumina reads?

Thanks in advance,
Marco Galardini

Shaun Jackman

unread,
Jul 26, 2011, 1:42:59 PM7/26/11
to mgalardini, ABySS
Hi Marco,

I've performed assemblies with ~2000x coverage. ABySS should have no
particular trouble with high coverage.

Please report the exact command line that you are using. Can you report
the abyss-fac stats for your assemblies? What value of k are you using?
Have you tried other values? High coverage data sets require higher
values of k.

I haven't yet seen the Illumina 1.9 data format. Is it FASTQ in standard
Phred+33 format? If so, it should give no problem. ABySS considers every
quality character from ASCII 33 (!) to ASCII 104 (h) as valid.

Cheers,
Shaun

Kranthi Varala

unread,
Aug 8, 2011, 5:46:11 PM8/8/11
to ABySS
FYI, the new pipeline outputs standard FASTQ now, i.e., Phred+33
format and ABySS reads in those values with no problem.
I've encountered a problem with the sequence headers though. The new
pipeline changed the representation of mate pairs.

Instead of ending with /1 and /2 the header now has the format:

@HWI-ST330:133:B027FACXX:4:1101:1380:2010 1:N:0:CTTGTA
and
@HWI-ST330:133:B027FACXX:4:1101:1380:2010 2:N:0:CTTGTA

Attempting a paired-end assembly with this data throws an error at the
KAligner | ParseAligns stage with following error:

error: read ID `HWI-ST330:133:B027FACXX:4:1101:1765:2157' must end in
one of
1 and 2 or A and B or F and R or F3 and R3 or forward and
reverse

Of course, this is very easy to fix by editing the headers in the
fastq files but if this change is permanent it might be worth
updating ABySS to correctly parse the new headers.

Kranthi

Shaun Jackman

unread,
Aug 9, 2011, 4:53:17 PM8/9/11
to Kranthi Varala, ABySS
Hi Kranthi,

I'm glad that the new pipeline has finally converged on the standard
FASTQ format. It's rather annoying though that they've changed the
format of the read ID to omit the /1 and /2 suffix. Is there possibly an
option of the pipeline to include the /1 and /2 suffixes? Many tools,
including ABySS and BWA, require these suffixes. ABySS considers a
duplicate read ID to be an error.

Cheers,
Shaun

Dana Price

unread,
Aug 9, 2011, 4:57:23 PM8/9/11
to Shaun Jackman, Kranthi Varala, ABySS
Just to clarify- the new quality score offsets are only used in fastq files that have been generated by CASAVA 1.8 or later.  If you use the raw qseq files (as I do), they are still in the old phred+64.

Dana
--
Dana Price
Laboratory Researcher in Bioinformatics
Rutgers, The State University
Bhattacharya Lab
http://dblab.rutgers.edu
d.p...@rutgers.edu

Alejandro Sanchez

unread,
Aug 9, 2011, 5:02:46 PM8/9/11
to Shaun Jackman, Kranthi Varala, ABySS
Depends on who is processing the data for the new Illumina pipeline.
From the example that Kranthi put, I reckon that the run was multiplex
and the last part is the tag. So is quite easy just to put something
like a # symbol and the number of the tag to the read name and at then
finish it with the classic /1 or /2.

Anyway, as with fasta format, I think fastq is not that standard as we
could wish...

Cheers.

Shaun Jackman

unread,
Aug 9, 2011, 5:19:49 PM8/9/11
to Dana Price, Kranthi Varala, ABySS
Yes, for qseq files ABySS assumes Phred+64 quality, whereas for FASTQ
files it assumes Phred+33 quality.

Cheers,
Shaun

Kranthi Varala

unread,
Aug 9, 2011, 5:27:08 PM8/9/11
to ABySS
Hi Shaun,
I did not see any such option in the manual. I think
the reason they
changed the convention is because of this change (copied from the
CASAVA manual) :

"Be aware that the CASAVA 1.8 FASTQ files contain all reads, both the
reads
that passed filtering, as well as the reads that did not pass
filtering. If you use
third-party software that cannot handle this, we recommend that clean
up
the FASTQ files first using the <is filtered> field described above."

I wrote this simple script to remove the filtered reads and fix the
suffix at the same time.
Use it if you need to, and modify as required. Also, your local
sequencing center might
already filter the reads for you in which case you dont need to worry
about them.

#!/usr/bin/perl

use Getopt::Long;
use strict;

my ($pair1,$pair2);

GetOptions ( 'PE1=s' => \$pair1,
'PE2=s' => \$pair2);

if (!defined $pair1 || !defined $pair2) {
print "Please define both pair files.\n";
&usage();
exit(1);
}

open P1, $pair1 or die "Failed to open $pair1\n";
open P2, $pair2 or die "Failed to open $pair2\n";

$pair1 =~ s/\.fastq$//;
$pair2 =~ s/\.fastq$//;
open P1O, ">$pair1.filt.fastq";
open P2O, ">$pair2.filt.fastq";
my (@l1,@l2);
my ($f1,$f2);
$f1=$f2=0;
my $temp = <P1>;
$temp = <P1>;
seek P1,0,0;
chomp $temp;
my $readLen = length($temp);
while($l1[0] = <P1>){
$l1[1] = <P1>;
$l1[2] = <P1>;
$l1[3] = <P1>;
$l2[0] = <P2>;
$l2[1] = <P2>;
$l2[2] = <P2>;
$l2[3] = <P2>;
my @tmp = split /:/, $l1[0];
if($tmp[7] eq 'N'){$f1 = 1;}
@tmp = split /:/, $l2[0];
if($tmp[7] eq 'N'){$f2 = 1;}
$l1[0] =~ s/\s([12]):N:0:([ATCGN]+)/:$2\/$1/;
$l2[0] =~ s/\s([12]):N:0:([ATCGN]+)/:$2\/$1/;
if($f1==1 && $f2==1){
print P1O join "", @l1;
print P2O join "", @l2;
}
# else {print $l1[0],$l2[0];}
$f1=$f2=0;
}
close P1;
close P2;
close P1O;
close P2O;

sub usage(){
print "perl removeFiltered.pl -PE1=<paired end 1> -PE2=<paired
end 2>\n";
}
# Creates new fastq files in the same location as original files and
calles them XXXX.filt.fastq
#############################################################################

If the reads were filtered and only the suffix needs to be fixed then
use this script on
the individual fastq files:

#!/usr/bin/perl
$suf = ":N:0:[ATCGN]+";
open FL, $ARGV[0];
while(<FL>){
s/\s([12]):N:0:([ATCGN]+)/:$2\/$1/;
print;
}
close FL;
#remeber to redirect the output to a new file
#######################################################

Kranthi

Kranthi Varala

unread,
Aug 9, 2011, 5:34:03 PM8/9/11
to ABySS
Good point. That was an example for a multiplexed run using an index.
If
no multiplexing was done the last part of the read id would be using.
I should correct the pattern match in the script I just posted to

s/\s([12]):N:0:([ATCGN]*)/:$2\/$1/;
so that it works even when there is no mulitplex index.
Yay, under a minute to find the first bug. Must be a record. :)

Kranthi

On Aug 9, 4:02 pm, Alejandro Sanchez <arrob...@gmail.com> wrote:
> Depends on who is processing the data for the new Illumina pipeline.
> From the example that Kranthi put, I reckon that the run was multiplex
> and the last part is the tag. So is quite easy just to put something
> like a # symbol and the number of the tag to the read name and at then
> finish it with the classic /1 or /2.
>
> Anyway, as with fasta format, I think fastq is not that standard as we
> could wish...
>
> Cheers.
>

Marco Galardini

unread,
Aug 23, 2011, 6:32:52 AM8/23/11
to abyss...@googlegroups.com
Hi everybody,

i've been struggling on this issue quite a while now; as suggested i've added the /1 /2 tags at the end of the reads and filtered them; however i still got some errors while running abyss-pe.

here's an example:
abyss-pe k=35 in="../17_R1_a.fastq ../17_R2_a.fastq" n=10 name=17
ABYSS -k25 -q3  --coverage-hist=coverage.hist -s 14-bubbles.fa  -o 14-1.fa ../17_R1_a.fastq ../17_R2_a.fastq 
ABySS 1.2.7
ABYSS -k25 -q3 --coverage-hist=coverage.hist -s 14-bubbles.fa -o 14-1.fa ../17_R1_a.fastq ../17_R2_a.fastq
Reading `../17_R1_a.fastq'
warning: discarded 46616 reads shorter than 25 bases
Reading `../17_R2_a.fastq'
warning: discarded 271567 reads shorter than 25 bases
Loaded 52984920 k-mer
Minimum k-mer coverage is 11
Using a coverage threshold of 8...
The median k-mer coverage is 64
The reconstruction is 7187818
The k-mer coverage threshold is 8
Setting parameter e (erode) to 8
Setting parameter E (erodeStrand) to 1
Setting parameter c (coverage) to 8
Generating adjacency
Generated 107957676 edges
Eroding tips
Eroded 25611226 tips
Eroded 0 tips
Trimming short branches: 1
Trimmed 13 k-mer in 13 branches
Trimming short branches: 2
Trimmed 10 k-mer in 5 branches
Trimming short branches: 4
Trimmed 28 k-mer in 10 branches
Trimming short branches: 8
Trimmed 99 k-mer in 16 branches
Trimming short branches: 16
Trimmed 189 k-mer in 18 branches
Trimming short branches: 25
Trimmed 214 k-mer in 11 branches
Trimming short branches: 25
Trimmed 73 branches in 6 rounds
Marked 3508774 edges of 1680299 ambiguous vertices.
Removing low-coverage contigs (mean k-mer coverage < 8)
Found 27370460 k-mer in 2160840 contigs before removing low-coverage contigs
Removed 20172909 k-mer in 893323 low-coverage contigs
Split 1786625 ambiguous branches
Eroding tips
Eroded 27552 tips
Eroded 0 tips
Trimming short branches: 1
Trimmed 155 k-mer in 155 branches
Trimming short branches: 2
Trimmed 272 k-mer in 202 branches
Trimming short branches: 4
Trimmed 658 k-mer in 261 branches
Trimming short branches: 8
Trimmed 1837 k-mer in 359 branches
Trimming short branches: 16
Trimmed 3879 k-mer in 376 branches
Trimming short branches: 25
Trimmed 3160 k-mer in 182 branches
Trimming short branches: 25
Trimmed 1 k-mer in 1 branches
Trimming short branches: 25
Trimmed 1536 branches in 7 rounds
Popping bubbles
Removed 2118 bubbles
Removed 2118 bubbles
Marked 16761 edges of 7960 ambiguous vertices.
2705 unassembled k-mer in circular contigs
Assembled 7100808 k-mer in 10730 contigs
Removed 45822202 k-mer.
The signal-to-noise ratio (SNR) is -8.05998 dB.
AdjList  -k25 -m30 14-1.fa >14-1.adj
PopBubbles  -j2 -k25 -p0.9  -g 14-3.adj 14-1.fa 14-1.adj >14-1.path
MergeContigs -k25 -o 14-3.fa 14-1.fa 14-1.adj 14-1.path
The minimum coverage of single-end contigs is 8.
The minimum coverage of merged contigs is 12.6522.
Consider increasing the coverage threshold parameter, c, to 12.6522.
awk '!/^>/ {x[">" $1]=1; next} {getline s} $1 in x {print $0 "\n" s}' \
14-1.path 14-1.fa >14-indel.fa
KAligner  -i  -j2 -k25 ../17_R1_a.fastq ../17_R2_a.fastq 14-3.fa \
|ParseAligns  -k25 -h 14-3.hist \
|sort -snk3 -k4 \
|gzip >14-3.sam.gz
Mateless   11933308  100%
Unaligned         0
Singleton         0
FR                0
RF                0
FF                0
Different         0
Multimap          0
Split             0
Total      11933308
gunzip -c 14-3.sam.gz \
|DistanceEst  -j2 -k25 -s100 -n10 -o 14-3.dist 14-3.hist
error: the histogram `14-3.hist' is empty
make: *** [14-3.dist] Error 1
make: *** Deleting file `14-3.dist'

Below a read example:
@HISEQ1:85:D0C0FABXX:3:1101:1410:2166#1:N:0:CGATGT/1
ATTGCCCAGCAGAATCTCCAGAAATTCGAACTGCGGGACGTGACGACCGAGAACGACAAGGCACACCGTCAGCGGCGTCGACAGCACCAATCCGACCGGCC
+
@@@FBFB?BFDDF@GHGIIIFHIIECGGIGHGEEG<FAF<;C@FCDEHF8AA@>/38=@BBBBBCBBB+9@BCCBBBBBBBB39BC<?#############

Other really similar reads with the old pipeline perform really well.
Thanks in advance for any suggestion you may have

Shaun Jackman

unread,
Aug 23, 2011, 1:03:00 PM8/23/11
to Marco Galardini, abyss...@googlegroups.com
Hi Marco,

Please give an example of a read pair.

Cheers,
Shaun

Marco Galardini

unread,
Aug 27, 2011, 9:25:44 AM8/27/11
to Shaun Jackman, abyss...@googlegroups.com
Hi,

there you go: sorry for the late reply!
What i did with the pairs was to replace spaces with '#' as well as adding the /1 /2 string at the end...

@HISEQ1:85:D0C0FABXX:3:1101:1410:2166#1:N:0:CGATGT/1
ATTGCCCAGCAGAATCTCCAGAAATTCGAACTGCGGGACGTGACGACCGAGAACGACAAGGCACACCGTCAGCGGCGTCGACAGCACCAATCCGACCGGCC
+
@@@FBFB?BFDDF@GHGIIIFHIIECGGIGHGEEG<FAF<;C@FCDEHF8AA@>/38=@BBBBBCBBB+9@BCCBBBBBBBB39BC<?#############

@HISEQ1:85:D0C0FABXX:3:1101:1410:2166#2:N:0:CGATGT/2
GTTCATTGTCCTTGAACTCTTGAGCAACAATGTCGTCGAGCCTTGGCTTTACGGTTCCCGCACCGGACTGTCGCCGCTTTCGATCATCGTCGCGGCGAACT
+
@;?DDDA?ADCFDH?3EAIHEF@EGG>GH9CFCG@GGG6?@GGEBG>GGI<FHHF<CHE@E<BBBC/8>9@##############################

Shaun Jackman

unread,
Aug 29, 2011, 1:54:20 PM8/29/11
to Marco Galardini, abyss...@googlegroups.com
Hi Marco,

The problem is that the read IDs have to be identical except for the
suffix /1 and /2. Your read IDs also differ after the sharp sign (#1 and
#2). I'd suggest changing your read IDs to:
@HISEQ1:85:D0C0FABXX:3:1101:1410:2166/1 1:N:0:CGATGT
@HISEQ1:85:D0C0FABXX:3:1101:1410:2166/2 2:N:0:CGATGT

You can of course delete the comment (everything after the space) if
it's easier for you. ABySS ignores the comment.

Cheers,
Shaun

Marco Galardini

unread,
Aug 30, 2011, 7:30:27 AM8/30/11
to Shaun Jackman, abyss...@googlegroups.com
Hi Shaun,

following your suggestions i managed to use abyss on these new reads.
Thanks a lot!
Reply all
Reply to author
Forward
0 new messages