Too short after clip: Question

46 views
Skip to first unread message

Sven

unread,
Jan 17, 2012, 6:00:14 PM1/17/12
to EA Utils
Hi,

I am the first one :-)

I am not sure this a problem, or not...

Maybe you can comment on that.

I have a (bad) illumina dataset, here one lane chip-seq.

I'd like to use fastq-mcf to

a) clip quality
b) clip primer
c) basic statistics

In a first approach I put three sequences (fasta formatted) in the
adapter file,

TruSeq_Universal_Adapter
P5_APr
P7_APr

all in 5'->3' direction.

Running fastq-mcf resulted in:

fastq-mcf -l 10 -P 33 -o MySample.fq_Clipped2 illuminaAdaptors.fasta
MySample.fastq
Scale used: 2.2
Phred: 33
Trim 'start': 1 from MySample.fastq
Threshold used: 251 out of 100000
Adapter TruSeq_Universal_Adapter
(AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT): counted
54623 at the 'end' of 'MySample.fastq', clip set to 1
Adapter P5_Amplification_Primer (AATGATACGGCGACCACCGAG): counted 54623
at the 'end' of 'MySample.fastq', clip set to 1
Files: 1
Total reads: 17536194
Too short after clip: 11405490
Clipped 'end' reads: Count: 3449370, Mean: 1.80, Sd: 0.89
Trimmed 14052427 reads by an average of 13.69 bases on quality < 7


Now I added the reverse-complement of the "reverse" primer P7 to the
adapter file (P7_APr_RevComp) and now I got:

fastq-mcf -l 10 -P 33 -o MySample.fq_Clipped3 illuminaAdaptors.fasta
MySample.fastq
Scale used: 2.2
Phred: 33
Trim 'start': 1 from MySample.fastq
Threshold used: 251 out of 100000
Adapter TruSeq_Universal_Adapter
(AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT): counted
54623 at the 'end' of 'MySample.fastq', clip set to 1
Adapter P5_APr (AATGATACGGCGACCACCGAG): counted 54623 at the 'end' of
'MySample.fastq', clip set to 1
Adapter P7_APr_RevComp (TCGTATGCCGTCTTCTGCTTG): counted 10582 at the
'end' of 'MySample.fastq', clip set to 2
Files: 1
Total reads: 17536194
Too short after clip: 7623003
Clipped 'end' reads: Count: 8224158, Mean: 13.85, Sd: 5.62
Trimmed 14052427 reads by an average of 13.69 bases on quality < 7


The final stats are (unexpectedly) different, especially:

Too short after clip: 11405490

vs.

Too short after clip: 7623003

How do I have to provide the adapter sequences? Always 5'->3' or "as
read" by the software?
Can you comment on the statistics? Something I have missed?

Thanks,
Sven

Erik Aronesty

unread,
Feb 3, 2012, 9:42:38 PM2/3/12
to ea-u...@googlegroups.com
It looks like one adapter was a subset of another, causing it to be "100%" matched...and therefore not "completely clip"

Fastq-mcf doesn't check for subsets.

Sven Klages

unread,
Feb 8, 2012, 3:58:33 AM2/8/12
to ea-u...@googlegroups.com
2012/2/4 Erik Aronesty <earo...@gmail.com>

It looks like one adapter was a subset of another, causing it to be "100%" matched...and therefore not "completely clip"

Fastq-mcf doesn't check for subsets.

Hmm, that does not explain why I get by far more "Too short after clip" when I just use the two instead of the three adaptors.

And, how do I have to provide the adapter sequences? Always 5'->3' or "as read" by the software?

thanks,
Sven

Erik Aronesty

unread,
Feb 9, 2012, 11:18:17 AM2/9/12
to ea-u...@googlegroups.com
the software always reads sequences one way....most of the time, if it's an adapter, this is OK

it has no notion of 5' or 3'

if the adapter sequence is ATGTGC, and the clip is set to 4 'end', and it sees:

TTTTTTTTTTTTTATGT

It will clip off those 4.

If the clip is set to '4 begin' and it sees:

GTGCTTTTTTTTTTTTTT

It will clip of those 4.

in the case of other, non-adapter (not likely to be at one end or the other) contamination, this utility is definitely not the right tool
Reply all
Reply to author
Forward
0 new messages