Re: [solexaqa-users] Effect of DynamicTrim on paired end insert size

65 views
Skip to first unread message

Murray Cox

unread,
Aug 3, 2012, 5:50:36 PM8/3/12
to solexaq...@googlegroups.com
Hi Dave,

This is a really good question.

As far as I know, this hasn't been tested explicitly. However, you could probably make some predictions.

If I remember my Illumina chemistry right, paired end reads are sequenced in from both ends of the DNA fragment. Because base call qualities decline exponentially along the read, the good quality bases should mostly be on both ends of the DNA fragment with poorer quality bases internal. When you trim, this should lead to a larger mean insert size.

Bases at the beginning of the read can be poor, and trimming these would lead to smaller DNA fragment sizes (although not smaller insert sizes). However, I suspect that this effect is less important than the extremely poor quality of bases at the ends of reads.

Given that base call quality varies widely between reads, I would also expect some increase in the variance of insert sizes.

As you say, all this should be testable. If you have reads from a known reference, mappers (like bwa) will typically give you an estimate of the insert size mean and variance. It would be interesting to see what happens to untrimmed and trimmed data in practice.

In general, I'm not sure how much of an effect this would have in any practical setting. For the majority of reads, only a few tens of bases are trimmed at most. Given that the original DNA fragment size is quite variable anyway (say, on the order of 100 bases), I suspect that the effects of trimming (on the order of tens of bases) will just tend to add noise within the more dominant variation in insert sizes that existed when the DNA fragments were originally made.

Best
-Murray


> Hi Murray
>
> I'm interested in your DynamicTrim algorithm because of its conservative nature but I'm wondering about the effect on insert sizes. If the longest contiguous sequence with quality scores greater than a threshold often did not include the sequences from the first few cycles, am I right in thinking this would decrease the mean insert size?
>
> Presumably this would be a rare occurrence as qualities are usually high at the start of the read? Insert sizes could be reassessed anyway after trimming and then mapping paired reads back to contigs assembled without insert size information? I'm trying to assess if the advantages of paired end insert libraries would be diminished. Retaining the higher quality parts of the reads could however compensate.
>
> cheers
>
> DaveW

MPC

unread,
Aug 6, 2012, 5:20:53 PM8/6/12
to solexaq...@googlegroups.com
Yes, exactly.  I must say, I'm now rather curious to see some quantitative numbers on this though.

Best
-Murray

Reply all
Reply to author
Forward
0 new messages