Linking non-overlapping paired end reads

1,017 views
Skip to first unread message

Shauna Baillie

unread,
Dec 22, 2015, 10:41:57 AM12/22/15
to pear-users
Hello,
I am trying to link non-overlapping paired end reads using Pear v0.9.8. The target amplicon length is 316bp and the read length is 151bp. Attached are my PrinSeq results for one Illumina MiSeq library; high quality, ~ 2 million reads. However, when I attempt to link forward and reverse reads, presumably using positional data, the majority of assembled reads are much shorter than 2 x 151 bp... but there should be zero overlap. I disabled the statistical test, have experimented with different settings; when I set the minimum assembled fragment length to 300bp, I get zero assembled fragments.

I am not a programmer and am new to using these data. Is there a simple script that I can use to link forward and reverse reads, and ideally place an 'N' other some other symbol at the join?



Joshua Herr

unread,
Dec 22, 2015, 1:20:21 PM12/22/15
to Shauna Baillie, pear-users
So we received a couple of messages from you at the PEAR users list-serve, and I'll respond because I don't think this is a PEAR issue.  Sounds like you are a little confused.

I am trying to link non-overlapping paired end reads using Pear v0.9.8. The target amplicon length is 316bp and the read length is 151bp.

First off, if you don't have overlapping paired end reads, why are you trying to merge them on the basis of overlapping regions (using PEAR)?  If they don't overlap, how do you intend to merge the reads? Why would you want to do this?

What exactly are you trying to do?  You didn't give us the most important detail: what type of data do you have? (amplicon, I presume?) and where does this data come from? (sample and marker region?)

Overlapping issues aside, if you have reads that are 150 bp in length and you set a minimum fragment length to 300 bp, anything you would merge by concatenation would be excluded, so there would logically be no results.  Even if you are trying to have overlapping regions, if your target length is 316 bp and you only sequence 151 bp, there won't be any sequence to overlap.  You would need 250 bp reads to merge overlapping regions for a 316 bp target.

The PEAR users forum is for issues with PEAR, not really general bioinformatics confusion, so you should check out forums such as biostars. There are numerous people on this list who are active there.  If you let us know exactly what you want to do, we can help you.

Warm regards and good luck to you.  Please feel free to email me directly or use the biostars forum if you have questions.

~ Josh 



--
You received this message because you are subscribed to the Google Groups "pear-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pear-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shauna Baillie

unread,
Dec 25, 2015, 7:14:22 AM12/25/15
to Joshua Herr, pear-users
Hi Joshua,
Thanks for responding. I don't think I am confused, but that is a possibility considering we don't always know our blind spots. I did mention that I was using amplicons.

The article below states that non-overlapping paired end reads can be linked using PEAR. That is why I tried the program.
 2014 Mar 1;30(5):614-20. doi: 10.1093/bioinformatics/btt593. Epub 2013 Oct 18.

PEAR: a fast and accurate Illumina Paired-End reAd mergeR.


Fig. 1.
Fig. 1. Three possible scenarios for paired-end read lengths and target DNA fragment lengths. (A) Short overlap between the paired-end reads; (B) no overlap between the paired-end reads; (C) single-end read length is larger than the target DNA fragment length.

Although PEAR does not perform as well with non-overlapping reads as it does with overlapping reads (see Table 1), I thought it might be worth a try.

The total length of the target amplicon (plus MID tags and primers) is 316bp, p-e read length is 151bp, so I would only be missing 14bp in the middle. This is a pilot project to assess deep sequencing of amplicons; I can get a good idea of genetic diversity without the 14 bp in the middle.

So, in short, I am trying to merge forward and reverse reads that do not overlap.

Perhaps I will try PANDAseq, as you can see in Table 1 of Zhang et al 2014, it seems PANDAseq had better success merging non-overlapping paired end reads than PEAR.

All the best,
Shauna
-- 
Dr. Shauna Baillie, Adjunct Professor and Postdoctoral Fellow
Department of Biology – Dalhousie University 
1355 Oxford Street, Halifax, Nova Scotia, Canada, B3H 4J1
Email:s.m.b...@gmail.com
Cc email: Shauna....@Dal.Ca
Web: http://www.bentzenlab.ca/people/ 

Joshua Herr

unread,
Dec 25, 2015, 7:14:22 AM12/25/15
to Shauna Baillie, pear-users
Sorry if I was a little gruff, it's just hard being on the other end when we don't have all the details to help you with the direction you need to be going.

Assuming you have random (or non-random) amplicons for a genetic study of a non-model organism -- why do you need to merge the reads?  What is your downstream application of the reads -- merged or otherwise?  Map to a public or internal reference?  Can this be accomplished with just the forward and reverse reads?  What is the next step of the data analysis algorithm?

I'm a fan of both PEAR and PANDAseq, but I'm not clear on why you would want to merge your reads if they don't overlap, and as a result I don't know if these are the right tools.

If this is pertinent to the PEAR listserve, please use it.  Otherwise, feel free to email me and I can try my best to help you.

Cheers ~ Josh


Reply all
Reply to author
Forward
0 new messages