general questions about join_paired_ends.py

397 views
Skip to first unread message

Vivi

unread,
Apr 17, 2014, 10:52:14 AM4/17/14
to qiime...@googlegroups.com


Hi everybody, 

I am very new to the environment and I am desperately try to find a program/script to merge paired end sequences from Illumina. I am working with double digest, paired end data. For every individual I have 2 files containing the first and the paired read respectively, each of which contain reads 86 bp long (they have been already filtered for quality and barcodes+restriction site have been trimmed away).

I have been trying to use join_paired_ends.py and it seems to work fine but I don't fully understand how it handles several situations, and I don't seem to be able to find an explanation anywhere in the web. It could be great if some of you with more experience on this than me could answer some of my questions :) If you are aware of any group/page where these issues are discussed it could be great even just to have a link to :)

For example:

           What happens if reads do not overlap at all? 

No output with unmatched paires is created and I don't seem to have any read in the output that is simply the sum of the first and paired original read. The length range in my output is min 86 bp and maximum 164 bp (so I don't have any that is 86*2=172 bp).

           What happens to reads that don't have a mate? Are they discarded or kept as 86 bp long reads?

Again, I don't get any output with singletons and I can't find any explanation on what happens online. Are the 86 bp reads in my output unpaired reads or the result of two reads that overlap completely?

           What happens if an overlap has too many differences (mismatches) between the two reads or if the reads have different bases in the overlap but same quality score?

           Does join_paired_ends.py requires reads in the same order in the first and paired read's files?


I will be extremely thankful to any help provided.

Many thanks already from now,

Best regards and Happy Easter

Vivi

 

 

Kyle Bittinger

unread,
Apr 17, 2014, 2:41:51 PM4/17/14
to qiime...@googlegroups.com
Forwarding to the maintainer of this script...


--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike R

unread,
Apr 17, 2014, 4:04:33 PM4/17/14
to qiime...@googlegroups.com
Hi Vivi! I'll answer inline...


           What happens if reads do not overlap at all? 

No output with unmatched paires is created and I don't seem to have any read in the output that is simply the sum of the first and paired original read. The length range in my output is min 86 bp and maximum 164 bp (so I don't have any that is 86*2=172 bp).


Actually, `join_paired_ends.py` will output the forward and reverse reads that have not been merged into their own separate files, see the documentation here. Remember, that pairs MUST overlap by a minimum number of bases in order to be joined. So, you should always expect the combined read length to be less than simply concatenating the reads end-to-end. If you do not provide a value for `--min_overlap`, then default values will be used for the `--pe_join_method` chosen. The defaults for fastq-join and SeqPrep is 6 bp and 15 bp respectively.

 

           What happens to reads that don't have a mate? Are they discarded or kept as 86 bp long reads?

The program may fail with an error if there are different numbers of reads in the two files. 

 

Again, I don't get any output with singletons and I can't find any explanation on what happens online. Are the 86 bp reads in my output unpaired reads or the result of two reads that overlap completely?


Not sure what you mean by singletons in this context. The reads probably completely overlap. What gene are you sequencing? Are the lengths of your sequencing products what you expect? Check the documentation linked to above, but generally you should have 4 output files (if using the `-b` option):

1) Assembled pairs
2) unassembled read 1 data
3) unassembled read 2 data
4) updated barcodes / index file (if using -b)

 

           What happens if an overlap has too many differences (mismatches) between the two reads or if the reads have different bases in the overlap but same quality score?

Generally speaking if there are to many mismatches / gaps, those reads will not be merged. I'd suggest reading the documentation on the fastq-join and SeqPrep programs (links provided on the script help page I linked to above), but every paired-end joining tool makes its own assumptions on how to deal with mis-matches.

 

           Does join_paired_ends.py requires reads in the same order in the first and paired read's files?

Yes, the reads MUST be in the same order for both files in order for most paired-end joining programs to work properly. In general, it is better to join the pairs first and then do sequence processing / quality control later. If you perform quality-control on your reads separately you may discard a paired read in one file but not the other, causing the two files to become out-of-sync. You also defeat the purpose of having paired-end data if you are performing quality control on the separate reads before joining. One of the advantages of joining the reads is to increase the confidence in the quality scores in areas of overlap, especially with the poor quality regions of each individual read. However, as with mismatch handling, how the quality scores are updated varies between paired-end joining programs.
 
-Hope this helps! :-)
-Mike

Vivi

unread,
Apr 18, 2014, 6:29:34 AM4/18/14
to qiime...@googlegroups.com
Dear Mike, 
 
This was extremely helpful, Thank you so much!

Vivi

unread,
Apr 23, 2014, 4:08:48 AM4/23/14
to qiime...@googlegroups.com
Dear Mike, 

I have one very last question about the script... :)
As i am working on 300 individuals I would like to give a template name for the outputs so that I can save the outputs from several individuals in the same folder with their unique names. 

According to the documentation for FastqJoin if in the command the output is defined as -o <read.%.fqI should obtain the three files (join un1 and un2) with the name I specified. 
You can supply 3 -o arguments, for un1, un2, join files, or one
argument as a file name template.  The suffix '
un1, un2, or join' is
appended to the file, or they replace a %-character if present.

However when I supply an output file names join_paired_ends.py creates a folder with whatever name I gave it... 
Do you know if there is anyway of defining the name of the output files and not of the folder they are contained i? 

Thank you very much already now for your great help! :)
Best regards
Vivi

Mike R

unread,
Apr 23, 2014, 9:41:12 AM4/23/14
to qiime...@googlegroups.com
We decided to not allow the use of '%' to avoid unforeseen issues with the application controller. For the moment, the script does not allow the file name to be specified. 

-Mike

Vivi

unread,
Apr 23, 2014, 9:47:09 AM4/23/14
to qiime...@googlegroups.com
Ok, is very good to know anyway :)
Many thanks once again! 
Reply all
Reply to author
Forward
0 new messages