How to fix reverse complements in R1 and R2 files that have barcodes and primers

1,236 views
Skip to first unread message

anya brown

unread,
Aug 13, 2015, 10:21:03 PM8/13/15
to Qiime Forum
Hi all,

I am currently at the STAMPS workshop and a few of us are struggling with getting Illumina MiSeq Sequence data from MrDNA lab into a useable format. We recently contacted the company and found out that:
 R1 and R2 data are generated with random ligation rather than long primer concatamer PCR with adapters and barcodes integrated.     Thus, the R1 and R2 files do have reads in both directions.

We are trying to run extract_barcodes.py but we have reads in the R1 file that are reverse complements of the barcoded forward reads (with primers). In the R2 file we have reverse reads, but some the reverse complement of the forward read (without barcodes and with primers). 

We have discussed:
1.  using the extract_barcodes.py script with just the R1 file using the reorientation flag (-a) and 
2. then doing a second extract_barcodes.py with the R2 file and create a new mapping file that has the reverse primer in the barcode column
3. joining the reads with join_paired_ends.py
and proceeding with the steps in the illumina tutorial. 

Has anyone encountered this problem or MRDNA for paired end Illumina data, and what did you do? 

We've stumped our TAs. They are pretty cool though. 

Thanks

Tony Walters

unread,
Aug 17, 2015, 7:56:51 PM8/17/15
to qiime...@googlegroups.com
Hello Anya,

Perhaps this approach would work (and avoid the duplicate reads that could arise from the approach you posted, but this might not be a problem as they shouldn't have valid barcodes that get past the split_libraries_fastq.py step):

1. extract_barcodes.py with --attempt_read_reorientation and the --input_type barcode_paired_end --bc1_len X --bc2_len 0
where X is the length of the barcodes, and they pass in the separate R1/R2 reads with -f and -r.
2. join_paired_ends.py with the extracted read 1 and 2 from step 1, and the barcodes from step 1 passed as -b.

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Theresa

unread,
Aug 24, 2015, 11:43:10 AM8/24/15
to Qiime Forum
Hi Anya,
   Have you tried any of this yet?
Theresa

Theresa

unread,
Aug 28, 2015, 6:21:08 PM8/28/15
to Qiime Forum
Hi Tony and Anya,

Tony I tried out your suggestions,starting with the extract barcodes and then joining the pairs (all info below).  I checked out the barcodes file and in some cases it looks like there may not have been a barcode so instead it stripped off the first 8 bp of the primer.  Additionally, when I tried joining the pairs a significant fraction remain unjoined and those that are joined are 400bp long for a region that should be about 300bp (515F/805R).

extract_barcodes.py -f /home/qiime/Desktop/Alex_16s_qiime/R1.fastq -r /home/qiime/Desktop/Alex_16s_qiime/R2.fastq -c barcode_paired_end -m /home/qiime/Desktop/Alex_16s_qiime/fasta-qual-mapping-files_from_MRDNA/validate_mapping_file_output/16s_mapping_corrected.txt --attempt_read_reorientation --bc1_len 8 --bc2_len 0 -o /home/qiime/Desktop/Alex_16s_qiime/trial_2/processed_seqs_removedbarcodes

Traceback (most recent call last):
  File "/usr/local/bin/extract_barcodes.py", line 175, in <module>
    main()
  File "/usr/local/bin/extract_barcodes.py", line 171, in main
    opts.attempt_read_reorientation, disable_header_match)
  File "/usr/local/lib/python2.7/dist-packages/qiime/extract_barcodes.py", line 79, in extract_barcodes
    forward_primers, reverse_primers = get_primers(header, mapping_data)
  File "/usr/local/lib/python2.7/dist-packages/qiime/extract_barcodes.py", line 520, in get_primers
    raise IndexError(("Mapping file is missing ReversePrimer field."))
IndexError: Mapping file is missing ReversePrimer field.

Made a new mapping file with ReversePrimer includes
validated it using the validate file.py

retry extract barcodes with new mapping file...
extract_barcodes.py -f /home/qiime/Desktop/Alex_16s_qiime/R1.fastq -r /home/qiime/Desktop/Alex_16s_qiime/R2.fastq -c barcode_paired_end -m /home/qiime/Desktop/Alex_16s_qiime/trial_2/validate_mapping_file_output/new_mappingfile_corrected.txt --attempt_read_reorientation --bc1_len 8 --bc2_len 0 -o /home/qiime/Desktop/Alex_16s_qiime/trial_2/processed_seqs_removedbarcodes

count_seqs.py -i fastq -o ./count_seqs.txt
barcodes file: 4335142
reads 1:281.1002 +/- 14.7983  4335142
reads 2: 289.3880 +/- 14.8808  4335142


Then tried joining the pairs

join_paired_ends.py -f /home/qiime/Desktop/Alex_16s_qiime/trial_2/processed_seqs_removedbarcodes/reads1.fastq -r /home/qiime/Desktop/Alex_16s_qiime/trial_2/processed_seqs_removedbarcodes/reads2.fastq -b /home/qiime/Desktop/Alex_16s_qiime/trial_2/processed_seqs_removedbarcodes/barcodes.fastq -o /home/qiime/Desktop/Alex_16s_qiime/trial_2/fastq-join_joined

check all resulting files with counts seqs...
216338  : /home/qiime/Desktop/Alex_16s_qiime/trial_2/fastq-join_joined/fastqjoin.join.fastq (Sequence lengths (mean +/- std): 400.4612 +/- 59.1968)
4118804  : /home/qiime/Desktop/Alex_16s_qiime/trial_2/fastq-join_joined/fastqjoin.un1.fastq (Sequence lengths (mean +/- std): 280.5051 +/- 14.9417)
4118804  : /home/qiime/Desktop/Alex_16s_qiime/trial_2/fastq-join_joined/fastqjoin.un2.fastq (Sequence lengths (mean +/- std): 288.7916 +/- 15.0263)

Not sure where to go from here....
Thanks again,
Theresa

Tony Walters

unread,
Aug 28, 2015, 6:36:14 PM8/28/15
to qiime...@googlegroups.com
So we've fixed the issue of getting the barcodes out (and orienting the reads so we get them off the correct side). Now the issue is the yield for join_paired_ends.py. For that, I can only suggest tweaking the parameters of that script to see if you can increase the number of successfully joined reads. If it can't be stitched with any of the stitching methods/parameters, then it may be better to just use one of the reads (e.g. read 1 and the barcodes) for input with split_libraries_fastq.py.

--

Theresa

unread,
Aug 31, 2015, 12:42:51 PM8/31/15
to Qiime Forum
Tony,
    The sequencing center states the following:
"This results in the amplicons being found in both 5’-3’ as usual.. and 3’-5’ orientation in the r1 and r2 files.  To process the r1 and r2 files >300bp: join the reads together. Look for barcodes at the 5’ end, also find reverse compliment barcodes at the 3’ end of the joined reads.   Reverse compliment the sequences containing barcodes at the 3’ end.  To process the r1 and r2 files <300bp: just use the forward reads from the r1 and r2 files."

I have data sets that represent both cases (some that are >300bp and some that are <300bp).  I am very confused by this because I thought R1 was supposed to be forward reads and R2 was supposed to be the reverse reads but here they are saying that forward and reverse are found in both the R1 and R2 (I have checked this out and this is the case).  Does this mean I can't technically use the flags -f and -r for R1 and R2, respectively? For the >300bp does this mean I should attempt read orientation and --rev_comp_bc1? then split libraries and take the primers off after the split libraries?   For the <300bp: how would I even go about removing the forward reads from each file and then combining the forward reads into one file so I can extract the barcodes and then split libraries?

thanks!

Tony Walters

unread,
Aug 31, 2015, 1:08:50 PM8/31/15
to qiime...@googlegroups.com
Theresa,

There are a bunch of questions here spanning extract_barcodes, join_paired_ends.py, and split_libraries_fastq.py.

The nomenclature of R1 and R2 may be confusing, but R1 was the first read on the sequencer, and R2 was the second read, no matter what the orientation of the particular read (relative to the SSU primers being used) was.

For the output of extract_barcodes, the reads (at least most of them, where it could find the primer sequences) *should* be oriented, so the forward reads correspond to R1 and reverse reads to R2. You don't have to reverse complement the barcodes because it attempts to orient the read first (which would reverse complement a barcode in reverse orientation) already.

We haven't addressed getting rid of the primers in the reads (only a minor impact on the results to leave these in, since they are SSU sequences after all, but the standard approach is to remove them), but I think that needs to be addressed after you get better yields with the stitching process OR choose to go with the R1 data alone. 

So a modified approach to the original one:


1. extract_barcodes.py with --attempt_read_reorientation and the --input_type barcode_paired_end --bc1_len X --bc2_len 0
where X is the length of the barcodes, and they pass in the separate R1/R2 reads with -f and -r.
(This is complete, and what we have after this are the barcodes file, and R1/R2 files that are oriented)


2. join_paired_ends.py with the extracted read 1 and 2 from step 1, and the barcodes from step 1 passed as -b.
(This ran, but gave low yields. You need to try different parameters with join_paired_ends.py, or a different method with -m, to see if you can increase yields-go with 3a if you can increase yields. If you can't increase the output, then go with option 3b below.)

Step 3 is using extract_barcodes.py to remove the primers by specifying the barcode lengths as the length of the primers.
3a. Use the stitched output and barcodes fastq output from step 2 with another call to extract_barcodes.py. In this case, the parameters will be --input_type barcode_paired_stitched --bc1_len X --bc2_len Y 
with X as the length of the forward primer and Y as the length of the reverse primer. The output reads file (not the barcodes) will be used with step 4. The barcodes output from step 2 should be used with step 4 for the barcodes fastq input.

3b. Use the R1 fastq reads file from step 1 as input to extract_barcodes.py. The parameters will be --input_type barcode_single_end --bc1_len X
where X is the length of the forward primer. The output reads file from this step will be used as the reads input to step 4, and the barcodes output from step 1 will be used as the barcode fastq input for step 4.

4. Demultiplexing with split_libraries_fastq.py. Using the input reads (-i) and barcodes (-b) as described in whichever step 3 you followed, demultiplex the data.


Theresa

unread,
Sep 1, 2015, 2:14:22 PM9/1/15
to Qiime Forum
Thanks for your patience, Tony!

I tried the SeqPrep method and it joined 90% of my sequences and they were around ~280bp instead of >400bp.  Now for split libraries....
Reply all
Reply to author
Forward
0 new messages