PE MiSeq reads won't pair after trimming

ashleyb

unread,

Feb 6, 2015, 5:21:18 PM2/6/15

to qiime...@googlegroups.com

Hello,
I am having some issues pairing my MiSeq reads after trimming with cutadapt. My scheme uses custom sequencing primers that mirror the Caporaso 16S design, but use ITS specific primers (BITSF/B58S3R). Due to the variable size of the ITS region, I have many reads in which my amplicon was shorter than the 250 bp Illumina read leaving me with linker/pad/adapter/junk to trim from the 3' ends. I did this using Cutadapt, which seems to have worked fine. I checked the sequence files to ensure the adapters/etc were removed in the correct directions and nothing looks weird to me in the fastq files.

The issue is that when I use these trimmed fastq files in join_paired_ends.py, most of my reads are not pairing. After investigation, I have narrowed the issue down to reads in the original fastq files that were untrimmed, 250bp (no adapter, assumed target >250bp) paired ok. Reads containing primer/adapter and subsequently trimmed will not pair (which is most of the dataset). The strange thing is that I am able to assemble those trimmed reads that will not pair in QIIME using the assembler in the Geneious software, so I know they overlap. I've blasted some of the trimmed reads to verify they are matching ITS. I was wondering if anyone else has run into this issue? I hope I've just missed something simple! Do the read lengths for R1 and R2 need to be the same? Although I've worked a bunch with 16S, this was our first test sequence run trying out this ITS design...

Here is the QIIME script:
MacQIIME Cultures-Lab:ITS_AH $ join_paired_ends.py -f R1_ITS_trimmed.fastq -r R2_ITS_trimmed.fastq -o TrimmedJoined_j20_p5 -j 20 -p 5

Example of a read pair that contained no adapter seq (untrimmed by cutadapt). These will stitch together in QIIME:
@HWI-M01323:210:000000000-D0DYK:1:1101:15218:1347 1:N:0:
NTTAATATTTTAAAATTTCCAGTTACGAAAATTCTTGTTTTTGACAAAAATTTAATGAATAGATAAAATTGTTTGTGTTTGTTAACCTCTGTGCCCCGATTGCTCGAATGCCCAAAGAAAAAGTTGCAAAGATATGAAAACTCCACAGTGTGTTGTATTGAAACGGTTTTAATTGTCCTATAACAAAAGCACAGAAATCTCTCACCGTTTGGAATAGCAAGAAAGAAACTTACAAGCCTAGCAAGACCGC
+
#>>1>>FFFFBFGG1FFGGFGFHHHHGGHE0GHDHGGHHHHGEFHHBAGEGHHB2DGFFHGHBGF111DGFGHHHHBEGGEHHHFGGHGFH22BDCF/EEEEFFF>/0FE1BFB/0B0E0B0?GG1FDEFB0BBB22BGFFGFGFGCCG1F1GHHFFHHFF1@DFFGGGGGFFGFHHDHEH1FFF1><CAGFHHEGFGBD0DHHBCGHGHCE/CCG:0;GCGHFBHHGF0C9FGBFEEGFE/:0;CFC-;
@HWI-M01323:210:000000000-D0DYK:1:1101:15218:1347 3:N:0:
TTAAAGAAATTTAATAATTTTGAAAATGGATTTTTTTTTTTGTTTTGGCAAGAGCATGAGAGCTTTTACTGGGCAAGAAGACAAGAGATGGAGAGTCCAGCCGGGCCTGCGCTTAAGTGCGCGGTCTTGCTAGGCTTGTAAGTTTCTTTCTTGCTATTCCAAACGGTGAGAGATTTCTGTGCTTTTGTTATAGGACAATTAAAACCGTTTCAATACAACACACTTTGGAGTTTTCATATCTTTGCAACTT
+
>1A111DFFFFFG3FFBG33EG1D11D1BEGCBHHEEGGGGEGFC0CE11B/E001@DBC@B?FGHHH1GF1/>0?>000100000B0B1B001EE11DGFEEE//>/?G//@CC@GDGB/@/A/<AFH111>FC10-000==0<=DGF0DGFDFEHB<00;C....;CF0:.000CBFGB;9F0:C;/;;;900B9.0;0;B0;.;ACF///:///9/;A-;///9//9BB/BBBBBF9FBF;/9//;F

Paired (from fastqjoin.join.fastq).:
@HWI-M01323:210:000000000-D0DYK:1:1101:15218:1347 1:N:0:
NTTAATATTTTAAAATTTCCAGTTACGAAAATTCTTGTTTTTGACAAAAATTTAATGAATAGATAAAATTGTTTGTGTTTGTTAACCTCTGTGCCCCGATTGCTCGAATGCCCAAAGAAAAAGTTGCAAAGATATGAAAACTCCACAGTGTGTTGTATTGAAACGGTTTTAATTGTCCTATAACAAAAGCACAGAAATCTCTCACCGTTTGGAATAGCAAGAAAGAAACTTACAAGCCTAGCAAGACCGCGCACTTAAGCGCAGGCCCGGCTGGACTCTCCATCTCTTGTCTTCTTGCCCAGTAAAAGCTCTCATGCTCTTGCCAAAACAAAAAAAAAAATCCATTTTCAAAATTATTAAATTTCTTTAA
+

Example of a read pair that had adapter seq trimmed by cutadapt. These will not stitch in QIIME but assemble in Geneious:
Untrimmed:
@HWI-M01323:210:000000000-D0DYK:1:1101:15411:1338 1:N:0:
NTGATTTATTTATGGTTTTACTCAGAAGTTACATATAGAAACAGAGTTTAGGGGTCCTCTGGCGGGCCGTCCCGTTTTACCGGGAGCGGGCTGATCCGCCGAGGCAACAAGTGGTATGTTCACAGGGGTTTGGGAGTTGTAAACTCGGTAATGATCCTTCCGCAGGTGGCTGACTGACTGGAAACCACCACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAATAACTTACCTACCTCTCAACTATTA
+
#>>AAAFFFFFFGG4BFGGGCGHHH4FGDGFGHGHHHHHHHBGHHGAGGGHBFDE2GGHHFF3AEC>>EG?AE@FHHFHDGG@@>E<><C<<CGHHHHEFC?D<-FFGCGFHGFFHHGHBDHHDDGEDGGGGCDAED/FHHHG00:CEEAEFFB00CFGFBDAB?@.BEGFFFEFFB9BFEFFFFFA?EFFA/9BB.;9BBB/;99.ABF/:F//.99B?@@-;-./:9////////.//:;//;//9;B
@HWI-M01323:210:000000000-D0DYK:1:1101:15411:1338 3:N:0:
TTACCGAGTTTACAACTCCCAAACCCCTGTGAACATACCACTTGTTGCCTCGGCGGATCAGCCCGCTCCCGGTAAAACGGGACGGCCCGCCAGAGGACCCCTAAACTCTGTTTCTATATGTAACTTCTGAGTCAAACCATAAATAAATCAAAACTTTCAGCAACGGATCTCACAATTACCATAGTGTAGATATCGGTGGTCGACGTATCATTCACAAAAAAAAAAAACACCCTACTATAACTCTCGTAAC
+
ABBBAA3@ABFFGFEGGGGFGCFGGGEFFHGDGGFG5BFHFFHCBEGEEGHGEGFG0EEFHFHGGGGCG?E?>EFHHHGGGDE/</>@C/BCCFEC0/<ADEEHHGHGHF?D11FGFGHFGHFGHHG111=>0=0D<.<C00DDDHBGFGH00CGFCC0C0CCGGGDE-BFBBBB0B0;CBFGG0;;009;000.9:AFA/9;;-9.BAB;B/;/;;FFEFF-;---9.9....://///;///9.....

Trimmed:
@HWI-M01323:210:000000000-D0DYK:1:1101:15411:1338 1:N:0:
NTGATTTATTTATGGTTTTACTCAGAAGTTACATATAGAAACAGAGTTTAGGGGTCCTCTGGCGGGCCGTCCCGTTTTACCGGGAGCGGGCTGATCCGCCGAGGCAACAAGTGGTATGTTCACAGGGGTTTGGGAGTTGTAAACTCGGTAATGATCCTTCCGCAGGT
+
#>>AAAFFFFFFGG4BFGGGCGHHH4FGDGFGHGHHHHHHHBGHHGAGGGHBFDE2GGHHFF3AEC>>EG?AE@FHHFHDGG@@>E<><C<<CGHHHHEFC?D<-FFGCGFHGFFHHGHBDHHDDGEDGGGGCDAED/FHHHG00:CEEAEFFB00CFGFBDAB?@.
@HWI-M01323:210:000000000-D0DYK:1:1101:15411:1338 3:N:0:
TTACCGAGTTTACAACTCCCAAACCCCTGTGAACATACCACTTGTTGCCTCGGCGGATCAGCCCGCTCCCGGTAAAACGGGACGGCCCGCCAGAGGACCCCTAAACTCTGTTTCTATATGTAACTTCTGAGTCAAACCATAAATAAATCAAAACTTTCAGCAACGGATCTC
+
ABBBAA3@ABFFGFEGGGGFGCFGGGEFFHGDGGFG5BFHFFHCBEGEEGHGEGFG0EEFHFHGGGGCG?E?>EFHHHGGGDE/</>@C/BCCFEC0/<ADEEHHGHGHF?D11FGFGHFGHFGHHG111=>0=0D<.<C00DDDHBGFGH00CGFCC0C0CCGGGDE-BF

Thanks for any suggestions!!

Ashley

AssembledReads.docx

Andrew Krohn

unread,

Feb 6, 2015, 6:26:20 PM2/6/15

to qiime...@googlegroups.com

Hi Ashley,

I do quite a bit of sequencing with ITS2 fungal primers on 2x250 and 2x300 kits and I also use geneious a bit for some other stuff. I use a different primer set, but the idea is similar. According to Nick's paper, your primers will amplify products for ascomycetes (183.6 +/- 46.8) and basidiomycetes (219.8 +/- 56.9). If you are only interested in ascomycetes, read joining is fine, but you will bias your results against basidios or anything else that produces a longer ITS1 by joining and discarding unjoined reads. For this reason, I would advise against joining reads for this project if you are interested in anything besides ascomycetes. I have found my ITS2 data to behave similarly, and taxa plots of joined versus unjoined data confirm the bias.

Since you are using the Caporaso design, I will assume you used your locus specific primers (plus some extra sequence to crank up the Tm) for sequencing and that your trimming is specific to 3' primer contamination. Unless doing 515-806 on a 2x150 run, I think primer trimming should be standard. I'm not familiar with cutadapt, but it sounds as though it did something at least. I typically use fastq-mcf from ea-utils. Do you use the joined paired reads script much with 16S data? I find that 2x150 reads join readily while 2x250 or 2x300 reads simply do not. I think the reason is that the quality for most Illumina runs seems to start tanking around cycle 500, but with a marked decrease in quality after cycle 350 or so. This means that read 2 for longer reads is never so good. You can sometimes compensate for this in joined paired reads (fastq-join mode) by allowing a decent mismatch (30% seems to be OK, despite Erik Aronesty's suggestion not to exceed 10%). Fastq-join seems to do a fine job of finding the best match.

One other thing I have found is that a surprising number of read pairs simply do not match for 2x250/2x300 runs. I haven't explored genome sequencing runs enough to be certain it is happening there, but I have some assemblies that suggest that quite a bit of discordance is present. This is likely due to mixed clustering during sequencing where two neighboring clusters are close enough together that signal can "bleed" into the other, and if one is producing more signal, it may overwhelm the signal from the target cluster. I found that phix is finding its way into my data this way and now take measures to remove it before I start analysis (http://enggen-nau.blogspot.com/2015/01/bash-scripts-for-qiime-work-akutils.html). This is another argument against use of read 2. I have checked some dual-indexed data and it has far lower rates of phix infiltration which suggests it will also suffer much less from barcode bleed. I think our lab will move to a dual indexed design very soon. I confirmed this in some of my own data by blasting read pairs and finding they came back as different results. I note both of the reads you supply that don't align come back as probably fusarium something, but if you try to assemble the reads by eye you can see they do not match (try to match up the strings of polyA/T). I'd guess your samples come from a soil somewhere (possibly agricultural) where fusarium might be common. Conversely, if you look at the pair that do match, it is easy to see that the tail of the second read aligns perfectly starting at the third line of read 1. Conclusion, the pair that is not joining simply does not match!

Geneious assembler is iterative and a little "generous" in my opinion. For genome assemblies, I have seen it leave in adapter contamination which it then uses to scaffold together a contig that would otherwise not have assembled. I like it for genome work due to the automated kmer optimization, but I wouldn't trust it with my amplicons.

Hopefully this helps.

ashleyb

unread,

Feb 9, 2015, 12:58:48 PM2/9/15

to qiime...@googlegroups.com

Thanks for the comments Andrew, but I still think I have an issue with my dataset. I think the amplicon is small enough that there is plenty of overlap to stitch from a 2x250 run. I don't have issues stitching reads from 16S, as you suggested. I am typically pairing >85% of the reads from 515-806 with min 200 bp overlap and 5% error on a 2x250 run. In any case, here I have a total of 644,008 ITS reads from a MiSeq Nano run and I am only able to pair 6,853 of those reads. From those that I've double checked, the pattern seems to be that the reads I was able to stitch were untrimmed (those that were 250 bp) and those that were trimmed with cutadapt do not stitch for some reason, even though I can see they should. The trimmed reads posted above do overlap (re-attached so you can see the sequences) -- the overhang on the 3' ends is the primer sequences that I purposely left so it was easier for me to verify the read orientations. Thanks for any other suggestions?

Ashley

AssembledReads2.docx

Andrew Krohn

unread,

Feb 9, 2015, 1:43:02 PM2/9/15

to qiime...@googlegroups.com

Your geneious assembly does look very nice. But your individual read lengths look strange to me. I attached a file of how your read lengths should appear for fungal data once joined (something like that, remember we have different primers, but with approximately the same number of length groups, corresponding to certain lineages). I think the length graph you sent is for single reads only, but notice the high number of reads that were trimmed to almost nothing (<20). Even the peak around 110 is troublesome. I'm wondering if cutadapt is doing quality filtering as well as adapter trimming? You need to ensure your trimmer is only removing adapters and leave in the low quality bases for now. During joining, the higher quality base will be chosen, and this can actually improve the quality of your dataset (some call this base rescue). I know others like and use trimmomatic for this, but I don't seem to get along with that program. I use fastq-mcf which does seem to work well. If you have ea-utils installed, try something like this to remove primers, and then do your joining right afterward:

fastq-mcf -0 -t 0.0001 primer.fas read1.fq read2.fa -o read1.trimmed.fq -o read2.trimmed.fq

The -0 tells mcf to set all defaults to zero (prevents quality trimming), and the -t 0.0001 makes mcf more sensitive to the presence of low-abundance primers sequences (common with degenerate primers).

You need to provide all possible versions of your primers (reverse complemented) in fasta format as the primer file.

Your primers:

BITS

ACCTGCGGARGGATCA

B58S3

GAGATCCRTTGYTRAAAGTT

Interpreted for degeneracy using code from here: https://www.biostars.org/p/6219/

BITS:

ACCTGCGGAAGGATCA

ACCTGCGGAGGGATCA

B58S3:

GAGATCCATTGCTAAAAGTT

GAGATCCATTGCTGAAAGTT

GAGATCCATTGTTAAAAGTT

GAGATCCATTGTTGAAAGTT

GAGATCCGTTGCTAAAAGTT

GAGATCCGTTGCTGAAAGTT

GAGATCCGTTGTTAAAAGTT

GAGATCCGTTGTTGAAAGTT

Reverse complement them by your favorite way, or here is my goto web app for this:

http://arep.med.harvard.edu/labgc/adnan/projects/Utilities/revcomp.html

fungal amplicon lengths.png

ashleyb

unread,

Feb 9, 2015, 3:07:14 PM2/9/15

to qiime...@googlegroups.com

The short reads are not caused by error in the trimming, but rather is likely primer dimer. There was a high abundance of a small peak present in our Bioanalyzer electropherogram as well. In the future I think we will consider adding a gel purification prior to sequencing the amplicon pool. These were low biomass samples (food origin), which likely didn't help the situation. As far as I can tell the trimming has worked fine and it does not do any quality trimming. Still stumped as to what's going on....

Ashley

Andrew Krohn

unread,

Feb 9, 2015, 3:48:43 PM2/9/15

to qiime...@googlegroups.com

Ooooo I think I have been dealing with something like this lately as well. Primer artifacts in amplicon preps are bad news. I have tried removing them with gel cuts as well as pippin, and neither does the job. I believe it is because the artifacts have high homology to the locus of interest, so single stranded bits can birds nest in with the rest of your prepped amplicons and be essentially impossible to remove. The best way I have found to control them is to bump up your annealing temp and keep your PCR cycles low. For 16S, 15 initial cycles is usually plenty with locus-specific priemrs, while for fungi, I often need 20-25 cycles, or even 30 on the initial PCR for low biomass or for taxon-specific primer sets. As an aside, I don't waste time on the bioanalyzer anymore. If I think I have a clean pool (based on gel results of individual samples on the last result and a subsequent bead cleanup, regardless of purity), I go straight to qPCR. Then I ALWAYS run a gel of the qPCR result as if you have anything with P5/P7 (clusterable) sequences if will be very visible in your qPCR result. Your qPCR gel should be clean with target products only, or else your quant will be invalid. At any rate, bead cleanups would probably help your preps immensely as it will effectively size-select away anything smaller than ~150bp.

Learn to make affordable bead cleanup solution here: http://enggen-nau.blogspot.com/2013/03/bead-cleanups.html

ashleyb

unread,

Feb 11, 2015, 9:18:52 AM2/11/15

to qiime...@googlegroups.com

Thanks for the suggestion Andrew. I still haven't solved my original read pairing issue, and I'm wondering what about the trimming could be causing the issue? Thanks!
Ashley

Andrew Krohn

unread,

Feb 12, 2015, 3:30:13 PM2/12/15

to qiime...@googlegroups.com

Hi Ashley,

I think I know your problem, and it is likely a setting in cutadapt (again, I am not familiar with it). See below from your previous post. I have highlighted the primer sites green. Your primers are being retained, and they should be removed. You don't want to include this synthetic sequence in your analysis, but the real issue for you here is this is the reason your reads are assembling. Find whatever setting in cutadapt tells it to remove the primer/adapter in addition to the sequence past there and you should get your reads to pair. Pretty sure fastq-join does something like sliding the sequences along each other until a good match is found, but many of these programs won't slide past each other to find the right match which would be necessary with your left-in primers residing at the 3' ends.

Example of a read pair that had adapter seq trimmed by cutadapt. These will not stitch in QIIME but assemble in Geneious:
Untrimmed:
@HWI-M01323:210:000000000-D0DYK:1:1101:15411:1338 1:N:0:
NTGATTTATTTATGGTTTTACTCAGAAGTTACATATAGAAACAGAGTTTAGGGGTCCTCTGGCGGGCCGTCCCGTTTTACCGGGAGCGGGCTGATCCGCCGAGGCAACAAGTGGTATGTTCACAGGGGTTTGGGAGTTGTAAACTCGGTAATGATCCTTCCGCAGGTGGCTGACTGACTGGAAACCACCACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAATAACTTACCTACCTCTCAACTATTA

@HWI-M01323:210:000000000-D0DYK:1:1101:15411:1338 3:N:0:
TTACCGAGTTTACAACTCCCAAACCCCTGTGAACATACCACTTGTTGCCTCGGCGGATCAGCCCGCTCCCGGTAAAACGGGACGGCCCGCCAGAGGACCCCTAAACTCTGTTTCTATATGTAACTTCTGAGTCAAACCATAAATAAATCAAAACTTTCAGCAACGGATCTCACAATTACCATAGTGTAGATATCGGTGGTCGACGTATCATTCACAAAAAAAAAAAACACCCTACTATAACTCTCGTAAC

Trimmed:
@HWI-M01323:210:000000000-D0DYK:1:1101:15411:1338 1:N:0:
NTGATTTATTTATGGTTTTACTCAGAAGTTACATATAGAAACAGAGTTTAGGGGTCCTCTGGCGGGCCGTCCCGTTTTACCGGGAGCGGGCTGATCCGCCGAGGCAACAAGTGGTATGTTCACAGGGGTTTGGGAGTTGTAAACTCGGTAATGATCCTTCCGCAGGT

@HWI-M01323:210:000000000-D0DYK:1:1101:15411:1338 3:N:0:
TTACCGAGTTTACAACTCCCAAACCCCTGTGAACATACCACTTGTTGCCTCGGCGGATCAGCCCGCTCCCGGTAAAACGGGACGGCCCGCCAGAGGACCCCTAAACTCTGTTTCTATATGTAACTTCTGAGTCAAACCATAAATAAATCAAAACTTTCAGCAACGGATCTC

Aligned after removing terminal primer sequences (reddish sequence was reverse-complemented):

NTGATTTATTTATGGTTTTACTCAGAAGTTACATATAGAAACAGAGTTTAGGGGTCCTCTGGCGGGCCGTCCCGTTTTACCGGGAGCGGGCTGATCCGCCGAGGCAACAAGTGGTATGTTCACAGGGGTTTGGGAGTTGTAAACTCGGTAA

TTGATTTATTTATGGTTTGACTCAGAAGTTACATATAGAAACAGAGTTTAGGGGTCCTCTGGCGGGCCGTCCCGTTTTACCGGGAGCGGGCTGATCCGCCGAGGCAACAAGTGGTATGTTCACAGGGGTTTGGGAGTTGTAAACTCGGTAA

ashleyb

unread,

Feb 13, 2015, 12:39:09 PM2/13/15

to qiime...@googlegroups.com

Hi Andrew,
Thank you!!!!!! You are correct. I trimmed the primer off and they pair now. Wouldn't have guessed that would be an issue, and I was pretty frustrated that I couldn't figure it out...haha. Thanks again!

Ashley

Reply all

Reply to author

Forward