Reads still contain primer sequences after Cutadapt

Jérémie Poitras

unread,

Sep 26, 2025, 9:46:22 PM9/26/25

to Microbiome Helper

Hi all,

It seems that for 4 out of my 146 samples, 124; 147; 233; 337 sequences were found beginning with the forward primer sequence after the Cutadapt step. About 45 other samples have 1-3 reads beginning with the forward primer sequence. No sample has any reads beginning with the reverse primer sequence. It seems that 3.2% of the reads were discarded because they didn’t contain the primers sequences.
Here is my code for the Cutadapt step:

qiime cutadapt trim-paired \
--i-demultiplexed-sequences demux.qza \
--p-cores 4 \
--p-front-f CAGCCGCGGTAATTCCAGCT \
--p-front-r GAACCCAAACACTTTGGTTTCC \
--p-no-indels \
--p-error-rate 0.1 \
--p-discard-untrimmed \
--o-trimmed-sequences demux-trimmed.qza \
--verbose

Here is the verbose summary output for the Cutadapt primers trim:

=== Summary ===

Total read pairs processed: 151,559

Read 1 with adapter: 150,710 (99.4%)

Read 2 with adapter: 147,490 (97.3%)

== Read fate breakdown ==

Pairs that were too short: 0 (0.0%)

Pairs discarded as untrimmed: 4,890 (3.2%)

Pairs written (passing filters): 146,669 (96.8%)

Total basepairs processed: 91,238,518 bp

Read 1: 45,619,259 bp

Read 2: 45,619,259 bp

Quality-trimmed: 0 bp (0.0%)

Read 1: 0 bp

Read 2: 0 bp

Total written (filtered): 82,135,270 bp (90.0%)

Read 1: 41,214,208 bp

Read 2: 40,921,062 bp

Then to verify the presence of my primers at the start of my reads:

qiime tools export \
--input-path demux-trimmed.qza \
--output-path trimmed_fastq

cd trimmed_fastq

for f in *R1*.fastq.gz; do
cnt=$(gzcat "$f" | awk 'NR % 4 == 2' | grep -c '^CAGCCGCGGTAATTCCAGCT')
echo "$f: $cnt"

done

Here is the ouput of that last command only for the 4 lines with high numbers of reads found starting with the forward primer sequence:

SAM020_H2_E_S117_L001_R1_001.fastq.gz: 337
SAM020_H1_B_S66_L001_R1_001.fastq.gz: 147
SAM027_H2_B_S19_L001_R1_001.fastq.gz: 233
SAM185_H2_E_S45_L001_R1_001.fastq.gz: 124

When I ru the same command for my reverse reads, I get 0 for every sample so the problem is only for my forward. My forward primer is WANDA (CAGCCGCGGTAATTCCAGCT).

I am hesitant of passing on to the next step (denoising with DADA2) considering that I still have some reads with the primer sequence in it, but I can't seem to find a way top remove these specific reads. Is there such a thing? Should I not care? Afterall, even 337 reads is not a lot considering that each sample has tens of thousands of reads right?

Your input and help would be greatly appreciated as I am a bioinformatics beginner!

Thanks,

Jérémie Poitras

Andre Comeau

unread,

Oct 2, 2025, 1:07:44 PM10/2/25

to Microbiome Helper

Jérémie,

I was surprised to see Cutadapt fail to recognize the primer, especially if the sequences you were finding were 100% exact to the original sequences (I thought the smoking gun might be related to your error rate you chose, but the missed fragments are 100%). You are correct, that so few reads out of the 99%+ proportion of other reads will probably not affect things too badly when those are in the 10000s+, but it would still be nice to get to the bottom of why this small amount of error is bleeding into the results...

I looked into it a bit and I think I found the answer - a few of those samples, and hence files, did not work very well in the PCRs it seems because, if you use FASTQC to check those raw files before trimming, you'll see they are mostly primer dimer sequences. They only contain about 100 bp of sequence (which is the ~20 bp of the F primer then the ~20 bp of the R primer and the ~60 bp of the adapter after it on the end) before it goes all G, which is the equivalent of no signal on the modern Illumina machines:

It is possible that after you trimmed the F primer from them, that a small minority had a dimer of the F primer again right after the first occurrence of it (so FprimerFprimerNNN...), but I'm not sure why you are not seeing the same phenomenon with the R2 reads, since they have the same short "dimer profile", unless you made a mistake using the proper reverse-complement version of the primer when grepping...?

At any rate, you'll see that the resulting trimmed files from those few samples will have very small sizes now (<100 bp) and can effectively be ignore since those reads are all going to get removed during the downstream QC/parameter choices which require a minimum length to pass into the ASV process.

ANDRÉ M. COMEAU, PhD
Manager • Integrated Microbiome Resource (IMR)
T: 902.494.2684 | E: andre....@dal.ca

Address for deliveries:
Dept. of Pharmacology
Tupper Med. Bldg., room 5D
Dalhousie University
5850 College St.
Halifax NS B3H 4R2

Research Associate (Lab Manager)

Morgan Langille Lab • Dept. of Pharmacology
ResearchGate Profile • GoogleScholar Publications

"Without fantasy, there is no science. Without fact, there is no art." - Nabokov
"The good thing about science is that it's true whether or not you believe in it." - Neil deGrasse Tyson

From: microbio...@googlegroups.com <microbio...@googlegroups.com> on behalf of Jérémie Poitras <5477p...@gmail.com>
Sent: Friday, September 26, 2025 10:46 PM
To: Microbiome Helper <microbio...@googlegroups.com>
Subject: [microbiome-helper] Reads still contain primer sequences after Cutadapt

CAUTION: The Sender of this email is not from within Dalhousie.

--
You received this message because you are subscribed to the Google Groups "Microbiome Helper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to microbiome-hel...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/microbiome-helper/b3f473f6-832c-4b55-8411-5f9a105811c9n%40googlegroups.com.

Jérémie Poitras

unread,

Oct 10, 2025, 2:43:15 PM10/10/25

to Microbiome Helper

Hi André,

That makes sense. I did verify that I did not make any orientation mistake when writing the primer sequences or syntax in general and it's not the case. I think too, that it might only be the PCRs fails and primer-dimers remain. I won't worry about it then.