Hi all,
It seems that for 4 out of my 146 samples, 124; 147; 233; 337 sequences were found beginning with the forward primer sequence after the Cutadapt step. About 45 other samples have 1-3 reads beginning with the forward primer sequence. No sample has any reads beginning with the reverse primer sequence. It seems that 3.2% of the reads were discarded because they didn’t contain the primers sequences.
Here is my code for the Cutadapt step:
qiime cutadapt trim-paired \
--i-demultiplexed-sequences demux.qza \
--p-cores 4 \
--p-front-f CAGCCGCGGTAATTCCAGCT \
--p-front-r GAACCCAAACACTTTGGTTTCC \
--p-no-indels \
--p-error-rate 0.1 \
--p-discard-untrimmed \
--o-trimmed-sequences demux-trimmed.qza \
--verbose
Here is the verbose summary output for the Cutadapt primers trim:
=== Summary ===
Total read pairs processed: 151,559
Read 1 with adapter: 150,710 (99.4%)
Read 2 with adapter: 147,490 (97.3%)
== Read fate breakdown ==
Pairs that were too short: 0 (0.0%)
Pairs discarded as untrimmed: 4,890 (3.2%)
Pairs written (passing filters): 146,669 (96.8%)
Total basepairs processed: 91,238,518 bp
Read 1: 45,619,259 bp
Read 2: 45,619,259 bp
Quality-trimmed: 0 bp (0.0%)
Read 1: 0 bp
Read 2: 0 bp
Total written (filtered): 82,135,270 bp (90.0%)
Read 1: 41,214,208 bp
Read 2: 40,921,062 bp
Then to verify the presence of my primers at the start of my reads:
qiime tools export \
--input-path demux-trimmed.qza \
--output-path trimmed_fastq
cd trimmed_fastq
for f in *R1*.fastq.gz; do
cnt=$(gzcat "$f" | awk 'NR % 4 == 2' | grep -c '^CAGCCGCGGTAATTCCAGCT')
echo "$f: $cnt"
done
Here is the ouput of that last command only for the 4 lines with high numbers of reads found starting with the forward primer sequence:
SAM020_H2_E_S117_L001_R1_001.fastq.gz: 337
SAM020_H1_B_S66_L001_R1_001.fastq.gz: 147
SAM027_H2_B_S19_L001_R1_001.fastq.gz: 233
SAM185_H2_E_S45_L001_R1_001.fastq.gz: 124
When I ru the same command for my reverse reads, I get 0 for every sample so the problem is only for my forward. My forward primer is WANDA (CAGCCGCGGTAATTCCAGCT).
I am hesitant of passing on to the next step (denoising with DADA2) considering that I still have some reads with the primer sequence in it, but I can't seem to find a way top remove these specific reads. Is there such a thing? Should I not care? Afterall, even 337 reads is not a lot considering that each sample has tens of thousands of reads right?
Your input and help would be greatly appreciated as I am a bioinformatics beginner!
Thanks,
Jérémie Poitras