Reads Ending with PolyG tails and contaminated with Illumina Universal Adapter

504 views
Skip to first unread message

Abosede Olarewaju

unread,
Oct 23, 2023, 6:33:28 AM10/23/23
to Stacks
Dear All,

I am new to using the Stacks pipeline.

I have been able to run the process_radtag on sequenced files and the number of retained reads after process_radtag varied from 22% to 95%. When I viewed the fastqc report before running process_radtag, I noticed that some of the reads were contaminated with Illumina universal adapters and there was an overrepresentation of some of the sequences with no hit.

Though I am not sure of which of the sequences to use as the Illumina universal adapter, when I include the adapters in the process_radtag command for one of the sequences files the number of the retained reads dropped from 58.3% to 5.6%. Is there a particular way to prevent reads from being contaminated with Illumina universal adapters? I would soon prepare my RADseq library for my whole dataset.

Below is the adapter I included in my process_radtag command

Read 1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA 

Read 2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT


Additionally, my sequenced files contain reads with length 101 and 251. The files with sequences of reads 251 have varying length of "GGGGGGG's " at the end of the sequences (I think these are polyG tails). So, I decided to trim these to 150, though at 150 lengths some of the reads still contain the GGGGs. Is this trimming okay? Is it advisable to proceed to the next steps? Or do I need to remove the PolyG tails with another tool such as fastp or timmomatic before proceeding with process_radtags?

I would like to get feedback to know how to proceed to the next step.

Thank you all for your contributions.
Abosede

Catchen, Julian

unread,
Oct 24, 2023, 5:46:23 PM10/24/23
to stacks...@googlegroups.com

Hi Abosede,

 

Adaptor sequence appears in reads typically due to the size selection used during library preparation and/or the choice of enzymes when doing double-digest RAD. If your sequenced read lengths are (e.g. 101bp or 251bp) are longer than the DNA insert that was captured between your P1 and P2 adaptors, the Illumina machine will sequence through the insert and into the adaptor. If you have a very frequent cutter as a second enzyme for ddRAD, then you will end up with lots of very short DNA inserts in between your two enzyme cuts. You can exclude some of these with size selection.

 

You can trim reads to prevent the read from being thrown out due to having adaptor sequence. You will want to process your 101 and 251bp reads separately as well. I would recommend sticking to a single length in your main sequencing runs as different length reads are hard to combine informatically.

 

Best,

 

julian

Reply all
Reply to author
Forward
0 new messages