trimming or not trimming?

596 views
Skip to first unread message

Luana Sousa

unread,
Sep 27, 2022, 2:41:52 PM9/27/22
to Stacks
Hey Julian and everyone!

I'm working with an Illumina Hiseq2500 single-end, double digest (PstI-MseI) raw sequence from the DART.

I'm in doubt about the trimming of my data. The data was sent to me already demultiplexed, so I just used process_radtags to remove the barcode and low quality reads:

process_radtags -p ./alti_RAW/ -o ./process_radtags/ -b barcodes_alti -e pstI -r -c -q --disable_rad_check

Total Sequences    186032269
Ambiguous Barcodes    0
Low Quality    33663
Ambiguous RAD-Tag    0
Retained Reads    185998606

This way, my read went from 84 to 69pb. (I used --disable_rad_check because I lost almost all reads using it. I guess not all the reads had the cut site. I tried using the two enzymes but they did not work.)

I analyzed it on Fastqc and found adapter content. Here is an example:
adaptador_content.png
And on overrepresented reads, the highlight is the adapter sequence.
overrepresent.png

I used cutadapat to remove the adapter:
cutadapt -a TTACAGATCGGAAGAG -m 1  -o  ${SAMPLE}trimmed  ${SAMPLE};

I removed some 1 to 4 pb that fastqc showed were bad. And by logic after this, my sequence was of variable length (1 to 66 media of 52).
 
I have a reference genome, with 80% alignment with my data, and I used these trimmed reads to mapping with bowtie2 (as my reads are short) and then did the ref_map. Thus my vcf had low coverage (~10x) and a lot of missing data (30-50%).
 
My idea was to do the integrated method. For the denovo I needed to use uniformed reads length, so I used process_radtags with t- 60, I lost 42.8% (because of the trimming of the adapters I guess).

Total Sequences    185994423
Barcode Not Found    0    0.0%
Low Quality    79680041    42.8%
RAD Cutsite Not Found    0    0.0%
Retained Reads    106314382    57.2%
END total_raw_read_counts

After all this, I read Rivera-Colón & Catchen 2021 and other comments here. Julian suggests that it is better to not trim in some cases. In some posts here, he says,  "or to just uniformly truncate all reads to a reasonable length that gets rid of most of the adaptor". I don't know if it is my case, because if I trim I lose lots of reads. Maybe I came to a conclusion by writing this post, but I need some confirmation.
 
Jullian, in my case, is it better not to trim? And in the case of the ref_map approach, would it be better not to trim before mapping either?

Thank you very much!
Luana Soares

Luana Sousa

unread,
Oct 3, 2022, 1:23:13 PM10/3/22
to Stacks
I am trying to do optimization runs (following Rivera-Colón & Catchen 2021)  with the reads without the trimming and it is running very very slowly. It took almost 70 hours to do the first run. When I did the trimming and I had lost half of the sequences, in just one day I ran all 104 individuals. (The files have between 1.5M and 2.1M reads).

I ran this script:

work=~/Documentos
for M in 1 2 3 4 5 6 7 8 9 10 11 12; do
    out=denovo.M${M}
echo $out    
mkdir -p $out
denovo_map.pl --samples $work/opt_samples --popmap $work/popmap.txt -o ./$out -M $M -n $M -T 8 --min-samples-per-pop 0.8
done
 
Aside from that, I have a 24 core and 62.9 GB of RAM, which are full with this analysis. Is this normal or did I do something wrong? And back to my initial question: is it better to trim and have fewer reads to process?
denovo_map.log

Catchen, Julian

unread,
Oct 3, 2022, 4:57:14 PM10/3/22
to stacks...@googlegroups.com

Hi Luana,

 

For trimming, I would recommend either trimming uniformly, or not trimming at all (and therefore discarding reads with adaptor sequence). Trimming uniformly would help if you have adaptor, but it is mostly concentrated at the 3’ ends of the reads. However, if you have adaptor sequence throughout, you probably need to discard those reads. If you were to trim variably per read, you will create a region of low coverage at the trim point for every RAD locus that has adaptor sequence in it creating low-quality genotype calls in those regions.

 

I can’t tell from your message how many of your reads actually have adaptor sequence in them, it is fine to use cutadapt, but process_radtags can also discard reads with adaptor (not trim), and may give you a better idea of how many reads you are losing per sample.

 

I don’t understand what you are saying here:

>
I removed some 1 to 4 pb that fastqc showed were bad. And by logic after this, my sequence was of variable length (1 to 66 media of 52).

 

This is also hard to interpret:

 

> My idea was to do the integrated method. For the denovo I needed to use uniformed reads length, so I used process_radtags with t- 60, I lost 42.8% (because of the trimming of the adapters I guess).

 

Did you specify the adaptors to process_radtags or just trim the data? If you are filtering on adaptor, I would not also trim, since reads with adaptor will be discarded. Anyway, it is hard to reconcile your first process_radtags run, with 33,663 reads lost to quality with your second run, with 79,680,041 reads lost to quality. If you run process_radtags with trimming set to 60, and a bunch of the reads are already less than 60, they will be discarded, which may be why you have 79M reads dropped. Again, I would run process_radtags on the raw data, with the adaptors provided and see what you get.

 

julian

 

From: stacks...@googlegroups.com <stacks...@googlegroups.com> on behalf of Luana Sousa <lusousa....@gmail.com>
Date: Tuesday, September 27, 2022 at 1:41 PM
To: Stacks <stacks...@googlegroups.com>
Subject: [stacks] trimming or not trimming?

Hey Julian and everyone!

 

I'm working with an Illumina Hiseq2500 single-end, double digest (PstI-MseI) raw sequence from the DART.

 

I'm in doubt about the trimming of my data. The data was sent to me already demultiplexed, so I just used process_radtags to remove the barcode and low quality reads:

 

process_radtags -p ./alti_RAW/ -o ./process_radtags/ -b barcodes_alti -e pstI -r -c -q --disable_rad_check

 

Total Sequences    186032269
Ambiguous Barcodes    0
Low Quality    33663
Ambiguous RAD-Tag    0
Retained Reads    185998606

 

This way, my read went from 84 to 69pb. (I used --disable_rad_check because I lost almost all reads using it. I guess not all the reads had the cut site. I tried using the two enzymes but they did not work.)

I used cutadapat to remove the adapter:

Luana Sousa

unread,
Oct 6, 2022, 12:48:07 PM10/6/22
to Stacks

Thank you very much, Julian.

I used cutadapt to trim the adaptators and left the reads with variable lengths. I set the process_radtags to 60 because that was the length of the most reads. The dropped reads were actually all that was trimmed by cutadapt.

Following your advice, I ran process_radtags with the raw:


BEGIN total_raw_read_counts
Total Sequences    186032269
Reads containing adapter sequence    76055954    40.9%

Barcode Not Found    0    0.0%
Low Quality    33663    0.0%

RAD Cutsite Not Found    0    0.0%
Retained Reads    109942652    59.1%

If it is preferable to remove the reads with adaptor, I will proceed with these data. The tests with the optimization of the denovo_map are calling about 30k r80 SNPs.

Thank you again!

Audrey Salinger

unread,
Sep 17, 2024, 3:07:37 PM9/17/24
to Stacks
Hello!

I have a small fraction of sequences with adapter read-through, and I am looking for some further theoretical clarification regarding the recommendation here to trim uniformly or not at all (instead discarding whole reads). I can conceptually understand that by trimming a read at the 3' end where adapter read-through occurs, you create a region of reduced depth after the trim point. What I am having trouble understanding is why removing the whole read does not also reduce coverage and thus quality of genotype calls. Can you help me to understand why entirely removing whole reads with adapter read-through is better for downstream genotype calling?

Many thanks,
Audrey

Angel Rivera-Colón

unread,
Sep 18, 2024, 1:32:16 PM9/18/24
to Stacks
Hi Audrey,

Yes, discarding the reads with adapter also decreases the final coverage. Also, you are correct that having higher coverage at a given site is better for genotype calling (although this effect is not linear).

I think the idea here is having uniform coverage across the locus mainly for the purpose of de novo assembly. Discarding reads reduces the coverage, but does it uniformly across the whole sequence. If only a small proportion of reads contain adapter, the reduction of coverage from discarding is likely not sufficient to cause major issues when genotyping. Trimming uniformly has a similar effect--you reduce the total sequence length, but the resulting sequence has uniform coverage. You don't get any data past, e.g., 100bp, however any sites before that should be called accurately. In contrast, trimming results in sequences of different lengths and non-uniform coverage. This can cause issues when assembling loci de novo, in addition to reducing the accuracy of genotyping at the 3'.

Ultimately, it is a tradeoff. If just discarding reads results in an excess of data being lost, and subsequently low coverage, then trimming uniformly is likely better. In contrast, maybe just a small proportion of reads have adapter and discarding those results in a very small reduction in coverage. In that case, discarding the reads is better. Which one to use depends on the properties of the data.

Thanks,
Angel

Audrey Salinger

unread,
Sep 19, 2024, 1:51:05 PM9/19/24
to Stacks
Hi Angel,

Thank you for this clarification!

Cheers,
Audrey
Reply all
Reply to author
Forward
0 new messages