hapstats.tsv: All HWE values zero, Haplotype Diversity Greater Than One

97 views
Skip to first unread message

Matthew Penney

unread,
Dec 8, 2021, 12:58:01 PM12/8/21
to Stacks
Hello,

I've run 83 RAD RAD libraries sequenced on an Illumina NovaSeq 6000 through Genome Quebec and trimmed to 75bp using Trimmomatic through a denovo_map.pl pipeline. They were also put through clone_filter and checked with fastqc. After staring with 100 libraries, any with fewer than 1M reads were discarded. The code for each step (minus fastqc) is as follows (I inserted SAMPLE where the sample number would be):

>clone_filter -1 ./raw/*SAMPLE_R1.fastq.gz -2 ./raw/*SAMPLE_R2.fastq.gz -i gzfastq -o ./Decloned --oligo_len_1 12 --retain_oligo GANNNNGTTGCA --inline_null

>java -Xms256m -Xmx4g -jar $EBROOTTRIMMOMATIC/trimmomatic-0.39.jar PE ./Decloned/*SAMPLE_R1.1.fq.gz ./Decloned/* SAMPLE _R2.2.fq.gz ./Trimmed/ SAMPLE _R1.fq.gz ./Trimmed/ SAMPLE _R1un.trimmed.fastq.gz ./Trimmed/ SAMPLE _R2.fq.gz ./Trimmed/ SAMPLE _R2un.trimmed.fastq.gz ILLUMINACLIP:./TrimAdapters/TRIM_Adapters2.fa:2:30:15 SLIDINGWINDOW:5:20 HEADCROP:20 MINLEN:75

>process_radtags -1 ./Trimmed/SAMPLE_R1.fq.gz -2 ./Trimmed/SAMPLE_R2.fq.gz -o ./Processed --clean --quality -t 75 -s 20 --disable_rad_check

>denovo_map.pl -M 2 -T 8 -o ./results --popmap ./popmap/popmapcukes3 --samples ./Processed --rm-pcr-duplicates --paired -X "ustacks: m = 10" -X "populations: -r 0.70 -R 0.70 --hwe --genepop --structure --phylip"

Looking over my hapstats.tsv file, I noticed that all my HWE values and p-values are zero as well as the fact that I have Haplotype Diversity values greater than one (the highest was 23). From what I understand, Haplotype Diversity ranges from 0 to 1. Any idea what might be happening here? I can provide log files if that would help.


Cheers,

Matthew Penney
MSc Candidate, Acadia University
Stewart Lab

Catchen, Julian

unread,
Dec 8, 2021, 4:33:28 PM12/8/21
to stacks...@googlegroups.com

Hi Matthew,

 

A few thoughts:

 

  1. It is not clear why you are using clone_filter, since these are paired-end data (what kind of RAD libraries are these?) you can let gstacks in the denovo_map.pl pipeline handle filtering PCR duplicates. Of course, if you have degenerate oligos marking your independent molecules, then using clone_filter makes sense. However, this command line option does not make sense: “--retain_oligo GANNNNGTTGCA”, as --retain_oligo is a flag and does not accept an argument. I don’t know what effect that would have, but not what you would like.
  2. There is no need to run trimmomatic at all, let process_radtags handle that for you.
  3. Why are you trimming to 75bp? Why not keep the entire length of the reads?
  4. You specified the “--rm-pcr-duplicates” flag to denovo_map.pl anyway, when in principle, you accomplished this with clone_filter. I suggest using one method or the other, not both.
  5. You specified “-X "ustacks: m = 10"” flag to denovo_map.pl and this is not formed properly. Moreover, it rarely, if ever would make sense to set -m to 10. Please see our optimization paper by Josie Paris, the stacks 2 paper, or our latest methods chapter on Biorxiv for a discussion.

 

Finally, once all of that is complete, you want to focus on coverage for your individual samples, and the amount of missing data. Only after all this looks good should you be focused on population genetic results (e.g. HWE).

 

julian

Matthew Penney

unread,
Dec 8, 2021, 6:43:25 PM12/8/21
to Stacks
Hi Dr. Catchen,

These are double-digested libraries (Msp1 and Pst1) with a forward adapter including a 4bp degenerate sequence and a reverse Y-adapter. The individual samples were identified using combinatorial barcodes, which were added via the PCR primers. Since clone_filter didn't seem to be doing much to remove PCR Duplicates based on FastQC results maybe it's better to keep the --rm-pcr-duplicates flag.

The reason I was using Trimmomatic is that Genome Quebec does not remove adapter sequence from the sequencing reads and my reads had adapter at the 5' end (the degenerate bases and cut site) and, unfortunately, adapter appearing at the 3' end according to FastQC. The intention was to use Trimmomatic to cut out adapters and any additional low-quality sequence that was showing up in the FastQC results. Low-quality sequence at the ends was also the reason for truncating the reads. It's possible that I could keep more, though. 75bp just seemed common in the literature.

That "m = 10" mistake was me jumping back and forth between Linux and R. I corrected that awhile ago but pulled the code from the wrong script. My bad. I'll take a look at the paper you mentioned.

One issue I do have is that the number of reads differs widely between some samples. I've considered taking the average and SD for total reads and then excluding anything that falls out of range. If you have any suggestions for dealing with this I'd appreciate any advice.


Cheers,
Matt Penney

Catchen, Julian

unread,
Dec 13, 2021, 5:09:25 PM12/13/21
to stacks...@googlegroups.com

Hi Matt,

 

You cannot use the --rm-pcr-duplicates flag for ddRAD data, it will just cause all data to be thrown away (since identifying PCR duplicates involves finding pairs of reads with the exact same starting coordinates, and ddRAD reads at each locus all have the same starting coordinates.

 

The clone_filter program will remove the degenerate bases if you allow it to, we don’t typically remove the cutsite, but leave it to be used in the denovo assembly process. You can give the 3’ adapter sequence to process_radtags and allow it to remove reads that contain it. If you have adapter frequently, it may be better to trim all the reads uniformly, but I would try to avoid that in your first attempts.

 

julian

 

From: stacks...@googlegroups.com <stacks...@googlegroups.com> on behalf of Matthew Penney <matthew....@gmail.com>
Date: Wednesday, December 8, 2021 at 5:43 PM
To: Stacks <stacks...@googlegroups.com>
Subject: Re: [stacks] hapstats.tsv: All HWE values zero, Haplotype Diversity Greater Than One

Hi Dr. Catchen,

 

These are double-digested libraries (Msp1 and Pst1) with a forward adapter including a 4bp degenerate sequence and a reverse Y-adapter. The individual samples were identified using combinatorial barcodes, which were added via the PCR primers. Since clone_filter didn't seem to be doing much to remove PCR Duplicates based on FastQC results maybe it's better to keep the --rm-pcr-duplicates flag.

 

The reason I was using Trimmomatic is that Genome Quebec does not remove adapter sequence from the sequencing reads and my reads had adapter at the 5' end (the degenerate bases and cut site) and, unfortunately, adapter appearing at the 3' end according to FastQC. The intention was to use Trimmomatic to cut out adapters and any additional low-quality sequence that was showing up in the FastQC results. Low-quality sequence at the ends was also the reason for truncating the reads. It's possible that I could keep more, though. 75bp just seemed common in the literature.

 

That "m = 10" mistake was me jumping back and forth between Linux and R. I corrected that awhile ago but pulled the code from the wrong script. My bad. I'll take a look at the paper you mentioned.

 

One issue I do have is that the number of reads differs widely between some samples. I've considered taking the average and SD for total reads and then excluding anything that falls out of range. If you have any suggestions for dealing with this I'd appreciate any advice.

 

 

Cheers,

Matt Penney

On Wednesday, December 8, 2021 at 5:33:28 PM UTC-4 jcatchen wrote:

Hi Matthew,

 

A few thoughts:

 

1.      It is not clear why you are using clone_filter, since these are paired-end data (what kind of RAD libraries are these?) you can let gstacks in the denovo_map.pl pipeline handle filtering PCR duplicates. Of course, if you have degenerate oligos marking your independent molecules, then using clone_filter makes sense. However, this command line option does not make sense: “--retain_oligo GANNNNGTTGCA”, as --retain_oligo is a flag and does not accept an argument. I don’t know what effect that would have, but not what you would like.

2.      There is no need to run trimmomatic at all, let process_radtags handle that for you.

3.      Why are you trimming to 75bp? Why not keep the entire length of the reads?

4.      You specified the “--rm-pcr-duplicates” flag to denovo_map.pl anyway, when in principle, you accomplished this with clone_filter. I suggest using one method or the other, not both.

5.      You specified “-X "ustacks: m = 10"” flag to denovo_map.pl and this is not formed properly. Moreover, it rarely, if ever would make sense to set -m to 10. Please see our optimization paper by Josie Paris, the stacks 2 paper, or our latest methods chapter on Biorxiv for a discussion.

 

Finally, once all of that is complete, you want to focus on coverage for your individual samples, and the amount of missing data. Only after all this looks good should you be focused on population genetic results (e.g. HWE).

 

julian

Matthew Penney

unread,
Dec 13, 2021, 7:20:25 PM12/13/21
to Stacks
Huh. Good to know. I'll make the edits to that code.

Unfortunately, I do have an issue with adapter sequence at the 3' ends of my reads. I think I selected too far down during the size selection step (we used AMPure beads for that). I've managed to get Trimmomatic (just bear with me for a second) to remove all of the adapter sequence and am currently trying to optimize the parameters. I can look into doing this with process_radtags, as well, since it would mean one less step in processing.

Thanks for the help so far. ^_^
Reply all
Reply to author
Forward
0 new messages