high missing data (vcftools --missing-indv) but high effective coverage (> 20X)

Ingrid Bunholi

unread,

Jan 9, 2025, 11:56:49 AM1/9/25

to Stacks

Hello stacks community,

I've been trying to troubleshoot what could be happening with my samples for the last few months and have not come to a solution. I have data from ddRADseq from 221 individuals from a fish species (no ref genome available) that I aim to investigate relatedness patterns within and among groups. After running populations with only a whitelist filtration (keep only loci with 1-4 SNPs), I ended up with two missing data groups: one with >70% and other with less than 30% (see graph attached) - My missing data has a bimodal distribution. I have checked for batch effect but did not see a pattern as you can see in the graph. There is some correlation with coverage depth but there are some individuals with coverage >20X with 80% of missing data.

Missing data gets worse when using the whitelist filtration and gets better when I apply the -r filtering.

Does anyone have seen anything like that? Is there any particular filtration I could try to maximize the number of individuals on my dataset?

Thank you!

Cheers,

Ingrid Bunholi

Catchen, Julian

unread,

Jan 9, 2025, 5:48:04 PM1/9/25

to stacks...@googlegroups.com

Hi Ingrid,

Often library contamination can cause the issues you are referring to. You might take some of the data from the high coverage (but also high missing data) and BLAST some of the raw reads to the NCBI NR database to see if something comes up you don’t expect. You could also take a look at the ustacks outputs for some of your samples – ustacks will throw out very high coverage stacks assuming they are repetitive DNA of some kind. But these can also be contamination. You could see if your low coverage (but high missing data) started out as high coverage, but had lots of data discarded, which would be another hint towards contamination and would be detailed in the ustacks output (captured in the denovo_map.log file).

Julian

Ps – stacks will directly give you pre-filtering and post-filtering missing data measures in the populations.log.distribs file.

Ingrid Bunholi

unread,

Jan 14, 2025, 10:06:56 AM1/14/25

to stacks...@googlegroups.com

Julien, I really appreciate your response to my question!

I checked ustacks output and the low coverage (but high missing data) individuals started with even lower coverage, so I did not see a sign of contamination based on that for any of those samples. See an example below:

Loaded 744114 reads; formed:

84552 stacks representing 443517 primary reads (59.6%)

264423 secondary stacks representing 300597 secondary reads (40.4%)

Stack coverage: mean=5.25; stdev=6.64; max=875; n_reads=443517(59.6%)

Removing repetitive stacks: cov > 26 (mean+3*stdev)...

Blacklisted 1492 stacks.

Coverage after repeat removal: mean=4.98; stdev=2.34; max=26; n_reads=413328(55.5%)

...

Final number of stacks built: 70392

Final coverage: mean=7.04; stdev=4.46; max=247; n_reads=491440(66.0%)

However, I noticed a lot of stacks were blacklisted (avg >1000 in each sample) and I wanted to blast these stacks on NCBI. Is there a way to extract these sequences from .tags.tsv file?

Ingrid Bunholi, M.S. (she/her)

PhD Student | Marine Science

The University of Texas at Austin

Marine Science Institute - UTMSI

750 Channel View Drive
Port Aransas, Texas 78373

--
Stacks website: http://catchenlab.life.illinois.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/stacks-users/PH7PR11MB6722CE5305824CC1567B0DDCA7132%40PH7PR11MB6722.namprd11.prod.outlook.com.

Catchen, Julian

unread,

Jan 14, 2025, 11:27:10 AM1/14/25

to stacks...@googlegroups.com

Hi Ingrid,

You can extract the blacklisted stacks using some UNIX:

zcat SAMPLEFILE.tags.tsv.gz | awk -F "\t" '$7==1'

If you want to run BLAST, you could convert them into a FASTA file:

zcat SAMPLEFILE.tags.tsv.gz | awk -F "\t" '$7==1' | cut -f 1,5 | sed -E 's/^([0-9]+)\t([ACGTN]+)/>\1\n\2/' > blisted_stacks.fa

You might also consider capturing the reads that did not form stacks (~34% of reads in the example you provided) by adding the -R flag to one of your ustacks runs, and you could then BLAST a handful of those reads.