Allele coverage not reaching 100%

7 views
Skip to first unread message

Ira Herniter

unread,
Sep 1, 2025, 10:52:03 AMSep 1
to Core Hunter Users
I'm trying to determine how many individuals I need to reach 95% allele coverage, so I've been using the "Coverage" option in the CoreHunter GUI.

My dataset consists of 76 individuals, and I've tested a range of core sizes, but for some reason the allele coverage seems to max out at 75%, even when I include all but one of my genotypes. I've repeatedly run the analysis with a designated core size of 75/76, and I keep getting the same score. My results are as follows:

#taxa CV
10 0.7058
15 0.7209
20 0.7292
25 0.7352
30 0.7398
35 0.7431
40 0.7458
45 0.7480
50 0.7497
55 0.7512
60 0.7524
65 0.7532
70 0.7537
75 0.7541

Thanks,
Ira

Herman De Beukelaer

unread,
Sep 3, 2025, 3:21:01 AMSep 3
to Core Hunter Users
Hi Irna,

Thanks for reaching out. Would it be possible to share your data, anonymized if needed? What data format are you using? My first guess would be that there are some loci in your dataset for which some alleles are not present in any of your accessions, which would mean you cannot get 100% allele coverage for these loci regardless of what selection you make.

Kind regards
Herman

Op maandag 1 september 2025 om 16:52:03 UTC+2 schreef iher...@gmail.com:

Ira Herniter

unread,
Sep 10, 2025, 12:38:45 PMSep 10
to Core Hunter Users
Hi Herman,

I generated my imported file through plink and then removed the missing data in Excel to allow import into the CoreHunter gui.
I've attached the data file in question. There is a significant amount of missing data, so how would I go about properly scaling it? Because I'd like to get a read on the allelic diversity, not including missing data in that calculation.

Below is a chart of the # of missing calls by marker. It doesn't include the 1195 markers with 1 missing data point or the 7591 with no missing data, as that would swamp everything else.

Thanks,
Ira

Picture1.png

gen_300_riparia.csv

Herman De Beukelaer

unread,
Sep 11, 2025, 3:25:10 AMSep 11
to Core Hunter Users
Hi Ira,

Are you using the biparental (SNP) format or the frequency format? The biparental format expects values 0/1/2, while the frequency format expects multiple columns per marker (one per allele) with values in the range [0,1] that sum to one across all alleles of the same marker. Your file format seems to be somewhere in between, so I am not sure what format you intended to use. I also noticed there are some unexpected additional header lines (rows 2-6) that should also be removed.

Please check the docs to make sure you have formatted your file correctly: https://www.corehunter.org/data.

Getting back to my previous reply: there are many (>5k) markers that were called as 0 for all accessions. When using the frequency format, this means these markers weren't detected in any of the accessions. If using biparental format, this means all accessions were homozygous for the same allele for all of these markers. In both cases, it means allele coverage cannot reach 100% regardless of the selected subset. Such markers should preferably be removed, as they don't contribute to the diversity of the selected core set, especially if your specific goal is to search for a subset with 100% allele coverage.

Missing values should not be a problem, they are handled internally by Core Hunter.
Op woensdag 10 september 2025 om 18:38:45 UTC+2 schreef iher...@gmail.com:
Reply all
Reply to author
Forward
0 new messages