SNP FILTERING

51 views
Skip to first unread message

Nigus Belay

unread,
Dec 8, 2025, 2:43:00 AM12/8/25
to dartR
Hello, I am trying to filter a DArTseq SNP data. The SNP data mapped in to two reference genome, one is based on scaffold level and the second one based on chromosome level (SNP assigned in to A and B subgenomes). Based on chromosome level, some SNPs are not assigned in to A and B subgenomes , instead it appears as chr_blank with 0 pos for some SNPs while others with   chr_blank with  pos different from 0. Is it possible to use scaffold and chromosome level filtering simultaneously. How could proceed with chr_blank for downstream analysis?

Thanks

Nigus

Jose Luis Mijangos

unread,
Dec 8, 2025, 4:55:18 PM12/8/25
to dartR

Hi Nigus,

In most situations where a reference genome is required, such as LD decay, sliding-window summaries, selection scans, or any analysis that relies on physical distance, it only makes sense to use one reference genome. The validity of these analyses depends on having a coherent genomic context, so mixing scaffold-level and chromosome-level coordinates usually introduces inconsistencies. However, if you can tell me more about your specific downstream application, we can give more tailored suggestions.

In general, I recommend using the reference genome with the highest number of SNPs successfully mapped and then filtering out SNPs that were not placed onto chromosomes. Below is example code showing how to:

- Assign chromosome and position information,

- Plot SNP density per chromosome, and

- Remove SNPs that were not mapped (i.e., unmapped or chr_blank).

For the SNP density plot, you’ll need the development version of dartR.base. Before installing it, remember to clean your R environment: Session → Clear Workspace and then Session → Restart R.

Hope this helps, and I’m happy to assist further if needed.

Cheers,
Luis

# Install developing version of dartR.base
devtools::install_github("green-striped-gecko/dartR.base@dev")
library(dartRverse)
# Example dataset
t1 <- platypus.gl
# ---- Assign chromosome information ----
# In this dataset, chromosome info is stored here:
t1@chromosome <- as.factor(t1$other$loc.metrics$Chrom_Platypus_Chrom_NCBIv1)
# ---- Assign chromosome positions ----
# Position information is stored here:
t1@position <- as.integer(t1$other$loc.metrics$ChromPos_Platypus_Chrom_NCBIv1)
# ---- Plot SNP density per chromosome ----
gl.plot.snp.density(
  t1,
  bin.size  = 1e6,   # 1 Mb bins
  min.snps  = 50,
  min.length = 1e6
)
# ---- Remove SNPs not mapped (pos = 0 or NA) ----
t2 <- gl.filter.locmetric(
  t1,
  metric = "ChromPos_Platypus_Chrom_NCBIv1",
  lower = 1,
  upper = max(t1@position, na.rm = TRUE),
  keep = "within"
)
# Number of loci after filtering
nLoc(t2)

Nigus Belay

unread,
Mar 16, 2026, 4:08:26 AMMar 16
to da...@googlegroups.com
Hi Luis, 
Thanks for the support, and i want to make analysis on Genetic diversity measures, Analysis of molecular variance (AMOVA), Nei’s Genetic Distance, Cluster, Principal Coordinate Analyses, genetic structure of the population, Linkage disequilibrium (LD) analysis. I have filtered DArT SNP data based on call rate, MAF, reproducibility, monomorphs and heterozygosity and i  get error on gl.filter.excess.het. I have listed below the order of filtering for your comment.

> t2 <- gl.filter.locmetric(

+   nig_gl,

+   metric = "Pos",

+   lower = 1,

+   upper = max(nig_gl@position, na.rm = TRUE),

+   keep = "within"

+ )

> # Number of loci after filtering

> nLoc(t2)

[1] 10557

> nInd(t2)

[1] 274

> gl1 <- gl.filter.callrate(t2,

+                             method = "loc",

+                             threshold = 0.80,

+                             recalc = TRUE)

Completed: gl.filter.callrate

> nLoc(gl1)

[1] 9510

> gl2 <- gl.filter.maf(gl1, threshold = 0.05)

 

Completed: gl.filter.maf

>   nLoc(gl2)

[1] 5577

>   nInd(gl2)

[1] 274

> gl3 <- gl.filter.callrate(gl2,

+                             method = "ind",

+                             threshold = 0.70,

+                             recalc = TRUE)

Completed: gl.filter.callrate

>   nLoc(gl3)

[1] 5577

>   nInd(gl3)

[1] 260

> gl4 <- gl.filter.reproducibility(gl3, threshold = 0.95)

Completed: gl.filter.reproducibility

>   nLoc(gl4)

[1] 4452

 

> gl5 <- gl.filter.monomorphs(gl4, v=0)

>   nLoc(gl5)

[1] 4452

> gl6 <- gl.filter.heterozygosity(x = gl5,

+                                   t.upper = 0.15,

+                                   t.lower = 0,

+                                   verbose = 3)

Starting gl.filter.heterozygosity

  Processing genlight object with SNP data

  Retaining individuals with heterozygosity in the range 0 to 0.15

  Minimum individual heterozygosity 0.0363

  Maximum individual heterozygosity 0.4819

 

Completed: gl.filter.heterozygosity

>   nLoc(gl6)

[1] 4452

>   nInd(gl6)

[1] 189

> excess_report <- gl.filter.excess.het(x = gl6,

+                                         Yates = FALSE, verbose = 3)

Starting gl.filter.excess.het

  Processing genlight object with SNP data

  Calculating loci statistics from observed data  Testing loci for high heterozygosity...

Error in `$<-.data.frame`(`*tmp*`, "En0", value = NA) :

  replacement has 1 row, data has 0


Best Regards,


Nigus


--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dartr/1161df97-8c06-49d4-aa42-9491d0986704n%40googlegroups.com.

Jose Luis Mijangos

unread,
Mar 17, 2026, 6:24:05 PMMar 17
to dartR
Hi Nigus,

Thank you for reporting that bug and sending a subset of your dataset, that makes debugging much easier. I fixed the bug, please install the developing version of dartR.base as shown below and try again. 

# Clean your environment
# RStudio > Menu > Session > Clear workspace
# Restart R Session
# RStudio > Menu > Session > Restart R
# installing developing version of dartR.base
devtools::install_github("green-striped-gecko/dartR.base@dev")

Cheers,
Luis 

Nigus Belay

unread,
Mar 29, 2026, 12:51:34 PM (4 days ago) Mar 29
to da...@googlegroups.com
Hi Luis,

Thanks,  It works. Gl.filter.excess.het() uses a chi-square test to detect excess heterozygosity which assumes random mating populations under Hardy-Weinberg equilibrium. I am working on highly selfing species and I need other options such as threshold Ho value (10%) to filter loci heterozygosity. What do you suggest me?

Best regards,

Nigus

Jose Luis Mijangos

unread,
Mar 29, 2026, 6:31:49 PM (4 days ago) Mar 29
to dartR
Hi Nigus,

Below is one way to remove loci with higher heterozygosity than a threshold.

Cheers,
Luis 

library(dartRverse)
# test dataset
t1 <- platypus.gl
# calculate heterozygosity
h1 <- utils.basic.stats(t1)
# get observed heterozygosity by locus
OH <- h1$perloc$Ho
# threshold for observed heterozygosity
OH_threshold <- 0.1
# drop loci with observed heterozygosity with higher or equal to OH threshold
t2 <- gl.drop.loc(t1,
                  loc.list = locNames(t1)[which(OH >= OH_threshold)])

Reply all
Reply to author
Forward
0 new messages