RevAvg/Reproducibility filter

Deo Angelo Macahig

unread,

Jul 23, 2020, 8:51:28 AM7/23/20

to dartR

Hello! Are the gi.filter.RepAvg and gi.filter.reproducibility the same filters? Also, can someone explain to me the concept if this filter? Thank you.

Deo

Arthur Georges

unread,

Jul 23, 2020, 7:38:34 PM7/23/20

to dartR

Hi Deo,

The script gl.filter.repavg is a legacy script to maintain backwards compatability. When we added in silicoDArT capability, which uses a DArT attribute 'reproducibility' we changed the name of the script. SilicoDArT data does not have a RepAvg attribute. So they are the same filters.

As for what the filter does, DArT run 30% of the samples you provide them as technical replicates. Ideally, you would get the same result the second time as the first time, so this is a measure of reliability of the marker. DArT use this attribute in their filtering before they give you the final set of SNPs, but this filter allows you to do your own additonal filtering. RepAvg in the SNP dataset is the reproducability averaged over the two alleles; reproducibility in the SilicoDArT dataset (presence absence of a fragment) is a single value.

Typically, only SNPs with high reproducibility will be passed to you by DArT, but you have the option to filter further after examining the plots produced by gl.report.reproducibility. The recommendation is not to over-filter. So the temptation might be to filter to achieve a reproducibility of 1 for all loci, but this may cause a bias that is best avoided and you can lose a lot of useful loci. A value of 0.995 might be better.

An exception might be if you are looking for pedigree incompatible loci and want to minimize the error rate at the expense of the number of loci. Then you might filter on read depth bringing that up to 8-12, and on reproducibility setting the threshold to 1. But for most applications, reproducibility > 0.995 is sufficiently conservative.

Another moot point. We are really dealing here with repeatability (you might find reviewers will pick up on that), but trying to call the script that caused a barrage of protest.

Cheers, A

Deo Angelo Macahig

unread,

Jul 29, 2020, 8:11:41 AM7/29/20

to dartR

Hi Arthur,

Apologies for the very late response. Many thanks for the explanation. I've been trying to optimize the number and order of my filters for the DArTseq data in R. Again, thank you.

Deo

Ana Filipa Sobral

unread,

Aug 11, 2020, 6:13:16 PM8/11/20

to dartR

Hi Arthur,

Thank you for your help with all questions I have posted here.

How could I get a reproducibility value for each individual?

I am looking to do this so I can I evaluate pairs of replicates in order to understand genotype error rate and % of identical genotypes for all common markers?

Hope this makes sense.

Cheers,

Ana

Arthur Georges

unread,

Aug 11, 2020, 7:21:22 PM8/11/20

to da...@googlegroups.com

Hi Ana,

Reproducibility is a locus metric (repAvg for SNP data), so the average individual repAvg will be the same for each individual. You probably want something like "how many loci differ between replicates of an individual" before and after applying your filter set.

To do that, you compare individual A1 with A2,

tmp <- gl.keep.ind(gl, ind.list=c("A1","A2"))

tmp <- as.matrix(gl.keep.ind(testset.gl, ind.list= c("A1","A2") ))
tmp <- tmp[tmp[1,]!=tmp[2,]]
tmp <- tmp>=0
sum(tmp,na.rm=TRUE)

errorrate <- sum(tmp,na.rm=TRUE)*100/nloc(gl) # To include local NAs

Bit clunky, but try that?

A

--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/578e5695-8c50-4b66-9717-619e89d2f464o%40googlegroups.com.

--

---------------------------------------------------------------------------------------------------------------------------
D-Prof. Arthur Georges
Institute for Applied Ecology, University of Canberra ACT 2601 Australia
Deliveries: Bld 3 Stores, Kirinari Street, Bruce ACT 2617
Tel : +61 (0)2 6201 5786 Fax: +61 (0)2 6201 5305 Mobile: +61 (0)418 866741
WWW: http://georges.biomatix.org ORCID ID: http://orcid.org/0000-0003-2428-0361

Map: https://www.canberra.edu.au/maps

Alice Atikessé

unread,

May 3, 2022, 1:13:17 PM5/3/22

to dartR

Hello Arthur,

The question "how many loci differ between replicates of an individual" is exactly the one I am asking myself at the moment.

So I tried your code but with a bit of a modification (changed "testset.gl" by "gl" on the second row) :

tmp <- gl.keep.ind(gl, ind.list=c("21-G74-1","21-G74-2"))
tmp <- as.matrix(gl.keep.ind(gl, ind.list= c("21-G74-1","21-G74-2") ))

tmp <- tmp[tmp[1,]!=tmp[2,]]
tmp <- tmp>=0
sum(tmp,na.rm=TRUE)
errorrate <- sum(tmp,na.rm=TRUE)*100/nloc(gl) # To include local NAs

Everything works well until the last row, where my R consol tells me that the nloc function is impossible to find...

Do you have any idea why and how I could make it work?

Thank you!

Alice Atikessé from Canada!

Arthur Georges

unread,

May 3, 2022, 1:49:41 PM5/3/22

to da...@googlegroups.com

Oops. nLoc(gl)

A

To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/a50df197-a3fa-4c42-be9f-be72292c5072n%40googlegroups.com.

Brianna Coulter

unread,

Dec 21, 2022, 7:04:25 PM12/21/22

to dartR

Hi Arthur,

I'm trying to do the same - get a reproducibility value for technical intra- and inter-plate replicates. I'm also interested in obtaining the individual loci that had errors to exclude loci from my analyses where there was loci with >1 mismatch between pairs of replicates.

I've written some code, modifying yours but I seem to be getting different error rates and cannot figure out why. I'm just wondering if you might be able to pickup on the differences or suggest how to obtain the loci with errors for each pair? For reproducibility sake I've used the bandicoot.gl below (obviously not technical replicates but just using for sake of code).

errorrate <- tibble(`Sample ID1` = c("bc1", "bc2", "bc3", "bc4"),
`Sample ID2` = c("bc5", "bc6", "bc7", "bc8"),
`Error rate (%)` = as.double(c(rep(1, 4))))

errorrate$`Loci list` <- vector("list", 4)

for(i in 1:nrow(errorrate)){
tmp <- as.matrix(gl.keep.ind(bandicoot.gl, ind.list= c(errorrate[[i, 1]],errorrate[[i, 2]])))
tmp <- left_join(as_tibble(setNames(rownames_to_column(as.data.frame(tmp[1,])), c("Loci", "Sample1"))),
as_tibble(setNames(rownames_to_column(as.data.frame(tmp[2,])), c("Loci", "Sample2"))),
by = "Loci") %>%
mutate(Error = Sample1 != Sample2)
errorrate[[i,3]] <- nrow(tmp[tmp$Error == "TRUE" & !is.na(tmp$Error),])/nLoc(bandicoot.gl) * 100
errorrate[[i,4]][[1]] <- tmp[tmp$Error == "TRUE" & !is.na(tmp$Error),][[1]]
}

Any help would be appreciated!

Kind regards,

Brianna.

Jose Luis Mijangos

unread,

Dec 21, 2022, 11:25:04 PM12/21/22

to dartR

Hi Brianna,

Can you try the following function.

Cheers,

Luis

########################################################################
# testing percentage of correct genotypes between two datasets #########
########################################################################

genotypes_2_datasets <- function(x,y){

cat("Testing percentage of correct genotypes between two datasets\n")

x <- gl.keep.ind(x, ind.list = indNames(y),verbose = 0)
x <- gl.filter.callrate(x, threshold = 1,verbose=0)
x <- gl.filter.monomorphs(x,verbose = 0)

y <- gl.filter.callrate(y, threshold = 1,verbose = 0)

t2 <- which(x$loc.names %in% y$loc.names)
t3 <- which(y$loc.names %in% x$loc.names)

x <- gl.keep.loc(x, loc.list = locNames(x)[t2],verbose = 0)
x <- x[, order(x$loc.names)]
x <- x[order(indNames(x)), ]

y <- gl.keep.loc(y, loc.list = locNames(y)[t3],verbose = 0)
y <- y[, order(y$loc.names)]
y <- y[order(indNames(y)), ]

s1 <- as.matrix(x)
s2 <- as.matrix(y)

s1[s1 == 2] <- 0
s2[s2 == 2] <- 0

# percentage of incorrect called genotypes between DArT and new dataset
genotypes <- nInd(y) * nLoc(y)
correct_genotypes <- sum(colSums(s1 == s2))

incorrect_percentage <- (correct_genotypes / genotypes) * 100

incorrect_percentage <- round(incorrect_percentage,2)
cat(incorrect_percentage,"percentage of", genotypes,"genotypes were the same in dataset 1 and dataset 2\n\n")
}

library(dartR)
t1 <- gl.keep.ind(platypus.gl,ind.list = indNames(platypus.gl)[1:10])
t2 <- gl.keep.ind(platypus.gl,ind.list = indNames(platypus.gl)[1:10])

genotypes_2_datasets(t1,t2)

Reply all

Reply to author

Forward