pairwise individual distances & genomic distances + difference with relatedness estimates

205 views

Skip to first unread message

Gabriella Scatà

unread,

Jun 27, 2023, 1:58:52 AM6/27/23

to dartR

Hi everyone,

this might be a bit of a long post as it related pairwise genetic distances between individuals and their use to detect duplicated or mixed (cross-contaminated) samples as and differentiate them from actually related individuals.

1) My first question regards the output of the function "dartR::gl.dist.ind".

As far as I understood, it computes pairwise individual distances based on allele counts of either the reference or alternate allele for each shared locus between 2 individuals.

I tried to run gl.dist.ind with the distance method "Manhattan" and I get very different results from those obtained with the "radiator::detect_duplicate_genomes" function by using the same distance method (Manhattan).

The radiator:detect_duplicate_genomes() function computes the Manhattan distance between individuals by implementing the function amap::Dist:

- with dartR::gl.dist.ind() --> i obtain distance min = 0.08, max = 0.26

- with radiator::detect_duplicate_genomes() --> I obtain distance min 0.31, max=1

I think the difference in values might be due to the fact that the distance reported in radiator::detect_duplicate_genomes() is a "relative distance" = for each individual, it's the distance divided by the maximum distance observed (i.e. 0.08/0.26 = 0.307).

However, I don't understand why when I use the option "scale = TRUE" in dartR::gl.dist.ind (scale=TRUE --> distances are scaled to fall in the range [0,1]), I get exactly the same min and max values (0.08, 0.26).
Shouldn't I get as maximum 1 in this case? Or different values anyways?

2) My second question regards how to use this measure, pairwise distances between individuals, to detect duplicate or mixed samples.
I am assuming that for a duplicate sample, you would expect a high similarity (>98%?) thus a very small genetic distance. However, how do you differentiate mixed samples from related samples? Mixed samples would have high heterozygosity and high similarity, but i would expect the same for related samples, no?

Is there a function in dartR that computes genome similarity (=the proportion of the shared genotypes averaged across shared markers between each pairwise comparison) (similar to the radiator detect_duplicate_genomes (genome=TRUE))?

Is there a function in dartR that can check for allele ratio (ref/alt) (coverage imbalance between alternate & ref allele) for all loci in each individual?

Thank you for any clarification and information you can provide!
Best,

Gabriella

Berry, Olly (NCMI, IOMRC Crawley)

unread,

Jun 27, 2023, 4:14:39 AM6/27/23

to da...@googlegroups.com

Hi Gabriella,

Not commenting on most of your post, but I find plotting observed heterozygosity per individual to be a good way to identify genotypes that likely result from cross-contamination. As you say, cross-contaminated samples will stand out as having unusually high heterozygosity. Splitting it by population/sample site can also spread the data out a bit for viewing and also identify whether certain sampling events introduced cross contamination more than others. Its not a precise science in the way I use this approach, but a useful filter.

Cheers,

Olly

--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/deee70ce-2a3d-461c-bd65-5d515bb06d82n%40googlegroups.com.

Peter Unmack

unread,

Jun 27, 2023, 6:40:02 AM6/27/23

to da...@googlegroups.com

Just to follow on from Olly said. For every dataset I always check
heterozygosity as there are often a couple of samples that have higher
heterozygosity than they should due to contamination. Sometimes I have
to check by population to find them. They really should be removed
prior to any analyses. I use this bit of code

het <- rowMeans(as.matrix(gl)==1, na.rm=T)
write.csv (het, file="het.csv")

then I open it in excel, I usually add in some metadata so I can use the
filters in excel to show different subsets of the data. I put a simple
bar graph on a separate sheet. It is difficult to say when a certain
heterozygosity value is too high and it depends on how many individuals
from a population you have for comparison too.

To find identical individuals you can export the data to a fasta file
and import it to a program like mega and generate a neighbour joining
tree. It is pretty usually to find two samples which are identical, or
extremely close to identical, so they tend to be pretty obvious if they
are the same.

Cheers
Peter Unmack

On 27/06/2023 6:14 pm, 'Berry, Olly (NCMI, IOMRC Crawley)' via dartR wrote:
> Hi Gabriella,
>
> Not commenting on most of your post, but I find plotting observed
> heterozygosity per individual to be a good way to identify genotypes
> that likely result from cross-contamination. As you say,
> cross-contaminated samples will stand out as having unusually high
> heterozygosity. Splitting it by population/sample site can also spread
> the data out a bit for viewing and also identify whether certain
> sampling events introduced cross contamination more than others. Its not
> a precise science in the way I use this approach, but a useful filter.
>
> Cheers,
>
> Olly
>

> *From:*da...@googlegroups.com <da...@googlegroups.com> *On Behalf Of
> *Gabriella Scatà
> *Sent:* Tuesday, 27 June 2023 1:59 PM
> *To:* dartR <da...@googlegroups.com>
> *Subject:* [dartR] pairwise individual distances & genomic distances +

> difference with relatedness estimates
>
> Hi everyone,
>
> this might be a bit of a long post as it related pairwise genetic
> distances between individuals and their use to detect duplicated or
> mixed (cross-contaminated) samples as and differentiate them from
> actually related individuals.
>
> 1) My first question regards the output of the function

> "/dartR::gl.dist.ind/".

>
> As far as I understood, it computes pairwise individual distances based
> on allele counts of either the reference or alternate allele for each
> shared locus between 2 individuals.
>
> I tried to run gl.dist.ind with the distance method "Manhattan" and I
> get very different results from those obtained with the

> "/radiator::detect_duplicate_genomes/" function by using the same

> distance method (Manhattan).
>
> The radiator:detect_duplicate_genomes() function computes the Manhattan
> distance between individuals by implementing the function amap::Dist:
>
> - with dartR::gl.dist.ind() --> i obtain distance min = 0.08, max = 0.26
>
> - with radiator::detect_duplicate_genomes() --> I obtain distance min
> 0.31, max=1
>
> I think the difference in values might be due to the fact that the
> distance reported in radiator::detect_duplicate_genomes() is a "relative
> distance" = for each individual, it's the distance divided by the
> maximum distance observed (i.e. 0.08/0.26 = 0.307).
>
> However, I don't understand why when I use the option "scale = TRUE" in

> /dartR::gl.dist.ind/ (scale=TRUE --> distances are scaled to fall in the

> range [0,1]), I get exactly the same min and max values (0.08, 0.26).
> Shouldn't I get as maximum 1 in this case? Or different values anyways?
>
> 2) My second question regards how to use this measure, pairwise
> distances between individuals, to detect duplicate or mixed samples.
> I am assuming that for a duplicate sample, you would expect a high
> similarity (>98%?) thus a very small genetic distance. However, how do
> you differentiate mixed samples from related samples? Mixed samples
> would have high heterozygosity and high similarity, but i would expect
> the same for related samples, no?
>
> Is there a function in dartR that computes genome similarity (=the
> proportion of the shared genotypes averaged across shared markers
> between each pairwise comparison) (similar to the radiator
> detect_duplicate_genomes (genome=TRUE))?
>
> Is there a function in dartR that can check for allele ratio (ref/alt)
> (coverage imbalance between alternate & ref allele) for all loci in each
> individual?
>
> Thank you for any clarification and information you can provide!
> Best,
>
> Gabriella
>
> --
> You received this message because you are subscribed to the Google
> Groups "dartR" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dartr+un...@googlegroups.com

> <mailto:dartr+un...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/dartr/deee70ce-2a3d-461c-bd65-5d515bb06d82n%40googlegroups.com <https://groups.google.com/d/msgid/dartr/deee70ce-2a3d-461c-bd65-5d515bb06d82n%40googlegroups.com?utm_medium=email&utm_source=footer>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "dartR" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to dartr+un...@googlegroups.com

> <mailto:dartr+un...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/dartr/MEAPR01MB2502965C74AB39B2F34DC62BFF27A%40MEAPR01MB2502.ausprd01.prod.outlook.com <https://groups.google.com/d/msgid/dartr/MEAPR01MB2502965C74AB39B2F34DC62BFF27A%40MEAPR01MB2502.ausprd01.prod.outlook.com?utm_medium=email&utm_source=footer>.

Gabriella Scatà

unread,

Jun 28, 2023, 5:06:57 AM6/28/23

to dartR

Dear Peter and Olly,

thank you for your reply and advice.

In a previous post, Luis has suggested me a few approaches to test relatedness both within dartR and outside. I still have to test those methods.

I am just realising that there are a wide variety of genetic similarities, relatedness and kinship estimators, and would like some advice on which are recommended for which cases, although i know this can be very variable depending on the situation.

So far I discarded both the R packages "related" and "Demerelate", as they both have issues.

I have managed to obtain a "kinship" estimate through the SNPRelate package which uses Maximum-Likelihood Estimation (MLE) of the coancestry coefficient (Choi et al., 2009; Milligan, 2003).

However, I am still a bit confused on how you would be able to discriminate highly related individuals from cross-contaminated samples.

For example, in my data I have 5-6 individuals in 2 out of 7 populations (sampling locations) which have extremely high individual heterozygosity (0.4-0.5), but the kinship coefficient (calculated with the SNPRelate MLE estimator) is 0.25-0.34 between these individual pairs.
However, the average heterozygosity of these populations is higher that that of the other populations. So, I am wondering whether they are actually cross-contaminated samples or whether there is actually something biological going on in those 2 populations which also have higher genetic variability across their individuals (about 20) (as detected through PCA).
When I looked at pairwise genetic distances between individuals (based on Manhattan distance estimated through allele counts of the alternate allele - through the radiator::detect_duplicate_genomes() function), those 6 individuals (3 pairs) turn out as outliers in the Manhattan plot and they have the lowest relative pairwise distances (0.3, 0.5).
Would this be considered cross-contamination?
I just don't want to miss something biologically relevant by assuming is cross-contamination...

Regarding which measure of relatedness/kinship to use, so far I have read only the Milligan (2003) and the Wang (2014, 2017) papers.

Milligan (2003) compares the estimates of the Coancestry coefficient (tetha) obtained through 5 non-likelihood methods (including Queller & Goodnight) and the MLE method used by Milligan.

The MLE has lower standard error (SE) across a range of sampling conditions according to simulations (i.e. multiallelic or biallelic loci, a range of n loci 5-30 used, 3 allele frequencies distributions including balanced frequencies across alleles & cases when one allele is dominant, and various degrees of actual relatedness). However, when one allele is dominant over the others, especially for biallelic markers, and actual relatedness is close to the biological boundaries (i.e. tetha ~ 0 --> unrelated individuals/1st cousins OR tetha ~0.5 --> highly related individuals), the bias of the MLE estimate is considerable. However, the Root Mean Square Error of the MLE estimate, which integrates both SE and bias measures of performance, remains low even in these conditions. Therefore, Milligan suggests that the bias is of little biological consequence....

Wang (2014), confirms that the MLE estimate of tetha (since this estimate is constrained within the biologically meaningful boundaries of tetha (0-0.5) ), are upwardly biased in some cases.

So, I am not sure whether this method is ok to be used for the biallelic SNPs and whether I can trust the values obtained, although it seems that with a high enough number of loci used, the estimate of relatedness can be more accurate.

I have seen this method used with SNPs data (i.e. Sadanandan, et al., 2020), but I am still unsure whether it is reliable or not, especially when using the biallelic SNPs. I guess I would need to test whether my alleles frequencies distributions are balanced, to know whether the estimates obtained are accurate enough.

Wang (2017) develops a new Wang estimator corrected for small sample sizes populations (<10) and shows that it performs better than all other estimators tested under those conditions. Again not sure if it performs well with structured populations. The new estimator is iomplemented in Coancestry, so I i wll try that.

I am still new to this analysis, as I understood relatedness = 2 tetha (that is 2*the coefficient of coancestry), but I know there are tons of different methods for calculating relatedness, and I am just starting to explore and read about some of them...so some advice or sharing of your experience using different approaches would be highly valuable!

If you have any suggestions on review papers that can give me a better idea of which relatedness estimators are available and the pros & cons of each/which situations to use them in, that would be great! Even just some review papers on the difference between relatedness & kinship/coancestry coefficient would be great. I only read Milligan (2003), Weir (2006) and Wang (2014, 2017) so far, but I am still quite confused.

So far, I have also been recommended the following approaches (inlcuding the ones Luis-dartR recommended):

- dartR function gl.grm.network

- software Coancestry (in the process of testing it)

- NGSrelateV2

- I was told that relatedness estimates can be inflated if the individuals sampled have population structure. I was recommended an R package that account for population structure like PCrelate (can use PCair to characterize pop structure using PCA) https://rdrr.io/bioc/GENESIS/man/pcrelate.html

- software Colony

- sequoia R package (Huisman, Jisca. "Pedigree reconstruction from SNP data: parentage assignment, sibship clustering and beyond." Molecular ecology resources 17.5 (2017): 1009-1024.)
- popkin R package (Ochoa, A., & Storey, J. D. (2021). Estimating FST and kinship for arbitrary population structures. PLoS genetics, 17(1), e1009241.)

Considering the range of options and estimation approaches, some shared knowledge on experience with these approaches & programs would be highly useful!
I am hoping to test Coancestry and a few of the appraches listed above, and I can share whatever outcome I obtain and how it compares between methods...it might just take some time.

In the meantime, if anyone has any advice/experience with this type of analysis and any of these approaches, it would be great to hear from you!

Thanks!
Gabriella

Reply all

Reply to author

Forward

0 new messages