Just to follow on from Olly said. For every dataset I always check
heterozygosity as there are often a couple of samples that have higher
heterozygosity than they should due to contamination. Sometimes I have
to check by population to find them. They really should be removed
prior to any analyses. I use this bit of code
het <- rowMeans(as.matrix(gl)==1, na.rm=T)
write.csv (het, file="het.csv")
then I open it in excel, I usually add in some metadata so I can use the
filters in excel to show different subsets of the data. I put a simple
bar graph on a separate sheet. It is difficult to say when a certain
heterozygosity value is too high and it depends on how many individuals
from a population you have for comparison too.
To find identical individuals you can export the data to a fasta file
and import it to a program like mega and generate a neighbour joining
tree. It is pretty usually to find two samples which are identical, or
extremely close to identical, so they tend to be pretty obvious if they
are the same.
Cheers
Peter Unmack
On 27/06/2023 6:14 pm, 'Berry, Olly (NCMI, IOMRC Crawley)' via dartR wrote:
> Hi Gabriella,
>
> Not commenting on most of your post, but I find plotting observed
> heterozygosity per individual to be a good way to identify genotypes
> that likely result from cross-contamination. As you say,
> cross-contaminated samples will stand out as having unusually high
> heterozygosity. Splitting it by population/sample site can also spread
> the data out a bit for viewing and also identify whether certain
> sampling events introduced cross contamination more than others. Its not
> a precise science in the way I use this approach, but a useful filter.
>
> Cheers,
>
> Olly
>
> *From:*
da...@googlegroups.com <
da...@googlegroups.com> *On Behalf Of
> *Gabriella Scatà
> *Sent:* Tuesday, 27 June 2023 1:59 PM
> *To:* dartR <
da...@googlegroups.com>
> *Subject:* [dartR] pairwise individual distances & genomic distances +
> difference with relatedness estimates
>
> Hi everyone,
>
> this might be a bit of a long post as it related pairwise genetic
> distances between individuals and their use to detect duplicated or
> mixed (cross-contaminated) samples as and differentiate them from
> actually related individuals.
>
> 1) My first question regards the output of the function
> "/dartR::gl.dist.ind/".
>
> As far as I understood, it computes pairwise individual distances based
> on allele counts of either the reference or alternate allele for each
> shared locus between 2 individuals.
>
> I tried to run gl.dist.ind with the distance method "Manhattan" and I
> get very different results from those obtained with the
> "/radiator::detect_duplicate_genomes/" function by using the same
> distance method (Manhattan).
>
> The radiator:detect_duplicate_genomes() function computes the Manhattan
> distance between individuals by implementing the function amap::Dist:
>
> - with dartR::gl.dist.ind() --> i obtain distance min = 0.08, max = 0.26
>
> - with radiator::detect_duplicate_genomes() --> I obtain distance min
> 0.31, max=1
>
> I think the difference in values might be due to the fact that the
> distance reported in radiator::detect_duplicate_genomes() is a "relative
> distance" = for each individual, it's the distance divided by the
> maximum distance observed (i.e. 0.08/0.26 = 0.307).
>
> However, I don't understand why when I use the option "scale = TRUE" in
> /dartR::gl.dist.ind/ (scale=TRUE --> distances are scaled to fall in the
> range [0,1]), I get exactly the same min and max values (0.08, 0.26).
> Shouldn't I get as maximum 1 in this case? Or different values anyways?
>
> 2) My second question regards how to use this measure, pairwise
> distances between individuals, to detect duplicate or mixed samples.
> I am assuming that for a duplicate sample, you would expect a high
> similarity (>98%?) thus a very small genetic distance. However, how do
> you differentiate mixed samples from related samples? Mixed samples
> would have high heterozygosity and high similarity, but i would expect
> the same for related samples, no?
>
> Is there a function in dartR that computes genome similarity (=the
> proportion of the shared genotypes averaged across shared markers
> between each pairwise comparison) (similar to the radiator
> detect_duplicate_genomes (genome=TRUE))?
>
> Is there a function in dartR that can check for allele ratio (ref/alt)
> (coverage imbalance between alternate & ref allele) for all loci in each
> individual?
>
> Thank you for any clarification and information you can provide!
> Best,
>
> Gabriella
>
> --
> You received this message because you are subscribed to the Google
> Groups "dartR" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
dartr+un...@googlegroups.com
> <mailto:
dartr+un...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/dartr/deee70ce-2a3d-461c-bd65-5d515bb06d82n%40googlegroups.com <
https://groups.google.com/d/msgid/dartr/deee70ce-2a3d-461c-bd65-5d515bb06d82n%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "dartR" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
dartr+un...@googlegroups.com
> <mailto:
dartr+un...@googlegroups.com>.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/dartr/MEAPR01MB2502965C74AB39B2F34DC62BFF27A%40MEAPR01MB2502.ausprd01.prod.outlook.com <
https://groups.google.com/d/msgid/dartr/MEAPR01MB2502965C74AB39B2F34DC62BFF27A%40MEAPR01MB2502.ausprd01.prod.outlook.com?utm_medium=email&utm_source=footer>.