Bruvo Polyploidy and Missing data

metz...@uw.edu

unread,

Sep 23, 2014, 11:03:01 AM9/23/14

to po...@googlegroups.com

I have a microsatellite dataset with individuals with different ploidy levels (2-4). Since Bruvo's distance is designed for this type of comparison I would like to use this method. Does anyone know exactly how poppr's version of Bruvo treats missing?

Specifically, I have to recode my data as tetraploid with missing values (a heterozygous diploid locus with alleles of 120 and 124 becomes 120 124 0 0, for example). Does poppr's Bruvo treat those zeros as missing in the sense that it knows it is a diploid and implements Bruvo's method appropriately, or does it treat them as missing values in a tetraploid organism and imputes the missing values, giving the wrong distance?

I have calculated Bruvo's distances by hand for all my loci and they are almost the same as poppr says, but slightly different for comparisons between individuals of different ploidy (I am not up for generating bootstrap values by hand, however). I would greatly appreciate any help with this.

--Michael

Zhian Kamvar

unread,

Sep 23, 2014, 12:59:00 PM9/23/14

to po...@googlegroups.com

Hello Michael,

Thank you for asking this question. Essentially, poppr will treat all comparisons at the highest observed ploidy level not taking into account missing data. This means that two diploid individuals, while they are in a tetraploid framework, will be treated as diploids. Much of the information is found in the algorithms and equations manual.

A note of caution that I will restate here, however: partial heterozygotes are only treated as the observed alleles. This means: tetraploids showing two alleles will be treated as diploids and tetraploids showing three alleles will be treated as triploids.

-----

Regarding the slightly different values you see for comparisons between individuals of different ploidy, I have an idea of why you might be seeing a discrepancy. This might be due to how the imputations are implemented. Poppr calculates the number of imputations necessary as n multichoose k, (where n is the number of observed alleles and k is the number of missing) as opposed to n^k. This will give all possible unordered combinations of alleles equal weight in the imputations. For an example, let's say you are comparing a diploid genotype of 120 and 124 against a tetraploid genotype: poppr will impute the genotype as:

[120, 124, 120, 120]

[120, 124, 120, 124]

[120, 124, 124, 124]

Where the blue numbers are the ones imputed.

I have seen other methods include the genotype of [120, 124, 124, 120], but this is erroneous because it assumes that the order of the alleles is important for this analysis. Since the distance is an average over the shortest path, both [120, 124, 124, 120] and [120, 124, 120, 124] will give the same answer. Including both in the imputations will bias the distance estimation toward the heterozygote genotype.

I hope that helps,

Zhian

metz...@uw.edu

unread,

Sep 23, 2014, 2:03:32 PM9/23/14

to po...@googlegroups.com

Zhian--

Thank you so much for your quick and helpful response. It is good to know that the poppr treats missing this way. Based on your response, poppr is implementing the Genome Addition Model described in the paper and in the algorithm pdf you sent. The calculation I was doing by hand/excel was the simpler Infinite Alleles Model (which explains why my equal ploidy comparisons were the same but my diploid/tetraploid calculations were similar, but not exactly the same), and either the Genome Addition Model or the Infinite Alleles Model would work for my purposes.

One other quick question--I had been thinking of trying to use the Bruvo distance with the "band sharing" model, which treats all different alleles as unique (a distance of 0 or 1) rather than using the 2^-|x| calculation (fragment analysis is great for determining different sized alleles, but the systemic biases in the sizing can lead to some errors in the absolute size differences of alleles). I believe I can do this using poppr by telling it that the step distance is an arbitrarily low number (0.0001). This should treat any difference, even a single base, as a huge number of steps, and give a distance of 1. Do you see any reason why this would not work? I assume putting zero into that calculation would blow it up with infinite numbers, and I think 0.0001 should essentially do the trick.

Thanks again,

--Michael

Zhian Kamvar

unread,

Sep 23, 2014, 3:30:03 PM9/23/14

to po...@googlegroups.com

Michael,

The description I gave was indeed for the GAM, but poppr implements all models (including infinite). The choice of the model is selected by modifying the add and loss flags in the bruvo.* functions.

Your idea for turning Bruvo's distance into a band-sharing distance is brilliant, in my opinion. To be safe, make sure to set the value to something where any differences will be >50 (i.e. as.character(1 - 2^-abs(1:51))). When applied to populations of known ploidy, this essentially turns into Provesti's distance. This distance is equivalent to the fraction of differences between genotypes and scales from zero to one (as the number of different alleles is divided by the number of loci multiplied by the ploidy). This trick with Bruvo's distance circumvents the issue of using Provesti's distance on mixed ploidy populations where the maximum distance is not known.

Cheers,

Zhian

metz...@uw.edu

unread,

Sep 23, 2014, 7:08:36 PM9/23/14

to po...@googlegroups.com

Zhian--

That is great. Regarding model choice, I see the add and loss arguments in the manual, but I don't see an infinite. Is that the default or do you set it to infinite by putting both add and loss to FALSE?

Also, please let me know if there is a publication I can cite.

Thanks for all your help!

--Michael

Zhian Kamvar

unread,

Sep 23, 2014, 7:22:32 PM9/23/14

to po...@googlegroups.com

Michael,

Setting both add = FALSE and loss = FALSE will render the infinite sites model. By default, the average of the GAM and GLM is utilized. I will make this distinction more explicit in the next version.

The publication you can cite is here and can be obtained by typing citation("poppr") into your R console.

Zhian

metz...@uw.edu

unread,

Sep 23, 2014, 7:29:50 PM9/23/14

to po...@googlegroups.com

Zhiang---

Thanks for all your help! I'll make sure to cite the paper.

--Michael

metz...@uw.edu

unread,

Sep 26, 2014, 7:06:26 PM9/26/14

to po...@googlegroups.com

Zhiang--

I have a new fun question about the calculation of Bruvo distances: Using an infinite alleles model, I can do pieces of the analysis by hand and a few of the numbers are still a bit odd. For example, if I do an infinite alleles bruvo matrix distance (which I think is coded with the flags: add=FALSE, loss=FALSE) on the samples pasted below with two individuals and two microsatellite alleles (one sample diploid and one triploid) it doesn't seem to recognize the shared allele of 190 in the Mar2 locus.

Using band sharing (replen = c(0.00001,0.00001)) the distance given is 1.0, which is not correct (should be 0.83333). I don't think this is a problem with my band sharing trick--with c(2,2) the distance is 0.9739126, which is still too high.

If I add a new size instead of the final 0 (like 2, for instance), I get 0.83333, the expected value. Also, if I replace the 210 with 0, so that both individuals are diploid I get 0.75, as expected. I don't know if this a problem with the way I am formatting things, or if there is something strange going on with the calculations across different ploidy levels.

One possible explanation is that the (add=FALSE, loss=FALSE) is not doing the Infinite alleles model or doing something odd. Interestingly if I type (add=TRUE, loss=FALSE) then I get 0.83333, which is the correct answer with the gene addition model (in this example, both models should give 0.83333). If I replace 203 with 214, then GAM (add=TRUE, loss=FALSE) gives 0.66667 (correct), but Infinite alleles (add=FALSE, loss=FALSE) gives 0.83333 (incorrect).

As I said, most of the numbers are exactly as expected (and the diploid numbers are exactly as I would expect), so overall this doesn't have a huge effect on the analyses, but I would like to figure it out.

Sorry, for the long rambling attempt at explanation here and I would greatly appreciate help. It is likely that there is something I am not understanding.

In addition to this problem most or all mixed ploidy calculations are slightly lower than expected when I do my full analysis with 10 loci, but I haven't figured out exactly how that happens yet. It doesn't seem to happen when I use subsets of two loci--only when I use all 10. If I figure it out I will update and let you know to disregard my rambling.

Thanks.

-Michael

Ind		Mar1			Mar2
A11-H	1	242	236	228	214	190	210
A10-T	1	226	211	0	203	190	0

metz...@uw.edu

unread,

Sep 26, 2014, 7:41:45 PM9/26/14

to po...@googlegroups.com

I wasn't able to figure out the main question above, but I did solve my mystery of the lower values. I have one locus which has been completely lost in many samples, so this is presented as 0 0 0 0 and I had been considering any comparison between it and another allele (diploid or lost) as having a distance of 1. In contrast, poppr will calculate the correct distance between two diploids, but will completely drop the locus from analysis if it is 0 0 0 0 (treating the 0 0 0 0 individuals as though the bruvo distance is calculated for 9 loci, not 10 loci). That explains why the poppr calculated numbers for individuals with that lost locus were ((expected*10)-1)/9. I think I could put "1" or some other number at the lost locus to force it to use the locus in the calculation and treat the distance as 1 with diploids and 0 with other individuals with that lost allele, but the current calculation may be a more conservative one.

I think that is more information that anyone needed, but I am happy to figure that part out.

Zhian Kamvar

unread,

Sep 26, 2014, 9:48:41 PM9/26/14

to po...@googlegroups.com

Hi Michael,

I have found the problem! Thank you for providing the example data. I reproduced your example and I have found a bug here. Essentially, I had accidentally inserted a silly logical error when I was trying to squash bigger and badder bugs for the major release. I am going to push the patch to the github development site momentarily. You can then install it with the command devtools::install_github("grunwaldlab/poppr", ref = "devel")

Regarding your second point, I agree that the current approach is a more conservative one as your approach is assuming that the null loci are biologically relevant. Your proposed solution of recoding the alleles for those samples sounds like a good solution if the null loci are indeed relevant. I will make this more clear in the documentation before I release the patch to CRAN.

Thank you for diving into the details of this. I don't know how long this bug would have persisted otherwise.

Cheers,

Zhian

metz...@uw.edu

unread,

Sep 27, 2014, 1:46:13 AM9/27/14

to po...@googlegroups.com

Zhian--

Excellent! It fixes it for me. Thank you so much for finding and fixing that so quickly! And I think that the conservative approach here is probably the best, to avoid biasing the results with my assumptions.

Best,
--Michael

Reply all

Reply to author

Forward