Missing data and comparing samples of differing ploidy

metz...@uw.edu

unread,

Jan 5, 2015, 2:02:38 AM1/5/15

to structure...@googlegroups.com

I am comparing samples of different ploidy levels (diploid to tetraploid) and trying to determine the best way to deal with lower ploidy organisms. I was wondering if someone knew specifically how missing values were treated in STRUCTURE.

If I treat the dataset as tetraploid, the diploid organisms have two "missing" alleles, and I am concerned that this will affect the analysis. I have generated a "pseudodiplod" dataset by duplicating the data from diploid individuals so that there are no missing values, and the groupings look different. There is more discrete structure within the diploid individuals in the pseudodiploid results. I suspect that comparisons between diploids are affected by the presence of the missing values (reducing observed clustering), but also worried that duplicating the values could artificially increase the clustering by overrepresenting the data.

The manual appears to say that in a tetraploid dataset, coding an individual as A B C (MISSING) indicates that the particular individual is triploid at the locus in question, but this is in the discussion of RECESSIVEALLELES and it is unclear whether this assumes RECESSIVEALLELES=0 or 1 or whether the NONAMBIGUOUS code is set to MISSING or some other value.

If anyone as experience with this type of analysis, I would appreciate any advice.

Thanks.

Michael

Andrea Schreier

unread,

Jan 5, 2015, 11:48:30 AM1/5/15

to structure...@googlegroups.com

Hi, Michael. I have used Structure to analyze data from polyploid
fishes and in one species I study, microsatellite markers are detected
in either four copies or eight copies (in some ways similar to your
two ploidy level system). Because of this and because I can't score
dosage of my alleles, I treat each microsatellite allele as a
present/absent dominant locus rather than treating each microsatellite
as a codominant polyploid locus. This approach is not as powerful as
an analysis of codominant data but it should help you get around your
problem. If you are interested in getting more details, email me off
the list (amdr...@ucdavis.edu) and I can send you some papers to look
at for examples.

Good luck!

Andrea

> --
> You received this message because you are subscribed to the Google Groups
> "structure-software" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to structure-softw...@googlegroups.com.
> To post to this group, send email to structure...@googlegroups.com.
> Visit this group at http://groups.google.com/group/structure-software.
> For more options, visit https://groups.google.com/d/optout.

metz...@uw.edu

unread,

Jan 7, 2015, 11:37:54 PM1/7/15

to structure...@googlegroups.com

I have done more reading and testing of STRUCTURE with different ploidy levels in case anyone is interested in this type of analysis.

So, I took a small (n=18) diploid dataset with 10 microsatellites and very minimal population structure and ran STURUCTURE on 5 different variations:
A: Diploid (ie. 240 244)
B: Diploid treated as tetraploid with two missing values (ie. 240 244 -9 -9)
C: Pseudotetraploid--duplicating diploid values and treated as tetraploid (ie. 240 244 240 244)
D: Ambiguous Tetraploid--same dataset as C, with a recessive allele line set to missing value and nonambiguous set to a value that does not occur, such that (240 244 240 244) could mean (240 244 240 244), (240 240 240 244) or (240 244 244 244).
E: Comparison of Diploid and Pseudotetraploid--a combination of datasets B and C treated as tetraploid.

(jpeg of the major K=3 output visualized with CLUMPAK is attached)
Basically, A and B were nearly identical (with minimal structure observed), while C and D showed much more divergent population structures that were very similar, but not identical. This suggests that adding missing values and treating as a higher ploidy does not affect the analysis when recessive alleles option is not used (as suggested in the manual). Additionally, in E, the clusters looked like C and D as expected, and the multiple representations of each individual (ie. 240 244 -9 -9 and 240 244 240 244) show the same cluster identity. This means that for codominant data, like microsatellites, treating as the higher ploidy with missing values for the lower ploidy individuals (without using recessive alleles) will correctly cluster the individuals without artificially clustering the higher ploidy individuals together. This seems like a good option if you want to look at the population clustering of individuals with different ploidy without treating ploidy level itself as a marker of difference, although it does seem to overrepresent the data in the low ploidy individuals.

If anyone has any other thoughts I would be very interested.

Thanks.

Ploidy Test.jpg

Reply all

Reply to author

Forward