Evanno's delta K method and underestimation of K

2,599 views
Skip to first unread message

ja

unread,
Jul 20, 2011, 4:53:22 AM7/20/11
to structure-software
I've read in several papers (e.g. Waples & Gaggiotti, Mol Ecol 2006,
15, 1419-1439) and seen it discussed on this forum that Evanno's delta
K method identifies the highest level of population structure, and K
may therefore be underestimated when there is hierarchical structure.
I believe I may be experiencing this issue with my own dataset, and am
looking guidance and/or suggested references on how to deal with
this.

For my data, there is a fairly linear increase in ln pr(X|K) which
"more-or-less plateaus" around K=5-7, yet the Evanno method
definitively identifies K=2 as the optimal solution. When I look at
the population assignments in the K=5-7 range, they seem biologically
and geographically sensible. Moreover, most individuals are
definitively assigned to one cluster or another, which also suggests
that K is not being overestimated. I recently tried to publish these
results, and justified my selection of K=6 based on these qualitative
criteria. However, the reviewers have required that I use the delta K
method, which I feel substantially underestimates K. I'm not sure how
to respond.

I have seen studies that run the delta K method, split the dataset
based on the optimal K, and then repeat until the optimal K=1 for all
subgroups (e.g. Coulon et al, Mol Ecol 2008, 17, 1685-1701). However,
I have not seen any detailed reviews of this approach or how broadly
applicable it might be to different datasets. Moreover, it's a real
pain to do on large datasets (it seems to me that if this is a sound
and generally applicable method, a savvy coder could make a program
that automates this process and get a quick pub out of it, while
helping the pop gen community). I am particualrly uncertain about
this iterative delta K/data-spltiing approach because my data has
patterns of isolation by distance, so I'm not sure how appropriate an
analysis intended for hierarchical structure might be (altough I
suppose it's possible to have patterns of both hierarchical structure
and isolation by distance in the same data).

I'd appreciate any thoughts or advice that anyone can offer on these
topics.

Jonathan Pritchard

unread,
Jul 20, 2011, 10:09:03 AM7/20/11
to structure...@googlegroups.com

Hi JP:

Estimation of K is difficult in mixture models in general, and in
Structure in particular. The approach that Structure takes (ie
computing Pr[K|X] ) is theoretically justified, although the method that
Structure uses to compute this is approximate. In my experience I have
found that this method works well for simulated data with distinct
populations.

A more serious problem may be that in reality K is often not a very well
defined quantity. For example, human population structure across Europe
seems to fit an isolation-by-distance model pretty well, so if we had a
set of samples uniformly spread across Europe there would not be a
natural "correct" value of K. I think that this type of problem is
quite common.

Given these issues, I suggested in the Structure manual that people take
a relatively informal approach to evaluating K (in addition to the
Pr[K|X] criterion). I would consider higher values of K to be
justified if (i) the population assignments make biological sense--for
example if all K clusters include different proportions of individuals
from each sampling location, and (ii) all K clusters include at least
some individuals who are strongly assigned to that cluster. You
indicate below that your data meet both criteria for values of K of 5-7
which (although I have not seen your data) seems like an excellent
reason to report those results. In this case it seems likely that the
Evanno criterion is being overly conservative.

Incidentally, this issue that K is often not very well-defined (for
example due to isolation by distance or hierarchical structure) is one
major reason why I have generally preferred to plot results for a series
of values of K (see eg Rosenberg et al 2002), and in your case it would
seem natural to go up to ~7.

Jonathan


> I've read in several papers (e.g. Waples& Gaggiotti, Mol Ecol 2006,

Reply all
Reply to author
Forward
0 new messages