fastStructure: scripts for running replicates, averaging, and plotting (in lieu of distruct)

1,130 views
Skip to first unread message

Mikhail Matz

unread,
Jul 15, 2014, 3:55:03 PM7/15/14
to structure...@googlegroups.com
Hello - I have a few simple scripts that I cobbled together that might be useful, take a look at the _walkthrough.txt file for details.
The idea is to run 100 replicates of fastStructure, select top 25 best-likelihood ones, average the assignment probabilities, and plot using ggplot2. 
These should work on a Mac or Linux/Unix.

cheers

Mikhail

fastStructureRepsPlots.tgz

Vikram Chhatre

unread,
Jul 15, 2014, 5:02:17 PM7/15/14
to structure-software, Anil Raj
Hi Mikhail,

Excellent work!  This should be rather useful.

I have one question.  Given that fastSTRUCTURE automatically performs iterations, is it necessary to replicate the runs as your script is set up to do?  I am also copying Anil here who may have more input for us.

Thanks again
Vikram




--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-softw...@googlegroups.com.
To post to this group, send email to structure...@googlegroups.com.
Visit this group at http://groups.google.com/group/structure-software.
For more options, visit https://groups.google.com/d/optout.

Mikhail Matz

unread,
Jul 15, 2014, 5:37:54 PM7/15/14
to structure...@googlegroups.com, raj...@stanford.edu
Hi Vikram and Raj -

Ha, a good question. I was following the suggestion in the preprint of fastStructure paper, for cases with weak structure:

"(page 20) When population structure is difficult to resolve, imposing a logistic prior and estimating its parameters using the data is likely to increase the power to detect weak structure. However, estimation of the hierarchical prior parameters by maximizing the approximate marginal likelihood also makes the model susceptible to overfitting by encouraging a small set of samples to be randomly, and often confidently, assigned to unnecessary components of the model. To correct for this, when using the logistic prior, we suggest estimating the variational parameters with multiple random restarts and using the mean of the parameters corresponding to the top 5 values of LLBO. In order to ensure consistent population labels when computing the mean, we permuted the labels for each set of variational parameter estimates to find the permutation with the lowest pairwise Jensen-Shannon divergence between admixture proportions among pairs of restarts."

As I said in another post ("poor chain mixing"), I seem to be having trouble with "overfitting" as described above for the dataset with weak structure. Running 100 reps and selecting top 25 for plotting helps in this particular case, but still feels like "seat of the pants" solution.  I would greatly appreciate your guys' opinion on that.

On Tuesday, July 15, 2014 4:02:17 PM UTC-5, Vikram Chhatre wrote:
Hi Mikhail,

Excellent work!  This should be rather useful.

I have one question.  Given that fastSTRUCTURE automatically performs iterations, is it necessary to replicate the runs as your script is set up to do?  I am also copying Anil here who may have more input for us.

Thanks again
Vikram


On Tue, Jul 15, 2014 at 3:55 PM, Mikhail Matz <cea.m...@gmail.com> wrote:
Hello - I have a few simple scripts that I cobbled together that might be useful, take a look at the _walkthrough.txt file for details.
The idea is to run 100 replicates of fastStructure, select top 25 best-likelihood ones, average the assignment probabilities, and plot using ggplot2. 
These should work on a Mac or Linux/Unix.

cheers

Mikhail

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-software+unsub...@googlegroups.com.

Rebecca

unread,
Jul 16, 2014, 12:39:22 AM7/16/14
to structure...@googlegroups.com, raj...@stanford.edu
This looks really useful - thanks! I'm interested to see what others' thoughts are on the overfitting issue too.

On a loosely-related note, PGDSpider has just (today, I think) been updated to produce a fastSTRUCTURE-formatted variant of the STRUCTURE file it was already outputting. I think this would get around a couple of the steps in your script? Even if not, it might be useful info for the wider STRUCTURE-using community.

Mikhail Matz

unread,
Jul 16, 2014, 10:55:50 AM7/16/14
to structure...@googlegroups.com
Great news about PGDspider, Rebecca! 

Now we can skip the silly header-removing, extra-columns-inserting lines in the walkthrough. The perl and R scripts would still be applicable if anyone wishes to run and average replicates. 

Also, the function ggplotStructure (part of the fastStructurePlotting.R source file) might be useful on its own to make quick plots out of individual .meanQ files.

M
Reply all
Reply to author
Forward
0 new messages