Computational speed

Shalanee

unread,

May 1, 2012, 12:59:43 AM5/1/12

to structure...@googlegroups.com

Dear all,

I have run the STRUCTURE with 11675 SNPs for 1440 animals with k=2. I used the front end version in Linux with default values, burnin period 20000 and MCMC length 20000. I used admixture model with prior population information. It took me 4 and half days to get the results. Could you please suggest me how can I improve my computational time?

Same time how can I see the results during program runing ? If i can save the results middle of the run (5000,10000,15000,20000 runs). it is very easy for me to comapre the results of different runs rather than waiting until it finish.

Thanks

Shalanee

Vikram Chhatre

unread,

May 1, 2012, 9:06:39 AM5/1/12

to structure...@googlegroups.com

Hi Shalanee,

Computational time depends on several factors including processor
speed, memory available, number of processors available and the run
conditions you have set. As a general rule, more number of markers
need longer computational time. You may not need to use all the
markers available to you. You could try several exploratory analyses
with smaller sets using different numbers of markers and compare the
results.

The number of BURNIN and MCMC steps also affects the computational
time. You need to do sufficient number of steps as burnin to allow
for parameters (alpha etc.) to converge. In the front end, time
series plots of these parameters can be visualized to confirm
convergence. In the backend version, you will need to manually
produce these time series plots from runtime output data.

As for looking at results while runs are underway, I do not think
final results are printed until a run is complete. However, the
runtime screen output can be saved to a log file using backend
version. I do not know exactly where to look for such information
when using front-end, so I will let someone else point you in the
right direction.

Unfortunately, there is no quick answer to your question. You may
have to play with run conditions and the size of your dataset to
figure this out.

Note: You may very well find that a burnin of 20000 steps is
sufficient for parameter convergence, It is still a good idea to run
several iterations (start with 3) for every K you are testing, and
compare the results between these independent iterations.

Hope that helps.

Vikram

> --
> You received this message because you are subscribed to the Google Groups
> "structure-software" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/structure-software/-/H5jNThiYD-MJ.
> To post to this group, send email to structure...@googlegroups.com.
> To unsubscribe from this group, send email to
> structure-softw...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/structure-software?hl=en.

Shalanee Weerasinghe

unread,

May 1, 2012, 9:06:12 PM5/1/12

to structure...@googlegroups.com

Thanks Vikram,

I have already started with 400k and then I selected 11K. I will go for 4K
too. Same time I will try for the different iterations and compare it.

Shalanee

Shalanee Weerasinghe

unread,

May 2, 2012, 9:26:35 PM5/2/12

to structure...@googlegroups.com

Hi Vikram,

I have given 201 animals with two prior populations and admixed group.
BURNIN 20000 and MCMC 20000.I ran the program with default values. I have
gone through the runtime plots while program running. Fst and alpha plots
showed small fluctuations even when the program finished. Does it mean it's
not converged? Why we get two Fst plots for this data set? In my pure
populations I show some common parts in bar plot. Does it means they share
common alleles or program couldn’t identify specific SNPS? I have attached
my final plots for your attention.

Thanks

Shalanee

-----Original Message-----
From: structure...@googlegroups.com
[mailto:structure...@googlegroups.com] On Behalf Of Vikram Chhatre
Sent: Tuesday, 1 May 2012 11:07 PM
To: structure...@googlegroups.com
Subject: Re: [structure-group] Computational speed

bar plot.jpg

data Alpha.jpg

data FSt.jpg

data liklihood.jpg

data Ln P(D).jpg

Vikram Chhatre

unread,

May 2, 2012, 10:32:16 PM5/2/12

to structure...@googlegroups.com

Shalanee,

I recently had a discussion with Dr. Pritchard about verifying
convergence. I will quote an excerpt below:

---------------------
"The MCMC framework converges to a *distribution* of values, so that
even at equilibrium we expect the actual values to bounce around quite
a bit. What would be worrying would be a strong trend in the value of
one of the parameters: ie the parameter is trending towards higher or
lower numbers through the course of the run, or makes a big jump at
some point.

I find it's often helpful to run the algorithm several times at the
same input parameter values, and if the parameter estimates are all
fairly similar across independent runs then that's quite encouraging.

In my experience, Structure tends to converge to a mode fairly
quickly, so very long runs are generally not necessary. The main
concern would be that it may in some cases converge to different modes
in different runs, and you could diagnose that by comparing the
parameter estimates from different runs and finding that the mean
estimates fall into two or more groups."
---------------------

Based on this, I will go ahead and say that your parameters seem to
have converged. From the barplot, it appears as though, about half of
your individuals have full membership in one of the two clusters. The
rest of the individuals have been probabilistically assigned equally
to both clusters. I assume here that you know whether this makes
biological sense depending upon what you know about these individuals.
If you are interested in how each SNP affects these membership
assignments, you should look at the results file.

You mention 'prior populations' and using 'program defaults'. If by
former, you mean 'POPINFO=1', that is not a program default (afaik).
Did you specifically set the popinfo flag?

Not sure if all your questions were answered, but hope this was helpful.

V

Shalanee Weerasinghe

unread,

May 2, 2012, 11:29:17 PM5/2/12

to structure...@googlegroups.com

Hello Vikram,

Thanks It's a great description. I have arranged my input file as Label,
Population and SNPs. pop=1 indicate first pure population, pop=2 indicate
second pure population and -9 for the admixed. EX:
Label pop SNPs for allele 1 and 2
ID1 1 1 2 1 2
ID2 1 2 2 2 1
ID3 2 1 2 1 2
ID4 2 1 1 2 2
ID5 -9 1 1 2 1
ID6 -9 2 1 2 1

I have selected putative population origin for each individual box not the
Popinfor flag. Now I feel I have to click popinfor flag. Am I correct?

Vikram Chhatre

unread,

May 2, 2012, 11:41:18 PM5/2/12

to structure...@googlegroups.com

Rather than trying to explain in my own words, let me refer you to
Section 3 of the Structure manual titled 'Modeling Decisions for the
User'. You should particularly look at model # 4, that deals with
LOCPRIOR apriori information setting. The section also refers to the
paper by Hubisz et al (2009).

If then some doubts remain about what parameter settings you should
use, someone here should be able to help you.

All the best
V

On Wed, May 2, 2012 at 10:29 PM, Shalanee Weerasinghe

Reply all

Reply to author

Forward