What is the advantage of using Clumpp for drawing 'Barplots'

Struggler

unread,

Oct 8, 2012, 11:50:57 AM10/8/12

to structure-software

Dear All,
In one of the forum messages it was posted that one can replace the
Individual Q matrix obtained from 'Structure' by the one obtained from
'Clumpp' to draw ther bar plots.

I did the same and see no difference in my barplots whether I used
direct Structure output or replaced it by Clumpp output.

Am I missing something to understand here? Is there a particular
reason to use Clumpp.

I was initially trying to process Clumpp output to 'Distruct' but did
not succeed as it needed two files and I only got one output file from
Clumpp for using in Distruct.

I would really appreciate any help/suggestion on this aspect.

Regards,
S

Vikram Chhatre

unread,

Oct 8, 2012, 11:55:44 AM10/8/12

to structure...@googlegroups.com

S -

The barplots made from data processed through CLUMPP are statistically
correct because CLUMPP performs permutations using various algorithms
(of your choice) to match the independent iterations for the chosen
optimal value of K, as closely to each other as possible (Phew! long
sentence).

Please read up on the documentation manual and the accompanying paper
to get a better understanding.
http://www.stanford.edu/group/rosenberglab/clumpp.html

If you can tell us exactly how you processed your data with clumpp and
post relevant example files here, someone should be able to help you.

All the best
V

> --
> You received this message because you are subscribed to the Google Groups "structure-software" group.
> To post to this group, send email to structure...@googlegroups.com.
> To unsubscribe from this group, send email to structure-softw...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.
>

Vikram Chhatre

unread,

Oct 8, 2012, 12:02:29 PM10/8/12

to structure...@googlegroups.com

I meant to say 'statistically *more* correct'. Sorry about the oversight.

V

Struggler

unread,

Oct 8, 2012, 3:43:12 PM10/8/12

to structure-software

Hi Vikram,
Thanks for your prompt response. I used the 'INDFILE' produced from
Structure as an input to Clumpp. I copied and pasted the output from
Clumpp back to the Structure output from the same 'k' level keeping
everything same in the file. Then opened this modified file again in
Structure and obtained the barplots.

The Clumpp run finished in less than 27 seconds, I don't have any
prior experience of running this and not sure if this is the usual/
common run time for everybody else?

The parameters (only the main and modified ones) used in Clumpp are as
follows:
# --------------- Main parameters
---------------------------------------------

DATATYPE 0 # The type of data to be read in.
# 0 = individual data in the file
# specified by INDFILE, 1 = population
# data in the file specified by
# POPFILE.

INDFILE tsb_tetra_k5.indfile # The name of the individual
datafile.
# Required if DATATYPE = 0.

POPFILE # The name of the population datafile.
# Required if DATATYPE = 1.

OUTFILE tsb_tetra_k5_pairmat1.outfile # The average cluster
membership
# coefficients across the permuted runs
# are printed here.

MISCFILE tsb_tetra_k5_pairmat1.miscfile # The parameters used and a
summary of
# the results are printed here.

K 5 # Number of clusters.

C 353 # Number of individuals or populations.

R 2 # Number of runs.

M 1 # Method to be used (1 = FullSearch,
# 2 = Greedy, 3 = LargeKGreedy).

W 0 # Weight by the number of individuals
# in each population as specified in
# the datafile (1 if yes, 0 if no).

S 1 # Pairwise matrix similarity
statistic
# to be used. 1 = G, 2 = G'.

======================================================
From your explanation it looks like the only need for Clumpp
processing is to get statistically significant results but if one only
has to see at gross level, Structure plots are also fine - is it
correct inference?

Regards,
S

On Oct 8, 5:02 pm, Vikram Chhatre <crypticline...@gmail.com> wrote:
> I meant to say 'statistically *more* correct'. Sorry about the oversight.
>
> V
>
> On Mon, Oct 8, 2012 at 10:55 AM, Vikram Chhatre
>
>
>
>
>
>
>
> <crypticline...@gmail.com> wrote:
> > S -
>
> > The barplots made from data processed through CLUMPP are statistically
> > correct because CLUMPP performs permutations using various algorithms
> > (of your choice) to match the independent iterations for the chosen
> > optimal value of K, as closely to each other as possible (Phew! long
> > sentence).
>
> > Please read up on the documentation manual and the accompanying paper
> > to get a better understanding.
> >http://www.stanford.edu/group/rosenberglab/clumpp.html
>
> > If you can tell us exactly how you processed your data with clumpp and
> > post relevant example files here, someone should be able to help you.
>
> > All the best
> > V
>

Julie Hebert

unread,

Oct 9, 2012, 11:36:03 AM10/9/12

to structure...@googlegroups.com

In my experience, CLUMPP runs very quickly.

I prefer to use my output from CLUMPP in distruct. Distruct allows better visualization of the data, with many more options for creating the histograms.

Julie

Vikram Chhatre

unread,

Oct 9, 2012, 11:39:31 AM10/9/12

to structure...@googlegroups.com

I will second that. Using the Greedy method, CLUMPP will finish in
seconds. I have only tested datasets comprising of 4000 x 500n. If
you use LargeK Greedy method, it is very time consuming.

As Julie mentioned, DISTRUCT allows for much better and highly
customizable visualization of barplots.

Struggler: I haven't looked at your pasted files yet.

V

> --
> You received this message because you are subscribed to the Google Groups
> "structure-software" group.

> To view this discussion on the web visit
> https://groups.google.com/d/msg/structure-software/-/6Y37_6WJPAEJ.

Struggler

unread,

Oct 9, 2012, 4:33:49 PM10/9/12

to structure-software

Many thanks Julie and Vikram for your feedback.

I was also aiming to run Distruct after Clumpp but was not successful
as it (Distruct) asks for two input files - one for population and one
for individuals and Clumpp only produced the output file for
individuals. Could somebody provide steps in detail on how to run
Distruct from Clumpp results when one is only analysing one
associtaion panel with lots of individuals instead of analysing many
populations.

Regarding Clumpp, I ran it by using option M 1 (FullSearch) only
instead of M 2 (Greedy) or M 3 (LargeKGreedy) options, is it also
right?

Regards,
S

Vikram Chhatre

unread,

Oct 9, 2012, 4:42:12 PM10/9/12

to structure...@googlegroups.com

HARVESTER produces indfiles and popfiles for each K you have tested.
Once you choose the optimal number based on lnPD and deltaK, fish out
the popfile and indfile for that specific K.

Run each of those two files through CLUMPP *separately* to perform
permutations. In the end you will have two output files, one for
individuals, one for populations. You can rename the resulting output
files as .indivq and .popq and process them with DISTRUCT to prepare
plots.

Let us know how it went.

V

VictorM

unread,

Oct 10, 2012, 5:22:18 PM10/10/12

to structure...@googlegroups.com

Struggler:

You need to run two times CLUMPP one for populations and one for individuals. Then you can use the two files in distruct.

Vikram Chhatre

unread,

Nov 2, 2012, 3:54:10 PM11/2/12

to structure...@googlegroups.com

To answer your first email:

Structure Harvester produces both indfiles and popfiles (which need to be separately processed through clumpp to obtain output files). These output files are to be then named .indiviq and .popq and used with Distruct.

Are you suggesting that you are unable to obtain both indfiles and the popfiles for respective K values from Structure Harvester?

V

On Fri, Nov 2, 2012 at 2:43 PM, mountainmanjared <mountain...@gmail.com> wrote:

Hello,

I am having similar issues as reported by Struggler. I ran 10 iterations in Structure of k=1-10, and found k=5 as optimal through structureHarvester and deltaK. I then took the k5.indfile into Clumpp to produce my outfile to use in Distruct. As Struggler mentioned, there is no POPQ output file produced that you need to run Distruct. To complicate things, the newest version of Structure v2.3.4 produces this in the summary "Overall proportion of membership of the sample in each of the 5 clusters", followed by 1 row of 5 columns, while the older version of Structure (used in generating Distruct's sample files) produces "Proportion of membership of each pre-defined population in each of the 5 clusters", which is what is asked for the in POPQ file.

I don't fully understand what this table shows, or how you get different number of populations and clusters; to me, populations and clusters have the same meaning. I tried to make a dummy file with different proportions for the number of populations, and it didn't seem to change the visualization of my results.

So I guess my question is, does the POPQ file matter/what does it do for Distruct? This seems to be the way to draw the vertical lines in between populations, but I don't know what these numbers actually mean, how you can get different numbers of populations and clusters?

Thanks in advance,
Jared

--
You received this message because you are subscribed to the Google Groups "structure-software" group.

To view this discussion on the web visit https://groups.google.com/d/msg/structure-software/-/j-lwLtDe7sIJ.

Yohannes Besufekad

unread,

Nov 3, 2012, 2:31:10 AM11/3/12

to structure...@googlegroups.com

Hi all Users

I have launched python script directory in my window ,however structureHarvester command provides as syntaxError so what shall I do.

Vikram Chhatre

unread,

Nov 3, 2012, 10:40:12 AM11/3/12

to structure...@googlegroups.com

Dent Earl (author of Harvester) suggested earlier to get in touch with him to troubleshoot any problems.

---------------------------------------

Vikram Chhatre

Graduate Program in Genetics

Texas A&M University

cryptic...@gmail.com

This message was sent from a cellular device. It may contain typos and other errors.

Reply all

Reply to author

Forward