[structure-group] How to make CLUMPP run

3,055 views
Skip to first unread message

SOFIA

unread,
May 19, 2010, 12:05:48 PM5/19/10
to structure-software
Hi Dears,

I spent 3 days trying to understand how to make Clumpp run on my
Structure outputs. So finally, with the help of Clumpp´ author and
myself I got it! I let you my comments, I hope it will be helpfull for
you:

1. On Windows Vista: download version 1.1.2. Then I used Clumpp
example files (the one unziped at the Clumpp.exe file directory) to
write my own data on. First it sounds strange having this particular
file extensions and this way of making a program run, but I used Excel
to edit "arabid.indfile" and a Word text editor to modify the
"paramfile". Do not use example\arabid\arabid.paramfile because this
file should just be written as: "paramfile" without any extension.
Then for the moment you do not need any file at all (out, misc permu
etc) unleast you change M1 option.
2. Edit "paramfile". If you do not have population data, just change
DATATYPE to 0. Leave the rest as it. Then change your K, C and R
value. In my example, it is K:6, C: 103 and R:2.

3. On "arabid.indfile" replaced all by Structure output. However,
first you need to edit with Excel as follows:
1 1 (x) 1 : 0.985 0.003 0.002 0.005 0.003 0.002
2 2 (x) 1 : 0.96 0.015 0.002 0.009 0.005 0.009
3 3 (x) 1 : 0.662 0.003 0.003 0.319 0.003 0.01
4 4 (x) 1 : 0.993 0.001 0.002 0.001 0.001 0.002
103 103 (x)
1 : ..............
... ... ... ... ..........................................................................
empty row
1 1 (x) 1 : 0.003 0.003
0.004 0.003 0.986 0.002
...............
103 103 (x)
1 : .........................................

Columns 1 and 2 are individual IDs. On column 2, Clumpp manual says
"indnr" which mean ID number. So, let column 1 and 2 as the original
Structure label output. Replaced then (% missing value) by (x) along
all rows: I do not understand why but Clumpp require it but do it.
Add then a column that means popnr (population number ID), in this
example is just 1, becasue I am working with 103 individual coming
from the same population. When the first Structure run finish, leave
an empty row, then past the folowing Structure run and so on. In my
example I remember I have 103 individual, 6 clusters and 2 runs.

4. In the case you have population data, replaced DATATYPE by 1, then
the name input file as arabid.popfile and column 4 information with
an integer population number label.

5. If you have file errors, the program would not tell you, it will
just close it with out any information. So, to get some clues folow
the author aid:

"the best way to see the (often informative) error messages is to run
CLUMPP from a command window. The manual explains how to do this for
windows. Basically, in the start menu, click "run" and type "cmd" in
the small window that appear. This will start a command window.
Navigate to the folder containing CLUMPP and your parameter/input
files using the cd command. Once you are in the correct folder, type
at the command line prompt "CLUMPP" followed by the name of the
parameter file and hit return. This will run the program or generate
an error message"

The problem you need to know how to "navigate" on DOS system. So,
remember to write "cd xxxxx" to change your directory folders, "dir"
command will display your disk folders.

sorry about my english....Hope it helps

Sofia.


--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To post to this group, send email to structure...@googlegroups.com.
To unsubscribe from this group, send email to structure-softw...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/structure-software?hl=en.

Thirsa-K

unread,
May 20, 2010, 3:21:16 AM5/20/10
to structure-software
Hi,

Just a small additional tip for making CLUMPP input files: Use
Structure Harvester....to be found at: http://taylor0.biology.ucla.edu/struct_harvest/
Make a zip file of your Strucutre results folder, import it into SH,
let SH run and among other results, you will also get a CLUMPP input
file for each K of your Strucutre run.
This saves a lot of time in comparison to editing by hand in Excel.

Cheers,
Thirsa

Laura Vogel

unread,
May 20, 2010, 12:25:36 PM5/20/10
to structure-software
Hi all,

Quick question related to Structure Harvester and Clumpp. I want to
run CLUMPP on the BioHPC at Cornell and I'm wondering whether to
combine the CLUMPPindfile for each K that I got from Harvester into
one big megafile. So, in other words, I have 20 replicates of 10
putative K's, should I run all 200 replicates at once, or run each K
(with 20 R's) separately?

Thanks very much!
Laura

Thirsa-K

unread,
May 21, 2010, 8:19:42 AM5/21/10
to structure-software
Hi Laura,

First of all a small piece of advise to significantly reduce the run-
time needed: In the manual you are advised to use the method LargeK
Greedy for K>15, but when you try running e.g. K=8 with the Greedy
method that will take almost a whole weekend (at least it did when I
tried it). So I would use LargeK for any K>7.

Unfortunately, as far as I know, it is not possible to run CLUMPP on a
indqfile in which data for more than one K is combined, since in your
param file, you have to indicate which K is assumed. But, it is very
easy to write a batch file for clumpp where you specify the run
details for each different K and which with one click of the mouse
will result in all your CLUMPP runs for all K's being run one after
the other without you having to do anything about it.

For example, suppose that in your paramfile (called "myparamfile") you
have specified the name of your indfile as "indfile-K2", K as 2, the
name of the outfile as "outfile-K2", the method to be used as Greedy
(option2) and the name of the permfiles as "mypermfile" (as of course
PRINT_PERMUTED_DATA at option 2, because those are the files you will
need for Distruct).
In that case the batch file might look something like this:

CLUMPP myparamfile
rename mypermfile_1 mypermfile-K2_1
rename mypermfile_2 mypermfile-K2_2
etc up to _20
CLUMPP myparamfile -i indfile-K3 -o outfile-K3 -k 3
rename mypermfile_1 mypermfile-K3_1
rename mypermfile_2 mypermfile-K3_2
etc up to _20

same for K=4, 5 and 6

CLUMPP myparamfile -i indfile-K7 -o outfile-K7 -k 7 -m 3
rename mypermfile_1 mypermfile-K7_1
rename mypermfile_2 mypermfile-K7_2
etc up to _20

same for higher K's

So, the first line tells the computer to run CLUMPP with the settings
specified in the file myparamfile. The next 20 lines tell the computer
to rename the files mypermfile_1 until _20 to mypermfile-K2_1 until 20
(I didn't feel like writing all 20 lines here, hence the remark "etc
up to 20").
If you don't rename the permfiles, they will be overwritten with the
permfiles for the next K, because unfortunately, there is no command
line option to change the name of your permfile (or at least....not
that I know of).
All other lines tell the computer to run CLUMPP with the settings
specified in myparamfile with the exception of the indfile, the
outfile, the number for K and (from K=7 onwards) the method to be used
and again to rename the permfiles to mypermfile-K*_1 until _20.

If you write this in a text-editor and save the file as e.g.
"mybatchfile.bat" in the CLUMPP folder, simply double clicking on
mybatchfile.bat should be enough to get all runs done.....

Laura Vogel

unread,
May 23, 2010, 3:14:25 PM5/23/10
to structure-software
Dear Thirsa,

Thank you so much for your help, it was very, very much appreciated.
Unfortunately, I foolishly did not heed your advice about the LargeK
Greedy because I was running CLUMPP on the BioHPC and I thought that
it could handle the jobs no problem. Well, I found out the hard way
that you couldn't be more right about having to use the LargeK for any
K>7 b/c the HPC timed out on me. Oh well, I guess I'll be using the
LargeK for those last 3 jobs anyway.

Also, thanks so much for your help with writing the batch file. I just
can't tell you how much I appreciate you sharing your time and
expertise. It worked perfectly and was very helpful.

Best regards!
Laura

Laura Vogel

unread,
May 24, 2010, 12:07:34 PM5/24/10
to structure-software
Hi everyone,

Please note that Thirsa is extremely right about using the
LargeKGreedy for k>7. I was running R=20 and C=99 - not a particularly
large run on the BioHPC and k=8 for the Greedy algorithm would have
taken 4.6 days, and k=9 with the Greedy would have taken 46 days!

Laura
Reply all
Reply to author
Forward
0 new messages