script for sampling subset of SNPs for structure or genepop files

Jason Munshi-South

unread,

Jun 4, 2013, 4:27:20 PM6/4/13

to stacks...@googlegroups.com

Hi STACKS community,

I have used the --write-single-snp flag when using 'populations' to produce an input file for STRUCTURE. However, I still end up with tens of thousands of SNPs.

Has anyone in this group (including the STACKS authors!) written a script yet for randomly choosing subsets of SNPs for STRUCTURE or other software that doesn't really require tens or hundreds of thousands of markers? It would be greatly appreciated if you could share so others don't have to reinvent the wheel!

Best,

Jason

Julian Catchen

unread,

Jun 4, 2013, 4:38:47 PM6/4/13

to stacks...@googlegroups.com

Hi Jason,

No script required, just use some magic shell:

(Consider this command all on one line)

grep -v "^#" batch_1.sumstats.tsv |
cut -f 2 |
sort |
uniq |
shuf |
head -n 1000 |
sort -n > whitelist.tsv

This command does the following at each step:

1) Grep pulls out all the lines in the sumstats file, minus the commented header
lines. The sumstats file contains all the polymorphic loci in the analysis.
2) cut out the second column, which contains locus IDs
3) sort those IDs
4) reduce them to a unique list of IDs (remove duplicate entries)
5) randomly shuffle those lines
6) take the first 1000 of the randomly shuffled lines
7) sort them again and capture them into a file.

So, this will pull out all the polymorphic catalog IDs, shuffle them and capture
the first 1000 random IDs into a file. You then run populations again and give
this file to populations as a whitelist (-W) flag. Populations then will only
process these 1000 random loci.

If you repeat this command a few times and compare the outputs, say:

head -n 25 whitelist_1000-1.tsv whitelist_1000-2.tsv

you should see different sets of IDs in the files.

If you want more than 1000 loci, just put in the number you want (1000-5000 loci
seems to work well with STRUCTURE, but it can't handle huge numbers of loci).

Good luck,

julian

Jason Munshi-South

unread,

Jun 4, 2013, 4:49:36 PM6/4/13

to stacks...@googlegroups.com, jcat...@uoregon.edu

Perfect! I did manage to run STRUCTURE with over 30k SNPs but not for very many steps (and above a few thousand SNPs it's not clear one is getting much benefit). This approach will likely be useful for others...thank you for posting!

Emerson, Kevin

unread,

Jun 5, 2013, 3:18:03 PM6/5/13

to stacks...@googlegroups.com

Hello all,

Julian had a very nice, succinct solution to the problem. I, as some might expect, have a much more convoluted solution, though it works just as well!

I have a python script that samples a stacks-output STRUCTURE file, and returns a new file with a random sample of the loci. This script was in the middle of a larger pipeline and outputs the various loci used into a log file so that you can track which loci you keep if you need.

Usage: sampleStructureFile.py [options] filename

Options:

--version show program's version number and exit

-h, --help show this help message and exit

-n NUMLOCI, --numLoci=NUMLOCI

number of loci to sample

-i FILENAME, --inFile=FILENAME

input structure file

-o OUTFILENAME, --outFile=OUTFILENAME

output file destination

The file is attached. Hope it helps.

Cheers,
Kevin

--
--
For more options or to unsubscribe: http://groups.google.com/group/stacks-users
Stacks website: http://creskolab.uoregon.edu/stacks/

---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--

------------------

Kevin J Emerson, PhD

Assistant Professor of Biology

Biology Department

St. Mary's College of Maryland

18952 E. Fisher Rd
St. Mary's City, MD 20686-3001

kjem...@smcm.edu

http://faculty.smcm.edu/kjemerson

Office: 240 - 895 - 2123, Shaefer Hall 231

---------------------

sampleStructureFile.py

chris blair

unread,

Oct 9, 2013, 1:33:23 PM10/9/13

to stacks...@googlegroups.com

Hi Kevin,

Thanks for the script. I am trying to use it to reduce the Structure file to 1000 loci. However, I am getting an error that states: ValueError: sample larger than population. I can't seem to figure out what this is referring to.

Chris

chris blair

unread,

Oct 9, 2013, 1:43:00 PM10/9/13

to stacks...@googlegroups.com

DO I need to edit something in random.py before I run the script?

Emerson, Kevin

unread,

Oct 9, 2013, 3:57:21 PM10/9/13

to stacks...@googlegroups.com

Chris,

I didn't have to do anything to random.py to run the script at this end, but that is clearly where the error is coming from. I used python 2.7.3 to run this script before. Are you using the same version?

If you want to send along your structure file I can try running it quickly to see if I can help troubleshoot.

Kevin

--

Stacks website: http://creskolab.uoregon.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Emerson, Kevin

unread,

Oct 9, 2013, 8:51:50 PM10/9/13

to stacks...@googlegroups.com

The script I shared earlier does not ignore comment lines. To use it just remove the comment line from the structure output from stacks.

$ grep -v ^# batch_1.structure.tsv > structureFile.tsv

$ sampleStructureFile.py -n 1000 -i structureFile.tsv -o out.tsv

That should work for you.

Reply all

Reply to author

Forward