script for sampling subset of SNPs for structure or genepop files

858 views
Skip to first unread message

Jason Munshi-South

unread,
Jun 4, 2013, 4:27:20 PM6/4/13
to stacks...@googlegroups.com
Hi STACKS community,

I have used the --write-single-snp flag when using 'populations' to produce an input file for STRUCTURE.  However, I still end up with tens of thousands of SNPs.

Has anyone in this group (including the STACKS authors!) written a script yet for randomly choosing subsets of SNPs for STRUCTURE or other software that doesn't really require tens or hundreds of thousands of markers?  It would be greatly appreciated if you could share so others don't have to reinvent the wheel!

Best,
Jason

Julian Catchen

unread,
Jun 4, 2013, 4:38:47 PM6/4/13
to stacks...@googlegroups.com
Hi Jason,

No script required, just use some magic shell:

(Consider this command all on one line)

grep -v "^#" batch_1.sumstats.tsv |
cut -f 2 |
sort |
uniq |
shuf |
head -n 1000 |
sort -n > whitelist.tsv

This command does the following at each step:

1) Grep pulls out all the lines in the sumstats file, minus the commented header
lines. The sumstats file contains all the polymorphic loci in the analysis.
2) cut out the second column, which contains locus IDs
3) sort those IDs
4) reduce them to a unique list of IDs (remove duplicate entries)
5) randomly shuffle those lines
6) take the first 1000 of the randomly shuffled lines
7) sort them again and capture them into a file.

So, this will pull out all the polymorphic catalog IDs, shuffle them and capture
the first 1000 random IDs into a file. You then run populations again and give
this file to populations as a whitelist (-W) flag. Populations then will only
process these 1000 random loci.

If you repeat this command a few times and compare the outputs, say:

head -n 25 whitelist_1000-1.tsv whitelist_1000-2.tsv

you should see different sets of IDs in the files.

If you want more than 1000 loci, just put in the number you want (1000-5000 loci
seems to work well with STRUCTURE, but it can't handle huge numbers of loci).

Good luck,

julian

Jason Munshi-South

unread,
Jun 4, 2013, 4:49:36 PM6/4/13
to stacks...@googlegroups.com, jcat...@uoregon.edu
Perfect!  I did manage to run STRUCTURE with over 30k SNPs but not for very many steps (and above a few thousand SNPs it's not clear one is getting much benefit).  This approach will likely be useful for others...thank you for posting!

Emerson, Kevin

unread,
Jun 5, 2013, 3:18:03 PM6/5/13
to stacks...@googlegroups.com
Hello all,

Julian had a very nice, succinct solution to the problem.  I, as some might expect, have a much more convoluted solution, though it works just as well!

I have a python script that samples a stacks-output STRUCTURE file, and returns a new file with a random sample of the loci.  This script was in the middle of a larger pipeline and outputs the various loci used into  a log file so that you can track which loci you keep if you need.

Usage: sampleStructureFile.py [options] filename

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -n NUMLOCI, --numLoci=NUMLOCI
                        number of loci to sample
  -i FILENAME, --inFile=FILENAME
                        input structure file
  -o OUTFILENAME, --outFile=OUTFILENAME
                        output file destination

The file is attached.  Hope it helps.

Cheers,
Kevin


--
--
For more options or to unsubscribe: http://groups.google.com/group/stacks-users
Stacks website: http://creskolab.uoregon.edu/stacks/
 
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
------------------
Kevin J Emerson, PhD
Assistant Professor of Biology
Biology Department
St. Mary's College of Maryland
18952 E. Fisher Rd
St. Mary's City, MD 20686-3001

Office: 240 - 895 - 2123, Shaefer Hall 231

---------------------

sampleStructureFile.py

chris blair

unread,
Oct 9, 2013, 1:33:23 PM10/9/13
to stacks...@googlegroups.com
Hi Kevin, 

Thanks for the script. I am trying to use it to reduce the Structure file to 1000 loci. However, I am getting an error that states: ValueError: sample larger than population. I can't seem to figure out what this is referring to.

Chris

chris blair

unread,
Oct 9, 2013, 1:43:00 PM10/9/13
to stacks...@googlegroups.com
DO I need to edit something in random.py before I run the script?

Emerson, Kevin

unread,
Oct 9, 2013, 3:57:21 PM10/9/13
to stacks...@googlegroups.com
Chris,

I didn't have to do anything to random.py to run the script at this end, but that is clearly where the error is coming from. I used python 2.7.3 to run this script before.  Are you using the same version?

If you want to send along your structure file I can try running it quickly to see if I can help troubleshoot.

Kevin


--
Stacks website: http://creskolab.uoregon.edu/stacks/
---
You received this message because you are subscribed to the Google Groups "Stacks" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stacks-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Emerson, Kevin

unread,
Oct 9, 2013, 8:51:50 PM10/9/13
to stacks...@googlegroups.com
The script I shared earlier does not ignore comment lines.  To use it just remove the comment line from the structure output from stacks.

$ grep -v ^# batch_1.structure.tsv > structureFile.tsv

$ sampleStructureFile.py -n 1000 -i structureFile.tsv -o out.tsv

That should work for you.  
Reply all
Reply to author
Forward
0 new messages