Hi Jason,
No script required, just use some magic shell:
(Consider this command all on one line)
grep -v "^#" batch_1.sumstats.tsv |
cut -f 2 |
sort |
uniq |
shuf |
head -n 1000 |
sort -n > whitelist.tsv
This command does the following at each step:
1) Grep pulls out all the lines in the sumstats file, minus the commented header
lines. The sumstats file contains all the polymorphic loci in the analysis.
2) cut out the second column, which contains locus IDs
3) sort those IDs
4) reduce them to a unique list of IDs (remove duplicate entries)
5) randomly shuffle those lines
6) take the first 1000 of the randomly shuffled lines
7) sort them again and capture them into a file.
So, this will pull out all the polymorphic catalog IDs, shuffle them and capture
the first 1000 random IDs into a file. You then run populations again and give
this file to populations as a whitelist (-W) flag. Populations then will only
process these 1000 random loci.
If you repeat this command a few times and compare the outputs, say:
head -n 25 whitelist_1000-1.tsv whitelist_1000-2.tsv
you should see different sets of IDs in the files.
If you want more than 1000 loci, just put in the number you want (1000-5000 loci
seems to work well with STRUCTURE, but it can't handle huge numbers of loci).
Good luck,
julian