Structure and seeds; beware the eighth Mersenne prime

55 views

Skip to first unread message

Owen Holland

unread,

Oct 1, 2024, 8:26:34 AM10/1/24

to structure-software

Hi,

I originally started writing this post to ask for help, but I have managed to rubber duck a solution after writing out my entire problem 😅 In the off chance that this happens to someone else I have explained the solution below.

I'm using the command line version Structure to investigate patterns of population structure (surprise! 😲). The dataset is quite large (~23,000 SNP loci), and using my institution's high performance cluster I have told Structure to run across an array for parallelisation of my runs (Slurm's array feature). I want to stay away from FastStructure when I can run regular Structure reletively quickly despite the number of loci, and I have used structure_threader before, but parallelising the runs using an array is more scalable and better for job management. I have a file (config.txt) that my script refers to in order to parse K, repetitions for K, and a seed based on the task number for the array. I have avoided using the 'RANDOMIZE' option in extraparams as the "random" seed is generated by the clock (I think) leading to a number of identical seeds when starting the run. Instead, I used shuf to generate "random" numbers up to 10 digits and wrote them into the config.txt file. This eliminates issues with randomisation using the system clock, and I also can keep a record of my seeds.

The config.txt file format is as follows:

$ head -15 config.txt
TaskID K rep seed
1 1 1 3114633652
2 1 2 2007009304
3 1 3 2053864978
4 1 4 1400045285
5 1 5 1832013447
6 1 6 6391659688
7 1 7 1460128096
8 1 8 3924660921
9 1 9 7133210386
10 1 10 3153633214
11 2 1 7747553815
12 2 2 5345363849
13 2 3 5283459171
14 2 4 785974012

The script works without errors, and generates the seed.txt file (not appended to an previous seed.txt file I promise), but when I look at the seeds I can see that they duplicate every now and then:

$ head seed.txt
2147483647
2053864978
1400045285
2147483647
1832013447
1460128096
2007009304
2147483647
2147483647
2147483647

Duplicates are a problem for estimating K using Evanno's method, as the standard deviation between repeat runs is too low when there are so many identical runs. When I sort and remove duplicates from this list, it reveals that I only have 28 unique seeds (as opposed to the 100 I generated). If I search for common seeds between the config.txt file and the seed.txt file, I can see that all except one seed is derived from my config.txt file:

$ awk '{print $4}' config.txt > configSeeds.txt

$ grep -f configSeeds.txt seed.txt | sort | uniq | wc -l
27

The rebel seed? It is the highly repeated 2147483647. So why has it used 27 of my seeds and 73 instances of 2147483647? Because that "is the largest value that a signed 32-bit integer field can hold." https://en.wikipedia.org/wiki/2,147,483,647. So basically that is the biggest seed allowed.

Sorting seed.txt shows that once Structure encountered any seed larger than 2147483647, it would print that instead:

$ sort -n seed.txt | tail -80 | head -15
1650723441
1727727319
1742692310
1832013447
1964940753
2007009304
2053864978
2147483647
2147483647
2147483647
2147483647
2147483647
2147483647
2147483647
2147483647

Sorry about the long post. I just figured if this highly specific but simple problem can cause me so much grief then I probably am not the first or the last, and hopefully this helps! Now I'll be off to re-run all my analyses with seeds < 2147483647 😅

Reply all

Reply to author

Forward

0 new messages