Structure and seeds; beware the eighth Mersenne prime

52 views
Skip to first unread message

Owen Holland

unread,
Oct 1, 2024, 8:26:34 AM10/1/24
to structure-software
Hi,
I originally started writing this post to ask for help, but I have managed to rubber duck a solution after writing out my entire problem 😅 In the off chance that this happens to someone else I have explained the solution below.

I'm using the command line version Structure to investigate patterns of population structure (surprise! 😲). The dataset is quite large (~23,000 SNP loci), and using my institution's high performance cluster I have told Structure to run across an array for parallelisation of my runs (Slurm's array feature). I want to stay away from FastStructure when I can run regular Structure reletively quickly despite the number of loci, and I have used structure_threader before, but parallelising the runs using an array is more scalable and better for job management. I have a file (config.txt) that my script refers to in order to parse K, repetitions for K, and a seed based on the task number for the array. I have avoided using the 'RANDOMIZE' option in extraparams as the "random" seed is generated by the clock (I think) leading to a number of identical seeds when starting the run. Instead, I used shuf to generate "random" numbers up to 10 digits and wrote them into the config.txt file. This eliminates issues with randomisation using the system clock, and I also can keep a record of my seeds.

The config.txt file format is as follows:

$ head -15 config.txt
TaskID  K       rep             seed
1       1       1               3114633652
2       1       2               2007009304
3       1       3               2053864978
4       1       4               1400045285
5       1       5               1832013447
6       1       6               6391659688
7       1       7               1460128096
8       1       8               3924660921
9       1       9               7133210386
10      1       10              3153633214
11      2       1               7747553815
12      2       2               5345363849
13      2       3               5283459171
14      2       4               785974012

The script works without errors, and generates the seed.txt file (not appended to an previous seed.txt file I promise), but when I look at the seeds I can see that they duplicate every now and then:

$ head seed.txt
2147483647
2053864978
1400045285
2147483647
1832013447
1460128096
2007009304
2147483647
2147483647
2147483647

Duplicates are a problem for estimating K using Evanno's method, as the standard deviation between repeat runs is too low when there are so many identical runs. When I sort and remove duplicates from this list, it reveals that I only have 28 unique seeds (as opposed to the 100 I generated). If I search for common seeds between the config.txt file and the seed.txt file, I can see that all except one seed is derived from my config.txt file:

$ awk '{print $4}' config.txt > configSeeds.txt
$ grep -f configSeeds.txt seed.txt | sort | uniq | wc -l
27

The rebel seed? It is the highly repeated 2147483647. So why has it used 27 of my seeds and 73 instances of 2147483647? Because that "is the largest value that a signed 32-bit integer field can hold." https://en.wikipedia.org/wiki/2,147,483,647. So basically that is the biggest seed allowed.

Sorting seed.txt shows that once Structure encountered any seed larger than 2147483647, it would print that instead:

$ sort -n seed.txt | tail -80 | head -15
1650723441
1727727319
1742692310
1832013447
1964940753
2007009304
2053864978
2147483647
2147483647
2147483647
2147483647
2147483647
2147483647
2147483647
2147483647

Sorry about the long post. I just figured if this highly specific but simple problem can cause me so much grief then I probably am not the first or the last, and hopefully this helps! Now I'll be off to re-run all my analyses with seeds < 2147483647 😅
Reply all
Reply to author
Forward
0 new messages