Hi,
I originally started writing this post to ask for help, but I have managed to rubber duck a solution after writing out my entire problem 😅 In the off chance that this happens to someone else I have explained the solution below.
I'm using the command line version Structure to investigate patterns of population structure (surprise! 😲). The dataset is quite large (~23,000 SNP loci), and using my institution's high performance cluster I have told Structure to run across an array for parallelisation of my runs (Slurm's array feature). I want to stay away from FastStructure when I can run regular Structure reletively quickly despite the number of loci, and I have used structure_threader before, but parallelising the runs using an array is more scalable and better for job management. I have a file (config.txt) that my script refers to in order to parse K, repetitions for K, and a seed based on the task number for the array. I have avoided using the 'RANDOMIZE' option in extraparams as the "random" seed is generated by the clock (I think) leading to a number of identical seeds when starting the run. Instead, I used
shuf to generate "random" numbers up to 10 digits and wrote them into the config.txt file. This eliminates issues with randomisation using the system clock, and I also can keep a record of my seeds.
The config.txt file format is as follows:
The script works without errors, and generates the seed.txt file (not appended to an previous seed.txt file I promise), but when I look at the seeds I can see that they duplicate every now and then:
Duplicates are a problem for estimating K using Evanno's method, as the standard deviation between repeat runs is too low when there are so many identical runs. When I sort and remove duplicates from this list, it reveals that I only have 28 unique seeds (as opposed to the 100 I generated). If I search for common seeds between the config.txt file and the seed.txt file, I can see that all except one seed is derived from my config.txt file:
$ awk '{print $4}' config.txt > configSeeds.txt
$ grep -f configSeeds.txt seed.txt | sort | uniq | wc -l
27
The rebel seed? It is the highly repeated
2147483647. So why has it used 27 of my seeds and 73 instances of 2147483647? Because that "is the largest value that a signed 32-bit integer field can hold."
https://en.wikipedia.org/wiki/2,147,483,647. So basically that is the biggest seed allowed.
Sorting seed.txt shows that once Structure encountered any seed larger than
2147483647, it would print that instead:
Sorry about the long post. I just figured if this highly specific but simple problem can cause me so much grief then I probably am not the first or the last, and hopefully this helps! Now I'll be off to re-run all my analyses with seeds <
2147483647 😅