Impact of missing data and differing results for the same dataset

136 views
Skip to first unread message

bcn27...@gmail.com

unread,
Apr 25, 2023, 9:34:59 AM4/25/23
to structure-software
Hi all!

I've been working with ddRAD data and I've encountered something quite intriguing to say the least. I've been processing my data using STACKS (Catchen Lab) and said program has an option to output your SNP data in *.structure format.

Therein missing data is codified as 0. Concurrently, just to be extra cautious, I used the same program to generate a *.vcf file out of the same loci and then process it with vcftools and PLINK to prune out linked SNPs and then convert the *.vcf to *.structure using PDGSpider, which codifies missing data as -9.

Just out of curiosity I decided to run the *.structure output from STACKS twice:
A) leaving 0 as the missing data value
B) copying the same *.structure file and changing all 0 for -9 as the missing data value

After running STRUCTURE, both A and B yielded different results, but the data was essentially the same. And I am clueless as to why this is happening.

What am I missing? Do you have any tips?

Thank you!

Vikram Chhatre

unread,
Apr 25, 2023, 9:37:49 AM4/25/23
to structure...@googlegroups.com
Are you using the front end or commandline version of STR? Regardless, did you set up the notation for what constitutes missing data within STR? By default, it's -9. If you do not change that notation, but then use 0 in your data, STR will interpret your file incorrectly.

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-softw...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/structure-software/a126477e-51c2-4df6-aa29-925376a3b0c1n%40googlegroups.com.

bcn27...@gmail.com

unread,
Apr 25, 2023, 9:45:35 AM4/25/23
to structure-software
Hi, Vikram!

I'm using the front end version 2.3.4 and in both cases I set up the notation of missing data properly, meaning that when I used the straightforward *.structure output from STACKS I indicated STR to treat 0 as the missing data notation and then -9 for the subsequent modified *.structure file (which was a copy of the aforementioned file but with all 0 changed to -9).

Vikram Chhatre

unread,
Apr 25, 2023, 9:48:44 AM4/25/23
to structure...@googlegroups.com
I forgot to mention that your individual run will never be identical given the nature of the MCMC process. The only way to repeat a run precisely is to choose the same starting SEED NUMBER.

Just to make sure, you did not change "0" to "-9" manually, correct? If you, for example, did it using a text editor, that process can introduce errors unless you are very cautious (and sometimes, even then).

Josh Banta

unread,
Apr 25, 2023, 9:48:56 AM4/25/23
to structure...@googlegroups.com
Hello,

How different are your results when using 0 versus -9 as the missing data? Of course every STRUCTURE run will be a bit different, but I'm assuming it's a dramatic difference?

Best,
Josh Banta

bcn27...@gmail.com

unread,
Apr 25, 2023, 10:20:09 AM4/25/23
to structure-software
I did expect some degree of variance between runs, but they are quite divergent in my particular case. I DID modify the *.structure file using a text editor, Notepad++. What would you advise me to do in order to change the missing data notation in the *.structure file?

Also, thank you. You are helping me big time

Vikram Chhatre

unread,
Apr 25, 2023, 10:22:51 AM4/25/23
to structure...@googlegroups.com
I suggest getting vcf out from STACKS, then passing it through PGDSpider where you can set options for missing data when converting to STR.



On Tue, Apr 25, 2023 at 10:20 AM bcn27...@gmail.com <bcn27...@gmail.com> wrote:
I did expect some degree of variance between runs, but they are quite divergent in my particular case. I DID modify the *.structure file using a text editor, Notepad++. What would you advise me to do in order to change the missing data notation in the *.structure file?

Also, thank you. You are helping me big time

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-softw...@googlegroups.com.

bcn27...@gmail.com

unread,
Apr 25, 2023, 10:23:10 AM4/25/23
to structure-software
It is quite different, indeed.

One run points clearly to K = 4 with no shred of doubt and the other one does the same but for K = 3. I added a screenshot of the differing results using the very same data.

Cheers,
Captura.PNG

bamber...@gmail.com

unread,
Apr 27, 2023, 10:03:51 AM4/27/23
to structure-software
Hello, 

I also use the structure-Output of Stacks but I modify the input file with PGDSpider and Plink for Admixture. Using -9 or 0 for missing data never influenced my results, and I would not expect it for the Structure program. More likely, mistakes may happen when preparing the input files.

It was already mentioned above that the results may vary due to the MCMC process.
To check the results I would use e.g. 10 independent Admixture/Structure runs, which gives an idea about the variation between the runs. 

Having a quick look, the results for K 3 and 4 look similar to me: in A) HEP separates from the red cluster of LEP+HEP and MCI in B). 
In A) HEP then forms a separate cluster with the admixed CA+CT. From the annotations it is not clear in B) to which cluster the samples CA+CT, MAR S.S. and MG+CT are assigned.

I assume you know best how the samples relate to each other (samples from small/large scale population/subspecies/species level, expected gene flow/hybridization/introgression).
There might be a (biological) reason for the varying cluster assignments. Have you checked the amount of missing data for each individual and made sure that there is not much sampling bias?

Best wishes,

Dani Dols

unread,
Apr 27, 2023, 10:49:33 AM4/27/23
to structure...@googlegroups.com
Hi! 

Thank you for your thorough answer. Theoretically, the sampling bias is kept at a minimum and all populations used in these analyses have more or less the same number of individuals. Of course, there are cases where said number may be a bit low. But we have noted long-branch attraction issues between LEP and HEP which could be a reasonable explanation as to why there might be such results (they are both different species).

I've run tests where I retained the first SNP of each loci of interest and this time they yielded the same results when using the straight STACKS *.structure output annotating the zeros as missing data compared to using the whole vcf-PDGspider path. Only this time I did not modify manually the missing data notation with a text editor.





El dj., 27 d’abr. 2023, 16:03, bamber...@gmail.com <bamber...@gmail.com> va escriure:
Reply all
Reply to author
Forward
0 new messages