Convergence, MCMC reps, and the impact in results

517 views
Skip to first unread message

bcn27...@gmail.com

unread,
May 3, 2023, 11:30:09 AM5/3/23
to structure-software
Hi all!

As of late I've been doing quite a lot of STRUCTURE analysis using the front end software (version 2.3.4) and Structure_threader. The topic of convergence amongst iterations for the same K and between MCMCs is always a hot one. I've read the manual and in practice one should check in Data plot > Alpha the alpha values throughout the MCMC reps and check whether or not the alpha values stabilize.

I've been running multiple analyses on the exact same dataset using different sets of burn-in/MCMC reps. Namely, 1000 burn-in/10,000 MCMC reps; 5000 burn-in/50,000 MCMC reps; and 10,000 burn-in/100,000 MCMC reps. For a given set and range of K's, each K was iterated 10 times. I checked the Alpha plots for each set and to me they all look like they have stabilized (indicating convergence), and the bar plots between the iterations of the given best K of a particular set do recover the same results. However, the results for the best K differ between sets of burn-in/MCMC reps.

I've added the alpha plots in case someone here can see something I don't. Likewise, I've added the results for each of the tested sets.
So I have basically two questions:

A) What would it be a customary number of burn-in/MCMC reps to strive for, if any, when the alpha plots look like the ones I've provided?

B) When using Structure_threader how can one assess whether or not the alpha value has stabilized? I've checked the manual through and through and there are no comments on it. I've figured that perhaps by checking the mean value of alpha between iterations of the same K (str_K#_rep#_f) convergence could be assessed.

Thank you so much!


Cheers
differing_results.pdf
K4_1000b-in_10k_MCMC.jpg
K4_10kb-in_100k_MCMC.jpg
K3_5000b-in_50k_MCMC.jpg

Francisco Pina Martins

unread,
May 7, 2023, 7:09:12 AM5/7/23
to structure-software
Dear bcn,

As the author of Structure_threader, allow me to answer question "B" first:
If you run the program with the switch --log 1 Structure_threader will generate a logfile for each run it is wrapping. In this logfile you can find the alpha values of each iteration. You are correct as this is not explicit in the manual I will update it to include this important piece of information!

As for "A", looking at your plots I would say all of them look equally good (I'm assuming the blue part of the 0 line represents the burnin). The comprative plot is indeed intriguing. May I ask what K=4 looks like for the 5kb/50k MCMC looks like? It might also just be my imagination, but it looks to me like the plot from the 100k MCMC seems to have a lower value of log(alpha). Can you please confirm this suspicion by calculating the averages (or even better, using a man-whittney u-test).

Best,

Francisco

bcn27...@gmail.com

unread,
May 8, 2023, 9:43:33 AM5/8/23
to structure-software

Dear Francisco,

As a new user to Structure_threader let me start by expressing my gratitude. It is a breeze of fresh air and it makes the overall user experience far less painful. So thank you.
Regarding your request, I've prepared a renewed version of the pdf summarizing some of my results. The alpha plots for K4 and 5000 b-in/50,000 MCMC reps looks stabilized and the results of the bar plot look exactly like those of the 10,000 b-in/100,000 MCMC reps run (see pdf document).

However, I initially ruled them out given the results of the Evanno test, which I attached too. As for the man-whittney u-test, I used the average alpha values for each of the 10 iterations per bestK and run (a.k.a, Na = 10 for the K3_5kb-in/50kMCMC run, and Nb = 10 for the K4_10kb-in/100kMCMC run), and the results indicate that the null hypothesis cannot be rejected. However, upon checking the average values of alpha I did notice that the ones for the 100k MCMC run did vary more than the ones for the 50k MCMC run.

Looking forward to hear your input! And thank you so much

Cheers
average_alpha.PNG
differing_results_V2.pdf
Evanno_test_5kb-in_50kMCMC.PNG
K4_5kb-in_50kMCMC_alpha_plot.jpg

Francisco Pina Martins

unread,
May 9, 2023, 12:13:05 PM5/9/23
to structure-software
Dear bcn,

I'm happy Structure_threader is being useful (also, I've improved the manual thanks your feedback)!
Now we seem to be reaching somewhere. Your comparative plots show that for K=4 the 100k and 50k runs are pretty much equivalent. This is consistent with the U-test results. I have no idea why Evanno's test is assigning different values of K for them, though.
That leaves only the 10k run as the "different" one from a results standpoint.
Can you please do the U-test with the 10k vs the 50k run? Rejecting the null hypothesis here would strongly suggest that the alpha values are different between runs. If that is the case, the difference between the runs could be explained by an unlikely, but ultimately possible coincidence. But if someone has a better explanation, I'm ready to hear it. =-)


Best,

Francisco

bcn27...@gmail.com

unread,
May 9, 2023, 12:44:25 PM5/9/23
to structure-software
Dear Francisco,

the U-test fails as well to reject the null hypothesis when comparing the bestK of the 10k run (K4) to the bestK for the 50k run (K3). Additionally, I took the liberty to compare the results for the bestK at 50k run (K3) and K4 for the 50k run, and the null hypothesis cannot be rejected. Likewise, the U-test between 10k bestK and 100k bestK yielded the same results. But even though both Evanno's test establish K4 as the bestK, the delimited groups are similar but not quite the same.

It has been noted that there might be long-branch attraction between LEP and HEP, but I do not know if this piece of information gives you any clue at all as to what might be happening here.

Under such circumstances, perhaps the wise course of action would be to settle for the 100k run (or more) and be done with it?

Thank you for your invaluable input!

Cheers

Francisco Pina Martins

unread,
May 9, 2023, 5:48:25 PM5/9/23
to structure-software
Dear bcn,

>Under such circumstances, perhaps the wise course of action would be to settle for the 100k run (or more) and be done with it?

Yes, this is likely what I would do in your place. =-)
Happy to help.

Best regards,

Francisco

bcn27...@gmail.com

unread,
May 10, 2023, 5:00:52 AM5/10/23
to structure-software
Dear Francisco,

Thank you for your interest and help. On a completely unrelated note, but regarding Structure_threader , is there an option to sort the bar plot based on
affiliation (as in the STRUCTURE front end's 'Sort by Q')? I know there are multiple possibilities with the popfiles but I believe that this one is not included. I managed to sort my samples by Q
in the .html plot, in plot.ly, but for reasons unknown, it is impossible to sign up into the website while loading the plots in order to try to download their modified versions.
Of course, one could always screenshot the figure, but the resolution is poor as a consequence.

Thank you! And have a nice day

Cheers

Francisco Pina Martins

unread,
May 11, 2023, 8:20:33 AM5/11/23
to structure-software
Dear bcn,

There are two features I would really like to see implemented in Structure_threader, but never had the chance to: Q plot sorting, and CLUMP.
I keep telling myself I will do it "next summer", but I've been doing it since 2017. =-)
So, it may, or may not happen, but I really wish I had the time to do it.
For now you have to resort to the "trick" you already tried, which is by no means ideal.

Best,

Francisco

bcn27...@gmail.com

unread,
Jul 4, 2023, 4:16:45 AM7/4/23
to structure-software
Hi there, Francisco!

I managed to reorganize the structure plots by modifying the *.svg in Illustrator, which is tedious but works best than the approach I mentioned last time.

But I had a question regarding Structure_threader that's been bugging me. In some of my analyses, the best K and the plots do not match. For instance, the Evanno test might indicate that the best K is 6 but the pertinent plot will show only 4 colours as in K = 4. For that same run, when checking the results from K1-8, each different K iteration will match the number of colours in the plot up until K4 (i.e., K1 has only 1 colour; K2 has all samples divided into 2 colours; in K3, 3; and K4, 4). But from K4 to K8 all plots have only 4 colours displayed.

I've attached an example of the situation I described.

Is there an explanation for this event?


Thank you!
evanno.txt
str_K6_rep9_f.html
str_K6_rep9_f.svg

Francisco Pina Martins

unread,
Jul 4, 2023, 7:22:34 AM7/4/23
to structure-software
Hi bcn,

I had never seen this behaviour before. But to make sure this isn't a product of chance, please take a look at the files generated in the results directory.
STRUCTURE requires replicates for the evanno test to run. These replicates will vary somewhat in their result (but these variations are rarely drastic). So, what I would recommend, is to take a look at the "str_K6_rep*" files, and find the "Inferred ancestry of individuals:" section. Then for each replicate file, make sure there are always 2 clusters with 0.000 in all lines. If this does not happen for all replicates, (some replicates have no columns with all values 0.000), then you might want to draw your plots using CLUMPP, which will get you a plot "averaged" from all replicates.

Best,

Francisco

Dani Dols

unread,
Jul 4, 2023, 9:33:09 AM7/4/23
to structure...@googlegroups.com
Hi again,

I've checked and all str_K6_rep* have 2 columns with 0.000 values, although not necessarily the same columns throughout replicates.

Cheers,

--
You received this message because you are subscribed to the Google Groups "structure-software" group.
To unsubscribe from this group and stop receiving emails from it, send an email to structure-softw...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/structure-software/11744d0e-5b4d-4dc2-91dd-1e6d23879f09n%40googlegroups.com.


--
Daniel Dols Serrate
PhD. Student

Dpt de Genètica, Microbiologia i Estadística

Avd. Diagonal, 643

08028 Barcelona

Telf: +34 647 130 643

Francisco Pina Martins

unread,
Jul 4, 2023, 9:42:10 AM7/4/23
to structure-software
Hi Daniel,

In that case you shouldn't have to worry about huge differences between replicates. Are there any differences between K4 and K6? Or is the pattern essentially the same?
Keep in mind that the evanno method is just another ad hoc method (as satted by the authors) for estimating the best value of K, and not the end-all solution for the problem. To be on the safe side, you might want to consider repeating the test, using only K1:4, since STRUCTURE does not seem to return any hint of more than 4 clusters.

Best,

Francisco

Dani Dols

unread,
Jul 4, 2023, 10:01:25 AM7/4/23
to structure...@googlegroups.com
Hi Francisco,

the pattern is essentially the same between K4 and K6, as you may see in the attached files. But I wanted to understand whether or not this sort of result was a feature of Structure_threader or not, or why it behaved like this, for I have never encountered such type of result using the front-end version of Structure. I know Evanno's test is not words written in stone, but it certainly complicates things a bit to reason why one would ignore Evanno's results to choose as optimal another K and not make it look like one only wants to see things that there are not in one's data. K4 makes sense, and hierarchical structures should be able to highlight further clustering within the data, but it would have been great to see what K6 delimited.

Either way, I cannot stress enough how grateful I am for your quick response.

Cheers!


str_K4_rep9_f.svg
str_K6_rep9_f.svg

Francisco Pina Martins

unread,
Jul 4, 2023, 10:09:27 AM7/4/23
to structure-software
Hi again Daniel,

You may also like to try running either fastStructure and/or ALStructure wrapped in Structure_threader to check if any of them return a different pattern for higher values of K. ALStructure should take ~20min on the first run (to install all dependencies), but then each run should take only a few seconds. fastStructure should take a few minutes to a few hours to run. However, I should state that I would personally have more trust on the STRUCTURE results than on those from either fastStructure or ALStructure, but if these alternatives reveal a biologically interesting pattern, it might be worth further investigation on the STRUCTURE parameters.
If you do run these alternatives, please share the results, as you have certainly poked my curiosity! =-)

Best,

Francisco

Dani Dols

unread,
Jul 10, 2023, 9:28:51 AM7/10/23
to structure...@googlegroups.com
Hi Francisco,

it also pricked my curiosity, so I would like to try fastStructure but the format required is not quite the same as for Structure. Correct me if I'm wrong, but I seem to understand that fastStructure needs two rows per individual, no headers, and 5 columns between the samples' column and the actual genotypes. Is this correct? Do these 'bogus' columns need to have any value or attributes?

Which brings me to my next question: is there an already existing way of converting a *.structure file to the fastStructure format?

Cheers!

Francisco Pina Martins

unread,
Jul 11, 2023, 4:38:20 AM7/11/23
to structure-software
Hello again Daniel,

Yes, the fastStructure format is based on STRUCTURE's, but is a bit more strict. Your assessment on the format is correct. Depending on what your STRUCTURE input file looks like, it can be wither fairly easy to convert between them, of quite hard. If your STRUCTURE file already has 2 rows per individual it should be fairly simple to delete the header and add 5 columns (which don't have to have any special value or attributes, I usually just fill them with the string "extracol"). If you have 2 columns per individual, though, things might get trickier.
Note that fastStructure also accepts input in PLINK format. I had some luck with PGDSpider for converting file formats. Although it still works, it seems "abandoned" since 2018 and requires java 8 to run (just install it in a dedicated conda environment and you should be fine). If you find a better alternative, please do share. =-)

Best,

Francisco

Dani Dols

unread,
Jul 13, 2023, 7:32:23 AM7/13/23
to structure...@googlegroups.com
Hi again, Francisco!

I managed to write a small script to modify a structure file to a faststructure readable format. However, I've encountered problems when trying to install fastStructure using the helper scripts. Basically, I keep running into the same error:
-- Check for working Fortran compiler: /usr/bin/f95  -- broken
CMake Error at /opt/cmake-3.15.7-Linux-x86_64/share/cmake-3.15/Modules/CMakeTestFortranCompiler.cmake:45 (message):
  The Fortran compiler

    "/usr/bin/f95"

And I've been doing some research but this is way out of my league and there does not seem to be (or maybe I could not find) a clear answer to this problematic. I believe I installed Structure_threader using pip3 install structure_threader --user, but there is no pre-compiled fastStructure binary in ~/.local/bin/. Any tips you can give me?

Much appreciated!

Best

p.s.: please, find attached the script I mentioned in case is of any interest to you.


struc2fastStruc_v2.py

Josh Banta

unread,
Jul 13, 2023, 1:19:33 PM7/13/23
to structure...@googlegroups.com
Dear Dani,

I have a tutorial for installing and running FastStructure under Ubuntu Linux. You can emulate Linux on a Windows machine, an Intel Mac, or an Apple ARM Mac (although it will run slower on an Apple ARM Mac). I actually already have a preconfigured virtual machine that you can use that has Faststructure installed, if interested.

Here is the relevant link:

How to Install FastStructure under Ubuntu Linux:

The show notes contain information on setting up a virtual machine on your computer.
For Windows machines, follow the windows instructions and install VirtualBox
For Macs:
If an Intel Mac, follow the Windows instructions and install VirtualBox (and if you want to use my virtual machine, download the version I have prepared for Windows).
If an Apple ARM Mac, follow the Mac instructions and install the program UTM instead.

I hope this helps.
Best wishes,
Josh


Francisco Pina Martins

unread,
Jul 14, 2023, 5:48:57 AM7/14/23
to structure-software
Hi again Daniel,
Good job with the script. =-)

If you installed Structure_threader inside a conda environment the fastStructure binary should be under "~/miniconda3/envs/[env_name]/bin/fastStructure". Also note that only GNU/Linux and MacOSX binaries are provided by Structure_threader.
If you can't find the binary in any of those paths, you can also search for it with this command:

find ~/ -name fastStructure

Other alternatives include  Josh's tutorials, or following the fastStructure build script from *Structure_threader* (https://gitlab.com/StuntsPT/Structure_threader/-/blob/master/helper_scripts/install_faststructure.sh), although this will require an installation of python2. Last time I did it, I had to use a docker image of an old Ubuntu release (14.04 IIRC).

Hope this helps,

Francisco

PS - I really have to make a conda package for *Structure_threader*...

Dani Dols

unread,
Jul 14, 2023, 7:10:55 AM7/14/23
to structure...@googlegroups.com
Thank you to both of you,

I'll be sure to check out your tutorial, Josh.

On another note, I found fastStructure installed in my base environment, which is where Structure_threader was installed and tried the following command

structure_threader run -K 5 -R 10 -i $infile -o $output_path -t 16 -fs $HOME/anaconda3/bin/fastStructure --ind individuals.txt --log 1

I believe the -R flag is unnecessary for fastStructure, but I was recycling the structure's command. Either way, it does seem to run. This means that I may be able to try looking into that matter with my differing K results.
Once again, thank you both for lending me a hand.


Cheers!

Reply all
Reply to author
Forward
0 new messages