Expansion/migration script and random seeding questions

21 views
Skip to first unread message

Cameron Grey

unread,
Nov 4, 2025, 3:51:01 PM11/4/25
to dadi-user

Hi Ryan,


I've been using dadi to model the initial expansion and gene flow of a recent invasion, and I've started on three populations. I have SNP data for the 3 populations and generation times for how long they have (probably) been there. I wanted a model that could examine likely initial expansion events as well as migration post founding of each population. I have scripts for each scenario, and have been comparing AIC values calculated from output LL. 


Specifically -  I tested different expansion scenarios first with the nuPOP parameters, and found the best fit expansion scenario. From there, I have added migration overtop of the expansion with the m_POP parameters in my scripts.


1. Question about migration -  I have included a script where I have both the expansion scenario and one of the migration scenarios I am testing overtop the expansion results. I was wondering if this was a reasonable/informative approach to expansion and migration. Because the data is from one time point, how does dadi differentiate initial expansion events from migration that occurs later on given that these shared allele frequency changes can be quite subtle for recent events? I'm curious because I just got a couple results suggesting migration in the same direction as my likely expansion scenario (I tested just expansion first).


2. Question about seeding - I realized that I was missing a random seeding for the different replicates of the model, and I am thinking of adding in np.random.seed() to the run_opt() portion. Any advice on this approach? I appreciate it.


Thank you for all of your help and for making such an interesting model!


-Cameron


scenario2_3popGOM_migBAtoBZ.py

Ryan Gutenkunst

unread,
Nov 7, 2025, 11:02:56 AM11/7/25
to dadi...@googlegroups.com
Hello Cameron,

On Nov 4, 2025, at 1:47 PM, Cameron Grey <camero...@gmail.com> wrote:

Hi Ryan,

I've been using dadi to model the initial expansion and gene flow of a recent invasion, and I've started on three populations. I have SNP data for the 3 populations and generation times for how long they have (probably) been there. I wanted a model that could examine likely initial expansion events as well as migration post founding of each population. I have scripts for each scenario, and have been comparing AIC values calculated from output LL.

Do be careful with AIC. If you’re SNPs are linked, then dadi is really computing a composite likelihood, and the AIC will be anti-conservative (favor the complex model too much).

Specifically -  I tested different expansion scenarios first with the nuPOP parameters, and found the best fit expansion scenario. From there, I have added migration overtop of the expansion with the m_POP parameters in my scripts.

1. Question about migration -  I have included a script where I have both the expansion scenario and one of the migration scenarios I am testing overtop the expansion results. I was wondering if this was a reasonable/informative approach to expansion and migration. Because the data is from one time point, how does dadi differentiate initial expansion events from migration that occurs later on given that these shared allele frequency changes can be quite subtle for recent events? I'm curious because I just got a couple results suggesting migration in the same direction as my likely expansion scenario (I tested just expansion first).

Be careful with migration directions. The m12 parameter in dadi is migration into population 1 from population 2, which isn’t obvious.

You may not have power to differentiate directional migration from other scenarios. As you said, it’s a subtle signal.

2. Question about seeding - I realized that I was missing a random seeding for the different replicates of the model, and I am thinking of adding in np.random.seed() to the run_opt() portion. Any advice on this approach? I appreciate it.

Dadi does the same seeding approach internally, so there’s no need to add it. One caveat is that the default np.random.seed() which dadi uses as well, is based on system clock time. If you’re starting a number of jobs simultaneously on a cluster, they can inadvertently get the same seed. dadi-cli works around this by also using additional info about the machine to seed.

I encourage you to also explore dadi-cli, it makes most basic dadi analyses much easier.

Best,
Ryan

Thank you for all of your help and for making such an interesting model!

-Cameron


--
You received this message because you are subscribed to the Google Groups "dadi-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dadi-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/dadi-user/b4fcad17-e40e-47b4-8845-75b4de67f348n%40googlegroups.com.
<scenario2_3popGOM_migBAtoBZ.py>

Cameron Grey

unread,
Jan 28, 2026, 2:46:59 PM (5 days ago) Jan 28
to dadi-user

Hi Ryan,

I appreciate your feedback on the earlier discussion. We’ve reevaluated our approach and are starting with relatively simple expansion-only demographic models to infer source populations and directionality among four populations.

I wanted to get your opinion on our SNP filtering strategy. My understanding is that any filtering that preferentially removes rare alleles can bias the SFS, so I’m avoiding MAF and HWE filtering. We’re working with ddRAD-seq SNPs (and later WGS SNPs) processed with STACKS, and I plan to randomly retain one SNP per RAD locus to ensure independence.

For missing data, I’m considering allowing ~30–50% missingness per SNP, while removing individuals with very high missing data (>40–50%). Does that seem like a reasonable balance for dadi analyses?

I also wanted to ask about paralog filtering. I’ve seen this suggested as a potential solution in some dadi discussions, and I was wondering whether this is something you typically recommend, and if so, whether it’s best handled via depth/heterozygosity-based filters rather than HWE.

Any guidance you’re willing to share would be much appreciated. Thanks again for taking the time to respond on these threads.


-Cameron

Ryan Gutenkunst

unread,
Jan 29, 2026, 5:09:28 PM (4 days ago) Jan 29
to dadi-user
Hello Cameron,

Retaining one SNP per RAD locus is a typical approach that I find sensible.

That seems like a reasonable balance for dadi analysis.

Paralog issues tend to show up as an anomalous “spike” in the data at exactly 50% allele frequency. If you see that, then move forward with filtering. Often they can be eliminated by a very conservative HWE filter, because they are so extreme (perfect heterozygosity), without affecting other calls. I would find that more reliable than pure depth.

Best,
Ryan

Reply all
Reply to author
Forward
0 new messages