Hi everyone,
I'm seeking advice on dataset subsampling and MCMC convergence for a single segment of an endemic arbovirus. I am trying to reconstruct its spatiotemporal history but have hit a bottleneck.
Dataset & Subsampling:
Original: 2,500 - 3,000 full-length sequences.
Cleaning: Excluded all intra-segment recombinants and inter-segment reassortants.
Subsampling: Removed identical sequences clustered by year/location/host, then applied a spatiotemporal stratified subsampling (capping seqs per province/year grid).
Final size: ~500 representative sequences.
The Issues:
Weak Temporal Signal: Checking the 500-taxa dataset in TempEst shows a very weak root-to-tip regression (R2 ≈ 0.02 - 0.05).
MCMC Non-convergence: I ran a baseline BEAST analysis (without discrete spatial traits yet) using a Skygrid demographic model. To compensate for the weak temporal signal, I set an informative clock prior (Normal: mean = 1.0E-4, stdev = 0.5E-5). However, after 300 million generations, the MCMC completely fails to converge (Posterior and most parameter ESS < 100).
Hardware & Performance:
The run is exceedingly slow. I am using a Windows 11 mini-PC (AMD Ryzen 7 PRO 6850H, 8 cores, 32 GB RAM, integrated AMD Radeon Graphics) with the BEAGLE library via OpenCL.
My Questions:
Subsampling: Could my spatiotemporal stratified subsampling have inadvertently flattened the temporal structure? Are there more robust subsampling strategies for large endemic datasets to preserve a sufficient temporal signal?
Convergence: Is the weak temporal signal actively preventing the Skygrid model from converging despite the informative prior? Would pruning specific outliers, reducing Skygrid grid points, or fixing the substitution rate entirely be a better approach here?
Performance: Do you have any suggestions for modifying the priors/operators, or specific BEAGLE command-line flags to run a 500-sequence Skygrid model more efficiently on an integrated AMD GPU setup?
Thanks in advance for your time and insights!