Simulating independent SNPs in FastSimcoal2 for ABCtoolbox analysis

845 views
Skip to first unread message

upadhya...@gmail.com

unread,
Jul 18, 2018, 9:11:32 AM7/18/18
to fastsimcoal
Hi all,


I have (observed) summary statistics from the dataset of about ~2500 SNPs that I have selected after filtering them from ~700,000 SNPs based on LD (--indep-pairwise 300 10 0.0001).In other words, the summary statistics are generated from 2500 SNPs that are independent. Later on, I used "fsc26" to generate 2500 independent SNPs. Before generating SNPs, I read some insightful discussion from this link: https://groups.google.com/forum/#!topic/fastsimcoal/_N53lrQ4MGo

Following is the syntax that I used to simulate 2500 SNPs:

//Number of independent loci [chromosome]
2500 0
//Per chromosome: Number of linkage blocks
1
//per Block: data type, num loci, rec. rate and mut rate + optional parameters
DNA 1 0.00000000 0.000000012 0.00

Now my question is, is this correct way of simulating SNPs data? For the information, following are the historical events that I have included in the tpl file:

2 historical event

2879 1 0 1 2800 0 1 #population 1 merge with population 0
21623 0 2 1 453092 0 2 #population 0 merge with population 2

Any guidance on simulating independent SNPs will be highly appreciated.

Regards,
Maulik

Nicolas

unread,
Jul 19, 2018, 2:27:33 AM7/19/18
to fastsimcoal
Hi Maulik,

As for the historical events, there might be an error with the new size parameter (fifth parameter) as you made these demes grow quite intensively:
2879 1 0 1 2800 0 1 # 2800 here means deme 0 size will increase by a factor of 2800. Before this event (backward in time), lets say deme 0 had a size of x. After this event, deme 0, where deme 1 coalesced into, will have a size of 2800x. Which is quite huge.
The "new size" parameters works as a percentage of the current size of the deme. If your deme 0 had a size of 1000 before the event, and you want it to be 2800 after the event, 2.8 would be the correct parameter. If you have parameters that needs to be sampled with the .est file, you can easily use the complex parameters to deal with this.
The same apply for the second event, which has an even bigger growth backward in time, or a big forward decline.

As for SNPs, I believe your parameters are correct, but you cannot trust me on this as I haven't worked with these yet.

Best regards,
Nico

upadhya...@gmail.com

unread,
Jul 19, 2018, 2:55:32 AM7/19/18
to fastsimcoal
Hi Nicholas,

First of all thank you very much for the prompt and elaborate response. And I totally got your point. I should have avoided taking "pow10" factor for the "new size" after the coalescent event. I have changed it now and made it relative to the sink population.

Regarding SNPs, I will still wait for someone to confirm the parameters that I have used for simulation.

Thank you again,

Regards,
Maulik

Laurent Excoffier

unread,
Oct 30, 2018, 8:23:55 AM10/30/18
to fastsimcoal
Hi,

regarding SNP generation, I would do that like that

//Number of independent loci [chromosome]
2500 0
//Per chromosome: Number of linkage blocks
1
//per Block: data type, num loci, rec. rate and mut rate + optional parameters
DNA 1 0.00000000 0.000000012 0.00

as such you probably never have 2500 polymorphic positions

What you want to simulate are short DNA sequences where you have at most 1 polymorphic position. therefore you should generate many more than 2500  short sequences since most of them will have o polymorphic sites.
In the end you should use the option -s 2500 to limit the output to 2500 polymorphic positions

I'd rather  propose to use something like

//Number of independent loci [chromosome]
100000 0

//Per chromosome: Number of linkage blocks
1
//per Block: data type, num loci, rec. rate and mut rate + optional parameters
DNA 100 0.00000000 0.000000012 0.00 

which would simulate 100,000 regions of 100 bp. As such it may be an overkill, and you would probably need to see how many polymorphic sites you generate and reduce the number of segments to generate to have 2500 SNPs.

best
laurent

amandine...@gmail.com

unread,
Jan 6, 2020, 6:16:41 PM1/6/20
to fastsimcoal
Hi all (english version below),

Je réponds au commentaire car je ne suis pas sûre de comprendre comment  savoir si fastsimcoal simule bien le bon nombre de SNPs.

En effet, nous voudrions simuler des 285882 SNPs indépendants avec fastsimcoal. A l'époque, on m'avait conseillé de faire cette ligne de commande dans le .par (j'uilise fastsimcoal + ABCtoolBox),  i.e.  simuler les SNPs comme des loci indépendants. Cependant les simulations sont ultra-longues (5 simulations en 20 min!).

//Number of independent chromosome
285882 0
//Number of contiguous linkage blocks
1
//Per Block: Data type, No. of loci, Recombination rate to the right-side locus, plus optional parameters
SNP 1 1 0.02

Je me penchais donc vers l'option discutée ci dessous, en multipliant par 100 le nombre de SNPs et en mettant -s285882 :

//Number of independent loci [chromosome]
10000000 0

//Per chromosome: Number of linkage blocks
1
//per Block: data type, num loci, rec. rate and mut rate + optional parameters
DNA 100 0.00000000 0.000000012 0.00

Cependant comment peut-on être sûre que le nombre de SNP simulé est toujours le même pour chaque simulation, à savoir 285882? Est-il possible que le nombre de SNP simulé soit inférieur à 285882?

Merci beaucoup par avance pour votre aide!
Bonne soirée,
Amandine


#####
Hi all,

I reply to this comment as I am not sure how we can check whether fastsimcoal is simulating the right number of SNP.

Indeed, we would like to simulate 285882 independant SNP with fastsimcoal. Some time ago, I was using the following command line (but on a much lower number of SNPs), in the .par (I use fastsimcoal + ABCtoolBox). However each simulation takes very long time (5 sim in 20 min!).

//Number of independent chromosome
285882 0
//Number of contiguous linkage blocks
1
//Per Block: Data type, No. of loci, Recombination rate to the right-side locus, plus optional parameters
SNP 1 1 0.02

I would therefore would like to test the option proposed above, by updating the number of simulated loci (by 100) and defining  -s285882 as such :

//Number of independent loci [chromosome]
10000000 0

//Per chromosome: Number of linkage blocks
1
//per Block: data type, num loci, rec. rate and mut rate + optional parameters
DNA 100 0.00000000 0.000000012 0.00

However, how can we be sure that the number os simulated SNP is always the same among simulations, i.e. 285882?
Is there any another mean to speed up the simulations?

Thanks a lot by advance for your help!
my best,
Amandine



Reply all
Reply to author
Forward
0 new messages