Hi Eyal and Maj,
the approach to calculate the p-value should be in some papers (at least in Nielsen et al. 2005, and some papers that we had in Stephan's group in Munich). But you are right, that it hasn't been documented in a clear way as a stand-alone paper/manual. I'll describe it here briefly and I'll put it within the next days in the website (
pop-gen.eu).
i. You need to infer/know the demographic model of your population.
This means that either you study a population that somebody else has published a demographic model for this population, or you should infer it. There are several approaches to infer demographic models from genomic data, none is perfect. You can use:
a) Approximate Bayesian Computation ABC to either choose between alternative models, or to infer parameters of a specific model (e.g. bottleneck, expansion etc). Typically, in ABC you choose some loci (just some random fragments in the genome), let's say about 200 in random locations or in distances large enough between them to be considered independent. Then, you calculate summary statistics of these loci, then you simulate 200 independent loci again with various parameter values of the demographic model ( you can use msABC for that; I acknowledge though that we should update a bit the code). Then you can use Csillery's R package (abc) to infer demography. This pipeline is described more or less here:
http://www.bio.lmu.de/~pavlidis/home/?Software:msABC
c) If you have population subdivision of two populations you can use IM, IMa model, but I have no experience with that (actually I had tried it 5-6 yrs ago and I had to wait 2-3 months for run to finish, so I killed it).
Anyway, I'll try to put more detailed description in the web-site; the conclusion is that somehow you should know the demographic model of your population.
ii) perform > 1000 neutral simulations wth the inferred/known demographic model (well, 1000 is of course arbitrary, but I'd use something like that)
This can be a bit tricky. You should use the same gridsize, and the corresponding mutation rate and recombination rate. For example if you simulate a whole chromosome, then theta = 4N mu and rho = 4 N r will be some large values probably. I'd suggest using Hudson's ms if the recombination rate for the simulated region is not very large. In Hudson's ms theta and rho refer to the whole simulated region (NOT to a single bp). If you have however large values of recombination rate (this can happen if you simulate a whole chromosome), then ms will take years. Thus, use MaCS from G. Chen (
http://www-hsc.usc.edu/~garykche/).
iii) calculate OmegaPlus for the simulations.
Just be careful to use the same grid-size as in the real data
iv) calculate how many OmegaPlus_MAX are larger than the OmegaPlus_MAX of the real data. If only 5 per 1000 are larger then p-value is 5/1000. If none is larger then p-value < 1/1000.
I hope that these instructions are more or less clear. Don't hesitate to contact me if you have further questions
all the best
pavlos