aborted runs and problems with multiSNP utility

catherine

unread,

Feb 2, 2011, 3:19:00 PM2/2/11

to BIMBAM HELP

Hello,
I am a new BIMBAM user and am trying to use the program to test for
phenotypic associations within a candidate gene region and to evaluate
evidence for multiple causal mutations within that gene. I have ~800
SNPs genotyped in 90 individuals. I have been able to run single SNP
analyses and multi-SNP analyses up to L=3, but anything greater than
L=3, the program chokes and I get the following error message:

terminate called after throwing an instance of 'std::bad_alloc'
what(): St9bad_alloc
-bimbam: single snp analysis of typed snps:
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>] 100%
/n/home11/clinnen/.lsbatch/1296597950.24723995: line 8: 17988
Aborted
(core dumped)

Here is an example of my usage: ./bimbam_lin -g agouti_geno.txt -p
pc1new.txt -pos agouti_ma
p.txt -o 13decVCF_L4.out -l 4. So far, I have not tried imputation. I
have tried running this on my Mac and on a Linux cluster, and have
tried specifying that I need a high-memory node on the cluster, but it
still chokes. I have run analyses with L>3 using the -m flag, but for
these, BFs for the different SNP models (1, 2, 3, etc) were not
computed, so I cannot compare models using the BF criterion. I guess
one option would be to create new input files that just have the top
100 (give or take) SNPs from the single-snp analysis, but is there
another way around this?

A second problem I am having is that when I try to use the multiSNP
utility (R code), I am unable to open the large multisnp files, even
for L=2 and L=3. If I try files for which I've used the -m flag, I can
get them read into R ok, but when I try to plot the results, I get the
following error:
> plot.multiSNP(mSNP)
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ

Any advice on how to deal on these problems would be greatly
appreciated!
Thank you!
Catherine

Yongtao Guan

unread,

Feb 2, 2011, 3:56:13 PM2/2/11

to catherine, BIMBAM HELP

Catherine,

choose(800, 4) is about 17 billion.

By design, bimbam doesn't compute BF for each configuration on the fly

-- in other words, it has to store all the configurations. That's why it will crash.

However, normally, you don't need to compute all the choose(800,4) configurations,

because 99.99...% of those models will have nil posterior probabilities.

Anyway, I do apologize for the inconvenience.

The -m option should work. You may want to try the latest version 1.0 if you haven't done so.

http://bcm.edu/cnrc/mcmcmc/bimbam/

If you want to compare different models with less restriction on number of SNPs,

I would like to recommend our new software piMASS,

http://bcm.edu/cnrc/mcmcmc/pimass/

The paper that describes the method implemented in piMASS can be found at

http://quartus.uchicago.edu/~yguan/pdfs/bvsr.pdf

Thank you.

Grant

Subject: aborted runs and problems with multiSNP utility
From: catherine <cli...@gmail.com>
To: BIMBAM HELP <bimba...@googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1

--
Yongtao Guan, PhD
Assistant Professor
Baylor College of Medicine
Tel: 713-798-0362
http://bcm.edu/cnrc/mcmcmc

Matthew Stephens

unread,

Feb 3, 2011, 2:30:06 PM2/3/11

to bimba...@googlegroups.com

Hi Catherine,
just to add to the comments:
in most cases it would only make sense to even
want to look at L>3 if you have seen interesting results
for L=3 (eg substantially stronger evidence for 3-SNP models than for
2-SNP models). If you are in that situation then it is
already good news that you gained something from the multi-SNP
analyses! In that case I would second Grant's suggestion to look at
piMASS, which uses MCMC to explore a larger set of models.

The plot.multiSNP code is not well designed to work with
so many SNPs. I hope we can get something more robust available
soon, but we don't have anything for now I'm afraid (you are the
first to report a problem! If others report problems this will become
a higher priority!)

Best wishes,

Matthew

> --
> You received this message because you are subscribed to the Google Groups "BIMBAM HELP" group.
> To post to this group, send email to bimba...@googlegroups.com.
> To unsubscribe from this group, send email to bimbam_help...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/bimbam_help?hl=en.
>
>

catherine

unread,

Feb 10, 2011, 10:33:58 PM2/10/11

to BIMBAM HELP

Hi Matthew and Grant,

Thanks so much for the advice. The BFs for L=2 and L=3 are essentially
the same. But I was hoping to see whether I could distinguish between
models in which there are a small # of causal variants and models in
which there are many causal variants (most of small effect). It sounds
like piMASS is the way to go for this, and following your suggestion,
I read your paper and have been playing around with the program. I
have a few questions regarding this, if you get the chance.

(1) What are the defaults for the hyperparameters -h and -p and do you
recommend trying alternative settings for -hmax/min -pmax/min? If so,
and given that I am working with a densely genotyped candidate region
(800 snps over 180kb) rather than a set of genome-wide markers, which
settings would you recommend? [some additional background: this
"candidate region" explains roughly 80-90% of phenotypic variation
among laboratory strains; I am now utilizing a variable natural
population to identify causal variants within this region].

(2) I'm still getting to know the program and the output, but do you
have any recommendations for particularly effective ways to summarize
the results? So far I have been looking at Manhattan plots for single-
SNP BFs and PIPs (from "snp" and "mcmc" files, respectively). To
summarize evidence for total # of SNPs, I plotted BF as a function of
SNP # (from the sampled states in the "path" file) and saw a peak
around 25 snps [log(BF)=46]. In contrast, sum of PIPs for all SNPs
("E" in your paper, I think) in this region was ~76. Are there better
ways to summarize the evidence for snp #? Also, which data did you use
to generate Fig 6B (posterior distributions of PVE)?

Thanks!!
Catherine

Yongtao Guan

unread,

Feb 10, 2011, 11:14:55 PM2/10/11

to bimba...@googlegroups.com

On Thu, Feb 10, 2011 at 9:33 PM, catherine <cli...@gmail.com> wrote:

Hi Matthew and Grant,

Thanks so much for the advice. The BFs for L=2 and L=3 are essentially
the same. But I was hoping to see whether I could distinguish between
models in which there are a small # of causal variants and models in
which there are many causal variants (most of small effect). It sounds
like piMASS is the way to go for this, and following your suggestion,
I read your paper and have been playing around with the program. I
have a few questions regarding this, if you get the chance.

(1) What are the defaults for the hyperparameters -h and -p and do you
recommend trying alternative settings for -hmax/min -pmax/min? If so,
and given that I am working with a densely genotyped candidate region
(800 snps over 180kb) rather than a set of genome-wide markers, which
settings would you recommend? [some additional background: this
"candidate region" explains roughly 80-90% of phenotypic variation
among laboratory strains; I am now utilizing a variable natural
population to identify causal variants within this region].

Catherine,

The default prior for h is Unif(0,1). The default prior for log(p) is Unif(log(1/n), log(400/n)), where n is the total number of SNPs (800 for your data).

You may try different priors for -pmax, say 50, 100, 200, etc.

If your PVE is 80-90%, it might be the case that you have a few SNPs with large effects sizes. You might consider to regress those SNPs out and use the residue to run piMASS.

(2) I'm still getting to know the program and the output, but do you
have any recommendations for particularly effective ways to summarize
the results? So far I have been looking at Manhattan plots for single-
SNP BFs and PIPs (from "snp" and "mcmc" files, respectively). To
summarize evidence for total # of SNPs, I plotted BF as a function of
SNP # (from the sampled states in the "path" file) and saw a peak
around 25 snps [log(BF)=46]. In contrast, sum of PIPs for all SNPs
("E" in your paper, I think) in this region was ~76. Are there better
ways to summarize the evidence for snp #? Also, which data did you use
to generate Fig 6B (posterior distributions of PVE)?

The ".path.txt" and ".mcmc.txt" files contain most of what you need for posterior inference. The way we analyse the CRP data in the paper can be a good start point on how to summarize the results. In addition to the Manhattan plot, you may report the posterior estimates of PVE (the hh column in the ".path.txt" file, which generates Fig 6B); posterior estimates of number of SNPs, or equivalently, posterior of $pi$ (p column in path file). The posterior mean of SNP# is probably more meaningful than the peak of the plot SNP# against logBF.

Hope this helps.

Grant

catherine

unread,

Feb 11, 2011, 9:40:05 AM2/11/11

to BIMBAM HELP

This helps a lot--thanks!!
Catherine

On Feb 10, 11:14 pm, Yongtao Guan <ytg...@gmail.com> wrote:

Matthew Stephens

unread,

Feb 11, 2011, 9:26:53 AM2/11/11

to bimba...@googlegroups.com

> (1) What are the defaults for the hyperparameters -h and -p and do you
> recommend trying alternative settings for -hmax/min -pmax/min? If so,
> and given that I am working with a densely genotyped candidate region
> (800 snps over 180kb) rather than a set of genome-wide markers, which
> settings would you recommend? [some additional background: this
> "candidate region" explains roughly 80-90% of phenotypic variation
> among laboratory strains; I am now utilizing a variable natural
> population to identify causal variants within this region].

I would definitely say for the candidate region you need to consider
changing the defaults,
since the defaults were set up for genome-wide analysis.
*In general* for a candidate gene, you would expect the total h to be
much smaller
than one (eg in human complex diseases, most genes will have genetic
variants explaining no more than 5%
of the variance at the very most, and in fact many will probably
explain less than 1%).
And most genes probably have only a few causal variants (so pmax = 200
would seem inappropriate).
For an arbitrary candidate gene in a human complex disease I would
have said try hmax = 0.05 and pmax = 10
to begin with. But it sounds from your description that you might be
in a somewhat
non-standard situation? Without knowing more it is difficult to make
suggestions.
[If you want to discuss such details further off-line, via direct
email, rather than bimbam_help then feel free]

Matthew

Reply all

Reply to author

Forward