Adjust model to decrease sampling time (large dataset)

41 views
Skip to first unread message

Troy Wixson

unread,
May 1, 2023, 11:23:24 AM5/1/23
to nimble-users
Good Morning, 

I am working on a model which classifies genes as being in one of three groups (deleterious, null, or beneficial). This is essentially a variable selection task except if the variable is included, the sign of the variable categorizes it as either deleterious or beneficial. We have two different data sets one of which uses negative binomial regression and the other uses logistic regression, and the two data types share information using the group labels. This model has performed well in simulation studies (with much smaller datasets) but when I compiled it and tried sampling from our real data it takes much too long. It takes around 2 hours to compile and takes around 7 samples an hour. 

I'm sure there are ways to adjust the nimble model code, etc. so that it samples more quickly but I'm not sure how to think about this. Any advice would be appreciated! 

For reference, I have 11,400 individuals in the GWAS datset, 370 individuals in the RNA_seq dataset, and would like to use around 1,680 genes. This gives the X_GWAS data matrix 11,400 rows and 1,680 columns and the response in the other branch (Y_RNA) is a length 306,xxx vector. 

Here's the model:

three_groups_code <- nimbleCode({
 
  # RNA-seq - (patient x gene) level
  for(i in 1:(num_individuals_RNA * num_genes)){
    Y_RNA[i]       ~ dnegbin(size = r_nb[gene_number_index[i]], prob = p_nb[i])
    p_nb[i]        <- 1 / (1 + gene_effect[i] * dispersion[gene_number_index[i]])
    gene_effect[i] <- exp((covariates_RNA[individual_index[i],] %*% beta_cov_RNA[1:10])[1,1] +
                           log_ind_lib_size[individual_index[i]] +         #known offset
                           log_gene_length[gene_number_index[i]] +   
#known offset
                           gene_effect_baseline[gene_number_index[i]] +
                           log_fc[gene_number_index[i]] * PD_indicator[individual_index[i]])
  }
 
  # GWAS - patient level  
  for(l in 1:num_individuals_GWAS){
    Y_GWAS[l] ~ dbern(p_bern[l])
    p_bern[l] <- expit((covariates_GWAS[l,] %*% beta_cov_GWAS[1:3])[1,1] +
                       (X_GWAS[l,] %*% beta_GWAS[1:num_genes])[1,1])
  }
 
  # both RNA-seq and GWAS - gene level
  for(j in 1:num_genes){
    # RNA-seq
      # gene-wise dispersion
    dispersion[j]  ~ T(dnorm(0, 10^(-3)), 0, )
    r_nb[j]        <- 1/dispersion[j]
      # gene effect
    gene_effect_baseline[j]  ~ dnorm(0,10^(-2))
      # group effect size
    delta_deleterious_RNA[j] ~ dhalfpiMOM(t=0.1, r=2)#t_deleterious_RNA, r=2)
    delta_beneficial_RNA[j]  ~ dhalfpiMOM(t=0.1, r=2)#t_beneficial_RNA, r=2)
   
    # GWAS
      # group effect size
    delta_deleterious_GWAS[j] ~ dhalfpiMOM(t=0.1, r=2)#t_deleterious_GWAS, r=2)
    delta_beneficial_GWAS[j]  ~ dhalfpiMOM(t=0.2, r=2)#t_beneficial_GWAS, r=2)
   
    # calculate the group-wise effect
    group[j,]    ~ dmulti(1, prob = lambda[1:3])
    log_fc[j]    <- group[j,1]*(-1)*delta_beneficial_RNA[j] +
                      group[j,3]*delta_deleterious_RNA[j]
    beta_GWAS[j] <- group[j,1]*(-1)*delta_beneficial_GWAS[j] +
                      group[j,3]*delta_deleterious_GWAS[j]
  }
 
  # hyperparameters
  lambda[1:3]   ~ ddirch(alpha=alpha[])
 
  # RNA-seq hyper-parameters
  beta_cov_RNA[1:10]   ~ dmnorm(mean_RNA[1:10], prec_RNA[1:10, 1:10])
  # t_deleterious_RNA    ~ dgamma(shape = 3, rate = 5)
  # t_beneficial_RNA     ~ dgamma(shape = 3, rate = 5)
 
  # GWAS hyper-parameters
  beta_cov_GWAS[1:3]  ~ dmnorm(mean_GWAS[1:3], prec_GWAS[1:3, 1:3])
  # t_deleterious_GWAS  ~ dgamma(shape = 3, rate = 0.5)
  # t_beneficial_GWAS   ~ dgamma(shape = 3, rate = 0.5)
 
  # impute missing sex for third sex category
  for(k in 1:length(missing_sex_RNA)){
    covariates_RNA[missing_sex_RNA[k], 10] ~ dbern(0.5) # sex is the 10th variable
  }
})



Thanks for your time!

-Troy Wixson 

Chris Paciorek

unread,
May 12, 2023, 1:36:19 PM5/12/23
to Troy Wixson, nimble-users
Hi Troy,

Just to make sure I follow -- you're only able to run 7 iterations of the MCMC in an hour?

As far as ideas, the slow compilation probably relates in part to having 306,000 Y_RNA nodes. One idea would be to try to vectorize that first "RNA-seq" for loop.
You'd have to write a user-defined distribution that represents the negative binomial joint density for all the Y_RNA values at once (i.e., moving the for loop
out of the nimble model code and into a user defined distribution). Then you'd have to see if you could write the p_nb and gene_effect lines as vectorized statements
(if necessary by writing user-defined functions).

Our training from January has comments in the third and fourth modules about advantages and disadvantages of vectorizing model statements.

-chris

--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/dcf7ba99-4275-430d-9f10-19a5d59a3dd0n%40googlegroups.com.

Troy Wixson

unread,
May 24, 2023, 7:06:57 PM5/24/23
to nimble-users
Thanks for the response to such a vague question!

You are correct, it is taking ~60GB of RAM and is very slow.

I have found a few ways to improve memory usage and speed up the mcmc iterations. First, I put X_GWAS into constants instead of data as I do not need to change the values in that matrix. Second, I vectorized Y_GWAS with a user-defined vectorized Bernoulli distribution. This had an incredible effect on the mcmc iteration rate (more than 100x improvement on datasets with 100 genes). I tried vectorizing Y_RNA, as suggested, but it takes around 10x longer with both Y_GWAS and Y_RNA vectorized than with only Y_GWAS vectorized to get the same number of samples on smaller datasets with 100 genes. The Y_GWAS vector contains a count for each gene for each individual. When I vectorize by gene (i.e. turn the vector into a matrix and run a for-loop over genes) there is a slight increase in mcmc iteration rate. I'm not sure how well these observations will scale but I'm hopeful. 

I'm working on implementing reversible jump mcmc (RJMCMC) by modifying the RJ code according to Sally's suggestion on this post (https://groups.google.com/g/nimble-users/c/O4A9mxW_vhU). I'm thinking this will be helpful as there is a lot of wasted computation for all of the null genes. This seems to work correctly on toy models and smaller datasets and has improved the sampling rate a little. I am testing it on the larger dataset now.

The other thing I tried was to pull some of the computations back into R with nimbleRcall like Perry suggested in this post (https://groups.google.com/g/nimble-users/c/8vaTHQoJz2M/m/kPKsOpMZAgAJ) but I was unsuccessful. Is this something I should spend more time trying to figure out?

Thanks for all of your time and for building this great tool and community,
-Troy 

Chris Paciorek

unread,
May 26, 2023, 8:17:27 PM5/26/23
to Troy Wixson, nimble-users
,Hi Troy,

A few further thoughts here.  I suspect that vectorizing Y_RNA doesn't help because it means that when you sample gene_effect_baseline elements you are computing the likelihood for all observations, even those that are not for the gene being considered and similarly for parameters that only affect a single individual. I can think roughly of some hacks to cause computation only of the necessary likelihood terms but that would get pretty involved.

One thing that might speed computation though perhaps only a bit, is that for


                       log_ind_lib_size[individual_index[i]] +         #known offset
                           log_gene_length[gene_number_index[i]] +    #known offset
 
you could create vectors of length `num_individuals_RNA * num_genes` that amount to pre-computing the result of the indexing since none of the info is unknown/varying. That would avoid some computation at the expense of longer vectors. You could potentially do something similar with covariates_RNA, extracting out the first 9 columns as a matrix (and leaving the 10th column as a separate vector), as those 9 columns look like they are fixed and known.

in response to your question about `nimbleRcall`, the main purpose of that, as Perry discussed in that post, is to keep large constant objects out of the model. I may not have looked closely enough at your model, but I'm not seeing that there are such objects in your model. But if you did expand out covariates_RNA, it might get big enough that Perry's trick could be worthwhile.

Lots to consider here...

-chris


Reply all
Reply to author
Forward
0 new messages