Good Morning,
I am working on a model which classifies genes as being in one of three groups (deleterious, null, or beneficial). This is essentially a variable selection task except if the variable is included, the sign of the variable categorizes it as either deleterious or beneficial. We have two different data sets one of which uses negative binomial regression and the other uses logistic regression, and the two data types share information using the group labels. This model has performed well in simulation studies (with much smaller datasets) but when I compiled it and tried sampling from our real data it takes much too long. It takes around 2 hours to compile and takes around 7 samples an hour.
I'm sure there are ways to adjust the nimble model code, etc. so that it samples more quickly but I'm not sure how to think about this. Any advice would be appreciated!
For reference, I have 11,400 individuals in the GWAS datset, 370 individuals in the RNA_seq dataset, and would like to use around 1,680 genes. This gives the X_GWAS data matrix 11,400 rows and 1,680 columns and the response in the other branch (Y_RNA) is a length 306,xxx vector.
Here's the model:
three_groups_code <- nimbleCode({
# RNA-seq - (patient x gene) level
for(i in 1:(num_individuals_RNA * num_genes)){
Y_RNA[i] ~ dnegbin(size = r_nb[gene_number_index[i]], prob = p_nb[i])
p_nb[i] <- 1 / (1 + gene_effect[i] * dispersion[gene_number_index[i]])
gene_effect[i] <- exp((covariates_RNA[individual_index[i],] %*% beta_cov_RNA[1:10])[1,1] +
log_ind_lib_size[individual_index[i]] + #known offset
log_gene_length[gene_number_index[i]] +
#known offset
gene_effect_baseline[gene_number_index[i]] +
log_fc[gene_number_index[i]] * PD_indicator[individual_index[i]])
}
# GWAS - patient level
for(l in 1:num_individuals_GWAS){
Y_GWAS[l] ~ dbern(p_bern[l])
p_bern[l] <- expit((covariates_GWAS[l,] %*% beta_cov_GWAS[1:3])[1,1] +
(X_GWAS[l,] %*% beta_GWAS[1:num_genes])[1,1])
}
# both RNA-seq and GWAS - gene level
for(j in 1:num_genes){
# RNA-seq
# gene-wise dispersion
dispersion[j] ~ T(dnorm(0, 10^(-3)), 0, )
r_nb[j] <- 1/dispersion[j]
# gene effect
gene_effect_baseline[j] ~ dnorm(0,10^(-2))
# group effect size
delta_deleterious_RNA[j] ~ dhalfpiMOM(t=0.1, r=2)#t_deleterious_RNA, r=2)
delta_beneficial_RNA[j] ~ dhalfpiMOM(t=0.1, r=2)#t_beneficial_RNA, r=2)
# GWAS
# group effect size
delta_deleterious_GWAS[j] ~ dhalfpiMOM(t=0.1, r=2)#t_deleterious_GWAS, r=2)
delta_beneficial_GWAS[j] ~ dhalfpiMOM(t=0.2, r=2)#t_beneficial_GWAS, r=2)
# calculate the group-wise effect
group[j,] ~ dmulti(1, prob = lambda[1:3])
log_fc[j] <- group[j,1]*(-1)*delta_beneficial_RNA[j] +
group[j,3]*delta_deleterious_RNA[j]
beta_GWAS[j] <- group[j,1]*(-1)*delta_beneficial_GWAS[j] +
group[j,3]*delta_deleterious_GWAS[j]
}
# hyperparameters
lambda[1:3] ~ ddirch(alpha=alpha[])
# RNA-seq hyper-parameters
beta_cov_RNA[1:10] ~ dmnorm(mean_RNA[1:10], prec_RNA[1:10, 1:10])
# t_deleterious_RNA ~ dgamma(shape = 3, rate = 5)
# t_beneficial_RNA ~ dgamma(shape = 3, rate = 5)
# GWAS hyper-parameters
beta_cov_GWAS[1:3] ~ dmnorm(mean_GWAS[1:3], prec_GWAS[1:3, 1:3])
# t_deleterious_GWAS ~ dgamma(shape = 3, rate = 0.5)
# t_beneficial_GWAS ~ dgamma(shape = 3, rate = 0.5)
# impute missing sex for third sex category
for(k in 1:length(missing_sex_RNA)){
covariates_RNA[missing_sex_RNA[k], 10] ~ dbern(0.5) # sex is the 10th variable
}
})
Thanks for your time!
-Troy Wixson