I’m currently a medical student at Stanford working on using your dnenrich software test for enrichment of gene sets in the de novos found in a large cohort of multiplex families with autism. We conducted a few preliminary runs, which worked very nicely, but as we proceed and begin to conduct our final analyses, but have a few more questions and have been unable to reach Menachem Fromer, who is the contact person on the website. I was wondering if you might be able to either put me in touch with him, or answer the questions yourself:
First, we have been unable to figure out how to incorporate individual sequencing depth into the gene size matrix file based on the documentation online. Is there anywhere you might be able to refer us where we might be able to get more detailed instructions on how to incorporate individual (or trio/quad-specific) sequencing depth into the gene size matrix file?
How you calculate the effective gene sizes for each child’s de novo mutations is a question I’ll get to briefly at the end of the email.
But, assuming you’ve calculated that, you’d simply insert that information in correspondingly-named rows in the “Gene size matrix” as described here:
For example, the row starting with:
"individual1 *:*"
means this gives the effective gene sizes for individual1, without subsetting to any particular base context or mutation type.
“individual1 *:NS”
gives the effective gene sizes for individual1, without subsetting to any particular base context, but only for NS mutations.
In particular, see here for the way we named the entires in the second column:
Second, we are a bit unclear on how to calculate the number of sites where each type of mutation can take place in each gene. As per the documentation, it appears that *:* should indicate the number of sites in each gene where any type of mutation can take place, however in the first gene in the provided gene size matrix file, there are 1518 *:* mutations but the sum of the LoF, NS, 8156 mutations of type * (i.e. mutations designated *: ___ with the underscore indicating either LoF, NS, etc.) is 8516. If the *:* indicates any base change with any functional significance, shouldn’t this be the sum of the LoF, NS, etc. mutations with type “ * “ ?
You are correct about the semantics of these values.
And, I agree that it’s a bit confusing about them not summing to 1, but the reason is that once we take into account base context, the space of possibilities now becomes:
(Number of point mutation at each site) * (Number of sites in the exome) = 3 * N
instead of just N.
i.e., there is a multiplicative factor of 3 here.
It’s important to note that for dnenrich the absolute magnitudes of the effective gene sizes do not actually matter, only their relative proportions of total “exome”.
[This is the case because when we find an observed de novo mutation from a particular class, we throw out it back down at the exome at random, with multinomial probabilities defined as the relative proportion of each gene.]
Moreover, the LoF and NS categories are NOT INDEPENDENT here (as I’ve defined them). Specifically, the NS category includes ALL LoF mutations plus missense mutations and (relatively) small number of other categories of mutations (e.g., stop-lost). On the other hand, as we’ve defined the categories here, the silent and NS categories should add up to be mutually exclusive and 3 times the total gene span.
We were also wondering what database (gencode, etc.) and information you used to calculate these effective gene sizes / mutation sites. Is there any documentation you could refer us to that would explain how we might be able to calculate our own gene matrix file if we so choose?
We used a version of Refseq downloaded from the UCSC Genome Browser in April 2013.
This corresponds to the PLINK/seq locdb you can download here:
This is because we used PLINK/seq to annotate all variants exome-wide, as well as the de novo mutations.
For more details, see:
Unfortunately, I do not believe we have the bandwidth in the foreseeable future to provide scripts for generating the matrix.
You should be aware that it can take a lot of computational resources to create these matrices (annotating each base in the exome, and then possibly counting coverage at each sequenced individual and then cross-tabulating that into a single matrix).
Thank you so much for your help,
Chloe O'Connell