Gene size matrix clarification

chloe.p....@gmail.com

unread,

Sep 8, 2015, 7:42:32 PM9/8/15

to dnenrich-users

Hi,

I really enjoyed your paper on synaptic network de novo mutations in schizophrenia. I am thinking of trying to adapt the dnenrich software for another dataset, and had a question regarding the gene size matrix input file. I looked at the documentation online, but am still having trouble understanding what you mean by the "relative sizes of each gene for the corresponding base composition and functional impact of mutation."

If the type of mutation is the 2nd column in this file, why does each gene (column) change size in the example file based on the type of mutation (i.e. as you move down rows within a column, corresponding to different mutation types, why does the size change for a given gene)?

Is there anywhere I could go to get more information about the gene size matrix input file and how to populate it for a given dataset?

Thanks so much,

Chloe O'Connell

Menachem Fromer

unread,

Sep 9, 2015, 9:45:24 AM9/9/15

to chloe.p....@gmail.com, dnenrich-users

Hi,

To populate this matrix in its most basic form, you'd just need only a single non-header row with *:* (to denote all base contexts and functional annotations) that gives the coding size of each gene.

What is currently in the matrix is a sub-divided form of above where instead of counting the total number of genomic bases a gene encompasses, then we only count those bases that, for example, are CTG to CAG mutations (looking at one genomic base before and after mutation) that result in a missense annotation in that gene.

This useful because it tells us the relative size of each gene with respect to the properties of a particular de novo mutation that was observed. That is, the number of genomic sites where such a mutation might happen, broken down by gene.

--
You received this message because you are subscribed to the Google Groups "dnenrich-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dnenrich-user...@googlegroups.com.
To post to this group, send email to dnenric...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dnenrich-users/cbb495e0-faa1-4509-84b9-4ecfc6889352%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Menachem Fromer

unread,

Dec 25, 2015, 4:27:59 PM12/25/15

to Chloe O'Connell, dnenrich-users

Hi Chloe,

If used, the effective gene size for an individual should reflect the number of possible genomic sites (and if base context is used, multiplied by 3 for all 3 possible non-reference point mutations) at which coverage was sufficient for a de novo mutation to be detected.

For example, what we used in our study is to count the number of sites in each gene for which the mother, father, and child had a coverage of at least 10x (i.e., 10 reads), since when we used PlinkSeq to call de novo mutations that was what we required (thus no de novos would be detected if there’s less than 10 reads at any of the 3 individuals’ data).

See here for more info on how to run PlinkSeq’s de novo mutation detection command:

https://groups.google.com/forum/#!msg/plinkseq-users/RoHz_3gjdv4/uniWI3lWG5MJ

Though, the parameter names have changed slightly since that post, and you can always find the latest parameters and explanations by running the following command (I’ve also added the output here for convenience):

pseq help denovo

denovo : filter for de-novo mutations and transmitted variants (SNPs and indels)
---------------------------------------------------------
  --allowDoubleAltDeNovos { flag } Include de novos that consist of two alternate alleles in child
  --minChildDP { int } Minimum child depth
  --minChildPL { float } Minimum child PL (genotype likelihood) for non-called genotype
  --minHet_AB_alt { float } Minimum AB-ALT (% of reads with ALT allele) for heterozygous individual
  --minHet_AB_ref { float } Minimum AB-REF (% of reads with REF allele) for heterozygous individual
  --minHomAlt_AB_alt { float } Minimum AB-ALT (% of reads with ALT allele) for homozygous alternate individual
  --minHomRef_AB_ref { float } Minimum AB-REF (% of reads with REF allele) for homozygous reference individual
  --minMQ { float } Minimum MQ (read mapping quality)
  --minParDP { int } Minimum parental depth
  --minParPL { float } Minimum parental PL (genotype likelihood) for non-called genotype
  --printTransmission { flag } Parent variants will be printed with transmission status

Specifically, in our work, we used:

--minChildDP 10

--minParDP 10

Best,

Menachem

On Dec 8, 2015, at 3:43 PM, Chloe O'Connell <chloe.p....@gmail.com> wrote:

Hi Dr. Fromer,

You mentioned in earlier correspondence that it is possible to include individual-specific coverage information in the gene matrix file. You mention in the documentation:

The per-trio gene sizes should be used if one has calculated the effective gene sizes after requiring that there be sufficient sequencing coverage in all 3 members of the trio at the corresponding bases in the respective genes.

How are the per-trio effective gene sizes calculated? Is there a way to adapt it for multiplex families, and/or incorporate per-individual effective gene size rather than per-trio effective gene size? And how do those get incorporated into the gene size matrix? Any more information on this would be much appreciated - I've read the documentation but have been unable to figure out exactly how we can incorporate this information into our runs of dnenrich.

Best,

Chloe O'Connell

On Wed, Sep 9, 2015 at 4:21 PM, Menachem Fromer <fro...@broadinstitute.org> wrote:

On Sep 9, 2015, at 12:34 PM, Chloe O'Connell <chloe.p....@gmail.com> wrote:

Thank you very much for your response. So it sounds like the gene size matrix file is something that can be re-used from study to study, if it just has information about the number of genomic sites where each type of mutation can happen in any given gene.

Indeed, that is the point.

I guess where my confusion lies, then, is in the group number. In the example file (https://psychgen.u.hpc.mssm.edu/dnenrich/started.shtml#gene_size_matrix)
In individual 1, there are 45 sites in gene A where a *:LoF (any type with an impact of "loss of function") mutation can occur. However, in individual 2, there are 49 such sites. How are these numbers calculated per individual? Am I wrong in saying that this file can be re-used between studies?

The matrix that we have online is generic and not specific to any particular sequenced individual(s). Dnenrich allows the user to incorporate individual-specific information since for any particular sample the coverage at specific genomic sites will be different. If you don't have that info readily available (and I agree it's not trivial to generate and collate all such statistics), then you can simply use the generic matrix we provide by downloading it.

In the work for our paper, we found that for post-QC individuals, the individual-specific coverage info did not drastically affect any of the significance results for pathway enrichment in practice.

Best, and thank you again for all your help,

Chloe O'Connell

Reply all

Reply to author

Forward