Hi, thanks for directing my to the code, that's helpful. It does seem to confirm what I would guess those files are: the reference genome, such as would be downloadable from
Homo sapiens genome assembly GRCh37 - NCBI - NLM (nih.gov). You can see in the % STEP2 section (starting at line 530), these files are being used to look up the bases to the left (leftpos) and right (leftpos+2) of a single nucleotide variant (or "trinucleotide context"). These are then used in % STEP4 to assign category to each variant (starting at line 563).
Presumably, when I select hg19 on the MutSigCV module page (the instructions there read: "Genome build to use for automatic category and effect discovery. This is necessary only if you are using a MAF file without the columns "categ" and "effect"."), it tells MutSigCV to look up the chr_files in chr_files_directory associated with build hg19 (37). It appears from lines 375-378 that these files have names chr1.txt, chr2.txt, etc. Do you perhaps have files like that stored in another folder somewhere, or can a standard download such as at the NCBI link above be used as a replacement?
Assigning the categories described on the module description page you linked to is part of the power of this method (for reasons described in Lawrence et al, 2013,
https://www.nature.com/articles/nature12213#Sec3). However, because the module can't find the chr_files, it defaults to two categories: "missense" and "null+indel". It appears these categories are very loosely defined and they are essentially supersets of the "effect" I have already designated (using the mutation table and mutation type dictionary), when they are actually supposed to be independent of "effect". I'm concerned defaulting to these two categories weakens the relevance of the results.
The preprocessing section of the MutSigCV module (which includes all the lines of code we've identified so far) looks up trinucleotide context from the reference genome in order to assign single nucleotide variants to the defined categories, but some of the variants that can be passed in are also insertions or deletions. I see there is also a section of preprocessing to identify indels--at least, I think this is what lines 583-597 are doing. What's odd to me is that I don't see an indel category listed in the "full coverage" table supplied by MutSigCV (Previously found at shared_data/example_files/MutSigCV_1.3/exome_full192.coverage.txt). In addition, the documentation indicates that 'categ' in the mutation file is supposed to match the categories in the coverage table. However, when the module defaults to the two categories of "missense" and "null+indel" in the mutation table, and it is still using the "full coverage" table which has a number of categories listed based on trinucleotide repeat (i.e., A(A->C)A, A(A->C)C, etc), then the categories of the mutation table and the coverage table don't match, and I would think it should throw an error. But it doesn't. I don't quite understand why, and I wonder if there is a bug there.
This should all be independent of the "effect" assignment based on the mutation type dictionary. From the documentation, as long as the mutation types in the mutation table match those in the dictionary, the "effect" from the mutation type dictionary should map to the mutation table without issue. I used the same types of "effect" as in the standard example (silent, nonsilent, noncoding), and just used a custom dictionary to map them. The dictionary and mutation table are tab-delimited files, so commas in the "Variant_Classification" column shouldn't cause an issue.
If you don't think you have or can easily get reference genome files to replace the "chr_files", I'll send an email as you suggest and see if they can give some direction.
Thanks!