regenerating mod file for 100way phastcons

73 views
Skip to first unread message

Song, Janet

unread,
Feb 22, 2022, 11:36:09 AM2/22/22
to gen...@soe.ucsc.edu, Doan, Ryan

Hi,


We are having trouble regenerating the 100way phastcons mod file (https://hgdownload.cse.ucsc.edu/goldenPath/hg38/phastCons100way/hg38.phastCons100way.mod) and are hoping to get some advice on how to do so. 


We are following the methods section from https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=cons100way and are running the following commands using the chr22 maf file available on UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/hg38/multiz100way/maf/chr22.maf.gz):


msa_view $maf_file --in-format MAF --4d --features $gff3_file > $ss_codon_file
msa_view $ss_codon_file --in-format SS --out-format SS --tuple-size 1 > $ss_sites_file
phyloFit --EM --precision MED --msa-format SS --subst-mod REV --tree $nh_file $ss_sites_file --out-root $mod_file


Our input gff3 file is from RefSeq Select and only contains protein-coding genes. We have also tried using gff3 from Gencode filtered for protein-coding genes but gotten similar results (that are not similar to the downloadable hg38.phastCons100way.mod).


Are we using the wrong gff3 file? We noticed that the methods mention filtering for "single-coverage long transcripts" ... what exactly does that mean? Is it important to compile the mod files across all chromosomes (we are only doing chr22 right now for testing)? Is there something else that we may be missing?


Thanks!


Best,

Janet

Brian Lee

unread,
Feb 23, 2022, 7:13:29 PM2/23/22
to Song, Janet, gen...@soe.ucsc.edu, Doan, Ryan
Dear Janet,

Thank you for using the UCSC Genome Browser and your question about regenerating the 100way phastcons mod file.

The steps taken to create our files are documented in what we call our makeDb/doc, and you can find detailed notes here: https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/hg/makeDb/doc/hg38/multiz100way.txt

Many of the first steps are putting the proper names of the species into the input information. Here is the line when phyloFit is used to create the tree model (output is phyloFit.mod)

   phyloFit --EM --precision MED --msa-format FASTA --subst-mod REV --tree tree-commas.nh 4d.all.mfa

It is renamed to all.mod and then eventually linked to the file that becomes hg38.phastCons100way.mod on hgdownload referenced. It stands out to me that this uses --msa-format FASTA versus the SS in your shared steps, earlier in the multiz100way.txt makeDb/doc there are steps about building the 4d.all.mfa with msa_view:

   msa_view --aggregate "`cat ../species.list`" mfa/*.mfa | sed s/"> "/">"  > 4d.all.mfa

I hope this is helpful, and thank you again for your inquiry and for using the UCSC Genome Browser. If you have any further public questions, please send new questions to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum to help others find answers to similar questions. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu, which is a private internal list to our support team.

All the best,

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/1645543398782.54408%40childrens.harvard.edu.
Reply all
Reply to author
Forward
0 new messages