phastCons model background frequencies and parameters

80 views
Skip to first unread message

lenis vasilis

unread,
Sep 2, 2014, 11:07:54 AM9/2/14
to gen...@soe.ucsc.edu
Hello Hiram,

I'm trying to predict the conserved elements among 7 ruminant genomes (sheep, goat, tibetan antelope, yak, giraffe and david deer). I'm using cow (bosTau7) as a reference genome.
I have noticed that the background frequencies that you are using for the phylogenetic tree among 46 species are:

0.295000 0.205000 0.205000 0.295000

Should I stick to these numbers or due to the small amount of species I have to change that?

Also I would like to ask you for the phastCons parameter.
Based on the parameters that you are using for the 46 species, I'm using:

--rho 0.3
--expected-length 45
--target-coverage 0.3

Do you believe that these parameters are good enough for these species or I have to change them?
Could you suggest me something else?

Thank you very much in advance,

Vasilis.

Steve Heitner

unread,
Sep 3, 2014, 12:01:10 PM9/3/14
to lenis vasilis, gen...@soe.ucsc.edu

Hello, Vasilis.

The background frequencies need to be calculated with the “4d” procedure using “phyloFit” with gene sets for each species.  You can find a clear example of this in our hg38 make doc which you can view in our source tree at http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/makeDb/doc/hg38.txt.  Look for the section titled “Phylogenetic tree from 7-way”.

The rho, expected-length and target-coverage phastCons parameters you referenced can be used as-is.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions. 
All messages sent to that address are archived on a publicly-accessible Google Groups forum.  If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group

--

lenis vasilis

unread,
Sep 8, 2014, 1:13:25 PM9/8/14
to gen...@soe.ucsc.edu
Hello Steve,

Thank you very much for your answer.
As I'm seeing in the link that you sent me, you are using the genes of the reference genome in order to generate the ss files with the 4d technique.
To do that I used the cds files.
After that I'm using phyloFit to bult the model.

The fourfold-degenerate sites sometimes have an unusual base composition, so the model estimated by phyloFit turns out to have a weird background distribution of A,C,G,T. 

So after running phyloFit, I have to change manually the background distribution like that:

modFreqs phyloFit.mod 0.295 0.205 0.205 0.295 > new.mod
Is that ok?
As it seems you are suggesting me to use the genes of each genome and somehow to predict the frequencies.
The problem is that I dont know how to find the genes of these genomes since that are newly sequenced (the most of them) and even if I have them, how to do it

I'm really sorry if my questions are a little bit childish, but I'm quite new in this field.

Thank you very much,
Vasilis.
On Sep 3, 2014, at 5:00 PM, Steve Heitner <st...@soe.ucsc.edu> wrote:

Hello, Vasilis.

The background frequencies need to be calculated with the “4d” procedure using “phyloFit” with gene sets for each species.  You can find a clear example of this in our hg38 make doc which you can view in our source tree at http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/makeDb/doc/hg38.txt.  Look for the section titled “Phylogenetic tree from 7-way”.

The rho, expected-length and target-coverage phastCons parameters you referenced can be used as-is.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions.  
All messages sent to that address are archived on a publicly-accessible Google Groups forum.  If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group
 

Steve Heitner

unread,
Sep 8, 2014, 7:48:34 PM9/8/14
to lenis vasilis, gen...@soe.ucsc.edu

Hello, Vasilis.

This is mostly correct.  You should only need the gene set for your reference species.  You should not need gene sets for all species.  For future reference, if you do need a gene set, you can always use Genscan (http://genes.mit.edu/GENSCAN.html) or Augustus (http://bioinf.uni-greifswald.de/augustus/) to obtain rough gene predictions.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions. 
Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users.  If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.



---
Steve Heitner
UCSC Genome Bioinformatics Group

--

lenis vasilis

unread,
Sep 9, 2014, 1:19:04 PM9/9/14
to st...@soe.ucsc.edu, gen...@soe.ucsc.edu
Hello Steve,

Thank you very much for your help.
Have a nice day,

Vasilis.

On Sep 9, 2014, at 12:48 AM, Steve Heitner <st...@soe.ucsc.edu> wrote:

Hello, Vasilis.

This is mostly correct.  You should only need the gene set for your reference species.  You should not need gene sets for all species.  For future reference, if you do need a gene set, you can always use Genscan (http://genes.mit.edu/GENSCAN.html) or Augustus (http://bioinf.uni-greifswald.de/augustus/) to obtain rough gene predictions.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions.  
Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users.  If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.


---
Steve Heitner
UCSC Genome Bioinformatics Group
Reply all
Reply to author
Forward
0 new messages