[Genome] Three questions on PhastCons (or net/chain)

311 views
Skip to first unread message

Adam Siepel

unread,
Oct 10, 2005, 4:17:53 PM10/10/05
to
Hi Zhang --

> Dear Sir/Madam,
>
> I read the paper on phastCons published in Genome Research. It is a
> really wonderful tool. However, because of my poor statistically
> background, I have several questions on PhastCons score.
>
> Firstly, how to judge a nucleotide is conserved or not given a score?
> In other words, is there such a threshold for PhastCons, 0.1, 0.8 or
> something like that? Furthermore, how to judge a gene is conserved or
> not? Minimum score, maximum score or average score, which one is
> better to describe a gene? Taking human as one example, for those
> genes with high score across full length, we know they might be
> conserved among vertebrate organisms. But for genes with moderate or
> lower scores, how could we know whether the genes exist in mouse, rat
> or more remote species? Is net/chain track useful in such a case? If
> yes, how to make use of net/chain tracks to judge whether ortholog of
> human genes do exist in mouse, rat or some other species?
>
> Secondly, how to compare two nucleodies? A nucleotide with a score
> greater than 0.9 is obviously more conserved than a nucleotide with a
> score less than 0.1. But it is not so simple in some other cases, such
> as 0.02 and 0.04, 0.8 and 0.9, etc. Which measurement is better,
> minus, ratio or log-odds ratio? Similarly, how to compare two genes
> (comparison of two series of scores)?
>
> Finally, there is no score for many genomic regions, which includes
> some regions not masked by repeatmasker. I wonder whether these
> regions mean one kind of species-specific feature, namely, none of
> subject genomes could align with them.

Many of these are questions without straightforward answers, but I'll
do what I can to address them. There is of course no universal
threshold for what is conserved and what is not conserved. You may
have to experiment with different thresholds to find something
appropriate for your purposes. It can be useful to look at the scores
of known functional elements (e.g., protein-coding genes or RNA genes)
as a "yardstick", to help decide on a threshold. The "most conserved"
track is our attempt to provide a one-size-fits-all classification of
all sequences as conserved or nonconserved. Note, however, that it is
not obtained by applying a threshold to the base-by-base scores, but by
using the Viterbi rather than the forward/backward algorithm with the
HMM (see paper). The nets and chains are useful for deciding whether
particular sequences exist in other species, and align "syntenically",
i.e., as part of chains of aligned sequences with consistent order and
orientation. Chains of aligned sequences are more likely to be
orthologous than are sequences in isolated "nonsyntenic" alignments.
You're right that the nets and chains are probably the best way to tell
if a gene exists in other species. The phastCons scores will tell you
how conserved it is. It could exist in other species, of course,
without being highly conserved. The base-by-base scores are
probabilities that each base belongs to a conserved element. They are
only as good as the statistical model we have used, the alignment, the
sequence data, etc. As you say, it is difficult to say based on these
scores that a base with a score of 0.04 is really "more conserved" than
a base of 0.02. Note that the scores reflect not only each base
itself, but the influence of neighboring bases. So a column with all
'A's will get a higher score when surrounded by other conserved columns
than when it occurs in the middle of a poorly conserved region. Note
also that the log odds scores are for whole conserved elements, while
the posterior probabilities are for individual bases. If you can work
with discrete "most conserved" predictions, then I would recommend
using the log odds scores. For other regions, e.g., as defined by
genes, your best bet is probably to take averages of the base-by-base
scores. When you see no score, it is simply because there is no
alignment information in that region. There are various possible
reasons for this: recent transposon insertions, sequencing gaps, highly
diverged sequences, etc.

Hope this helps,
Adam Siepel



>
> Many many thanks ahead.
>
> Best regards,
> ?????
> -----------------------------------------------------------------------
> -------------------------------------------------
> Zhang Yong
> Ph.D Candidate
> College of LifeScience, Peking University
> Beijing,PR China,100871
> E-mail: pkuzhangy at bj1860.net

zhangy

unread,
Oct 10, 2005, 10:27:56 PM10/10/05
to
Dear Dr. Adam Siepel,

A millions thanks for your help, and I understand most of them. However, I am still a little confused about comparison between two genes or some regulatory elements in terms of average PhastCons score. As I discusssed in previous letter, minus, ratio or log-odds ratio, which might be appropriate? Or is there any other better measurement you can recommend?
Additionally, you mentioned that Chain/Net is of great use to judge whether serveral genes fall in the same syntenic chain. I found you also implemented such a filter in your Genome Research paper. I wonder how you did at that time in a batch mode.
Best regards,

======= 2005-10-11 you wrote: =======

>Hi Zhang --
>
>> Dear Sir/Madam,
>>
>>
>> Secondly, how to compare two nucleodies? A nucleotide with a score
>> greater than 0.9 is obviously more conserved than a nucleotide with a
>> score less than 0.1. But it is not so simple in some other cases, such
>> as 0.02 and 0.04, 0.8 and 0.9, etc. Which measurement is better,
>> minus, ratio or log-odds ratio? Similarly, how to compare two genes
>> (comparison of two series of scores)?
>>
= = = = = = = = = = = = = = = = = = = =



------------------------------------------------------------------------------------------------------------------------
Zhang Yong
Ph.D Candidate
College of Life Science,Peking University
E-mail: pkuzhangy at bj1860.net

Adam Siepel

unread,
Oct 11, 2005, 12:54:17 AM10/11/05
to

On Oct 10, 2005, at 7:27 PM, zhangy wrote:

> Dear Dr. Adam Siepel,
>
> A millions thanks for your help, and I understand most of them.
> However, I am still a little confused about comparison between two
> genes or some regulatory elements in terms of average PhastCons score.
> As I discusssed in previous letter, minus, ratio or log-odds ratio,
> which might be appropriate? Or is there any other better measurement
> you can recommend?

I would just average the base-by-base scores (posterior probabilities)
within the features of interest. The resulting score could be
interpreted as a posterior expected fraction conserved.

> Additionally, you mentioned that Chain/Net is of great use to judge
> whether serveral genes fall in the same syntenic chain. I found you
> also implemented such a filter in your Genome Research paper. I wonder
> how you did at that time in a batch mode.

our filter was a little more involved. We selected a subset of chains
that met certain length and score thresholds as a "syntenic net" (there
is a program in Jim Kent's toolkit called "netFilter" that will do
this), then discarded predicted conserved elements that overlapped
these chains by less than 90%. I think you'd need to download the
chains and nets in bulk as well as the kent source code in order to
reproduce this filter. You might be able to achieve something similar
using the Table Browser, but I'm not good enough with it to tell you
exactly how. Someone else may be able to guide you on this...

Adam
Reply all
Reply to author
Forward
0 new messages