Hi Zhang --
> Dear Sir/Madam,
>
> I read the paper on phastCons published in Genome Research. It is a
> really wonderful tool. However, because of my poor statistically
> background, I have several questions on PhastCons score.
>
> Firstly, how to judge a nucleotide is conserved or not given a score?
> In other words, is there such a threshold for PhastCons, 0.1, 0.8 or
> something like that? Furthermore, how to judge a gene is conserved or
> not? Minimum score, maximum score or average score, which one is
> better to describe a gene? Taking human as one example, for those
> genes with high score across full length, we know they might be
> conserved among vertebrate organisms. But for genes with moderate or
> lower scores, how could we know whether the genes exist in mouse, rat
> or more remote species? Is net/chain track useful in such a case? If
> yes, how to make use of net/chain tracks to judge whether ortholog of
> human genes do exist in mouse, rat or some other species?
>
> Secondly, how to compare two nucleodies? A nucleotide with a score
> greater than 0.9 is obviously more conserved than a nucleotide with a
> score less than 0.1. But it is not so simple in some other cases, such
> as 0.02 and 0.04, 0.8 and 0.9, etc. Which measurement is better,
> minus, ratio or log-odds ratio? Similarly, how to compare two genes
> (comparison of two series of scores)?
>
> Finally, there is no score for many genomic regions, which includes
> some regions not masked by repeatmasker. I wonder whether these
> regions mean one kind of species-specific feature, namely, none of
> subject genomes could align with them.
Many of these are questions without straightforward answers, but I'll
do what I can to address them. There is of course no universal
threshold for what is conserved and what is not conserved. You may
have to experiment with different thresholds to find something
appropriate for your purposes. It can be useful to look at the scores
of known functional elements (e.g., protein-coding genes or RNA genes)
as a "yardstick", to help decide on a threshold. The "most conserved"
track is our attempt to provide a one-size-fits-all classification of
all sequences as conserved or nonconserved. Note, however, that it is
not obtained by applying a threshold to the base-by-base scores, but by
using the Viterbi rather than the forward/backward algorithm with the
HMM (see paper). The nets and chains are useful for deciding whether
particular sequences exist in other species, and align "syntenically",
i.e., as part of chains of aligned sequences with consistent order and
orientation. Chains of aligned sequences are more likely to be
orthologous than are sequences in isolated "nonsyntenic" alignments.
You're right that the nets and chains are probably the best way to tell
if a gene exists in other species. The phastCons scores will tell you
how conserved it is. It could exist in other species, of course,
without being highly conserved. The base-by-base scores are
probabilities that each base belongs to a conserved element. They are
only as good as the statistical model we have used, the alignment, the
sequence data, etc. As you say, it is difficult to say based on these
scores that a base with a score of 0.04 is really "more conserved" than
a base of 0.02. Note that the scores reflect not only each base
itself, but the influence of neighboring bases. So a column with all
'A's will get a higher score when surrounded by other conserved columns
than when it occurs in the middle of a poorly conserved region. Note
also that the log odds scores are for whole conserved elements, while
the posterior probabilities are for individual bases. If you can work
with discrete "most conserved" predictions, then I would recommend
using the log odds scores. For other regions, e.g., as defined by
genes, your best bet is probably to take averages of the base-by-base
scores. When you see no score, it is simply because there is no
alignment information in that region. There are various possible
reasons for this: recent transposon insertions, sequencing gaps, highly
diverged sequences, etc.
Hope this helps,
Adam Siepel
>
> Many many thanks ahead.
>
> Best regards,
> ?????
> -----------------------------------------------------------------------
> -------------------------------------------------
> Zhang Yong
> Ph.D Candidate
> College of LifeScience, Peking University
> Beijing,PR China,100871
> E-mail: pkuzhangy at
bj1860.net