[Genome] -score in chrN_chainHg17-

2 views
Skip to first unread message

-Christophe-

unread,
Aug 6, 2004, 6:34:34 AM8/6/04
to gen...@soe.ucsc.edu
Hi,
I'm wondering how to interpret the "score" indicated into the table
"chrN_chainHg17" of the annotated mysql database (of the dog genome).
This score called double score; "score of chain"\ is significative
from which value (1000000) ? thanks in advance for your help and your
time.
christophe

Angie Hinrichs

unread,
Aug 6, 2004, 12:49:16 PM8/6/04
to -Christophe-, gen...@soe.ucsc.edu
Hi Christophe,

The chain scores are computed in a way very similar to blastz
alignment scores, except that long gaps are not penalized as harshly.
Both blastz and the axtChain program use this substitution matrix by
default, and this is what we used for human-dog:

A C G T
A 91 -114 -31 -123
C -114 100 -125 -31
G -31 -125 100 -114
T -123 -31 -114 91

The big difference is in how gaps are penalized. Instead of a typical
"gap open + gap extend" penalty, axtChain uses a piecewise linear
function defined by these points:

position 1 2 3 11 111 2111 12111 32111 72111 152111 252111
qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
bothGap 750 825 850 1000 1300 3300 23300 58300 118300 218300 318300

Think of position as the X axis, in bp, and *Gap as the Y axis, in
penalty points (same units as from the subst. matrix). Straight lines
connect successive points.

So if there's a gap of 1 base in target or query, the penalty is 350.
If there's a gap of 1 base in both target and query, 750.
If there's a gap of 8 bases in both, then we interpolate between the
penalties for 3 bases and 11 bases:

850 + ( (1000 - 850) * (8 - 3) / (11 - 3) ) = 943.75

Why do this? So that penalties for very long gaps (and double-sided
gaps, which are not allowed in blastz and most aligners) are not
prohibitive, and very long chains can be constructed from shorter
blastz alignments.

Picking a score threshold for chains is a tricky business... scores
vary hugely with length as well as conservation. This scoring scheme
allows us to recognize long chains in syntenic regions, but it also
retains almost anything from blastz. That's why we also have the
"net" tracks -- to keep the best chains and ignore most of the
"fluff". So I can't suggest a threshold based on chain score, but
hopefully it will be helpful to know how the scores are generated.

Angie
Reply all
Reply to author
Forward
0 new messages