[Genome] netFilter to produce syntenic regions

98 views
Skip to first unread message

Danielle Lemay

unread,
Feb 6, 2010, 6:21:28 PM2/6/10
to gen...@soe.ucsc.edu
Hi,

This question is regarding the netFilter program.

Given a pair of mammalian genomes (e.g. mouse and opossum), I'd like
to produce a list of all syntenic regions. This is presumably a
combination of the level 1 net with the lower level nets that are
syntenic.

It has been suggested in the archives that "netFilter -syn" can be
used to do this and elsewhere in the archives there's an indication
that it has been "tuned" to human-mouse. What does that mean? Will
it still work for other mammal pairs?

Also, is it possible to specify the permissible gap size? For
example, let's say there are two level 1 nets that are separated by a
100bp or less and I want to join them, is that possible with the
program?

Thanks,
Danielle

########################################
Danielle G. Lemay, PhD
Postdoctoral Scholar, Bruce German's Lab
Food Science and Technology Department
University of California at Davis
(530) 297 7688

Angie Hinrichs

unread,
Feb 8, 2010, 7:25:43 PM2/8/10
to Danielle Lemay, gen...@soe.ucsc.edu
Hi Danielle,

> It has been suggested in the archives that "netFilter -syn" can be
> used to do this and elsewhere in the archives there's an indication
> that it has been "tuned" to human-mouse. What does that mean? Will
> it still work for other mammal pairs?

Yes, that is the aim of "netFilter -syn". I will describe how it
works in as much detail as I can, and hopefully that will be enough
information for you to decide whether it will suit your purposes.

"netFilter -syn" uses these parameters described in the usage (run
just "netFilter" with no arguments to see the full usage):

-syn - do filtering based on synteny (tuned for human/mouse).
-minTopScore=N - Minimum score for top level alignments. default 300000
-minSynScore=N - Min syntenic block score (def=200,000).
Default covers 27,000 bases including 9,000
aligning--a very stringent requirement.
-minSynSize=N - Min syntenic block size (def=20,000). -
-minSynAli=N - Min syntenic alignment size(def=10,000). -
-maxFar=N - Max distance to allow synteny (def=200,000).

The -syn filter has multiple stages:

First, a net passes the -syn filter if its score is at least
minSynScore, its size on the reference genome is at least minSynSize,
and its alignment size (number of bases in gapless aligned blocks in
chain) is at least minSynAli.

If those criteria are not satisfied, but the net has been classified
as "top" by the netSyntenic program, and its score is at least
minTopScore, it passes.

If the net has been classified as "nonSyn" by netSyntenic, it fails.

If the net's distance from the position predicted by its parent net
is greater than maxFar, it fails.

Finally, if the net has not met any of the above criteria, it passes.
The source code has this comment:
/* For all others, assume syntenic. This keeps tandem dupes, small inversion,
* and translocations. */

The source tree has a description of the .net format that may provide
insight into some of these parameters ("ali" is compared to minSynAli,
"qFar" is compared to maxFar): kent/src/hg/mouseStuff/netFormat.doc

Source code for the chain* and net* programs is also in mouseStuff,
and ultimately that is where to look for answers about the detailed
workings of the programs. (Or, as you have done, you can ask the
genome list for a reading of the entrails of the code :)

I don't know the process by which the parameters were tuned -- perhaps
Jim Kent can comment.

My hunch is that it will probably do a decent job with other pairs of
mammalian genomes; I suggest running netFilter on a downloaded .net
file, using netToBed to make a custom track, and then visually
comparing the custom track to the net track in a handful of regions --
do the retained and excluded parts of the net seem reasonable? If
not, is there a netFilter parameter that could be changed to get the
sensitivity / specificity that you're looking for? I guess that would
be a tuning process, by subjectively evaluating the output of a very
complex quantitative system.


> Also, is it possible to specify the permissible gap size? For
> example, let's say there are two level 1 nets that are separated by a
> 100bp or less and I want to join them, is that possible with the
> program?

netFilter can gloss over gaps within a net, but not between top-level
nets. But hopefully there wouldn't be cases of adjacent top-level
nets that should have been joined -- that should have happened at the
chaining step. The command line arg that you want is -minGap=100:

-minGap=N - restrict to those with gap size (tSize) >= minSize

-- that should read "restrict gaps to those with size (tSize) >=
minGap".

Hope that helps,

Angie

----- "Danielle Lemay" <dgl...@ucdavis.edu> wrote:
> _______________________________________________
> Genome maillist - Gen...@lists.soe.ucsc.edu
> https://lists.soe.ucsc.edu/mailman/listinfo/genome

Danielle Lemay

unread,
Feb 9, 2010, 2:28:26 PM2/9/10
to Angie Hinrichs, gen...@soe.ucsc.edu
Dear Angie,

Thank you for the thorough explanation. The default settings are much
too stringent for my purposes. Is it possible for you to explain to
me how the two scores are calculated, TopScore and SynScore, so that a
user can make intelligent parameter choices?

Thanks,
Danielle
--
########################################
Danielle G. Lemay, PhD
Postdoctoral scholar, German Lab

Angie Hinrichs

unread,
Feb 11, 2010, 1:55:58 PM2/11/10
to Danielle Lemay, gen...@soe.ucsc.edu
Hi Danielle,

In a nutshell, the chain scoring scheme is somewhat complicated, but ballpark estimates of scores can be made from expected size of aligned blocks, percent identity, gap size and gap frequency.

For most species pairs, we use this substitution matrix to score gapless aligned blocks:

A C G T
A 91 -90 -25 -100
C -90 100 -100 -25
G -25 -100 100 -90
T -100 -25 -90 91

The matrix is included in the details page for each chain track. It gives more precision in scoring matches/mismatches than the Smith-Waterman +1/-1. I believe that was derived a long time ago (2001?) by Webb Miller and colleagues from observed match and mismatch frequencies in manually checked multi-species alignments of the HoxD region, which was sequenced long before entire genome sequences became available. We use more stringent matrices for closely related mammals e.g. mouse-rat or human-chimp. But as a rule of thumb, expect scores of aligned blocks from this scheme to be a couple orders of magnitude larger than Smith-Waterman scores because substitution scores are normalized to max=100.

Gaps in chains are penalized with a piecewise linear function that penalizes gap openings the most, with less harsh penalties as gaps are larger (we expect large gaps in cross-species alignments due to insertions, rearrangements etc.). For mammal-mammal, we use this matrix (which also appears on chain details pages):

tableSize 11
smallSize 111
position 1 2 3 11 111 2111 12111 32111 72111 152111 252111
qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
bothGap 750 825 850 1000 1300 3300 23300 58300 11360,000
8300 218300 318300

tableSize 11 means that there are 11 points in the piecewise function. smallSize 111 is an implementation detail (boundary for precomputing penalties). position corresponds to gap size boundaries. Query gap penalties are in qGap, target gap penalties in tGap, and gaps in both target and query (meaning there is unalignable sequence in both) are in bothGap. Given a gap size, if it corresponds to one of the values in position, it gets the corresponding penalty. If it is between two positions, its score is linearly interpolated between the two positions' penalties. Then the penalty is subtracted from the chain score.

The penalties can be divided by 100 to get a (low) estimate of how many matching bases are required to make up for the gap penalty (beyond the number of matching bases required to make up for mismatching bases).

Another approach to making sense of chain scores is to look at cahin score histograms, e.g. for all chains or for chains of a particular length and gap size extracted using chainFilter.

Jim said that the score thresholds were set to exclude most of (Robert Baertsch's) pseudogenes. So I suppose there were several trials of trying a chain score picked from a histogram of chains meeting some desired size and gap content (netFilter usage: "-minSynScore... def=200,000... Default covers 27,000 bases including 9,000 aligning--a very stringent requirement."), and then intersecting the resulting filtered nets with the pseudogene track. (Also, code comment:b*sc

Angie Hinrichs

unread,
Feb 11, 2010, 1:59:03 PM2/11/10
to Danielle Lemay, genome
Hi Danielle,

Sorry that message was sent before it was complete. Here is the complete reply:

Angie Hinrichs

unread,
Feb 11, 2010, 2:07:43 PM2/11/10
to Danielle Lemay, genome
Hi Danielle,

Sorry that message was sent before it was complete (twice, ugh). *Here* is the complete reply:

In a nutshell, the chain scoring scheme is somewhat complicated, but ballpark estimates of scores can be made from expected size of aligned blocks, percent identity, gap size and gap frequency.

For most species pairs, we use this substitution matrix to score gapless aligned blocks:

A C G T
A 91 -90 -25 -100
C -90 100 -100 -25
G -25 -100 100 -90
T -100 -25 -90 91

The matrix is included in the details page for each chain track. It gives more precision in scoring matches/mismatches than the Smith-Waterman +1/-1. I believe that was derived a long time ago (2001?) by Webb Miller and colleagues from observed match and mismatch frequencies in manually checked multi-species alignments of the HoxD region, which was sequenced long before entire genome sequences became available. We use more stringent matrices for closely related mammals e.g. mouse-rat or human-chimp. But as a rule of thumb, expect scores of aligned blocks from this scheme to be a couple orders of magnitude larger than Smith-Waterman scores because substitution scores are normalized to max=100.

Gaps in chains are penalized with a piecewise linear function that penalizes gap openings the most, with less harsh penalties as gaps are larger (we expect large gaps in cross-species alignments due to insertions, rearrangements etc.). For mammal-mammal, we use this matrix (which also appears on chain details pages):

tableSize 11
smallSize 111
position 1 2 3 11 111 2111 12111 32111 72111 152111 252111
qGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
tGap 350 425 450 600 900 2900 22900 57900 117900 217900 317900
bothGap 750 825 850 1000 1300 3300 23300 58300 11360,000
8300 218300 318300

tableSize 11 means that there are 11 points in the piecewise function. smallSize 111 is an implementation detail (boundary for precomputing penalties). position corresponds to gap size boundaries. Query gap penalties are in qGap, target gap penalties in tGap, and gaps in both target and query (meaning there is unalignable sequence in both) are in bothGap. Given a gap size, if it corresponds to one of the values in position, it gets the corresponding penalty. If it is between two positions, its score is linearly interpolated between the two positions' penalties. Then the penalty is subtracted from the chain score.

The penalties can be divided by 100 to get a (low) estimate of how many matching bases are required to make up for the gap penalty (beyond the number of matching bases required to make up for mismatching bases).

Another approach to making sense of chain scores is to look at cahin score histograms, e.g. for all chains or for chains of a particular length and gap size extracted using chainFilter.

Jim said that the score thresholds were set to exclude most of (Robert Baertsch's) pseudogenes. So I suppose there were several trials of trying a chain score picked from a histogram of chains meeting some desired size and gap content, and then intersecting the resulting filtered nets with the pseudogene track. (netFilter usage: "-minSynScore... def=200,000... Default covers 27,000 bases including 9,000 aligning--a very stringent requirement." Also, code comment: "On average in the human/mouse net a score of 200,000 will cover 27000 bases including 9000 aligning bases - more than all but the heartiest of processed pseudogenes.")

Sorry there's no simple formula -- there is a complex formula that depends heavily on assumptions about %identity. I think that's why Jim ended up calibrating the thresholds by trial-and-error, using observed chain scores as initial expectations and matches of genes vs. pseudogenes as criteria.
Reply all
Reply to author
Forward
0 new messages