How to understand the weight values in percolator.weights.txt

31 views
Skip to first unread message

Robot mantou

unread,
Dec 8, 2020, 2:38:25 AM12/8/20
to crux-users
Dear Cruxers,

I'm writing to consult how to understand the weight values in percolator.weights.txt.

For instance, I run the following commands with Crux:
  1. crux tide-index --overwrite T --max-length 30 --min-length 7 --mods-spec C+57.02146,1M+15.994915 --mod-precision 4 --max-mods 3 --missed-cleavages 1 x.fasta index/
  2. crux tide-search --max-precursor-charge 4 --mod-precision 4 --num-threads 4 --overwrite T --top-match 1 --concat T --compute-sp T x.raw index/ 
  3. crux percolator --output-weights T --overwrite T crux-output/tide-search.txt
Then, I got the percolator.weights.txt like this:


q1: I know that Percolator used cross-validation in three splits so the table has three parts. Why each part has two rows?

q2: The value of 'dM' or 'absdM'  in the red box is reached to dozens, even hundreds. If 'absdM' means the absolute mass error, how can it be so high?

q3: According to the table, how should I calculate the importance of each feature? In my mind, the importance value is in the range of zero to one.

Thanks for your wonderful work about Crux. I have learned a lot from it.

Best,

A user of Crux

William Fondrie

unread,
Dec 8, 2020, 5:14:17 PM12/8/20
to crux-users
q1: I know that Percolator used cross-validation in three splits so the table has three parts. Why each part has two rows?

One thing Percolator does is normalize the features to have a mean of zero and variance of one. Each part in the weights file contains weights on both the normalized feature scale (the first row) and the original feature scale (the second row).

q2: The value of 'dM' or 'absdM'  in the red box is reached to dozens, even hundreds. If 'absdM' means the absolute mass error, how can it be so high?

The weights in these rows are on the original feature scale---I would not be surprised if the values were higher than the other features, because the mass error is often very small. However, in this case it looks like the 'absdM' feature is receiving more weight than I would typically expect (-8, -9, -9 on the normalized scale). Can you provide more details about your experiment (for example, the type of instrument that was used)? I suspect that maybe the default 'mz-bin-width' of 0.02 for the tide-search command may be too small.

q3: According to the table, how should I calculate the importance of each feature? In my mind, the importance value is in the range of zero to one.

You can think of the normalized weights as a crude, relative measure of importance. While the normalized weights do not have any units associated with them, they indicate how much each feature contributes to Percolator's score. For example, the large, negative weight for the 'absdM' feature indicates that PSMs are heavily penalized as the absolute difference between the observed and theoretical mass increases. However, its worth noting that these weights are confounded by features that are correlated with one another. When two features are highly correlated with one another, then either feature can often substitute for the other. This can result in highly variable weights that may not reflect the actual magnitude of either feature's importance in reality. 

Best,
Will

Robot mantou

unread,
Dec 8, 2020, 8:18:09 PM12/8/20
to crux-users

Dear Will,

Thanks for your detailed answers. 

For q2, the file is var_D150908_DDA_R01_S209-120min-22W-BC-P1_T0.raw with QE HF from PXD005573.

Best,

Song

Robot mantou

unread,
Dec 10, 2020, 8:09:29 PM12/10/20
to crux-users
Dear Will,

After I added the param of '--auto-mz-bin-width warn' for tide-search, the weight of absdM fell to -0.14 and the identified #PSMs increased from 22074 to 25612. It looks normal now. Should I add this param for each QE HF raw file? Any comments?

Best,

Song

在2020年12月9日星期三 UTC+8 上午6:14:17<wfon...@uw.edu> 写道:

William Fondrie

unread,
Dec 11, 2020, 12:35:19 PM12/11/20
to crux-users
Hi Song,

I'm glad to hear that using param-medic seems to have fixed things.

I forgot to ask, but what version of crux are you using (you can get this by running "crux version")? One thing that has changed since version 3.2 is that we've updated the default "mz-bin-width" to be 0.02 (for high resolution data) instead of 1.0005079 (for low res data).

Best,
Will
Reply all
Reply to author
Forward
0 new messages