StPeter nanogram calculation error

11 views

Skip to first unread message

Farshad AbdollahNia

unread,

Mar 20, 2024, 6:50:48 PMMar 20

to spctools-discuss

Dear TPP developers and community,

I wanted to point out a potential error in how StPeter estimates protein mass (nanograms) in the proteome sample. As described in the paper, the program normalizes the spectral index, dSI, by the protein length, L, and the total spectral index from the sample, Sum(dSI), as in the formula below:

This is correct for estimating the relative copy number, or mole fraction (the fraction of the total number of protein molecules), of each protein. However, for nanograms, or mass fraction (the fraction of the total proteome mass), the normalization by L should be omitted.

I hope this makes sense. The mass abundance of each protein is proportional to both its length and its copy number, therefore, normalization by length should not be performed for mass abundance estimation.

Unfortunately, as the StPeter paper says (and as I have verified in the output), for calculating the nanograms "each protein SI_N is divided by the sum of all proteins’ SI_N and multiplied by the protein load in nanograms". This is effectively using mole fraction in place of mass fraction, which is incorrect.

The authors (and other users) may not have noticed this error because it is inconsequential for tracking changes between different samples/conditions. However, it would be significant for consistency with other mass quantitation methods.

To check the consistency, when StPeter's SI_N output is correctly used to estimate mass fractions, i.e. dSI_N * L / Sum(dSI_N * L) is calculated instead of the above formula, the result is highly correlated with that of spectral counting, as expected, and as you can see in an example below:

The method of mass fraction estimation using spectral counting is already established in the literature, for example in this paper see the "Absolute protein quantitation" section: "The absolute abundance of a protein was calculated by dividing the total number of spectra of all peptides for that protein by the total number of 14N spectra in the sample." No normalization by protein length is done, because length has to be included in the mass abundance of a protein.

The paper also verifies the consistency of this method with 15N-labeled relative quantitation (see their supplementary figure S9). I have also verified the agreement in my own relative quantitation experiments.

I would be interested in learning your thoughts on this. For obtaining protein mass abundances (or mass fractions), StPeter's "SIn" output (which is log2[dSI_N]) is currently usable in the way described above, but the "ng" output needs to be corrected in the source code. Optionally, the current "ng" calculation can also be re-labeled as "copy numbers" given a total load of copy numbers (instead of total nanograms) provided by the user, but that would probably be of less interest than nanograms.

Please let me know what you think.

Thank you,

Farshad

mhoo...@systemsbiology.org

unread,

Mar 20, 2024, 9:04:19 PMMar 20

to spctools-discuss

Hi Farshad,

Thanks for the detailed analysis, but I want to clarify what the function you are referring to does and assure everyone it is behaving as intended. This is independent of how one might prefer to quantify their proteins, but you bring up great discussion points. So let's dig in!

First, the full quote for the the nanogram protein mass estimate calculation you are referring to is, "For statistical analysis, we converted SIN and dSIN values for each protein to nanogram estimations using the RPQ method [ref: https://pubmed.ncbi.nlm.nih.gov/20010810/]. In brief, each protein SIN is divided by the sum of all proteins’ SIN and multiplied by the protein load in nanograms." There was some key information there that was lost without seeing the full quote (like having your words taken out of context...). Namely, the nanogram estimation method, known as RPQ, is defined in a previous publication. StPeter's job is to replicate that function as originally described, which we believe it does, and that includes the units of the result. So I'd like to reiterate that the StPeter nanogram estimates are not computed incorrectly, but instead computed as defined in the publication, and the prior publication from which it was derived.

Second, it is good to understand what the RPQ means in terms of actual value or accuracy. In the RPQ equation, the sum of all the nanogram values should equal the total protein load onto the mass spectrometer. But StPeter (or any quantification method) can only quantify what was identified from the sample. There are perhaps thousands or more molecules in a sample that are never identified during a run, and thus were not included in the sum total of proteins quantified. The RPQ results are best described as a rescaling of the SIN to a range that resembles nanogram amounts, and are undoubtedly overestimates of the actual quantities, perhaps even if you've managed to quantify every protein in your sample.

Third, nothing in StPeter performs absolute protein quantification. It is all relative to the sample. That is, not necessarily a proteome, and changing sample preparation in any way can influence the quantities regardless of the sample load or whatever value you may choose to use in RPQ.

Hopefully that was clear, but the take home point is the nanogram estimates are computed using the published RPQ method, which StPeter has correctly replicated from its publication. The results are not necessarily precise nanogram estimates, but relative abundances scaled to fall within the total [and arbitrary] number of nanograms you wish to see. For most people who are uncomfortable with the log2(SIN) scale, this is the alternative they use, maybe even unwisely.

Whew, sorry that was so long. On to the discussion points you raise:

1. Yes, I agree completely that SIN/dSIN tries to quantify based on molecules, not mass. This is an important distinction, and I'm happy you pointed it out to everyone who is reading.

2. Regarding my thoughts, I prefer to quantify using log(SIN) or log(dSIN) and not RPQ, as illustrated in several examples in the StPeter publication, and believe use of RPQ should be done with caution and calibrated appropriately (e.g., with known quantities spiked into the sample, for starters).

3. StPeter isn't one quantification algorithm. It is one program with a collection of quantification algorithms. It is possible to use spectral indexes, or spectral counts, or distributed spectral counts, etc. So even if RPQ is offered, there is no obligation to use it. Instead, use what is appropriate for your research.

4. I agree, especially after this lengthy response ;) , that we could perhaps update StPeter to better clarify what RPQ is and is not. Maybe you have additional suggestions? I am thinking at the very least to describe it as "nanogram-scale" quantity estimates, but finding a concise way to also express that converting molecule counts to mass estimates might undermine the analysis. I'm not sure if labeling RPQ and copy numbers is accurate either, as actually estimating total copy numbers of any given complex mixture is bound to be exceptionally inaccurate.

Cheers,

Mike

Farshad AbdollahNia

unread,

Mar 21, 2024, 5:38:46 PMMar 21

to spctools...@googlegroups.com

Hi Mike,

Thank you for your detailed explanation. I understand that the "RPQ method" has been faithfully adapted from the Griffin et. al. paper, and I apologize for the incomplete quote (no intention to take words out of context). I fully agree that StPeter computes "nanograms" as defined in the Griffin et. al. publication, but unfortunately the computed quantity is not nanograms and the error is rooted in the publication by Griffin et. al.

I am also glad that you "agree completely that SIN/dSIN tries to quantify based on molecules, not mass." This is the critical point, because nanogram is a unit of mass, not moles or copy numbers, therefore calling this quantity nanograms is misleading and incorrect, especially when actual mass abundances (regardless of measurement errors) are often required and used by the users and can be computed by StPeter as well.

We can look deeper into Griffin et. al.'s work, but before that, I should emphasize that the concern here is not the systematic errors due to incomplete identification of the proteins (your second point), and that it is not accurate to call the discussed quantities "absolute" (your third point). The "absolute" terminology used by Hui et al. is to distinguish it from the "relative" term which they use to refer to their isotope-labeling method. They clarify this by calling it "mass fraction", i.e. the fraction of the observed (whether partial or completed) protein mass in the sample (whether whole proteome or not). The concern, instead, is that a mass quantity should be called a mass quantity and carry mass units, and a molar (or copy number) quantity should be called a number (dimensionless) quantity and be represented with appropriate units. These are fundamentally and physically different quantities and the distinction between them is beyond measurement incompleteness or the nature of the sample.

Unfortunately, Griffin et. al. have overlooked this distinction. In their method development, they seek agreement between four replicate measurements (the four diamond plots in each panel in the screenshot below, from Figure 1:)

They introduce the quantity SI_GI which is the true mass fraction quantity and "successfully normalized different samples" (panel i in the figure):

Then, in an attempt to make "further enhancement", they argue that "large proteins can contribute more peptides than smaller ones, thus their abundance may be overestimated". However, the term "abundance" here is ambiguous: mass abundance or molar abundance? Mass abundance is not overestimated, but molar abundance is (for larger proteins). Regardless, they proceed to normalize by protein length and introduce SI_N :

They observe that SI_Nis also a valid normalization method (panel k), but without any justification claim that "SI_N was superior to SI_GI". Indeed, the two figures clearly show the same degree of success in making the 4 replicates consistent. The error bars (circle or diamond widths) in panel i seem a bit larger because of the scaling, but there is nothing superior about one quantity to the other (I don't see any t-test comparison between the two measures), they are both valid but physically different quantities. SI_GI is mass fraction, whereas SI_Nis mole fraction.

Perhaps because of the purported "superiority", they then use SI_N to calculate nanogram abundances:

This is simply wrong. Nonogram (and µg for Q) is a unit of mass, so the mass fraction SI_GI should have been used instead. Alternatively, they could have reported this as molar or copy number abundance, with Q being total µmole of protein loaded (less commonly used). The mass and molar abundances would have very different values, and they would be equally consistent between replicates.

Sorry for my lengthy email as well :) I really appreciate that you are open to updating StPeter to better clarify what RPQ is and is not. My suggestion would be to report both "mass frac" and "mole frac" quantities and simply call them that. Also leave multiplying by the total load to the user (if they need to), i.e. no input of the total ng or nmoles is required. Just report the fraction-of-total-observed quantities that add up to 1. This makes the program usage simpler and the output more clear, in my opinion. It also avoids the assumption that the total observed protein is known (whether in grams or moles), because as you said, the observed/identified protein is not exactly the total loaded protein.

I hope these make sense, but please let me know if you disagree.

Thanks!

Farshad

--
You received this message because you are subscribed to the Google Groups "spctools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spctools-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/spctools-discuss/17050bab-0238-4dfa-989a-80cf54e65afdn%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages