Hi Mike,
Thank you for your detailed explanation. I understand that the "RPQ method" has been faithfully adapted from the
Griffin et. al. paper, and I apologize for the incomplete quote (no intention to take words out of context). I fully agree that StPeter computes "nanograms" as defined in the Griffin et. al. publication, but unfortunately the computed quantity is not nanograms and the error is rooted in the publication by Griffin et. al.
I am also glad that you "agree completely that SIN/dSIN tries to quantify based on molecules, not mass." This is the critical point, because nanogram is a unit of mass, not moles or copy numbers, therefore calling this quantity nanograms is misleading and incorrect, especially when actual mass abundances (regardless of measurement errors) are often required and used by the users and can be computed by StPeter as well.
We can look deeper into Griffin et. al.'s work, but before that, I should emphasize that the concern here is not the systematic errors due to incomplete identification of the proteins (your second point), and that it is not accurate to call the discussed quantities "absolute" (your third point). The "absolute" terminology used by
Hui et al. is to distinguish it from the "relative" term which they use to refer to their isotope-labeling method. They clarify this by calling it "mass
fraction", i.e. the fraction of the
observed (whether partial or completed) protein mass in the
sample (whether whole proteome or not). The concern, instead, is that a mass quantity should be called a mass quantity and carry mass units, and a molar (or copy number) quantity should be called a number (dimensionless) quantity and be represented with appropriate units. These are fundamentally and physically different quantities and the distinction between them is beyond measurement incompleteness or the nature of the sample.
Unfortunately, Griffin et. al. have overlooked this distinction. In their method development, they seek agreement between four replicate measurements (the four diamond plots in each panel in the screenshot below, from Figure 1:)
They introduce the quantity SIGI which is the true mass fraction quantity and "successfully normalized different samples" (panel i in the figure):
Then, in an attempt to make "further enhancement", they argue that "large proteins can contribute more peptides than smaller ones, thus their abundance may be overestimated". However, the term "abundance" here is ambiguous: mass abundance or molar abundance? Mass abundance is not overestimated, but molar abundance is (for larger proteins). Regardless, they proceed to normalize by protein length and introduce SIN :
They observe that SIN is also a valid normalization method (panel k), but without any justification claim that "SIN was superior to SIGI". Indeed, the two figures clearly show the same degree of success in making the 4 replicates consistent. The error bars (circle or diamond widths) in panel i seem a bit larger because of the scaling, but there is nothing superior about one quantity to the other (I don't see any t-test comparison between the two measures), they are both valid but physically different quantities. SIGI is mass fraction, whereas SIN is mole fraction.
Perhaps because of the purported "superiority", they then use SIN to calculate nanogram abundances:
This is simply wrong. Nonogram (and µg for Q) is a unit of mass, so the mass fraction SIGI should have been used instead. Alternatively, they could have reported this as molar or copy number abundance, with Q being total µmole of protein loaded (less commonly used). The mass and molar abundances would have very different values, and they would be equally consistent between replicates.
Sorry for my lengthy email as well :) I really appreciate that you are open to updating StPeter to better clarify what RPQ is and is not. My suggestion would be to report both "mass frac" and "mole frac" quantities and simply call them that. Also leave multiplying by the total load to the user (if they need to), i.e. no input of the total ng or nmoles is required. Just report the fraction-of-total-observed quantities that add up to 1. This makes the program usage simpler and the output more clear, in my opinion. It also avoids the assumption that the total observed protein is known (whether in grams or moles), because as you said, the observed/identified protein is not exactly the total loaded protein.
I hope these make sense, but please let me know if you disagree.
Thanks!
Farshad