Hi Jonathan,
Apologies for the delayed response, I have been traveling most of the month and wanted to give this question the attention it required.
censoredInt
First on the "censoredInt" parameter. This parameter tells MSstats what missing values are censored (i.e. missing for reasons of low abundance). This is normally NA, meaning that any value that shows NA will be treated as censored. However, some tools (mainly Skyline) indicate a censored value with a very low number, either zero or close to zero. In these cases "censoredInt" needs to be set to "0". This will treat these small values as censored, while any NA values will be treated as MAR and will not be imputed. Setting "censoredInt" to NULL treats all the values as MAR and will essentially disable imputation. This option is redundant and should probably be removed.
Imputation effect on quantification
Next on quantified values going up and down with imputation. I ran the same test as you on a DDA experiment processed with FragPipe and looked at quantification on the "Group" level. Note I did see different results than you, with more quantifications decreasing with imputation than increasing, but this isn't necessarily an issue (will talk about it further down).
- 6,175 quantifications decrease with imputation enabled
- 1,354 quantifications increased with imputation enabled
- 2,432 did not change (because no imputation was performed)
Generally, I would expect the quantification to drop when we impute, due to the missing values being close to the limit of detection, however there will always be some cases where it increases. I want to go through a couple examples so that we can empirically see what the imputation is doing.
Imputation makes quantification go down
Here is an example protein where quant goes down.
Here we can see that the summarized protein-level results (dark red) are much higher without imputation (runs 4 and 5 in particular). Without imputation they follow the few high intensity features that are observed, which naturally makes the summarized value higher. With imputation (bright red points) the summarized results are lower because most of the imputed points are lower. This is what I would generally expect to happen because the missing points are missing due to low abundance and we impute them at these low values.
Imputation makes quantification go up
This is a more complex situation, and there are a number of reasons this could happen. I will run through a couple examples below.
Protein 1
This protein is an interesting case because there are only two features, but the higher intensity feature is missing a lot of values even though it is further from the limit of detection. We impute the values for the higher intensity feature as lower than what was observed (see runs 1 and 2 vs run 3), however the summarized results are strongly affected because the imputed values are higher than the other observed feature. This makes the final summarized (quantified) value higher in runs 1, 2, and 4. Empirically, I think these imputed values generally make sense, and the summarized values are much more consistent after imputation. In my opinion this case is not a problem.
Protein 2
This one might be harder to see. I did quantification on the condition level, grouping the first 3 and last 3 runs separately. Here run 2 in particular goes up after imputation, although you'll notice that we actually don't even impute in that run. In this case the imputed values in other runs affected the summarization of Run 2. We use Tukey's Median Polish for summarization, which is similar to median summarization but is also impacted by the values in other runs. Due to the summarization on the imputed data, we see a higher quantification even though there was no imputation in this run. As with the last protein I do not see this as a problematic imputation.
Protein 3
I wanted to highlight this protein, because while the previous two showed reasonable imputations, this one is problematic. Here we only have two features and in particular in condition 2 (run 4) we only have one observation. In condition 2 the summarized value is higher due to imputation, but it is very unstable. In run 2 we also impute but that value looks really off and is much lower than everything else in condition 1. In this case I would be wary of trusting the imputation and the quantification increasing is an indication of an issue.
Final recommendations
I hope I was able to give you a good overview of why you might see quantification values increase even though we impute with the assumption that values are missing for reasons of low abundance. Going back to your data, it is odd that you see such a high percentage increasing after imputation. I would recommend spot checking some of these with profile plots and just seeing if everything looks correct, or if you get weird results like Protein 3 above.
In particular, I would recommend avoiding imputation when your proteins have very low feature counts (1-3). Although this is not always necessary as seen in Protein 1 above.
I have attached the code I used to generate these plots if it will help. I know the included profile plots in MSstats do not show imputed values currently.
Let me know if you have any questions!
Devon