How to interpret lower protein quantitation when imputation is turned off?

251 views

Skip to first unread message

Jonathan Bui

unread,

May 29, 2023, 3:36:19 PM5/29/23

to MSstats

Hello again to the MSstats team.

I am trying to understand how MSstats' imputation model affects protein quantitation. I found Devon's explanation for how imputation works during the summarization step, and I was trying to understand what happens when I turn imputation on/off. I wonder if you could clarify why some of the quantitation in the data increases / decreases based on imputation.

Imputation is recommended to be turned on most of the time, and so the default function call is:

MSstats::dataProcess(
msstats_data,
censoredInt = "NA",
MBimpute = TRUE
)

I am not sure I understand 100% how "censoredInt" changes the treatment of values in imputation, but I think the option means that only "NA" values are treated as candidates for possible imputation. "MBimpute" of course means that data is imputed.

Following the MSstats user manual section 4.1.4, I can disable imputation by changing two options:

MSstats::dataProcess(
msstats_data,
censoredInt = NULL,
MBimpute = FALSE
)

where "censoredInt = NULL" tells MSstats that data is missing at random (and therefore not suitable for imputation by the accelerated failure model), and disables imputation.

I compared the protein-level quantitation with imputation enabled / disabled, and I observed:

about half of my data has their quantitation affected
of the data affected, about 1/4 of the data has increased in quantitation with imputation off and censoredInt=NULL
the other 3/4 of the affected data has decreased in quantitation.

Could you help explain to me what has changed in MSstats treatment of the data to cause the increase / decrease in protein-level quantitation?

I can possibly explain why some proteins would have increased quantitation with imputation turned off. When MSstats imputes on missing values near the limit of quantification, these low values "drag down" the summarized value of the protein in that sample. Is my understanding on this point correct?

However, I can't explain why many proteins have actually decreased in quantitation when imputation is turned off. My best guess is that, when "MBimpute=TRUE," the imputed fragments are added to the overall intensity of that protein, so disabling imputation removes these fragments from the equation. However, this does not make sense to me based on how Devon has explained imputation in the previous post.

I would appreciate any insights you could give on this topic.

Many thanks,

Jonathan

Devon Kohler

unread,

Jun 27, 2023, 5:31:07 PM6/27/23

to MSstats

Hi Jonathan,

Apologies for the delayed response, I have been traveling most of the month and wanted to give this question the attention it required.

censoredInt

First on the "censoredInt" parameter. This parameter tells MSstats what missing values are censored (i.e. missing for reasons of low abundance). This is normally NA, meaning that any value that shows NA will be treated as censored. However, some tools (mainly Skyline) indicate a censored value with a very low number, either zero or close to zero. In these cases "censoredInt" needs to be set to "0". This will treat these small values as censored, while any NA values will be treated as MAR and will not be imputed. Setting "censoredInt" to NULL treats all the values as MAR and will essentially disable imputation. This option is redundant and should probably be removed.

Imputation effect on quantification

Next on quantified values going up and down with imputation. I ran the same test as you on a DDA experiment processed with FragPipe and looked at quantification on the "Group" level. Note I did see different results than you, with more quantifications decreasing with imputation than increasing, but this isn't necessarily an issue (will talk about it further down).

6,175 quantifications decrease with imputation enabled
1,354 quantifications increased with imputation enabled
2,432 did not change (because no imputation was performed)

Generally, I would expect the quantification to drop when we impute, due to the missing values being close to the limit of detection, however there will always be some cases where it increases. I want to go through a couple examples so that we can empirically see what the imputation is doing.

Imputation makes quantification go down

Here is an example protein where quant goes down.

Here we can see that the summarized protein-level results (dark red) are much higher without imputation (runs 4 and 5 in particular). Without imputation they follow the few high intensity features that are observed, which naturally makes the summarized value higher. With imputation (bright red points) the summarized results are lower because most of the imputed points are lower. This is what I would generally expect to happen because the missing points are missing due to low abundance and we impute them at these low values.

Imputation makes quantification go up

This is a more complex situation, and there are a number of reasons this could happen. I will run through a couple examples below.

Protein 1

This protein is an interesting case because there are only two features, but the higher intensity feature is missing a lot of values even though it is further from the limit of detection. We impute the values for the higher intensity feature as lower than what was observed (see runs 1 and 2 vs run 3), however the summarized results are strongly affected because the imputed values are higher than the other observed feature. This makes the final summarized (quantified) value higher in runs 1, 2, and 4. Empirically, I think these imputed values generally make sense, and the summarized values are much more consistent after imputation. In my opinion this case is not a problem.

Protein 2

This one might be harder to see. I did quantification on the condition level, grouping the first 3 and last 3 runs separately. Here run 2 in particular goes up after imputation, although you'll notice that we actually don't even impute in that run. In this case the imputed values in other runs affected the summarization of Run 2. We use Tukey's Median Polish for summarization, which is similar to median summarization but is also impacted by the values in other runs. Due to the summarization on the imputed data, we see a higher quantification even though there was no imputation in this run. As with the last protein I do not see this as a problematic imputation.

Protein 3

I wanted to highlight this protein, because while the previous two showed reasonable imputations, this one is problematic. Here we only have two features and in particular in condition 2 (run 4) we only have one observation. In condition 2 the summarized value is higher due to imputation, but it is very unstable. In run 2 we also impute but that value looks really off and is much lower than everything else in condition 1. In this case I would be wary of trusting the imputation and the quantification increasing is an indication of an issue.

Final recommendations

I hope I was able to give you a good overview of why you might see quantification values increase even though we impute with the assumption that values are missing for reasons of low abundance. Going back to your data, it is odd that you see such a high percentage increasing after imputation. I would recommend spot checking some of these with profile plots and just seeing if everything looks correct, or if you get weird results like Protein 3 above.

In particular, I would recommend avoiding imputation when your proteins have very low feature counts (1-3). Although this is not always necessary as seen in Protein 1 above.

I have attached the code I used to generate these plots if it will help. I know the included profile plots in MSstats do not show imputed values currently.

Let me know if you have any questions!

Devon

profile_plot_code.R

Reply all

Reply to author

Forward

0 new messages