Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Biomolecular condensates help cells organise their content in space and time. Cells harbour a variety of condensate types with diverse composition and many are likely yet to be discovered. Here, we develop a methodology to predict the composition of biomolecular condensates. We first analyse available proteomics data of cellular condensates and find that the biophysical features that determine protein localisation into condensates differ from known drivers of homotypic phase separation processes, with charge mediated protein-RNA and hydrophobicity mediated protein-protein interactions playing a key role in the former process. We then develop a machine learning model that links protein sequence to its propensity to localise into heteromolecular condensates. We apply the model across the proteome and find many of the top-ranked targets outside the original training data to localise into condensates as confirmed by orthogonal immunohistochemical staining imaging. Finally, we segment the condensation-prone proteome into condensate types based on an overlap with biomolecular interaction profiles to generate a Protein Condensate Atlas. Several condensate clusters within the Atlas closely match the composition of experimentally characterised condensates or regions within them, suggesting that the Atlas can be valuable for identifying additional components within known condensate systems and discovering previously uncharacterised condensates.
For decades, membrane-bound organelles have been recognised as the key mechanism by which eukaryotic cells achieve compartmentalisation. This compartmentalisation allows cells to carry out multiple biological processes simultaneously by creating distinct biochemical environments for each. Biomolecular condensates have been proposed to offer cells an additional layer of spatial organisation that is more dynamic than what membrane-bound organelles can provide1,2,3,4,5. To date, numerous biomolecular condensate systems have been identified with several found to regulate key cellular functions, including gene expression, stress response and signal transduction6,7,8. Because of their broad functional roles, condensates have also become promising targets for drug discovery9,10,11. This has sparked significant interest in understanding their composition and the factors that affect it.
The number of different condensate types in cells is large with many systems likely yet to be discovered. Experimental techniques used for condensate characterisation either yield sensitive information about a handful of candidate targets (top left) or permit hypothesis-free characterisation without the requirement for pre-defined probes but offer limited resolution in determining protein co-localisation into the same condensate type (bottom right). Here, we combined predictive machine learning models with experimental data from protein interaction and biomolecular condensation studies (top right) to make proteome-wide predictions on the composition of heteromolecular condensates. Image drawn with the aid of BioRender.com.
To gain insight into which proteins are present in biomolecular condensates, we analysed the mass spectrometry data from lysate-reconstituted NPM1-condensates that was recently gathered by Freibaum et al. (Supplementary Dataset 1)17. NPM1 is an RNA-binding protein that is known to play a central role in the formation of the nucleolus. Purified NPM1 can induce condensation from cell lysate and forms biomolecular condensates that recapitulate the nucleolus. As the dataset only included information on proteins that were enriched into condensates, we combined this information with a mass spectrometry-based proteomics study that had identified a total of 7273 proteins in this same cell line (U2OS; Supplementary Dataset 2)27. We found 1008 of these proteins to overlap with the proteins that had been detected in condensates (Fig. 2a, green).
We compared the predictive power of each approach by estimating the areas under the receiver-operator characteristic curve (auROC; Fig. 4b) and the precision-recall curve (auPRC; Fig. 4c) on an independent left-out test set. We noticed that the model relying solely on the DeePhase score as the input performed notably worse than all other approaches. As also suggested by earlier analysis (Figs. 2, 3), this was likely due to homotypic phase separation propensity not being the only factor that determines protein localisation into condensates. The remaining four strategies performed comparably well and reached auROC and auPRC values as high as 0.78 and 0.57, respectively. We note that the performance of the models did not get elevated substantially when an explicit feature characterising the RNA-binding character of the sequences was included. One advantage of building models without relying on this feature is the ability to make predictions for every sequence, regardless of RNA-binding annotation availability.
Finally, we deployed the model to assess the probability of each protein in the human proteome to localise into the NPM1-condensates. We observed that the predicted scores were high for the majority of the proteins within the COND+ dataset (Fig. 4, blue distribution) but also for several proteins that had not been seen to localise into the NPM1 reconstituted condensates by Freibaum et al. (green distribution). To verify if any of these proteins are true positives and localise into condensates, we turned to the Human Protein Atlas database that has performed immunohistostaining for a large number of proteins across the human proteome using U2OS as one of its model cell lines37. We visually inspected the images of the top 10 highest and lowest-scoring proteins filtered for proteins not present in the training data. We observed the former set to be enriched in proteins that localise into condensates (Fig. 4d top row; four out of top 10 targets showed clear condensates, a few additional with less well-defined condensates as has been summarised in Supplementary Dataset 7) relative to the latter set (bottom row). The images further suggested that these proteins localised into the nucleus and could be part of the nucleolus, of which NPM1 is a key component38. Taken together, these results clearly demonstrate the ability of the model to extend beyond the sequence space covered in the training set.
Motivated by this observation, we set out to explore if it is possible to partition our model-predicted condensation-prone proteome into individual condensate systems by utilising previously characterised biomolecular interaction profiles to identify which proteins co-localise into the same condensate system. The development of a capability to predict the composition of heteromolecular condensates would be a qualitative advancement over approaches published to date, which have focused on predicting homotypic phase separation propensity or protein localisation into condensates22,23. To create such a Protein Condensate Atlas, we first defined the predicted condensate localising proteome as proteins with a high intrinsic phase separation propensity (DeePhase score above 0.75) or a predicted partitioning score above 0.25 (Fig. 4a, green line; the optimal threshold was determined using the Youden J-statistic). We then integrated information on biomolecular interactions from the StringDB database and consensus clustered the interaction profiles of the human proteome (Methods; Fig. 5b). This process yielded a total of 133 clusters. For 62 of these, our models predicted that at least half of the proteins would localise into condensates, suggesting that they may correspond to condensate systems. The composition of the predicted condensate clusters, which we refer to as our predicted Protein Condensate Atlas, can be found in Supplementary Dataset 8. We note that our algorithm predicts each protein to be located in only a single cluster.
Next, we set out to examine how biomolecular interactions and condensation propensity, which are the two key inputs into the model, correlate with the predictions of the Atlas. To this effect, we first quantified the number of confirmed interaction partners that the proteins within and outside of the condensate clusters had, utilising the data reported in StringDB. We observed that proteins not part of the condensate clusters (Fig. 5c, green) tended to have a larger number of interaction partners compared to proteins not predicted to localise into condensates (red). This trend aligns with our finding when we characterised the composition of NPM1-condensates (Fig. 3c).
We additionally analysed interactions within the predicted condensate clusters. Specifically, we used data from StringDB to count the number of interactions that each protein within a predicted condensate cluster can form with other proteins in the same cluster. The interactions within an exemplary condensate system are shown in Supplementary Fig. S3a, with the distributions of the interaction counts highlighted in Supplementary Fig. S3b. For this particular cluster, we found no significant difference in the number of interactions between condensate localising proteins with and without a high predicted homotypic phase separation propensity (Supplementary Fig. S3b, orange and green, respectively). However, when we performed the analysis globally across all the predicted condensate clusters, we observed that proteins without a high homotypic phase separation propensity tended to have a higher number of interaction partners (Fig. 5d; the y-axis values are normalised for condensate cluster size to allow comparison between clusters of different sizes) than proteins that had a high phase separation propensity. This observation further highlights the key role that heteromolecular interactions play in protein recruitment into condensates (Fig. 3e) and its distinction from homotypic phase separation processes.
c80f0f1006