On-the-fly calculation of arbitrary compounds

Kevin Shebek

unread,

Jun 1, 2020, 12:25:34 PM6/1/20

to eQuilibrator Users

Hello,

My name is Kevin Shebek, a student in the Broadbelt and Tyo labs at Northwestern University.

I am interested in calculating the Gibbs free energy of formation (to calculate reaction free energies) for arbitrary compounds at specified pH. I understand that the eQuilibrator API is only able to return values for pre-computed compounds, which I assume is due to the fact that pKa calculators are proprietary. Recently I chatted with Chris Henry and he mentioned Elad helped to update the newest modelSEED with compounds that aren't found in the API. He suggested I reach out to see if there is any way to calculate arbitrary compounds on the fly.

I have actually started to implement this sort of approach using the groups obtained by Jankowski, but if I can use the component contribution results I would prefer that.

If I am able to obtain the correct licenses for the calculators, is there a way to use eQuilibrator on-the-fly? Or will I have to instead specify specific compounds and then add the results to the compound cache for subsequent runs? Ideally I would like to be able to simply input a SMILES with the pH and ionic strength and get out a value.

Thanks,

Kevin

Elad Noor

unread,

Jun 1, 2020, 1:39:57 PM6/1/20

to eQuilibrator Users

Hi Kevin,

I'm afraid that what you are asking for is not currently supported by equilibrator-api. The problem is, indeed, the fact that we rely on ChemAxon (a proprietary software) for the pKa calculation (and also for getting a SMILES with a specific charge, but that's a minor point). I never bothered adding the option you asked for because we designed equilibrator-api to be fully open source and as easy to install as possible.

Since you already said you want to implement something yourself, I can guide you on how to do it if you want. Maybe we could even add it as an "extension" package to equilibrator.

First, you'll need to get a license for ChemAxon's Marvin suite (free for academics).

Then, if your compound is not already in the cache (e.g. a new SMILES), then you'll need the code here in order to generate the Compound object for it (instead of using the database).

When you have that, you can use it as an input to a Reaction object and then use the standard functions to get the ΔG' estimates.

I'll be happy to answer any further questions you'll have. Best,

Elad

Kevin Shebek

unread,

Jun 1, 2020, 3:07:16 PM6/1/20

to eQuilibrator Users

Hi Elad,

That helps a lot, thanks. I'll take a look at it and see if I can put something together.

Thank you for your help,

Kevin

Kevin Shebek

unread,

Jun 4, 2020, 2:38:16 PM6/4/20

to eQuilibrator Users

Hi Elad,

I have a question regarding results I obtain from the quilt database vs. values I calculate myself using equilibrator functions on a SMILES input.

I am now able to take an arbitrary SMILES to generate a compound object with the data calculated by the cxcalc program. One issue I ran into was that on macOS cxcalc does not support inchi input and your chemaxon.py only populates the input to cxcalc from the inchi column of the dataframe. I am not sure if this is something you are worried about, but it was easy enough to modify.

With this step done, I'm now ready to check to see if the results I obtain match those from the compound cache. The first thing I am looking at is to simple see if I can calculate the standard Gibbs free energy of formation by decomposing my compound to get the group vectors and then using the GibbsEnergyPredictor.standard_dgf. I am not able to get the numbers to match between these, however they are close. Some results of GibbsEnergyPredictor.standard_dgf are given below. The SMILES from the ccache was used as the input to my approach, all group vector decompositions match and the same predictor is used throughout.

It seems like the numbers match well, but never completely. I can't seem to figure out where the difference is coming from. Do you have any ideas why these values may be different?

Standard Gibbs Free Energy of Formation

ATP (kegg:C00002)

ccache: -2811.6 +/- 1.5

SMILES: -2811.6 +/- 3.3

Fructose bisphosphate (kegg:C00354)

ccache: -2598.9 +/- 1.1

SMILES: -2599.4 +/- 2.0

Ethanol (kegg:C00469)

ccache: -179.1 +/- 1.4

SMILES: -178.9 +/- 1.0

Propanol (kegg:C05979)

ccache: -175.9 +/- 2.9

SMILES: -171.7 +/- 1.1

Thank you,

Kevin

Elad Noor

unread,

Jun 5, 2020, 4:01:41 AM6/5/20

to eQuilibrator Users

I'm very happy to hear about your progress. I didn't know about the MacOS problem, but honestly we never tried to test anything on MacOS. How did you solve the problem? I would be happy to adjust the code to be multiplatform if it is not too much work. More importantly, did you encounter any issues on MacOS in the "standard way" of using equilibrator-api (i.e. just for prediction of compounds existing in the database)?

Regarding your question, I think that all the examples you provided are exactly the ones where the predictions should be different. Some of the compounds have direct data about their formation energy and "bypass" the group contribution step. ATP (kegg:C00002) is a good example for that. Therefore, when you use the group vector to calculate the dGf0 of ATP, you should get a slightly different value compared to the standard prediction. If you only considered compounds that have no direct data (i.e. they are not in the NIST or Alberty's databases), the two numbers should match exactly. Here is a code snippet for getting the IDs of these compounds (i.e. the ones you should avoid):

from equilibrator_api import ComponentContribution

cc = ComponentContribution()

cc.predictor.params.train_G.index

Kevin Shebek

unread,

Jun 5, 2020, 9:30:37 AM6/5/20

to eQuilibrator Users

I didn't do a very sophisticated fix for this early work, I just sent in the SMILES instead of the inchi to reduce the modifications done to any of your code. Then in subsequent code I just have to remember that my "inchi" column is really the SMILES. I will have access to SMILES from my analysis, so a long term solution for me would be to have the option to put in either inchi or smiles. Currently you have it hardcoded to use inchi (at least in the chemaxon.py found in equilibrator_assets).

A potential fix could perhaps allow you to input either inchi or smiles in your data frame, check the OS and the data frame to decide which column to use, and then run cxcalc. This way you can throw whatever data frame at it. This issue isn't a priority for me right now given I have a crude workaround and if no one has brought it up then I imagine there is no rush to add support for it. I was going to write code for this at some point in the future anyway so if you would like I can do that and bring it up at a later date to talk about it.

I tried the equilibrator_api and it works. The code in the readme works for the reaction "kegg:C00002 + kegg:C00001 = kegg:C00008 + kegg:C00009" However, I also tried the code in https://gitlab.com/equilibrator/equilibrator-api/-/blob/develop/scripts/equilibrator_cmd.ipynb which seems to be out of date and throws the error "modeule 'equilibrator_api' has no attribute 'parse_reaction_formula' for the first cell. Nevertheless, the readme example gave me the correct results.

What you said about the predictions makes a lot of sense, I've stared at that Alberty table long enough that I should have thought of that! In my final implementation I will check to see if the compound exists already in the database to avoid this issue altogether.

Elad Noor

unread,

Jun 5, 2020, 12:18:19 PM6/5/20

to eQuilibrator Users

Thanks for elaborating on the problem. There is a reason why we prefer to use the InChIs - they are unique identifiers. I am not sure why ChemAxon on MacOS doesn't support them, but they are essential to our pipeline (e.g. all compound descriptors in MetaNetX are given as InChIs).

I agree that if we provide users the option to add their own compounds (like you requested), then we could support both InChI and SMILES, and I'll try to follow your suggestion if that happens. And, yes, please bring it up again when you are ready.

Regarding the equilibrator_cmd scripts, thanks for pointing out the bug. I fixed it now.

Cheers,

Elad

Kevin Shebek

unread,

Jun 15, 2020, 3:29:25 PM6/15/20

to eQuilibrator Users

Hi Elad,

It looks like my arbitrary SMILES approach matches the compound cache to calculate ∆Go for all of compounds not used in the training set--I have checked about 15. There are some exceptions, for example cis,cis-Muconate, but I am looking into that.

What I wanted to ask you for help with is dealing with the calculation of microspecies. With the addition of pH and ionic strength to calculate ∆G'o there are some discrepancies between my SMILES approach and the corresponding kegg ID in the compound cache approach. I believe the difference is due to different sets of microspecies between the compound objects obtained from each method. A concrete example of this is in microspecies_example notebook in the equilibrator_assets. In your notebook database results as well as the calculated results match. However, when I run this code on my computer it appears as if the sqlite database has many more microscpecies than what is calculated (picture attached at bottom). I am not sure what is causing this discrepancy--perhaps it is something to do with chemaxon on macOS again?

What I have checked so far:

1. I input num_acidic = 20, num_basic = 20, mid_ph = 7 into chemaxon.get_dissociation_constants and use min_pH = 0 and max_pH = 14 for the subsequent filtering

2. When I calculate dissociation constants for a SMILES and compare them with the corresponding compound cache value they seem to agree. They agree in the notebook example I linked you.

Do you have an idea why compounds would have more microspecies in the database than I am calculating? For most of the compounds I checked this does not have a huge impact on the final ∆G'o, but for something where the database has more microspecies than I calculate this will make a big difference. For something like 3'-Phosphoadenylylsulfate at pH=5 and I=0 the difference is 20 kJ/mol between my method and the component cache.

Thank you,

Kevin

Elad Noor

unread,

Jun 16, 2020, 9:24:59 AM6/16/20

to eQuilibrator Users

Hi Kevin,

There is a simple explanation for this. In the latest version of the quilt compound DB, we added Magnesium species for some of the compounds.

Specifically, kegg:C00053 (3'-Phosphoadenylylsulfate) is one of them. The data doesn't come from ChemAxon, but from another table which we took from Du et al. (10.1016/j.bpj.2018.04.030).

(you can add "microspecies.number_magnesiums" to the output in your example to see that). You can also see that the charge doesn't match the number of protons...

If you don't have dissociation constants for Mg2+ for your compounds, that's perfectly fine. Their impact is usually quite small.

More precisely, these micro-species should have some impact on the formation energy only when the pMg parameter is low (less than 5). If you set it to 14, you shouldn't see a difference anymore.

kevinsh...@u.northwestern.edu

unread,

Jun 16, 2020, 9:41:55 AM6/16/20

to eQuilibrator Users

Hi Elad,

I realized that yesterday after I posted when I saw the charge doesn't match the protons and the issues I was having corresponded exactly to the compounds in the magnesium_pkds.csv. Thank you for the clarification, it makes a lot of sense.

Another question: I don't want to recalculate compound information if it already exists in the quilt compound DB to avoid cxcalc calls if I can. Is there a good way to check to see if a compound exists in the DB based on an input SMILES? What I was thinking of doing was to generate an uncharged, canonical SMILES for each of the compounds in the DB and pairing this with the internal ID of each compound to make a dictionary where the keys are SMILES and values are the ID. With this I will check to see if an input SMILES (uncharged, canonical) is in the keys and if it is I can use the internal ID to look up the compound, otherwise I will calculate the information. If you have any better ideas I would be grateful to hear them.

Thank you,

Kevin

Elad Noor

unread,

Jun 16, 2020, 9:49:00 AM6/16/20

to eQuilibrator Users

Hi,

The best solution would be to use InChIKeys.

We already have the field Compound.inchi_key indexed so the search will be very fast.

You should be able to convert SMILES to InChIKey using openbabel (and also RDkit, I think) and easily search the table using it.

kevinsh...@u.northwestern.edu

unread,

Jun 16, 2020, 1:54:30 PM6/16/20

to eQuilibrator Users

Hi Elad,

Are you talking about using the search_compound_by_inchi_key method in the CompoundCache class? What I did was convert a SMILES into the inchi key, remove the last two characters from the key (hyphen and a character), and then put that into the search_compound_by_inchi_key. That returns a list of matches and seems to work fine. The only question I have is that sometimes it returns multiple database entries.

For example using the exact inchi key for glyoxylate:

from equilibrator_cache import create_compound_cache_from_quilt

ccache = create_compound_cache_from_quilt()

inchi_key = ccache.get_compound_by_internal_id(66).inchi_key

results = ccache.search_compound_by_inchi_key(inchi_key)

this returns 4 compounds with different IDs, but the same inchi keys: [Compound(id=66, inchi_key=HHLFWLYXYJOTON-UHFFFAOYSA-M), Compound(id=692385, inchi_key=HHLFWLYXYJOTON-UHFFFAOYSA-M), Compound(id=692448, inchi_key=HHLFWLYXYJOTON-UHFFFAOYSA-M), Compound(id=692919, inchi_key=HHLFWLYXYJOTON-UHFFFAOYSA-M)]

Is it safe to just take the smallest ID if there are multiple compounds?

Kevin

Elad Noor

unread,

Jun 16, 2020, 5:16:57 PM6/16/20

to eQuilibrator Users

Yes, we do have a few duplicates that were copied over from the MetaNetX database.

Most of them are IDs that are marked "deprecated" but we still have them for backward compatibility. Unfortunately, the information regarding which one is deprecated is currently not stored.

When we update the compound database next time (to MetaNetX version 4) most of these issues should be solved,

but for now I think your best shot is indeed to pick the compounds with the lowest internal ID.

kevinsh...@u.northwestern.edu

unread,

Jun 17, 2020, 9:11:22 AM6/17/20

to eQuilibrator Users

Thank you for the information--I will just take the lowest ID.

I now have code that is in a workable form for me to insert a SMILES, check to see if that exists in the DB already or calculate properties via cxcalc, and then return the ∆Go'(pH, I). I plan to make the code into a nicer form to easily allow my coworkers to use it. If you still wanted to turn this into an "extension" for eQuilibrator I think it would be good to get your input into the design so it matches your overall design philosophy for eQuilibrator.

I can't work on this for the next week or so, but if you are interested an "extension" and being involved in the design please let me know so I can message you closer to when I am going to start working on this again so I can get your input.

Thank you again for all of your help!

Kevin

Elad Noor

unread,

Jun 17, 2020, 12:22:14 PM6/17/20

to eQuilibrator Users

Yes, that would probably be a very useful extension for many people. I'll be very happy to include it in the main package, or highlight it as an extension package, whatever you prefer.
Thank you for making the effort and willingness to share your results!

kevinsh...@u.northwestern.edu

unread,

Jun 29, 2020, 9:08:42 AM6/29/20

to eQuilibrator Users

Hi Elad,

Is there a best way to send the code of what I have done so far for you to look at? My current set-up took pieces of the equilibrator-assets folder with added files from component-contribution/scripts/support (specifically molecule.py and group_decompose.py). I have an ipynb as well that shows examples of the code. It is probably easiest for me to share the folder itself. Please, let me know what is most convenient for you!

Kevin

Elad Noor

unread,

Jun 29, 2020, 9:53:40 AM6/29/20

to eQuilibrator Users

Hi Kevin,

Indeed, the easiest would be if you could either attach the files here, or send it by email to my ETH address.

Perhaps it would be better if you forked the component-contribution project on GitLab and add your files there, but it's up to you.

Elad

Elad Noor

unread,

Jul 29, 2020, 12:32:09 PM7/29/20

to eQuilibrator Users

Hi Kevin,

Inspired by your request and the scripts you've sent me, I rearranged some files between the repositories, and added a script that handles new compounds:

https://gitlab.com/equilibrator/equilibrator-assets/-/blob/master/src/equilibrator_assets/generate_compound.py

I've also added an example Jupyter notebook to demonstrate its use (similar to the one you sent me):

https://gitlab.com/equilibrator/equilibrator-assets/-/blob/master/notebooks/generate_compound.ipynb

Using it still requires checking out the equilibrator-assets project and installling it in a conda virtualenv (and also installing openbabel). However, it could be useful for others who want to get formation energies for uncached compounds.

Cheers,

Elad

Elad Noor

unread,

Sep 3, 2020, 9:45:18 AM9/3/20

to eQuilibrator Users

Hi Kevin,

Thank you for making the pull request. I think you made a very elegant solution and it worked pretty easily for me.

I made just a few minor changes and merged it now into the master branch.

I'll try soon to polish it a bit more, and write some clear instructions for how people can use it.