Hi Elad,
It looks like my arbitrary SMILES approach matches the compound cache to calculate ∆Go for all of compounds not used in the training set--I have checked about 15. There are some exceptions, for example cis,cis-Muconate, but I am looking into that.
What I wanted to ask you for help with is dealing with the calculation of microspecies. With the addition of pH and ionic strength to calculate ∆G'o there are some discrepancies between my SMILES approach and the corresponding kegg ID in the compound cache approach. I believe the difference is due to different sets of microspecies between the compound objects obtained from each method. A concrete example of this is in
microspecies_example notebook in the equilibrator_assets. In your notebook database results as well as the calculated results match. However, when I run this code on my computer it appears as if the sqlite database has many more microscpecies than what is calculated (picture attached at bottom). I am not sure what is causing this discrepancy--perhaps it is something to do with chemaxon on macOS again?
What I have checked so far:
1. I input num_acidic = 20, num_basic = 20, mid_ph = 7 into chemaxon.get_dissociation_constants and use min_pH = 0 and max_pH = 14 for the subsequent filtering
2. When I calculate dissociation constants for a SMILES and compare them with the corresponding compound cache value they seem to agree. They agree in the notebook example I linked you.
Do you have an idea why compounds would have more microspecies in the database than I am calculating? For most of the compounds I checked this does not have a huge impact on the final ∆G'o, but for something where the database has more microspecies than I calculate this will make a big difference. For something like 3'-Phosphoadenylylsulfate at pH=5 and I=0 the difference is 20 kJ/mol between my method and the component cache.
Thank you,
Kevin