performance on tox21

229 views
Skip to first unread message

Taro Kiritani

unread,
Aug 21, 2020, 6:39:39 AM8/21/20
to Open Graph Benchmark

Hi,

I am comparing OGB and MoleculeNet. On tox21, you reported roc-auc of 77.5% in your paper while roc-auc was 85% with MoleculeNet:

http://moleculenet.ai/latest-results

Even with logistic regression, MoleculeNet reported 78%. Do you know why the results are so different?


The main differences, in my observation, are:

  1. OGB uses stratified split whereas MoleculeNet uses random split.
  2. Atom feature is embedded differently.
  3. The GNN networks are different.
  4. Maybe more differences.


I modified your example (https://github.com/snap-stanford/ogb/blob/master/examples/graphproppred/mol/main_pyg.py), and used random split. Still the ROC-AUC is around 75%. I am now trying to embed node feature as MoleculeNet did. Where can I find the script that prepares the tox21 dataset in OGB? Thanks a lot for your help.


Taro Kiritani

Weihua Hu

unread,
Aug 21, 2020, 9:57:34 AM8/21/20
to Open Graph Benchmark
Hi Taro,

Great questions!
The performance difference is due to the scaffold split, which is much harder than random split. Regarding the random split, I remembered I got around 86% on ogbg-moltox21 with GIN-virtual, so I suggest you to double-check your random split implementation. The pre-processing script for all the molecule datasets is available at https://github.com/snap-stanford/ogb/blob/master/examples/graphproppred/mol/smiles2graph.py

Hope this helps!

Best,
Weihua 

Taro Kiritani

unread,
Aug 23, 2020, 9:37:37 PM8/23/20
to Open Graph Benchmark
Hi Weihua,

Thanks a lot for your quick response. It was indeed my fault, and I got > 80% roc-auc by properly splitting the dataset randomly.

I am aware of smiles2graph.py, but mol2dataobj() in this module is not used anywhere in the repo. I would like to see how the stratified split is done. If you can share the code that would be great.

Best,

Taro

--
You received this message because you are subscribed to a topic in the Google Groups "Open Graph Benchmark" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/open-graph-benchmark/0KML08gNVcM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to open-graph-bench...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/open-graph-benchmark/4f21f9d8-2f75-429f-bf0a-ca1c30c77630o%40googlegroups.com.

Weihua Hu

unread,
Aug 23, 2020, 9:51:16 PM8/23/20
to Open Graph Benchmark
Hi Taro,

Great!

We applied smiles2graph.py on the list of SMILES strings stored in `mapping/` (not done in public repo). 
For quick reference, I pasted our code to obtain our scaffold split, assuming a list of SMILES strings is given.

Best,
Weihua

```python
import torch
import random
import numpy as np
from itertools import compress
from rdkit.Chem.Scaffolds import MurckoScaffold
from collections import defaultdict

def generate_scaffold(smiles, include_chirality=False):
    """
    Obtain Bemis-Murcko scaffold from smiles
    :param smiles:
    :param include_chirality:
    :return: smiles of scaffold
    """
    scaffold = MurckoScaffold.MurckoScaffoldSmiles(
        smiles=smiles, includeChirality=include_chirality)
    return scaffold
# # test generate scaffold
# s = 'Cc1cc(Oc2nccc(CCC)c2)ccc1'
# scaffold = generate_scaffold(s)
# assert scaffold == 'c1ccc(Oc2ccccn2)cc1'

def scaffold_split(smiles_list, frac_train=0.8, frac_valid=0.1, frac_test=0.1):
    """
    Split dataset by Bemis-Murcko scaffolds. Deterministic split
    :param smiles_list: list of smiles
    :param frac_train:
    :param frac_valid:
    :param frac_test:
    :return: list of train, valid, test indices corresponding to the
    scaffold split
    """
    np.testing.assert_almost_equal(frac_train + frac_valid + frac_test, 1.0)

    # create dict of the form {scaffold_i: [idx1, idx....]}
    all_scaffolds = {}
    for i, smiles in enumerate(smiles_list):
        scaffold = generate_scaffold(smiles, include_chirality=True)
        if scaffold not in all_scaffolds:
            all_scaffolds[scaffold] = [i]
        else:
            all_scaffolds[scaffold].append(i)

    # sort from largest to smallest sets
    all_scaffolds = {key: sorted(value) for key, value in all_scaffolds.items()}
    all_scaffold_sets = [
        scaffold_set for (scaffold, scaffold_set) in sorted(
            all_scaffolds.items(), key=lambda x: (len(x[1]), x[1][0]), reverse=True)
    ]

    # get train, valid test indices
    train_cutoff = frac_train * len(smiles_list)
    valid_cutoff = (frac_train + frac_valid) * len(smiles_list)
    train_idx, valid_idx, test_idx = [], [], []
    for scaffold_set in all_scaffold_sets:
        if len(train_idx) + len(scaffold_set) > train_cutoff:
            if len(train_idx) + len(valid_idx) + len(scaffold_set) > valid_cutoff:
                test_idx.extend(scaffold_set)
            else:
                valid_idx.extend(scaffold_set)
        else:
            train_idx.extend(scaffold_set)

    assert len(set(train_idx).intersection(set(valid_idx))) == 0
    assert len(set(train_idx).intersection(set(test_idx))) == 0
    assert len(set(test_idx).intersection(set(valid_idx))) == 0

    return train_idx, valid_idx, test_idx
```
To unsubscribe from this group and all its topics, send an email to open-graph-benchmark+unsub...@googlegroups.com.

Taro Kiritani

unread,
Aug 23, 2020, 9:56:05 PM8/23/20
to Open Graph Benchmark
Hi Weihua,

This is wonderful. I will look into it. Many thanks for your help!

Taro

To unsubscribe from this group and all its topics, send an email to open-graph-bench...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Open Graph Benchmark" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/open-graph-benchmark/0KML08gNVcM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to open-graph-bench...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/open-graph-benchmark/0fd07ee1-d97e-4de2-bbc8-84716ddc64f4o%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages