9: Pre-evaluated CS50K with Active Learning#

Authors: Mateusz K Bieniek, Ben Cree, Rachael Pirie, Joshua T. Horton, Natalie J. Tatum, Daniel J. Cole

Overview#

An AL study using precomputed Gnina scores.

import pandas as pd
import prody
from rdkit import Chem

import fegrow
from fegrow import ChemSpace

from fegrow.testing import core_5R83_path, smiles_5R83_path

# create the chemical space
cs = ChemSpace()
# we're not growing the scaffold, we're superimposing bigger molecules on it
cs.add_scaffold(Chem.SDMolSupplier(core_5R83_path)[0])
# we can ignore the protein as the values have been pre-computed
cs.add_protein(None)

/home/dresio/code/fegrow/fegrow/package.py:597: UserWarning: ANI uses TORCHAni which is not threadsafe, leading to random SEGFAULTS. Use a Dask cluster with processes as a work around (see the documentation for an example of this workaround) .
  warnings.warn("ANI uses TORCHAni which is not threadsafe, leading to random SEGFAULTS. "


Dask can be watched on http://192.168.178.20:8989/status


/home/dresio/code/fegrow/fegrow/package.py:801: UserWarning: The template does not have an attachement (Atoms with index 0, or in case of Smiles the * character. )
  warnings.warn("The template does not have an attachement (Atoms with index 0, "

# switch on the caching
# I set it here to 6GB of RAM
cs.set_dask_caching(6e9)

# load 50k Smiles
oracle = pd.read_csv(smiles_5R83_path)

# remove .score == 0, which was used to signal structures that were too big
oracle = oracle[oracle.cnnaffinity!=0]

# here we add Smiles which should already have been matched
# to the scaffold (rdkit Mol.HasSubstructureMatch)
smiles = oracle.Smiles.to_list()
cs.add_smiles(smiles)

Active Learning#

Warning! Please change the logger in order to see what is happening inside of ChemSpace.evaluate. There is too much info to output it into the screen .#

import logging
logging.basicConfig(encoding='utf-8', level=logging.DEBUG)

from fegrow.al import Model, Query

# This is the default configuration
# cs.model = Model.gaussian_process()
cs.model = Model.linear()
cs.query = Query.Greedy()

# we will use the preivously computed scores for this AL study
# we're going to look up the values instead
def oracle_look_up(scaffold, h, smiles, *args, **kwargs):
    # mol, data
    return None, {"score": oracle[oracle.Smiles == smiles].iloc[0].cnnaffinity}

# the first cycle will take more time
for cycle in range(20):
    # select 2 hundred
    selections = cs.active_learning(200)
    res = cs.evaluate(selections, full_evaluation=oracle_look_up)

    print(f"AL{cycle:2d}. "
      f"Mean: {res.score.mean():.2f}, "
      f"Max: {res.score.max():.2f}, "
      f">4.8: {sum(res.score > 4.8):3d}, "
      f">5.0: {sum(res.score > 5.0):3d}, "
      f">5.2: {sum(res.score > 5.2):3d}, "
      f">5.4: {sum(res.score > 5.4):3d}, "
      )

/home/dresio/code/fegrow/fegrow/package.py:1287: UserWarning: Selecting randomly the first samples to be studied (no score data yet). 
  warnings.warn("Selecting randomly the first samples to be studied (no score data yet). ")


AL 0. Mean: 4.50, Max: 5.50, >4.8:  46, >5.0:  23, >5.2:   7, >5.4:   1, 
AL 1. Mean: 5.17, Max: 6.11, >4.8: 187, >5.0: 151, >5.2:  90, >5.4:  33, 
AL 2. Mean: 5.16, Max: 5.73, >4.8: 177, >5.0: 146, >5.2:  90, >5.4:  36, 
AL 3. Mean: 4.93, Max: 5.73, >4.8: 132, >5.0:  85, >5.2:  42, >5.4:  20, 
AL 4. Mean: 4.95, Max: 6.16, >4.8: 130, >5.0:  95, >5.2:  54, >5.4:  19, 
AL 5. Mean: 4.93, Max: 5.89, >4.8: 128, >5.0:  75, >5.2:  37, >5.4:  21, 
AL 6. Mean: 4.85, Max: 5.69, >4.8: 114, >5.0:  75, >5.2:  38, >5.4:  14, 
AL 7. Mean: 4.76, Max: 5.59, >4.8: 101, >5.0:  60, >5.2:  20, >5.4:   2, 
AL 8. Mean: 4.77, Max: 5.77, >4.8: 100, >5.0:  57, >5.2:  30, >5.4:  11, 
AL 9. Mean: 4.67, Max: 5.65, >4.8:  76, >5.0:  39, >5.2:  16, >5.4:   7, 
AL10. Mean: 4.59, Max: 5.62, >4.8:  63, >5.0:  33, >5.2:  18, >5.4:   7, 
AL11. Mean: 4.60, Max: 6.06, >4.8:  63, >5.0:  36, >5.2:  10, >5.4:   2, 
AL12. Mean: 4.92, Max: 5.78, >4.8: 138, >5.0:  89, >5.2:  45, >5.4:  15, 
AL13. Mean: 5.03, Max: 5.88, >4.8: 155, >5.0: 110, >5.2:  61, >5.4:  26, 
AL14. Mean: 5.12, Max: 6.24, >4.8: 174, >5.0: 125, >5.2:  77, >5.4:  32, 
AL15. Mean: 5.10, Max: 6.20, >4.8: 165, >5.0: 126, >5.2:  78, >5.4:  38, 
AL16. Mean: 5.12, Max: 5.98, >4.8: 177, >5.0: 144, >5.2:  75, >5.4:  31, 
AL17. Mean: 5.10, Max: 5.96, >4.8: 169, >5.0: 130, >5.2:  71, >5.4:  25, 
AL18. Mean: 5.09, Max: 5.83, >4.8: 176, >5.0: 136, >5.2:  67, >5.4:  20, 
AL19. Mean: 5.08, Max: 6.02, >4.8: 173, >5.0: 129, >5.2:  64, >5.4:  22,