4: Introduce Chemspace#
Authors: Mateusz K Bieniek, Ben Cree, Rachael Pirie, Joshua T. Horton, Natalie J. Tatum, Daniel J. Cole
Overview#
Here we introduce the ChemSpace class which: - automates protocols and takes care of CPU/cluster processing with Dask - stores data in a dataframe - employs scikit for active learning
import prody
from rdkit import Chem
import fegrow
from fegrow import RGroups, Linkers, ChemSpace
# initialise
rgroups = RGroups()
linkers = Linkers()
<frozen importlib._bootstrap>:241: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterHierarchyMatcher> already registered; second conversion method ignored.
<frozen importlib._bootstrap>:241: RuntimeWarning: to-Python converter for boost::shared_ptr<RDKit::FilterCatalogEntry> already registered; second conversion method ignored.
MolGridWidget(grid_id='m2')
MolGridWidget(grid_id='m1')
Prepare the ligand template#
The provided core structure lig.pdb
has been extracted from a crystal structure of Mpro in complex with compound 4 from the Jorgensen study (PDB: 7L10), and a Cl atom has been removed to allow growth into the S3/S4 pocket. The template structure of the ligand is protonated with Open Babel:
!obabel sarscov2/lig.pdb -O sarscov2/coreh.sdf -p 7
1 molecule converted
Load the protonated ligand into FEgrow:
init_mol = Chem.SDMolSupplier('sarscov2/mini.sdf', removeHs=False)[0]
# get the FEgrow representation of the rdkit Mol
scaffold = fegrow.RMol(init_mol)
Show the 2D (with indices) representation of the core. This is used to select the desired growth vector.
scaffold.rep2D(idx=True, size=(500, 500))
Using the 2D drawing, select an index for the growth vector. Note that it is currently only possible to grow from hydrogen atom positions. In this case, we are selecting the hydrogen atom labelled H:8 to enable growth.
# specify the connecting point
scaffold.GetAtomWithIdx(8).SetAtomicNum(0)
# create the chemical space
cs = ChemSpace()
/home/dresio/code/fegrow/fegrow/package.py:595: UserWarning: ANI uses TORCHAni which is not threadsafe, leading to random SEGFAULTS. Use a Dask cluster with processes as a work around (see the documentation for an example of this workaround) .
warnings.warn("ANI uses TORCHAni which is not threadsafe, leading to random SEGFAULTS. "
Dask can be watched on http://192.168.178.20:8989/status
The R-Group lacks initial coordinates. Defaulting to Chem.rdDistGeom.EmbedMolecule.
[11:33:21] UFFTYPER: Unrecognized atom type: *_ (0)
[11:33:21] UFFTYPER: Unrecognized atom type: *_ (3)
The R-Group lacks initial coordinates. Defaulting to Chem.rdDistGeom.EmbedMolecule.
[11:33:21] UFFTYPER: Unrecognized atom type: *_ (0)
Generated 14 conformers.
Generated 6 conformers.
Generated 23 conformers.
Removed 3 conformers.
Removed 6 conformers.
Removed 15 conformers.
/home/dresio/software/mambaforge/envs/fegrow-onechannel/lib/python3.11/site-packages/parmed/structure.py:1799: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
coords = np.asanyarray(value, dtype=np.float64)
using ani2x
/home/dresio/software/mambaforge/envs/fegrow-onechannel/lib/python3.11/site-packages/torchani/aev.py:16: UserWarning: cuaev not installed
warnings.warn("cuaev not installed")
/home/dresio/software/mambaforge/envs/fegrow-onechannel/lib/python3.11/site-packages/torchani/__init__.py:59: UserWarning: Dependency not satisfied, torchani.ase will not be available
warnings.warn("Dependency not satisfied, torchani.ase will not be available")
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
/home/dresio/software/mambaforge/envs/fegrow-onechannel/lib/python3.11/site-packages/torchani/resources/
using ani2x
/home/dresio/software/mambaforge/envs/fegrow-onechannel/lib/python3.11/site-packages/torchani/resources/
failed to equip `nnpops` with error: No module named 'NNPOps'
Optimising conformer: 0%| | 0/3 [00:00<?, ?it/s]
failed to equip `nnpops` with error: No module named 'NNPOps'
Optimising conformer: 33%|███████▋ | 1/3 [00:03<00:07, 3.55s/it]
Optimising conformer: 100%|███████████████████████| 3/3 [00:06<00:00, 2.29s/it]
[Aimising conformer: 12%|██▉ | 1/8 [00:05<00:35, 5.05s/it]
[Aimising conformer: 25%|█████▊ | 2/8 [00:06<00:16, 2.74s/it]
[Aimising conformer: 38%|████████▋ | 3/8 [00:06<00:09, 1.85s/it]
[Aimising conformer: 50%|███████████▌ | 4/8 [00:08<00:06, 1.54s/it]
[Aimising conformer: 62%|██████████████▍ | 5/8 [00:08<00:03, 1.33s/it]
[Aimising conformer: 75%|█████████████████▎ | 6/8 [00:10<00:02, 1.34s/it]
[Aimising conformer: 88%|████████████████████▏ | 7/8 [00:10<00:01, 1.07s/it]
using ani2x
Optimising conformer: 100%|███████████████████████| 8/8 [00:11<00:00, 1.50s/it]
/home/dresio/software/mambaforge/envs/fegrow-onechannel/lib/python3.11/site-packages/torchani/resources/
failed to equip `nnpops` with error: No module named 'NNPOps'
Optimising conformer: 100%|███████████████████████| 8/8 [00:05<00:00, 1.59it/s]
cs.add_scaffold(scaffold)
# initially it is empty
cs
Smiles | score | h | Training | Success | enamine_searched | enamine_id | 2D |
---|
Select RGroups for your template#
R-groups can be selected interactively or programmaticaly.
We have provided a set of common R-groups (see fegrow/data/rgroups/library
), which can be browsed and selected interactively below.
Molecules from the library can alternatively be selected by name, as demonstrated below.
Finally, user-defined R-groups may be provided as .mol
files. In this case, the hydrogen atom selected for attachment should be replaced by the element symbol R. See the directory manual_rgroups
for examples.
rgroups
# retrieve the interactively selected groups
interactive_rgroups = rgroups.get_selected()
# you can also directly access the built-in dataframe programmatically
R_group_ethanol = rgroups[rgroups.Name == '*CCO'].Mol.item()
# select the R-group using the index
R_group_cyclopropane = rgroups.Mol[69]
# use SMILES
R_group_methanol = Chem.AddHs(Chem.MolFromSmiles('*CO'))
# add your R-groups from files
R_group_propanol = Chem.MolFromMolFile('manual_rgroups/propan-1-ol-r.mol', removeHs=False)
Expand your chemical space by building on top off your scaffold.#
# Adding R-groups implies that the scaffold should be used.
# The previously scaffold will be attached automatically.
# or we can use the template merged with the linker
# in which case the attachement point is not needed (R* atom is used)
cs.add_rgroups(R_group_ethanol)
cs
Smiles | score | h | Training | Success | enamine_searched | enamine_id | 2D | |
---|---|---|---|---|---|---|---|---|
0 | [H]OC([H])([H])C([H])([H])c1c([H])nc([H])c([H]... | <NA> | 8 | False | NaN | False | NaN |
linkers
# get linkers programmatically from the library
rcr_linker = linkers[linkers.Name == 'R1CR2'].Mol.item()
rocr_linker = linkers.Mol[6], # use the linker table index directly, e.g. index 6 is "R2COR1"
# pick linkers from the grid
grid_linkers = linkers.get_selected()
# use Smiles
rcor_linker = Chem.AddHs(Chem.MolFromSmiles('[*:0]CO[*:1]'))
#
Add linkers to build more structures#
# Adding R-groups implies that the scaffold should be used.
# The previously scaffold will be attached automatically.
# join a linker with the rgroups
cs.add_rgroups(rcor_linker, [R_group_methanol, R_group_propanol])
cs
Smiles | score | h | Training | Success | enamine_searched | enamine_id | 2D | |
---|---|---|---|---|---|---|---|---|
0 | [H]OC([H])([H])C([H])([H])c1c([H])nc([H])c([H]... | <NA> | 8 | False | NaN | False | NaN | |
1 | [H]OC([H])([H])OC([H])([H])c1c([H])nc([H])c([H... | <NA> | 8 | False | NaN | False | NaN | |
2 | [H]c1nc([H])c(C([H])([H])OOC([H])([H])C([H])([... | <NA> | 8 | False | NaN | False | NaN |
The R-group library can also be viewed as a 2D grid, or individual molecules can be selected for 3D view (note that the conformation of the R-group has not yet been optimised):
mol = cs[0]
mol.rep2D()
cs[0].rep3D()
3Dmol.js failed to load for some reason. Please check your browser console for error messages.
<py3Dmol.view at 0x785b64e3ce50>
Once the ligands have been generated, they can be assessed for various ADMET properties, including Lipinksi rule of 5 properties, the presence of unwanted substructures or problematic functional groups, and synthetic accessibility.
cs.toxicity()
[11:33:21] DEPRECATION WARNING: please use MorganGenerator
[11:33:21] DEPRECATION WARNING: please use MorganGenerator
[11:33:21] DEPRECATION WARNING: please use MorganGenerator
MW | HBA | HBD | LogP | Pass_Ro5 | has_pains | has_unwanted_subs | has_prob_fgs | synthetic_accessibility | Smiles | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 123.155 | 2 | 1 | 0.616 | True | False | False | False | 7.522 | [H]OC([H])([H])C([H])([H])c1c([H])nc([H])c([H]... |
1 | 139.154 | 3 | 1 | 0.548 | True | False | True | True | 7.369 | [H]OC([H])([H])OC([H])([H])c1c([H])nc([H])c([H... |
2 | 167.208 | 3 | 0 | 1.940 | True | False | True | True | 7.715 | [H]c1nc([H])c(C([H])([H])OOC([H])([H])C([H])([... |
For each ligand, a specified number of conformers (num_conf
) is generated by using the RDKit ETKDG algorithm. Conformers that are too similar to an existing structure are discarded. Empirically, we have found that num_conf=200
gives an exhaustive search, and num_conf=50
gives a reasonable, fast search, in most cases.
If required, a third argument can be added flexible=[0,1,...]
, which provides a list of additional atoms in the core that are allowed to be flexible. This is useful, for example, if growing from a methyl group and you would like the added R-group to freely rotate.
cs.generate_conformers(num_conf=50,
minimum_conf_rms=0.5,
# flexible=[3, 18, 20])
)
Prepare the protein#
The protein-ligand complex structure is downloaded, and PDBFixer is used to protonate the protein, and perform other simple repair:
# get the protein-ligand complex structure
!wget -nc https://files.rcsb.org/download/7L10.pdb
# load the complex with the ligand
sys = prody.parsePDB('7L10.pdb')
# remove any unwanted molecules
rec = sys.select('not (nucleic or hetatm or water)')
# save the processed protein
prody.writePDB('rec.pdb', rec)
# fix the receptor file (missing residues, protonation, etc)
fegrow.fix_receptor("rec.pdb", "rec_final.pdb")
# load back into prody
rec_final = prody.parsePDB("rec_final.pdb")
File ‘7L10.pdb’ already there; not retrieving.
@> 2609 atoms and 1 coordinate set(s) were parsed in 0.05s.
@> 4638 atoms and 1 coordinate set(s) were parsed in 0.03s.
View enumerated conformers in complex with protein:
cs[0].rep3D(prody=rec_final)
3Dmol.js failed to load for some reason. Please check your browser console for error messages.
<py3Dmol.view at 0x785b42dc5110>
Any conformers that clash with the protein (any atom-atom distance less than 1 Angstrom), are removed.
cs.remove_clashing_confs(rec_final)
cs[0].rep3D(prody=rec_final)
3Dmol.js failed to load for some reason. Please check your browser console for error messages.
<py3Dmol.view at 0x785b42d9a950>
Optimise conformers in context of protein#
The remaining conformers are optimised using hybrid machine learning / molecular mechanics (ML/MM), using the ANI2x neural nework potential for the ligand energetics (as long as it contains only the atoms H, C, N, O, F, S, Cl). Note that the Open Force Field Parsley force field is used for intermolecular interactions with the receptor.
sigma_scale_factor
: is used to scale the Lennard-Jones radii of the atoms.
relative_permittivity
: is used to scale the electrostatic interactions with the protein.
water_model
: can be used to set the force field for any water molecules present in the binding site.
# opt_mol, energies
energies = cs.optimise_in_receptor(
receptor_file="rec_final.pdb",
ligand_force_field="openff",
use_ani=True,
sigma_scale_factor=0.8,
relative_permittivity=4,
water_model = None,
platform_name = 'CPU', # or e.g. 'CUDA'
)
Any of the rmols that have no available conformers (due to unresolvable steric clashes with the protein) can be discarded using the .discard_missing()
function. This function also returns a list of the indices that were removed, which can be helpful when carrying out data analysis.
missing_ids = cs.discard_missing()
Optionally, display the final optimised conformers. Note that, unlike classical force fields, ANI allows bond breaking. You may occasionally see ligands with distorted structures and very long bonds, but in our experience these are rarely amongst the low energy structures and can be ignored.
Conformers are now sorted by energy, only retaining those within 5 kcal/mol of the lowest energy structure:
final_energies = cs.sort_conformers(energy_range=5)
RMol index 0
RMol index 1
RMol index 2
cs[0].rep3D()
3Dmol.js failed to load for some reason. Please check your browser console for error messages.
<py3Dmol.view at 0x785b1f9ed390>
Save all of the lowest energy conformers to files and print the sorted energies in kcal/mol (shifted so that the lowest energy conformer is zero).
cs.to_sdf("optimised_molecules.sdf")
cs[0].to_file("best_conformers0.pdb")
print(final_energies)
Energy
ID Conformer
NaN 0 0.000
1 0.014
2 2.231
0 0.000
1 0.985
2 2.810
3 2.931
0 0.000
1 0.129
2 0.455
3 2.351
4 2.497
5 2.676
6 3.266
The conformers are scored using the Gnina molecular docking program and convolutional neural network scoring function. [Note that this step is not supported on macOS]. If unavailable, the Gnina executable is downloaded during the first time it is used. The CNNscores may also be converted to predicted Kd (nM) (see column "Kd").
affinities = cs.gnina(receptor_file="rec_final.pdb")
affinities
CNNaffinity | Kd | ||
---|---|---|---|
ID | Conformer | ||
0 | 0 | 3.165 | 684258.1832942407 |
1 | 3.134 | 734243.3130889146 | |
2 | 3.064 | 862740.1309222322 | |
1 | 0 | 2.950 | 1121656.816578837 |
1 | 2.980 | 1047948.6439058614 | |
2 | 3.063 | 865764.9506603737 | |
3 | 2.944 | 1136920.2450490897 | |
2 | 0 | 3.222 | 599777.2657145987 |
1 | 3.198 | 634278.5139306292 | |
2 | 3.217 | 607351.3483774511 | |
3 | 3.271 | 535685.634451859 | |
4 | 3.285 | 519254.1784516523 | |
5 | 3.293 | 509764.98369178106 | |
6 | 3.335 | 462178.7780293703 |
Predicted binding affinities may be further refined using the structures output by FEgrow
, using your favourite free energy calculation engine. See our paper for an example using SOMD to calculate the relative binding free energies of 13 Mpro inhibitors.
# display units
affinities.Kd
ID Conformer
0 0 684258.1832942407
1 734243.3130889146
2 862740.1309222322
1 0 1121656.816578837
1 1047948.6439058614
2 865764.9506603737
3 1136920.2450490897
2 0 599777.2657145987
1 634278.5139306292
2 607351.3483774511
3 535685.634451859
4 519254.1784516523
5 509764.98369178106
6 462178.7780293703
Name: Kd, dtype: pint[nanomolar][Float64]