• LOGIN
    Login with username and password
Repository logo

BORIS Portal

Bern Open Repository and Information System

  • Publications
  • Theses
  • Research Data
  • Projects
  • Organizations
  • Researchers
  • More
  • Collections
  • Statistics
  • LOGIN
    Login with username and password
Repository logo
Unibern.ch
  1. Home
  2. Publications
  3. Randomized SMILES strings improve the quality of molecular generative models
 

Randomized SMILES strings improve the quality of molecular generative models

Options
  • Details
  • Files
BORIS DOI
10.7892/boris.138507
Publisher DOI
10.1186/s13321-019-0393-0
Description
Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. Herein we perform an extensive benchmark on models trained with subsets of GDB-13 of different sizes (1 million, 10,000 and 1000), with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations. To guide the benchmarks new metrics were developed that define how well a model has generalized the training set. The generated chemical space is evaluated with respect to its uniformity, closedness and completeness. Results show that models that use LSTM cells trained with 1 million randomized SMILES, a non-unique molecular string representation, are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Specifically, a model was trained with randomized SMILES that was able to generate almost all molecules from GDB-13 with a quasi-uniform probability. Models trained with smaller samples show an even bigger improvement when trained with randomized SMILES models. Additionally, models were trained on molecules obtained from ChEMBL and illustrate again that training with randomized SMILES lead to models having a better representation of the drug-like chemical space. Namely, the model trained with randomized SMILES was able to generate at least double the amount of unique molecules with the same distribution of properties comparing to one trained with canonical SMILES.
Date of Publication
2019
Publication Type
Article
Subject(s)
500 Science > 570 Life sciences; biology
500 Science > 540 Chemistry
Language(s)
en
Contributor(s)
Arus Pous, Josep
Departement für Chemie und Biochemie (DCB)
Johansson, Simon Viet
Prykhodko, Oleksii
Bjerrum, Esben Jannik
Tyrchan, Christian
Reymond, Jean-Louisorcid-logo
Departement für Chemie und Biochemie (DCB)
Chen, Hongming
Engkvist, Ola
Additional Credits
Departement für Chemie und Biochemie (DCB)
Series
Journal of cheminformatics
Publisher
Springer
ISSN
1758-2946
Access(Rights)
open.access
Show full item
BORIS Portal
Bern Open Repository and Information System
Build: dd892c [ 9.04. 8:30]
Explore
  • Projects
  • Funding
  • Publications
  • Research Data
  • Organizations
  • Researchers
  • Audiovisual Material
  • Software & other digital items
  • Events
More
  • About BORIS Portal
  • Send Feedback
  • Cookie settings
  • Service Policy
Follow us on
  • Mastodon
  • YouTube
  • LinkedIn
UniBe logo