Data-Efficient Graph Grammar Learning for Molecular Generation - 2023

Details

Title : Data-Efficient Graph Grammar Learning for Molecular Generation Author(s): Neurosymbolic Programming for Science

Rough Notes

The typical general pipeline for molecule generation is: Training set \(\to\) Representation (e.g. some string format) \(\to\) Generative model \(\to\) Novel generated molecules \(\leftarrow\) Domain-specific metrics

Many deep learning approaches produce chemically invalid molecules.

These days, the representation and generative model are built on top of some deep learning approach, this however requires the training dataset to be on the order of around 10,000. This is a problem - since molecular data is not always abundant.

E.g. consider a specific molecule type, fulfilling ingredient requirements, like monomer generation for polymers (#DOUBT Look into the monomer examples). Existing dataset for polyuerathanes has only 20 samples - (, ).

Deep learning based molecule generation requires lots of data. Here, there is the assumption that molecule structures contain necessary information of chemical knowledge, representing molecules as latent vectors by passing them through some embedding functions. The task then involves making the generated molecules close to the molecules in some dataset. Many of the molecules generated this way are chemically invalid. However, the latent vector is not an efficient representation. The chemical space is diverse and vast, and even with large training datasets it is hard to learn very simple chemical constraints, see (, a). This is in comparison to a hard-working PhD student whose work will always produce chemically valid molecules.

Using a Formal Grammar:

  • Allows for data efficiency, as a small number of rules to generate a large amount of samples.
  • Allow for expressiveness, as the production rules can specify the necessary constraints.

The grammar generates molecules, and consists of non-terminal symbols, terminal symbols and production rules that transform combinations of the symbols. Such grammars are used in computer graphics (, a, , a), and computer vision (, a, , a). Some related methods in chemistry using grammars include GrammarVAE, MHG, SD-VAE, SELFIES, STONED. However, these grammars are manually designed, and it is hard to design complicated constraints.

Ideally, we would want to learn the grammar from a small set of training data (~ a few dozen), where the grammar then produces molecules. This grammar is constructed using the bottom-up construction, where you have a branch of input-output pairs and you want to find the proper program that can generate this input-output pair.

The bottom-up construction here is determined by a sequence of collapsed edges, and each edge has a rank that tells when it is being collapsed - the construction of a minimum spanning tree (MST) determines the order of the edges. This MST needs a parametrized edge weight function \(F_\theta(e)\) which determines the constructed grammar. In short, \(F_\theta(\cdot)\) determines the rank of all the edges, which in turn determine the sequence of the production rules, which determine the final grammar (#DOUBT Need to read more on this construction). The maximization objective is a combination of the diversity and validity of the grammar generated by \(F_\theta(\cdot)\).

Note that \(F_\theta(\cdot)\) is not differentiable - to handle this, a "probability" (#NOTE The quotes are from me) is defined on each edge by \(\sigma(F_\theta(e))\), and the maximization objective is converted to an expectation, allowing the use of the REINFORCE estimator. (#NOTE Look into other discrete gradient approximations).

Experiments work well, even when using 0.15% of samples from a dataset of ~81,000 samples.

Github repository link.

Emacs 29.4 (Org mode 9.6.15)