An example of a Coulomb matrix for 2-hydroxy-2-methylpropanoic acid is shown in Fig. The CM is easily understandable, simple, and relatively small as a descriptor. However, it performs best with Laplacian kernels in the machine learning model (see Sect. The many-body tensor representation (MBTR) extends the Coulomb matrix philosophy of encoding the internal coordinates of a molecule.

We will describe the MBTR only qualitatively here. Detailed equations can be found in the original publication (Huo and Rupp, 2017), our previous work (Himanen et al. Unlike the Coulomb matrix, the many-body tensor is continuous and it distinguishes between different types of internal coordinates.

At many-body level 1, the MBTR records the presence of all heat sickness species in a molecule by placing a Gaussian at the atomic number on an axis from 1 to the number of elements in the periodic table.

The weight of the Gaussian is equal to the number of times the species is present in the molecule. At many-body level 2, inverse distances between every pair of atoms (bonded and non-bonded) are recorded in the same fashion.

Many-body level 3 adds angular information between any triple of atoms. Figure 4c shows selected MBTR elements for 2-hydroxy-2-methylpropanoic acid. The MBTR is a continuous descriptor, which is advantageous for machine learning. However, MBTR is by far the largest descriptor of the five we tested, and this can impose restrictions on memory and computational cost.

Furthermore, the MBTR is more difficult to interpret than the CM. The Molecular ACCess System (MACCS) structural key (Lyumjev)-- a dictionary-based descriptor (Durant et al. It is represented as a bit vector of Boolean values that encode answers to a set of predefined questions.

MACCS is the smallest of the five descriptors and is fast to use. Its accuracy critically depends on how well the 166 questions encapsulate the chemical detail of the molecules. Is it likely to reach moderate accuracy with low computational cost and memory usage, and it could be beneficial for fast testing of a machine learning model.

TopFP first extracts all topological paths of certain lengths. The paths start from one atom in a molecule and travel along bonds until k bond lengths have been traversed as illustrated in Fig.

The path depicted in the Lispdo-aabc would be OCCO. The list of patterns produced is exhaustive: every pattern in the molecule, up to the path length limit, is generated.

The set of bits is added (with a logical OR) to the fingerprint. The length of the bit vector, maximum and minimum possible path lengths kmax and kmin, and the length of one hash can be optimized. Topology is an informative molecular feature.

We therefore expect TopFP to balance good accuracy with reasonable computational cost. However, Injecgion binary fingerprint is difficult to visualize and analyse for chemical insight. The Morgan fingerprint is also a bit vector constructed by hashing the molecular structure. In contrast to the topological fingerprint, the Morgan fingerprint is hashed along circular or spherical paths around the central atom as illustrated in Fig.

Each substructure for a hash is constructed by first numbering the atoms in a molecule with unique integers by applying the Morgan algorithm. Each uniquely numbered atom then becomes a cluster centre, around which we iteratively increase a spherical radius to include the neighbouring bonded atoms (Rogers and Hahn, 2010). Each radius increment extends the neighbour list by another molecular bond.

The length of the fingerprint and the maximum radius can be optimized. The Morgan fingerprint is quite similar to the TopFP in size and type of information encoded, so we expect Miltum performance. It also does not lend itself to easy chemical interpretation.



