Introduction

In recent years, the availability of large datasets combined with the improvement in algorithms and the exponential growth in computing power led to an unparalleled surge of interest in the topic of machine learning. Nowadays, machine learning algorithms are successfully employed for classification, regression, clustering, or dimensionality reduction tasks of large sets of especially high-dimensional input data.1 In fact, machine learning has proved to have superhuman abilities in numerous fields (such as playing go,2 self driving cars,74,75 Monte Carlo cross-validation is similar to k-fold cross-validation in the sense that the training and test set are randomly chosen. However, here the size of the training/test set is chosen independently from the number of folds. While this can be advantageous, it also means that a sample is not guaranteed to be in the test/training set. Leave-one-cluster-out cross-validation73 was specifically developed for materials science and estimates the ability of the machine learning model to extrapolate to novel groups of materials that were not present in the training data. Depending on the target quantity, this allows for a more realistic evaluation and a better understanding of the limitations of the machine learning model. Leave-one-cluster-out cross-validation removes a cluster of materials and then considers the error for predictions of the materials belonging to the removed cluster. This is, for example, consistent with the finding in ref. 76 that models trained on superconductors with a specific superconducting mechanism do not have any predictive ability for superconductors with other mechanisms.

Before discussing various applications of machine learning in materials science, we will give an overview of the different descriptors, algorithms, and databases used in materials informatics.

Databases

Machine learning in materials science is mostly concerned with supervised learning. The success of such methods depends mainly on the amount and quality of data that is available, and this turns out to be one of the major challenges in material informatics.77 This is especially problematic for target properties that can only be determined experimentally in a costly fashion (such as the critical temperature of superconductors—see section “Prediction of material properties—superconductivity”). For this reason, databases such as the materials project,78 the inorganic crystal structure database,79 and others (Materials genome initiative, The NOMAD archive, Supercon, National Institute of Materials Science 2011)80,81,82,83,84,85,86,87,88,89,90,91,92 that contain information on numerous properties of known materials are essential for the success of materials informatics.

In order for these databases and for materials informatics to thrive, a FAIR treatment of data93 is absolutely required. A FAIR treatment encompasses the four principles: findability, accessibility, interoperability, and repurposability.94 In other words, researchers from different disciplines should be able to find and access data, as well as the corresponding metadata, in a commonly accepted format. This allows the application of the data for new purposes.

Traditionally, negative results are often discarded and left unpublished. However, as negative data are often just as important for machine learning algorithms as positive results,28,95 a cultural adjustment toward the publication of unsuccessful research is necessary. In some disciplines with a longer tradition of data-based research (like chemistry), such databases already exist.95 In a similar vein, data that emerges as a side product but are not essential for a publication are often left unpublished. This eventually results in a waste of resources as other researchers are then required to repeat the work. In the end, every single discarded calculation will be sorely missed in future machine learning applications.

Features

A pivotal ingredient of a machine learning algorithm is the representation of the data in a suitable form. Features in material science have to be able to capture all the relevant information, necessary to distinguish between different atomic or crystal environments.96 The process itself, denoted as feature extraction or engineering, might be as simple as determining atomic numbers, might involve complex transformations such as an expansion of radial distribution functions (RDFs) in a certain basis, or might require aggregations based on statistics (e.g., average over features or the calculation of their maximum value). How much processing is required depends strongly on the algorithm. For some methods, such as deep learning, the feature extraction can be considered as part of the model.97 Naturally, the best choice for the representation depends on the target quantity and the variety of the space of occurrences. For completeness, we have to mention that the cost of feature extraction and of target quantity evaluation must never be comparable.

Ideally, descriptors should be uncorrelated, as an abundant number of correlated features can hinder the efficiency and accuracy of the model. When this happens, further feature selection is necessary to circumvent the curse of dimensionality,98 simplify models, and improve their interpretability as well as training efficiency. For example, several elemental properties such as the period and group in the periodic table, ionization potential, and covalent radius, can be used as features to model formation energies or distances to the convex hull of stability. However, it was shown that, to obtain acceptable accuracies, often only the period and the group are required.99

Having described the general properties of descriptors, we will proceed with a listing of the most used features in materials science. Without a doubt, the most studied type of features in this field are the ones related to the fitting of potential energy surfaces. In principle, the nuclear charges and the atomic positions are sufficient features, as the Hamiltonian of a system is usually fully determined by these quantities. In practice, however, while Cartesian coordinates might provide an unambiguous description of the atomic positions, they do not make a suitable descriptor, as the list of coordinates of a structure are ordered arbitrarily and the number of such coordinates varies with the number of atoms. The latter is a problem, as most machine learning models require a fixed number of features as an input. Therefore, to describe solids and large clusters, the number of interacting neighbors has to be allowed to vary without changing the dimensionality of the descriptor. In addition, a lot of applications require that the features are continuous and differentiable with respect to atomic positions.

A comprehensive study on features for atomic potential energy surfaces can be found in the review of Bartó et al.100. Important points mentioned in their work are: (i) the performance of the model and its ability to differentiate between different structures do not depend directly on the descriptors but on the similarity measurement between them; (ii) the quality of the descriptors is related to the differentiability with respect to the movement of the atoms, completeness of the representation, and invariance to the basis symmetries of physics (rotation, reflection, translation, and permutation of atoms of the same species). For clarification, a set of invariant descriptors qi, which uniquely determines an atomic environment up to symmetries, is defined as complete. An overcomplete set is then a set that includes more features than necessary.

Simple representations that show shortcomings as features are transformations of pairwise distances,101,102,103 Weyl matrices,104 and Z-matrices.105 Pairwise distances (and also reciprocal or exponential transformations of these) only work for a fixed number of atoms and are not unique under permutation of atoms. The constrain on the number of atoms is also present for polynomials of pairwise distances. Histograms of pairwise atomic distances are non-unique: if no information on the angles between the atoms is given, of if the ordering of the atoms is unknown, it might be possible to construct at least two different structures with the same features. Weyl matrices are defined by the inner product between neighboring atoms positions, forming an overcomplete set, while permutations of the atoms change the order of the rows and columns. Finally, Z-matrices or internal coordinate representations are not invariant under permutations of atoms.

In 2012, Rupp et al.106 introduced a representation for molecules based on the Coulomb repulsion between atoms I and J and a polynomial fit of atomic energies to the nuclear charge

$$M_{IJ} = \left\{ {\begin{array}{*{20}{l}} {0.5Z_I^{2.4}} \hfill & {{\mathrm{for}}\,I = J} \hfill \\ {\frac{{Z_IZ_J}}{{|{\boldsymbol{R}}_I - {\boldsymbol{R}}_J|}}} \hfill & {{\mathrm{for}}\,I \;\ne\; J} \hfill \end{array}} \right..$$
(1)

The ordered eigenvalues (ε) of these “Coulomb matrices” are then used to measure the similarity between two molecules.

$$d(\varepsilon ,\varepsilon \prime ) = \sqrt {\mathop {\sum}\limits_i {\left| {\varepsilon _i - \varepsilon _i^\prime } \right|^2} } .$$
(2)

Here, if the number of atoms is not the same in both systems, ε is extended by zeros. In this representation, symmetrically equivalent atoms contribute equally to the feature function, the diagonalized matrices are invariant with respect to permutations and rotations, and the distance d is continuous under small variations of charge or interatomic distances. Unfortunately, this representation is not complete and does not uniquely describe every system. The incompleteness derives from the fact that not all degrees of freedom are taken into account when comparing two systems. The non-uniqueness can be demonstrated using as an example acetylene (C2H2).107 In brief, distortions of this molecule can lead to several geometries that are described by the same Coulomb matrix.

Faber et al.108 presented three distinct ways to extend the Coulomb matrix representation to periodic systems. The first of these features consists of a matrix where each element represents the full Coulomb interaction between two atoms and all their infinite repetitions in the lattice. For example:

$${X_{ij}} = {\frac{1}{N}}{Z_{i}}{Z_{j}}\sum \limits_{k,l} \varphi (|{{\boldsymbol{R}}_{k}} - {{\boldsymbol{R}}_{l}}|),$$
(3)

where the sum over k (l) is taken over the atom i (j) in the unit cell and its N closest equivalent atoms. However, as this double sum has convergence issues, one has to resort to the Ewald trick: Xij is divided into a constant and two rapidly converging sums, one for the long-range interaction and another for the short-range interaction. Another extension by Faber et al. considers electrostatic interactions between the atoms in the unit cell and the atoms in the N closest unit cells. In addition, the long-range interaction is replaced by rapidly decaying interaction. In their final extension, the Coulomb interaction in the usual matrix is replaced by a potential that is symmetric with respect to the lattice vectors.

In the same line of work, Schütt et al.109 extended the Coulomb matrix representation by combining it with the Bravais matrix. Unfortunately, this representation is plagued by a degeneracy problem that comes from the arbitrary choice of the coordinate system in which the Bravais matrix is written. Another representation proposed by Schütt et al. is the so called partial radial distribution function, which considers the density of atoms β in a shell of width dr and radius r centered around atom α (see Fig. 2):

$$g_{\alpha \beta }(r) = \frac{1}{{N_\alpha V_r}}\mathop {\sum}\limits_i^{N_\alpha } {\mathop {\sum}\limits_j^{N_\beta } \theta } (d_{\alpha _i\beta _j} - r)\theta (r + dr - d_{\alpha _i\beta _j}).$$
(4)
Fig. 2
figure 2

Two crystal structure representations. (Left) A unit cell with the Bravais vectors (blue) and base (pink) represented. (Right) Depiction of a shell of the discrete partial radial distribution function gαβ(r) with width dr. (Reprinted with permission from ref. 109. Copyright 2014 American Physical Society

Here Nα and Nβ are the number of atom of types α and β, Vr is the volume of the shell, and dαβ are the pairwise distances between two atom types.

Another form for representing the local structural environment was proposed by Behler and Parrinelo.110 Their descriptors111 involve an invariant set of atom-centered radial

$$G_i^{\mathrm{r}}(\{ {\boldsymbol{R}}_i\} ) = \mathop {\sum}\limits_{j \ne i}^{{\mathrm{neighbors}}} {g^{\mathrm{r}}} (R_{ij}),$$
(5)

and angular symmetry functions

$$G_i^{\mathrm{a}}(\{ {\boldsymbol{R}}_i\} ) = \mathop {\sum}\limits_{j \ne i}^{{\mathrm{neighbors}}} {g^{\mathrm{a}}} (\theta _{ijk}){\mkern 1mu} ,$$
(6)

where θijk is the angle between Rj − Ri and Rk − Ri. While the radial functions \(G_i^{\mathrm{r}}\) contain information on the interaction between pairs of atoms within a certain radius, the angular functions \(G_i^{\mathrm{a}}\) contains additional information on the distribution of the bond angles θijk. Examples for atom-centered symmetry functions are

$$G_i^{\mathrm{r}} = \mathop {\sum}\limits_{j \ne i}^{{\mathrm{neighbors}}} {f_{\mathrm{c}}} (R_{ij}){\mathop{\rm{e}}\nolimits} ^{ - \eta (R_{ij} - R_{\mathrm{s}})^2}$$
(7)

and

$$G_i^{\mathrm{a}} = 2^{1 - \zeta }\mathop {\sum}\limits_{\scriptstyle jk\atop\\ \scriptstyle i \ne j \ne k}^{{\mathrm{neighbors}}} {(1 + \lambda \cos \theta _{ijk})^\zeta } {\mathop{\rm{e}}\nolimits} ^{ - \eta (R_{ij}^2 + R_{ik}^2 + R_{jk}^2)} \times f_{\mathrm{c}}(R_{ij})f_{\mathrm{c}}(R_{ik})f_{\mathrm{c}}(R_{jk}).$$
(8)

Here fc is a cutoff function, leading to the neglect of interactions between atoms beyond a certain radius Rc. Furthermore, η controls the width of the Gaussians, Rs is just a parameter that shifts the Gaussians, λ determines the positions of the extrema of the cosine, and ζ controls the angular resolution. The sum over neighbors enforces the permutation invariance of these symmetry functions. Usually, 20–100 symmetry functions are used per atom, constructed by varying the parameters above. Beside atom centered, these functions can also be pair centered.112

A generalization of the atom-centered pairwise descriptor of Behler was proposed by Seko et al.113 It consists of simple basis functions constructed from the multinomial expansion of the product between a cutoff function (fc) and an analytical pairwise function (fn) (for example, Gaussian, cosine, Bessel, Neumann, polynomial, or Gaussian-type orbital functions)

$$b_{n,p}^{i,j} = \left[ {\mathop {\sum}\limits_k {f_n} (R_{jk}^i)\cdot f_{\mathrm{c}}(R_{jk}^i)} \right]^p,$$
(9)

where p is a positive integer, and \(R_{jk}^i\) indicates the distance between atoms j and k of structure i. The descriptor then uses the sum of these basis functions over all the atoms in the structure \(\left( {\mathop {\sum}\limits_j {b_{n,p}^{i,j}} } \right)\).

A similar type of descriptor is the angular Fourier series (AFS),100 which consists of a collection of orthogonal polynomials, like the Chebyshev polynomials Tl(cos θ) = cos (), and radial functions

$${\mathrm{AFS}}_{nl} = \mathop {\sum}\limits_{i,j > i} {g_n} (r_i)g_n(r_j){\mathrm{cos}}(l\theta _{ij}).$$
(10)

These radial functions are expansions of cubic or higher-order polynomials

$$g_n(r) = \mathop {\sum}\limits_\alpha {W_{n\alpha }} \phi _\alpha (r){\mkern 1mu} ,$$
(11)

where

$$\phi _\alpha (r) = (r_{\mathrm{c}} - r)^{\alpha + 2}/N_\alpha .$$
(12)

A different approach for atomic environment features was proposed by Bartok et al.100,114 and leads to the power spectrum and the bispectrum. The approach starts with the generation of an atomic neighbor density function

$$\rho ({\mathbf{r}}) = \delta ({\mathbf{r}}_0) + \mathop {\sum}\limits_i {\delta ({\mathbf{r}} - {\mathbf{r}}_i)} ,$$
(13)

which is projected onto the surface of a four-dimensional sphere with radius r0. As an example, Fig. 3 depicts the projection for 1 and 2 dimensions. Then the hyperspherical harmonic functions \(U_{m\prime m}^j\) can be used to represent any function ρ defined on the surface of a four-dimensional sphere115,116

$$\rho = \mathop {\sum}\limits_{j = 0}^\infty {\mathop {\sum}\limits_{m,m{\prime} = - j}^j {c_{m\prime m}^j} } U_{m{\prime}m}^j.$$
(14)
Fig. 3
figure 3

Map** of a flat space in one and two dimensions onto the surface of a sphere in one higher dimension. (Reprinted with permission from ref. 100. Copyright 2013 American Physical Society.)

Combining these with the rotation operator and the transformation of the expansion coefficients under rotation leads to the formula

$$P_j = \mathop {\sum}\limits_{m{\prime},m = - j}^j {c_{m{\prime}m}^{j \ast }} c_{m{\prime}m}^j$$
(15)

for the SO(4) power spectrum. On the other hand, the bispectrum is given by

$$B_{j{\kern 1pt} j_1{\kern 1pt} j_2} = \mathop {\sum}\limits_{m_{1{\prime}},m_1 = - j_1}^{j_1} {c_{m_{1{\prime}}m_1}^{j_1}} \mathop {\sum}\limits_{m_{2{\prime}},m_2 = - j_2}^{j_2} {c_{m_{2{\prime}}m_2}^{j_2}} \times \mathop {\sum}\limits_{m\prime ,m = - j}^j {C_{m{\kern 1pt} m_1{\kern 1pt} m_2}^{j{\kern 1pt} j_1{\kern 1pt} j_2}} C_{m{\prime}{\kern 1pt} m{\prime}_1{\kern 1pt} m{\prime}_2}^{j{\kern 1pt} j_1{\kern 1pt} j_2}c_{m{\prime}m}^{j \ast },$$
(16)

where \(C_{m\,m_1\,m_2}^{j\,j_1\,j_2}\) are the Clebsch–Gordon coefficients of SO(4). We note that the representations above are truncated, based on the band limit jmax in the expansion.

Finally, one of the most successful atomic environment features is the following similarity measurement

$$K(\rho ,\rho \prime ) = \left[ {\frac{{k(\rho ,\rho \prime )}}{{\sqrt {k(\rho ,\rho )k(\rho \prime ,\rho \prime )} }}} \right]^\zeta$$
(17)

also known as the smooth overlap of atomic positions (SOAP) kernel.100 Here ζ is a positive integer that enhances the sensitivity of the kernel to changes on the atomic positions and ρ is the atomic neighbor density function, which is constructed from a sum of Gaussians, centered on each neighbor:

$$\rho ({\mathbf{r}}) = \mathop {\sum}\limits_i {{\mathop{\rm{e}}\nolimits} ^{ - \alpha |{\mathbf{r}} - {\mathbf{r}}_i|^2}} .$$
(18)

In practice, the function ρ is then expanded in terms of the spherical harmonics. In addition, k(ρ, ρ′) is a rotationally invariant kernel, defined as the overlap between an atomic environment and all rotated environments:

$$k(\rho ,\rho {\prime}) = {\int} d \hat R\;{\int} d {\mathbf{r}}\;\rho ({\mathbf{r}})\rho {\prime}(\hat R{\mathbf{r}}).$$
(19)

The normalization factor \(\sqrt {k(\rho ,\rho )k(\rho \prime ,\rho \prime )}\) ensures that the overlap of an environment with itself is one.

The SOAP kernel can be perceived as a three-dimensional generalization of the radial atom-centered symmetry functions and is capable of characterizing the entire atomic environment at once. It was shown to be equivalent to using the power or bispectrum descriptor with a dot-product covariance kernel and Gaussian neighbor densities.100

A problem with the above descriptors is that their number increases quadratically with the number of chemical species. Inspired by the Behler symmetry functions and the SOAP method, Artrith et al.117 devised a conceptually simple descriptor whose dimension is constant with respect to the number species. This is achieved by defining the descriptor as the union between two sets of invariant coordinates, one that maps the atomic positions (or structure) and another for the composition. Both of these map**s consist of the expansion coefficients of the RDFs

$${\mathrm{RDF}}_i(r) = \mathop {\sum}\limits_\alpha {c_\alpha ^{{\mathrm{RDF}}}} \phi _\alpha (r)\quad {\mathrm{for}}\quad 0 \le r \le R_{\mathrm{c}}$$
(20)

and angular distribution functions (ADF)

$${\mathrm{ADF}}_i(\theta ) = \mathop {\sum}\limits_\alpha {c_\alpha ^{{\mathrm{ADF}}}} \phi _\alpha (\theta )\quad {\mathrm{for}}\quad 0 \le r \le R_{\mathrm{c}}.$$
(21)

in a complete basis set ϕα (like the Chebyshev 94% average cross-validation error). The advantage of stability

$$c_\alpha ^{{\mathrm{RDF}}} = \mathop {\sum}\limits_{R_j} {\phi _\alpha } (R_{ij})f_{\mathrm{c}}(R_{ij})w_{tj}$$
(22)

and

$$c_\alpha ^{{\mathrm{ADF}}} = \mathop {\sum}\limits_{R_j,R_k} {\phi _\alpha } (\theta _{ijk})f_{\mathrm{c}}(R_{ij})f_c(R_{ij})w_{tj}w_{tk}.$$
(23)

Here fc is a cut-off function that limits the range of the interactions. The weights wtj and wtk take the value of one for the structure maps, while the weights for the compositional maps depend on the chemical species, according to the pseudo-spin convention of the Ising model. By limiting the descriptor to two and three body interactions, i.e., radial and angular contributions, this method maintains the simple analytic nature of the Behler–Parrinelo approach. Furthermore, it allows for an efficient implementation and differentiation, while systematic refinement is assured by the expansion in a complete basis set.

Sanville et al.118 used a set of vectors, each of which describes a five-atom chain found in the system. This information includes distances between the five atoms, angles, torsion angles, and functions of the bond screening factors.119

The simplex representation of a molecular structure of Kuz’min et al.120,121 consists in representing a molecule as a system of different simplex descriptors, i.e., a system of different tetratromic fragments. These descriptors become consecutively more detailed with the increase of the dimension of the molecule representation. The simplex descriptor at the one-dimensional (1D) level consists on the number of combinations of four atoms for a given composition. At the two-dimensional (2D) level, the topology is also taken into account, while at the 3D level, the descriptor is the number of simplexes of fixed composition, topology, chirality, and symmetry. The extension of this methodology to bulk materials was proposed by Isayev et al.122 and counts bounded and unbounded simplexes (see Fig. 4). While a bonded simplex characterizes only a single component of the mixture, unbounded simplexes can describe up to four components of the unit cell.

Fig. 4
figure 4

Depiction of the generation of the simplex representation of molecular structure descriptors for materials. (Reprinted with permission from ref. 122. Further permissions should be directed to the ACS.)

Isayef et al.41 also adapted property-labeled material fragments123 to solids. The structure of the material is encoded in a graph that defines the connectivity within the material based on its Voronoi tessellation124,125 (see Fig. 5). Only strong bonding interactions are considered. Two atoms are seen as connected only when they share a Voronoi face and the interatomic distance does not exceed the sum of the Cordero covalent bond lengths.126 In the graph, the nodes correspond to the chemical elements, which are identified through a plethora of elemental properties, like Mendeleev group, period, thermal conductivity, covalent radius, etc. The full graph is divided into subgraphs that correspond to the different fragments. In addition, information about the crystal structure (e.g., lattice constants) is added to the descriptor of the material, resulting in a feature vector of 2500 values in total. A characteristic of these graphs is their adjacency matrix, which consists of a square matrix of order n (number of atoms) filled with zeros except for the entries aij = 1 that occur when atom i and j are connected. Finally, for every property scheme q, the descriptors are calculated as

$$T = \mathop {\sum}\limits_{i,j} {\left| {q_i - q_j} \right|} M_{ij},$$
(24)

where the set of indices go over all pairs of atoms or over all pairs of bonded atoms, and Mij are the elements of the product between the adjacency matrix of the graph and the reciprocal square distance matrix.

Fig. 5
figure 5

Representation of the construction of property-labeled material fragments. The atomic neighbors of a crystal structure (a) are found via Voronoi tessellation (b). The full graph is constructed from the list of connections, labeled with a property (c) and decomposed into smaller subgraphs (d). (Reprinted with permission from ref. 41 licensed under the CC BY 4.0 [https://creativecommons.org/licenses/by/4.0/])

A different descriptor, named orbital-field matrix, was introduced by Pham et al.127 Orbital-field matrices consist in the weighted product between one-hot vectors \(\left( {o_i^p} \right)\), resembling those from the field of natural language processing. These vectors are filled with zeros with the exception of the elements that represent the electronic configuration of the valence of the atom. As an example, for the sodium atom with electronic configuration [Ne]3s1, the one-hot vector is filled with zeros except for the first element, which is 1. The elements of the matrices are calculated from:

$$X_{ij}^p = \mathop {\sum}\limits_{k = 1}^{n_p} {o_i^p} \,o_j^k\,w_k(\theta _k^p,r_{pk}){\mkern 1mu} ,$$
(25)

where the weight \(w_k(\theta _k^p,r_{pk})\) represents the contribution of atom k to the coordination number of the center atom p and depends on the distance between the atoms and the solid angle \(\theta _k^p\) determined by the face of the Voronoi polyhedron between the atoms. To represent crystal structures, the orbital-field matrices are averaged over the number of atoms Np in the unit cell:

$$F_{ij} = \frac{1}{{N_p}}\mathop {\sum}\limits_p^{N_p} {X_{ij}^p} .$$
(26)

Another way to construct features based on graphs is the crystal graph convolutional neural network (CGCNN) framework, proposed by **e et al.40 and shown schematically in Fig. 6. The atomic properties are represented by the nodes and encoded in the feature vectors vi. Instead of using continuous values, each continuous property is divided into ten categories resulting in one-hot features. This is obviously not necessary for the discrete properties, which can be encoded as standard one-hot vectors without further transformations. The edges represent the bonding interactions and are constructed analogously to the property-labeled material fragments descriptor. Unlike most graphs, these crystal graphs allow for several edges between two nodes, due to periodicity. Therefore, the edges are encoded as one-hot feature vectors \(u_{(i,j)_k}\), which translates into the kth bond between atom i and j. Crystal graphs do not form an optimal representation for predicting target properties by themselves; however, they can be improved by using convolution layers. After each convolution layer, the feature vectors gradually contain more information on the surrounding environment due to the concatenation between atom and bond feature vectors. The best convolution function of **e et al. consisted of

$$\begin{array}{*{20}{l}} {{\boldsymbol{v}}_i^{(t + 1)}} \hfill & = \hfill & {{\boldsymbol{v}}_i^{(t)} + \mathop {\sum}\limits_{j,k} \sigma \left( {z_{(i,j)_k}^{(t)}{\boldsymbol{W}}_f^{(t)} + {\boldsymbol{b}}_f^{(t)}} \right)} \hfill \\ {} \hfill & {} \hfill & { \odot g\left( {{\mathbf{z}}_{(i,j)_k}^{(t)}{\boldsymbol{W}}_s^{(t)} + {\boldsymbol{b}}_s^{(t)}} \right),} \hfill \end{array}$$
(27)

where \({\boldsymbol{W}}_f^{(t)}\), \({\boldsymbol{W}}_s^{(t)}\), and \({\boldsymbol{b}}_i^{(t)}\) represent the convolution weight matrix, self-weight matrix, and the bias of the tth layer, respectively. In addition, e indicates element-wise multiplication, σ denotes the sigmoid function, and \(z_{(i,j)_k}^{(t)}\) is the concatenation of neighbor vectors:

$$z_{(i,j)_k}^{(t)} = {\boldsymbol{v}}_i^{(t)} \oplus {\boldsymbol{v}}_j^{(t)} \oplus \,{\boldsymbol{u}}_{(i,j)_k},$$
(28)

Here ⊕ denotes concatenation of vectors.

Fig. 6
figure 6

Illustration of the crystal graph convolutional neural network. a Construction of the graph. b Structure of the convolutional neural network. (Reprinted with permission from ref. 40. Copyright 2018 American Physical Society.)

After R convolutions, a pooling layer reduces the spatial dimensions of the convolution neural network. Using skip layer connections,128 the pooling function operates not only on the last feature vector but also on all feature vectors (obtained after each convolution).

The idea of applying graph neural networks129,130,131 to describe crystal structures stems from graph-based models for molecules, such as those proposed in refs. 131,132,133,134,135,136,137,138,139,140. Moreover, all these models can be reorganized into a single common framework, known as message passing neural network141 (MPNNs). The latter can be defined as a model operating on undirected graphs G, with edge features xv and vertex features evw. In this context, the forward pass is divided into two phases: the message passing phase and the readout phase.

During the message passing phase, which lasts for T interaction steps, the hidden states hv at each node in the graph are updated based on the messages \(m_v^{t + 1}\):

$$h_v^{t + 1} = S_t\left( {h_v^t,m_v^{t + 1}} \right),$$
(29)

where St(⋅) is the vertex update function. The messages are modified at each interaction by an update function Mt(⋅), which depends on all pairs of nodes (and their edges) in the neighborhood of v in the graph G:

$$m_v^{t + 1} = \mathop {\sum}\limits_{w \in N(v)} {M_t} \left( {h_v^t,h_w^t,e_{vw}^t} \right){\mkern 1mu} ,$$
(30)

where N(v) denotes the neighbors of v.

The readout phase occurs after T interaction steps. In this phase, a readout function R(⋅) computes a feature vector for the entire graph:

$$\hat y = R\left( {\left\{ {h_v^T \in G} \right\}} \right).$$
(31)

Jørgensen et al.142 extended MPNNs with an edge update network, which enforces the dependence of the information exchanged between atoms on the previous edge state and on the hidden states of the sending and receiving atom:

$$e_{vw}^{t + 1} = E_t\left( {h_v^{t + 1},h_w^{t + 1},e_{vw}^t} \right),$$
(32)

where Et(.) is the edge update function. One example of MPNNs are causal generative neural networks. The message corresponds to \(z_{(i,j)_k}^{(t)}\), defined in Eq. (27). Likewise, the hidden node update function corresponds to the convolution function of Eq. (27). In this case, we can clearly see that the hidden node update function depends on the message and on the hidden node state. The readout phase comes after R convolutions (or T iterations steps) and the readout function corresponds to the pooling layer function of the CGCNNs.

Up to now, we discussed very general features to describe both the crystal structure and chemical composition. However, should constrains be applied to the material space, the features necessary to study such systems can be vastly simplified. As mentioned above, elemental properties alone can be used as features, e.g., when a training set is restricted to only one kind of crystal structure and stoichiometry.33,35,56,99,143 Consequently, the target property only depends on the chemical elements present in the composition. Another example can be found in ref. 144, where a polymer is represented by the number of building blocks (e.g., number of C6H4, CH2, etc) or of pairs of blocks.

The crude estimations of properties can be an interesting supplement to standard features as discussed in ref. 77. As its name implies, crude estimations of properties consist of the calculation of a target property (for example, the experimental band gap) utilizing crude estimators [for example, the DFT band gap calculated with the Perdew–Burke–Ernzerhof (PBE) approximation145 to the exchange-correlation functional]. In principle, this approach can achieve successful results, as the machine learning algorithm no longer needs to predict a target property but rather an error or a difference between properties calculated with two well-defined methodologies.

Fischer et al.146 took another route and used as features a vector that completely denotes the possible ground states of an alloy:

$${\mathbf{X}} = (x_{c_1},x_{c_2},...,x_{c_n},x_{E_1},x_{E_2},...,x_{E_c}),$$
(33)

where \(x_{c_i}\) denotes the all possible crystal structures present in the alloy at a given composition and \(x_{E_1}\) the elemental constituents of the system. In this way, the vector X = (fcc, fcc, Au, Ag) would represent the gold–silver system. Furthermore, the probability density p(X) denotes the probability that X is the set of ground states in a binary alloy. With these tools, one can find the most likely crystal structure for a given composition by sorting the probabilities and predict crystal structures by evaluating the conditional probability p(X|e), where e denotes unknown variables.

Having presented so many types of descriptors, the question that now remains concerns the selection of the best features for the problem at hand. In section “Basic principles of machine learning—Algorithms”, we discuss some automatic feature selection algorithms, e.g., least absolute shrinkage and selection operator (LASSO), sure independence screening and sparsifying operator (SISSO), principal component analysis (PCA), or even decision trees. Yet these methods mainly work for linear models, and selecting a feature for, e.g., a neural network force field from the various features we described is not possible with any of these methods. A possible solution to this problem is to perform through benchmarks. Unfortunately, while there are many studies presenting their own distinct way to build features and applying them to some problem in materials science, fewer studies96,100,147 actually present quantitative comparisons between descriptors. Moreover, some of the above features require a considerable amount of time and effort to be implemented efficiently and are not readily and easily available.

In view of the present situation, we believe that the materials science community would benefit greatly from a library containing efficient implementations of the above-mentioned descriptors and an assembly of benchmark datasets to compare the features in a standardized manner. Recent work by Himanen et al.148 addresses part of the first problem by providing efficient implementations of common features. The library is, however, lacking the implementation of the derivatives. SchNetPack by Schütt et al.149 also provides an environment for training deep neural network for energy surfaces and various material properties. Further useful tools and libraries can be found in refs. 150,151,152

Algorithms

In this section, we briefly introduce and discuss the most prevalent algorithms used in materials science. We start with linear- and kernel-based regression and classification methods. We then introduce variable selection and extraction algorithms that are also largely based on linear methods. Concerning completely non-linear models, we discuss decision tree-based methods like random forests (RFs) and extremely randomized trees and neural networks. We start with simple fully connected feed-forward networks and convolutional networks and continue with more complex applications in the form of variational autoencoders (VAEs) and generative adversarial networks (GANs).

In ridge regression, a multi-dimensional least-squares linear-fit problem, including a L2-regularization term, is solved:

$$\mathop {{\min }}\limits_x |Ax - b|_2^2 + \lambda |x|_2^2.$$
(34)

The extra regularization term is included to favor specific solutions with smaller coefficients.

As complex regression problems can usually not be solved by a simple linear model, the so-called kernel trick is often applied to ridge regression.153 Instead of using the original descriptor x, the data are first transformed into a higher-dimensional feature space ϕ(x). In this space, the kernel k(x, y) is equal to the inner product 〈ϕ(x), ϕ(y)〉. In practice, only the kernel needs to be evaluated, avoiding an inefficient or even impossible explicit calculation of the features in the new space. Common kernels are, e.g.,154, the radial basis function kernel

$$k_G(x,y) = e^{ - \frac{{|x - y|^2}}{{2\sigma ^2}}},$$
(35)

or the polynomial kernel of degree d

$$k_P(x,y) = (x^Ty + c)^d.$$
(36)

Solving the minimization problem given by Eq. (34) in the new feature space results in a non-linear regression in the original feature space. This is usually referred to as kernel ridge regression (KRR). KRR is generally simple to use, as for a successful application of KRR only very few hyperparameters have to be adjusted. Consequently, KRR is often used in materials science.

Support vector machines155 (SVMs) search for the hyperplanes that divide a dataset into classes such that the margin around the hyperplane is maximized (see Fig. 7). The hyperplane is completely defined by the data points that lie the closest to the plane, i.e., the support vectors from which the algorithm derives its name.

Fig. 7
figure 7

Classification border of a support vector machine with the support vectors shown as arrows

Analogously to ridge regression, the kernel trick can be used to arrive at non-linear SVMs.153 SVM regressors also create a linear model (non-linear in the kernel case) but use the so-called ε-insensitive loss function:

$${\mathrm{Loss}} = \left\{ {\begin{array}{*{20}{l}} 0 \hfill & {if\,\varepsilon > \left| {y - f(x)} \right|} \hfill \\ {\left| {y - f(x)} \right| - \varepsilon } \hfill & {{\mathrm{otherwise}}} \hfill \end{array}} \right.$$
(37)

where f(x) is the linear model and ε a hyperparameter. In this way, errors smaller than the threshold defined by ε are neglected.

When comparing SVMs and KRR, no big performance differences are to be expected. Usually SVMs arrive at a sparser representation, which can be of advantage; however, their performance relies on a good setting of the hyperparameters. In most cases, SVMs will provide faster predictions and consume less memory, while KRR will take less time to fit for medium datasets. Nevertheless, owing to the generally low computational cost of both algorithms, these differences are seldom important for relatively small datasets. Unfortunately, neither method is feasible for large datasets as the size of the kernel matrix scales quadratically with the number of data points.

Gaussian process regression (GPR) relies on the assumption that the training data were generated by a Gaussian process and therefore consists of samples from a multivariate Gaussian distribution. The only other assumption that enter the regression are the forms of the covariance function k(x, x′) and the mean (which is often assumed to be zero). Based on the covariance matrix, whose elements represent the covariance between two features, the mean and the variance for every possible feature value can be predicted. The ability to estimate the variance is the main advantage of GPR, as the uncertainty of the prediction can be an essential ingredient of a materials design process (see section “Adaptive design process and active learning”). GPR also uses a kernel to define the covariance function. In contrast to KRR or SVMs where the hyperparameters of the kernel have be optimized with an external validation set, the hyperparameters in GPR can be optimized with gradient descent if the calculation of the covariance matrix and its inverse are computationally feasible. Although modern and fast implementations of Gaussian processes in materials science exist (e.g., COMBO156), their inherent scaling is quite limiting with respect to the data size and the descriptor dimension as a naive training requires an inversion of the covariance matrix of order \({\cal{O}}(N^3)\) and even the prediction scales with \({\cal{O}}(N^2)\) with respect to the size of the dataset.157 Based on the principles of GPR, one can also produce a classifier. First, GPR is used to qualitatively evaluate the classification probability. Then a sigmoid function is applied to the latent function resulting in values in the interval [0, 1].

In the previous description of SVMs, KRR, and GPR, we assumed that a good feature choice is already known. However, as this choice can be quite challenging, methods for feature selection can be essential.

The LASSO158,159 attempts to improve regression performance through the creation of sparse models through variable selection. It is mostly used in combination with least-squares linear regression, in which case it results in the following minimization problem159:

$$\mathop {{\min }}\limits_{\beta ,\beta _0} \mathop {\sum}\limits_i {(y_i - \beta _0 - \beta x_i)^2} {\mathrm{subject}}\,{\mathrm{to}}\mathop {\sum}\limits_i {|\beta _i| < t} ,$$
(38)

where yi are the outcomes, xi the features, and β the coefficients of the linear model that have to be determined. In contrast to ridge regression, where the L2-norm of the regularization term is used, LASSO aims at translating most coefficients to zero. In order to actually find the model with the minimal number of non-zero components, one would have to use the so called L0-norm of the coefficient vector, instead of the L1-norm used in LASSO. (The L0-norm of a vector is equal to its number of non-zero elements). However, this problem is non-convex and NP-hard and therefore infeasible from a computational perspective. Furthermore, it is proven160 that the L1-norm is a good approximation in many cases. The ability of LASSO to produce very sparse solutions makes it attractive for cases where a simple, maybe even simulatable model (see section “Discussion and conclusions—Interpretability”), is needed. The minimization problem from Eq. (38), under the constraint of the L0-norm and the theory around it, is also known as compressed sensing.161

Ghiringhelli et al. described an extended methodology for feature selection in materials science based on LASSO and compressed sensing.162 Starting with a number of primary features, the number of descriptors is exponentially increased by applying various algebraic/functional operators (such as the absolute value of differences, exponentiation, etc.) and constructing different combinations of the primary features. Necessarily, physical notions like the units of the primary features constrain the number of combinations. LASSO is then used to reduce the number of features to a point where a brute force combination approach to find the lowest error is possible. This approach is chosen in order to circumvent the problems pure LASSO faces when treating strongly correlated variables and to allow for non-linear models.

As LASSO is unfortunately still computationally infeasible for very high-dimensional feature spaces (>109), Ouyang et al. developed the SISSO163 that combines sure independence screening,164 other sparsifying operators, and the feature space generation from ref. 162. Sure independence screening selects a subspace of features based on their correlation with the target variable and allows for extremely high-dimensional starting spaces. The selected subspace is than further reduced by applying the sparsifying operator (e.g., LASSO). Predicting the relative stability of octet binary materials as either rock-salt or zincblende was used as a benchmark. In this case, SISSO compared favorably with LASSO, orthogonal matching pursuit,165,166 genetic programming,167 and the previous algorithm from ref. 162 Bootstrapped-projected gradient descent168 is another variable selection method developed for materials science. The first step of bootstrapped-projected gradient descent consists in clustering the features in order to combat the problems other algorithms like LASSO face when encountering strongly correlated features. The features in every cluster are combined in a representative feature for every cluster. In the following, the sparse linear fit problem is approximated with projected gradient descent169 for different levels of sparsity. This process is also repeated for various bootstrap samples in order to further reduce the noise. Finally, the intersection of the selected feature sets across the bootstrap samples is chosen as the final solution.

PCA170,171 extracts the orthogonal directions with the greatest variance from a dataset, which can be used for feature selection and extraction. This is achieved by diagonalizing the covariance matrix. Sorting the eigenvectors by their eigenvalues (i.e., by their variance) results in the first principal component, second principal component, and so on. The broad idea behind this scheme is that, in contrast to the original features, the principal components will be uncorrelated. Furthermore, one expects that a small number of principal components will explain most of the variance and therefore provide an accurate representation of the dataset. Naturally, the direct application of PCA should be considered feature extraction, instead of feature selection, as new descriptors in the form of the principal components are constructed. On the other hand, feature selection based on PCA can follow various strategies. For example, one can select the variables with the highest projection coefficient from, respectively, the first n principal components when selecting n features. A more in-depth discussion of such strategies can be found in ref. 171.

The previous algorithms can be considered as linear models or linear models in a kernel space. An important family of non-linear machine learning algorithms is composed by decision trees. In general terms, decision trees are graphs in tree form,172 where each node represents a logic condition aiming at dividing the input data into classes (see Fig. 8) or at assigning a value in the case of regressors. The optimal splitting conditions are determined by some metric, e.g., by minimizing the entropy after the split or by maximizing an information gain.173

Fig. 8
figure 8

Schema of a classification tree deciding whether a material is stable

In order to avoid the tendency of simple decision trees to overfit, ensembles such as RFs174 or extremely randomized trees175 are used in practice. Instead of training a single decision tree, multiple decision trees with a slightly randomized training process are built independently from each other. This randomization can include, for example, using only a random subset of the whole training set to construct the tree, using a random subset of the features, or a random splitting point when considering an optimal split. The final regression or classification result is usually obtained as an average over the ensemble. In this way, additional noise is introduced into the fitting process and overfitting is avoided.

In general, decision tree ensemble methods are fast and simple to train as they are less reliant on good hyperparameter settings than most other methods. Furthermore, they are also feasible for large datasets. A further advantage is their ability to evaluate the relevance of features through a variable importance measure, allowing a selection of the most relevant features and some basic understanding of the model. Broadly speaking, these are based on the difference in performance of the decision tree ensemble by including and excluding the feature. This can be measured, e.g., through the impurity reduction of splits using the specific feature.176

Extremely randomized trees are usually superior to RFs in higher variance cases as the randomization decreases the variance of the total model175 and demonstrate at least equal performances in other cases. This proved true for several applications in materials science where both methods were compared.99,177,178 However, as RFs are more widely known, they are still prevalent in materials science.

Boosting methods179 generally combine a number of weak predictors to create a strong model. In contrast to, e.g., RFs where multiple strong learners are trained independently and combined through simple averaging to reduce the variance of the ensemble model, the weak learners in boosting are not trained independently and are combined to decrease the bias in comparison to a single weak learner. Commonly used methods, especially in combination with decision tree methods, are gradient boosting180,181 and adaptive boosting.182,183 In materials science, they were applied to the prediction of bulk moduli184,185 and the prediction of distances to the convex hull, respectively.99,186

Ranging from feed-forward neural networks over self-organizing maps187 up to Boltzmann machines188 and recurrent neural networks,189 there is a wide variety of neural network structures. However, until now only feed-forward networks have found applications in materials science (even if some Boltzmann machines are used in other areas of theoretical physics190). As such, in the following we will leave out “feed-forward” when referring to feed-forward neural networks. In brief, a neural network starts with an input layer, continues with a certain number of hidden layers, and ends with an output layer. The neurons of the nth layer, denoted as the vector xn, are connected to the previous layer through the activation function ϕ(x) and the weight matrix \(A_{ij}^{n - 1}\):

$$x_i^n = \phi \left( {\mathop {\sum}\limits_j {x_j^{n - 1}} A_{ij}^{n - 1}} \right).$$
(39)

The weight matrices are the parameters that have to be fitted during the learning process. Usually, they are trained with gradient descent style methods with respect to some loss function (usually L2 loss with L1 regularization), through a method known as back-propagation.

Inspired by biological neurons, sigmoidal functions were classically used as activation functions. However, as the gradient of the weight-matrix elements is calculated with the chain rule, deeper neural networks with sigmoidal activation functions quickly lead to a vanishing gradient,191 hampering the training process. Modern activation functions such as rectified linear units192,193

$$\phi (x) = \left\{ {\begin{array}{*{20}{l}} x &{{\mathrm{if}} \, x \;>\; 0\,}\\0 & {\mathrm{otherwise}}\end{array}} \right.$$
(40)

or exponential linear units194

$$\phi (x) = \left\{ {\begin{array}{*{20}{l}} x \hfill & {{\mathrm{if}}\,\;x \;>\; 0} \hfill \\ {\alpha (e^x - 1)} \hfill & {{\mathrm{otherwise}}} \hfill \end{array}} \right.$$
(41)

alleviate this problem and allow for the development of deeper neural networks.

The real success story of neural networks only started once convolutional neural networks were introduced to image recognition.55,195 Instead of solely relying on fully connected layers, two additional layer variants known as convolutional and pooling layers were introduced (see Fig. 9).

Fig. 9
figure 9

Topology of a convolutional neural network starting with convolutional layers with multiple filters followed by pooling and two fully connected layers

Convolutional layers consist of a set of trainable filters, which usually have a receptive field that considers a small segment of the total input. The filters are applied as discrete convolutions across the whole input, allowing the extraction of local features, where each filter will learn to activate when recognizing a different feature. Multiple filters in one layer add an additional dimension to the data. As the same weights/filters are used across the whole input data, the number of hidden neurons is drastically reduced in comparison to fully connected layers, thus allowing for far deeper networks. Pooling layers further reduce the dimensionality of the representation by combining subregions into a single output. Most common is the max pooling layer that selects the maximum from each region. Furthermore, pooling also allows the network to ignore small translations or distortions. The concept of convolutional networks can also be extended to graph representations in material science,139 in what can be considered MPNNs141. In general, neural networks with five or more layers are considered deep neural networks,55 although no precise definition of this term in relation to the network topology exists. The advantage of deep neural networks is not only their ability to learn representations with different abstraction levels but also to reuse them.97 Ideally, the invariance and differentiation ability of the representation should increase with increasing depth of the model.

Obviously, this saves resources that would otherwise be spent on feature engineering. However, some of these resources have now to be allocated to the development of the topology of the neural network. If we consider hard-coded layers (like pooling layers), one can once again understand them as feature extraction through human intervention. While some methods for the automatic development of neural network structures exist (e.g., the neuroevolution of augmenting topologies196), in practice the topologies of neural networks are still developed through trial and error. The extreme speedup in training time through graphics processing unit implementations and new methods that improve the training of deep neural networks, like dropout197 and batch normalization,226 and evolutionary algorithms,19,227,228,229,230,231,232,233 as well as the progress in energy evaluation methods, expanded the scope of application of “classical” crystal structure prediction methods to a wider range of molecules and solid forms.234 Nevertheless, these methods are still highly computationally expensive, as they require a substantial amount of energy and force evaluations. However, the search for new or better high-performance materials is not possible without searching through an enormous composition and structure space. As there are tremendous amounts of data involved, machine learning algorithms are some of the most promising candidates to take on this challenge.

Machine learning methods can tackle this problem from different directions. A first approach is to speed up the energy evaluation by replacing a first-principle method with machine learning models that are orders of magnitude faster (see section “Machine learning force fields”). However, the most prominent approach in inorganic solid-state physics is the so-called component prediction.61 Instead of scanning the structure space for one composition, one chooses a prototype structure and scans the composition space for the stable materials. In this context, thermodynamic stability is the essential concept. By this we mean compounds that do not decompose (even in infinite time) into different phases or compounds. Clearly, metastable compounds like diamond are also synthesizable and advances in chemistry have made them more accessible.235,236 Nevertheless, thermodynamically stable compounds are in general easier to produce and work with. The usual criterion for thermodynamic stability is based on the energetic distance to the convex hull, but in some cases the machine learning model will directly calculate the probability of a compound existing in a specific phase.

Component prediction

Clearly the formation energy of a new compound is not sufficient to predict its stability. Ideally, one would always want to use the distance to the convex hull of thermodynamic stability. In contrast to the formation energy, the distance to the convex hull considers the difference in free energy of all possible decomposition channels. De facto, this is not the case because our knowledge of the convex hull is of course incomplete. Fortunately, as our knowledge of the convex hull continuously improves with the discovery of new stable materials, this problem becomes less important over time. Lastly, most first-principle energy calculations are done at zero temperature and zero pressure, neglecting kinetic effects on the stability.

Faber et al.35 applied KRR to calculate formation energies of two million elpasolites (with stoichiometry ABC2D6) crystals consisting of main group elements up to bismuth. Errors of around 0.1 eV/atom were reported for a training set of 104 compositions. Using energies and data from the materials project,78 phase diagrams were constructed and 90 new stoichiometries were predicted to lie on the convex hull.

Schmidt et al.99 first constructed a dataset of DFT calculations for approximately 250,000 cubic perovskites (with stoichiometry ABC3) using all elements up to bismuth and neglecting rare gases and lanthanides. After testing different machine learning methods, extremely randomized trees175 in combination with adaptive boosting183 proved the most successful with an mean average error of 0.12 eV/atom. Curiously, the error in the prediction depends strongly on the chemical composition (see Fig. 12). Furthermore, an active learning approach based on pure exploitation was suggested (see section “Adaptive design process and active learning”).

Fig. 12
figure 12

Mean average error (in meV/atom) for adaptive boosting used with extremely random trees averaged over all perovskites containing the element. The numbers in parentheses are the actual averaged error for each element. (Reprinted with permission from ref. 99. Copyright 2017 American Chemical Society.)

In ref. 186, the composition space for two ternary prototypes with stoichiometry AB2C2 (tI10-CeAl2Ga2 and the tP10-FeMo2B2 prototype structures) were explored for stable compounds using the approach developed in ref. 99. In total, 1893 new compounds were found on the convex hull while saving around 75% of computation time and reporting false negative rates of only 0% for the tP10 and 9% for the tI10 compound.

Ward et al.34 used standard RFs to predict formation energies based on features derived from Voronoi tessellations and atomic properties. Starting with a training set of around 30,000 materials, the descriptors showed better performance than Coulomb matrices108 and partial RDFs109 (see section “Basic principles of machine learning—Features” for the different descriptors). Surprisingly, the structural information from the Voronoi tessellation did not improve the results for the training set of 30,000 materials. This is based on the fact that very few materials with the same composition, but different structure, are present in the dataset. Changing the training set to an impressive 400,000 materials from the open quantum materials database80 proved this point, as the error for the composition-only model was then 37% higher than for the model including the structural information.

A recent study by Kim et al.237 used the same method for the discovery of quaternary Heusler compounds and identified 53 new stable structures. The model was trained for different datasets (complete open quantum materials database,80 only the quaternary Heusler compounds, etc.). For the prediction of Heusler compounds, it was found that the accuracy of the model also benefited from the inclusion of other prototypes in the training set. It has to be noted that studies with such large datasets are not feasible with kernel-based methods (e.g. KRR, SVMs) due to their unfavorable computational scaling.

Li et al.33 applied different regression and classification methods to a dataset of approximately 2150 A1−xA′xB1−yB′yO3 perovskites, materials that can be used as cathodes in high-temperature solid oxide fuel cell.238 Elemental properties were used as features for all methods. Extremely randomized trees proved to be the best classifiers (accuracy 0.93, F1-score 0.88) while KRR and extremely randomized trees had the best performance for regression, with mean average errors of <17 meV/atom. The errors in this work are difficult to compare to others as the elemental composition space was very limited.

Another work treating the problem of oxide–perovskite stability is ref. 56. Using neural networks based only on the elemental electronegativity and ionic radii, Ye et al. achieved a mean average error of 30 meV/atom for the prediction of the formation energy of unmixed perovskites. Unfortunately, their dataset contained only 240 compounds for training, cross-validation, and testing. Ye et al.56 also achieved comparable errors for mixed perovskites, i.e. perovskites with two different elements on either the A- or B-site. Mean average errors of 9 and 26 meV/atom were then obtained, respectively, for unmixed and mixed garnets with the composition C3A2D3O12. By reducing the mixing to the C-site and including additional structural descriptors, Ye et al. were able to once again decrease the latter error to merely 12 meV/atom. If one compares this study to, e.g., refs. 1,35, the errors seem extremely small. This is easily explained once we notice that ref. 56 only considers a total compound space of around 600 compounds in comparison to around 250,000 compounds in ref. 1. In other words, the complexity of the problem differs by more than two orders of magnitude.

The CGCNNs (see section “Basic principles of machine learning—Features”) developed by ** methods to estimate the uncertainty. In ref. 375, Balachandran et al. compared different surrogate models and strategies on a set of M2AX compounds for the optimization of elastic properties. From a pure prediction perspective, SVRs with radial basis function slightly outperformed Gaussian processes for training set sizes >120 materials. Different design strategies were then used in combination with the SVR. It turned out that efficient global optimization,387 as well as knowledge gradient,388 showed the best results. Xue et al.384 obtained similar results concerning the choice of algorithms for the composition optimization of NiTi-based shape memory alloys. Starting with a set of 22 materials, Xue et al. successfully synthesized 14 materials (from a total of 36 synthesized in total during 9 feedback loops), which were superior to the original dataset.

Balachandran et al.389 also applied SVRs in combination with efficient global optimization to the maximization of the band gap of A10(BO4)6X2 apatites. In this case, the performance for two feature sets, one containing the Shannon ionic radii and the other one the Pauling electronegativity differences was compared. Interestingly, the design based on the ionic radii performed better, finding the optimal material after 22 materials (13 materials in the initial training set, 9 chosen by the design algorithm) in comparison to 30 for the electronegativities, while having a far larger error in the machine learning model (0.54 eV compared to 0.19 eV for the electronegativities). The result is most likely due to the fact that, of the three atomic species considered for the B-site (P, V, As), P provides clearly higher band gaps than the other elements and has a different ionic radius while the electronegativity of P and As are nearly the same.389 Using this information, the algorithm eliminated all compositions without P on the B-site. This example demonstrates that sometimes the algorithm with the highest predictive power will not necessarily lead to the best optimal design results. A combination of the two predictors leads to even better results with the optimal composition after one iteration; however, the mean absolute error of the model was still slightly worse (0.21 eV) than the one of the purely electronegativity-based model.

Ling et al.329 treated a high-dimensional (with respect to the descriptor space) materials design problem with the RF framework FUELS.390 By adding a bias term to the uncertainty, which accounts for noise and missing degrees of freedom, they expanded upon previous uncertainty estimates from refs. 391,392. Tested on 4 datasets (magnetocaloric, thermoelectric, superconductors, and thermoelectric) with higher descriptor number (respectively, 54, 54, 56, 22), FUELS compared favorably with the Bayesian framework COMBO and random sampling, while being roughly an order of magnitude faster. In order to evaluate various selection strategies or model algorithms, different metrics were used. In materials science, a commonly used metric is the number of experiments until the optimal material is found. While this metric has some merit, in most cases opportunity cost (the distance of the current best from the overall best) or the number of experiments until the current best is within a specific distance (e.g., 1%) is superior and is also used more often in the literature.393,394

Monte Carlo tree searches395 are a second algorithm with superior scaling that has recently been introduced to materials science. The application is inspired by its success in go,2 where a combination of neural networks, reinforcement learning, and Monte Carlo tree search allowed for the first superhuman performance in this ancient strategy game. Dieb et al.396 implemented a materials design version in the form of the open source library MDTS. Using the test case of the optimal design of thermoelectric Si-Ge alloys, they demonstrated that, although Bayesian optimization has advantages for small problems due to its advanced prediction abilities, Monte Carlo tree search design time stays close to constant (see Fig. 17) with increasing problem size. Furthermore, and in contrast to genetic algorithms, it does not require the determination of hyperparameters. Owing to the unfavorable scaling of Bayesian optimization, at some point the computational effort of the design becomes larger than the computational effort of the experiments, at which point Monte Carlo methods become superior. For the interface structure optimization in ref. 396, this is already the case for interfaces with >22 atoms. Further applications to the determination of grain boundary structures397 and the structure of boron-doped graphene398 also demonstrate the viability of the method for structure design problems. A more in-depth review of Bayesian optimization and Monte Carlo tree search in materials design can be found in ref. 399.

Fig. 17
figure 17

Design time of Bayesian optimization and Monte–Carlo tree search for different numbers of atoms in the interface. (Reprinted with permission from ref. 396 licensed under the CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).)

Sawada et al.400 also developed an algorithm for multicomponent design based on game tree search. Optimizing the composition in a seven-component Heusler compound, the algorithm proved to be around nine times faster than expected improvement or upper confidence bound401 strategies based on Gaussian processes.

Dehghannasiri et al.402 proposed an experimental design framework based on the mean objective cost of uncertainty. This is defined as the expected difference in cost between the material, which minimizes the expected cost for a surrogate model and the optimal material.403 Applying the framework to the minimization of the dissipation energy of shape memory alloys demonstrated the superiority of the algorithm to pure exploitation and random selection.

So far, none of the discussed algorithms considered nested decision problems or cases where it is more efficient to carry out experiments in batches of similar experiments instead of one at a time. The latter is, for example, true for the case of the photoactive device design considered by Wang et al.404 The size of thiol-gold nanoparticles and their density on the surface determine the efficiency of the device. While one can easily explore different densities of nanoparticles in a batch of experiments, it is difficult to change the size of the nanoparticle due to the cost of their synthesis. Therefore, it is more efficient to consider a nested problem where the algorithm first chooses a size and then a batch of densities. Wang et al. extended the concept of knowledge gradient388 to the case of nested decisions and batches of experiments. Applying it to the previously described design problem, the new algorithm proved to be superior to all naive strategies (pure exploitation/exploration, or ε-greedy which chooses either pure exploration or exploitation with probability ε) and also to sequential knowledge gradient (batch size 1) if one considers the number of batches. If one instead considers the total number of experiments, the performance of knowledge gradient was only slightly better.

If we consider typical design problems, one often has to consider multiple objectives. For example, for the design of a shape memory alloy, one desires a specific finish temperature, thermal hysteresis, and possibly a high maximum transformation strain. Naturally, this requires more sophisticated measures of improvement (see ref. 405 for a review) than single objective optimization methods. A typical measure is the expected hyper-volume improvement406 that measures the change in hypervolume of the space dominated by the best known materials. Solomou et al.407 applied this metric to the optimization of shape memory alloys in combination with a Gaussian process model, once for two objectives (specific finish temperature and thermal hysteresis) and once for three objectives (adding the maximum transformation strain), and demonstrated that it is clearly superior to a random or purely exploitative strategy.

Talapatra et al.408 also combined expected hyper-volume improvement with Gaussian processes in order to simultaneously maximize the bulk modulus while minimizing the shear modulus. Instead of using a single Gaussian regressor, they developed a method called Bayesian model averaging, which combines different models. This approach can prove useful in cases where the available data is too limited to choose good features or hyperparameters.

Gopakumar et al.409 compared both SVRs and Gaussian processes on multiple datasets: optimal thermal hysteresis and transition temperature for shape memory alloys, optimal bulk and Young’s modulus for M2AX phases, and optimal piezoelectric modulus and band gap for piezoelectric materials. SVRs performed better as regressors and were consequently chosen as surrogate model. Several optimal design strategies were used, specifically random, exploitation, exploration, centroid, and maximin. For the smallest dataset, maximin surprisingly performed only slightly better for large experimental budgets and worse than pure exploitation for small budgets. However, for the larger dataset of elastic moduli both centroid and maximin proved to be clearly superior.

An additional popular choice of global optimization algorithms that can also be applied to adaptive design, especially to structure development, are genetic algorithms. Reviews of their application to materials design can be found in refs. 230,410.

It is difficult to compare the ability of the different optimal design algorithms and frameworks discussed in this section because no systematic study has ever been carried out. Nevertheless, it is quite clear that, given sufficient data, adaptive design algorithms produce superior results in comparison to naive strategies like pure exploration or exploitation, which are unfortunately still extremely common in materials science. Furthermore, several works demonstrated that experimental resources are used more efficiently if they are allocated to the suggestions of the design algorithm instead of a larger initial random training set. Machine learning models can be quite limited in their accuracy; however, the inclusion of knowledge of this uncertainty in the design process can alleviate these limitations. This allows for a feedback cycle between experimentalists and theoreticians, which increases trust and cooperation and reduces the number of expensive experiments.

Machine learning force fields

As previously discussed, first-principle calculations can accurately describe most systems but at a high computational price. Usually this price is too high for use in molecular dynamics, Monte Carlo, global structural prediction, or other simulation techniques that require frequent evaluations of the energy and forces. Even DFT is limited to molecular dynamics runs of a few picoseconds and simulations with hardly more than thousands of atoms. For this reason, the research concerning empirical potentials and the development of models for the potential energy surfaces never faded away.

In fact, most molecular dynamics simulations are normally computed with classical force fields.411,412,413,414,415,416,417 As these potentials often scale linearly with the number of atoms, they are computationally inexpensive and the loss in accuracy is overlooked in favor of the possibility to perform longer simulations or simulations with hundreds of thousands or even millions of atoms. Another approach is DFT-based tight binding.418,419,420 This quantum–mechanical technique scales with the cube of the number of electrons but has a much smaller prefactor than DFT. Certainly, calculations performed with this method are not as accurate as in DFT, but they are more reliable than classical force field calculations. In addition to the reduced precision, the construction of force fields and tight-binding parameters is unfortunately not straightforward.

Neural networks were the first machine learning method used in the construction of potential energy surfaces. As early as 1992, Sumper et al.421 used a neural network to relate the vibration spectra of a polyethylene molecule with its potential energy surface. Unfortunately, the large amount of input data and architecture optimization required deemed this approach as too cumbersome and difficult to apply to other molecular systems. It was the work of Blank et al.422 in 1995 that really showed the potential, and marked the birth, of machine learning force fields. Their work on the surface diffusion of CO/Ni(111) relied on neural network potentials, which mapped the energy of a system with its structure, mainly the lateral position of the center of mass, the angle of the molecular axis relative to the surface normal, and the position of the center of mass. The training set was obtained from electronic structure calculations and no further approximations were used. Their seminal study proved that neural networks could be used to make accurate and efficient predictions of the potential energy surface for systems with several degrees of freedom.

Since then, many machine learning potentials were reported. As several reviews on these potentials can be easily found in the literature,112,423,424,425 here we discuss only the most prominent and recent approaches related to materials science.

One of the most successful applications of machine learning to the creation of a reliable representation of the potential energy surface is the Behler and Parrinelo approach.110 Here the total energy of a system is represented as a sum of atomic contributions Ei. This became the standard for all later machine learning force fields, as it allows their application to very large systems. In the Behler–Parrinelo approach, a multilayer perceptron feedforward neural network is used to map each atom to its contribution to the energy. Every atom of a system is described by a set of symmetry functions, which serve as input to a neural network of that element. Every element in the periodic table is characterized by a different network. As the neural network function provides an energy, analytical differentiation with respect to the atomic positions or the strain delivers, respectively, forces and stresses. This approach was originally applied to bulk silicon, reproducing DFT energies up to an error of 5 meV/atom. Furthermore, molecular dynamic simulations using this potential were able to reproduce the RDF of a silicon melt at 3000 K. Many applications of this methodology to the field of materials science have appeared since then, for example, to carbon,426 sodium,427 zinc oxide,428 titanium dioxide,111 germanium telluride,429 copper,430 gold,431 and Al-Mg-Si alloys.432

Since its publication in 2007, several improvements were made to the Behler and Parrinelo approach. In 2015, Ghasemi et al. proposed a charge equilibration technique via neural networks,433 where an environment-dependent atomic electronegativity is obtained from the neural networks and the total energy is computed from a charge equilibration method. This technique successfully reproduced several bulk properties of CaF2.434 In 2011, the cost function was expanded to include force terms.424,428 This extension was first proposed by Witkoskie et al.435 and later extended and generalized by Pukrittayakamee et al.436,437 These works show that the inclusion of the gradients in the training substantially improves the accuracy of the force fields, not only due to the increase of the size of the training set but also due to the additional restrictions in the training. Ha**azar et al.438 devised a strategy to train hierarchical multicomponent systems, starting with elemental substances and going up to binaries, ternaries, etc. They then applied this technique to the calculations of defects and formation energies of Cu, Pd, and Ag systems and were able to obtain an excellent reproduction of phonon dispersions. Another improvement concerns the replacement of the original Behler–Parrinelo symmetry functions by descriptors that can be systematically improved. One such descriptor is given by Chebyshev polynomials,117 which also allow for the creation of potentials for materials with several chemical elements, due to its constant complexity with respect to the number of species. Potentials constructed with this descriptor are the reported machine learning potentials that can describe more chemical species, with 11 so far.

Artrith et al.439 proved the applicability of specialized neural network potentials in their study of amorphous Li–Si phases. They compared the results obtained with two different sampling methods. The first involved a delithiation algorithm, which coupled a genetic algorithm with a specialized potential trained with only 725 structures close to the crystalline LixSi1−x phase. The second method consisted of an extensive molecular dynamics heat-quench sampling and a more general potential. Figure 18 shows the accuracy of the latter neural network potential.

Fig. 18
figure 18

Phase diagram of 45,000 LixSi1−x structures depicting the formation energies predicted using the general neural network potential (green stars) and the density functional theory reference formation energies (black circles). (Reprinted from ref. 439, with the permission of AIP Publishing.)

We note that not only machine learning methods are changing the field of materials science but also machine learning methodologies. The spectral neighbor analysis potential440 from Thompson et al. consists of a linear fit that associates an atomic environment, represented by the four-dimensional bispectrum components, with the energies of solids and liquids. The first application of these potential to tantalum showed promising results, as it was able to correctly reproduce the relative energy of different phases. Furthermore, in the application of this potential to molybdenum by Chen et al.,441 PCA was used to examine the distribution of the features in the space. This technique increases the efficiency of the fitting, as it ensures a good coverage of the feature space and reduces the number of structures in the training set. Their potential achieved good accuracies for energies and stresses (9 meV and 0.9 GPa, respectively). Although the accuracy in the forces was considerably worse (0.30 eVÅ), they also managed to reproduce correctly several mechanical properties such as the bulk modulus, lattice constants, or phonon dispersions (see Fig. 19). Wood et al.442 proposed an improvement of the model that consisted in the introduction of quadratic terms in the bispectrum components and Li et al.443 introduced a two-step model fitting work-flow for multi-component systems and applied it to the binary Ni–Mo alloy.

Fig. 19
figure 19

Comparison between the phonon dispersion curves obtained with density functional theory and the spectral neighbor analysis potential model for a 5 × 5 × 5 supercell of Mo. (Reprinted with permission from ref. 441. Copyright 2017 American Physical Society.)

Other linear models include the work of Seko et al.,113 who reproduced potential energy surfaces to Na and Mg using KRR and LASSO combined with the multinomial expansion descriptor (see section “Basic principles of machine learning—Features”). Phonon dispersion and specific heat curves calculated with the LASSO technique for hcp-Mg were in good agreement with the DFT results. Using a similar methodology, Seko et al. applied elastic net regression,444,445 a generalization of the LASSO technique, to 10 other elemental metals446 (Ag, Al, Au, Ca, Cu, Ga, In, K, Li, and Zn). The resulting potential yielded a good accuracy for energies, forces, and stresses, enabling the prediction of several physical properties, such as lattice constants and phonon spectra.

In a different approach, Li et al.447 devised a molecular dynamics scheme that relies on forces obtained by either Bayesian inference using GPR or by on-the-fly quantum mechanical calculations (tight binding, DFT, or other). Certain simulations in materials science involve steps where complex, recurring, chemical bonding geometries are encountered. The principal idea behind this scheme is that an adaptive approach can handle the occurrence of unseen geometries while the recurring ones are trained for. This is achieved by the following predictor-corrector algorithm448,449: After n steps of the simulation with a force field, the latest configuration is selected for quantum mechanical treatment and the accuracy of the force field is tested. Should the accuracy fall below a certain threshold, the force field is refitted. This scheme might not be the most efficient for a singular molecular dynamics cycle but excels when the simulations involve monotonic cycles between two temperatures, for example. Applications to silicon,447 aluminum, and uranium450 (with linear regression) reveal accuracies for forces <100 meV/Å. The phonon density of states and melting temperature of aluminum obtained with this scheme are also in good agreement with ab initio calculations. In the same spirit, Glielmo et al.451 employed vectorial Gaussian process452,453 regression to predict forces using vector two-body kernels of covariant nature. Their results for nickel, silicon, and iron indicate that the inclusion of symmetries results in a more efficient learning and that it is not necessary to impose energy conservation to achieve force covariance. Additional improvements of this methodology include the replacement of the features by higher-order n-body-based kernels.454

Another family of highly successful machine learning potentials is the Gaussian approximation potentials (GAPs). First introduced by Bartók et al.,114 these potentials interpolate the atomic energy in the bispectrum space using GPR. Tests for semiconductors and iron revealed a remarkable reproduction of the ab initio potential energy surface. Advances in this methodology include the replacement of the bispectrum descriptor by the SOAP descriptor and the training of not only energies but also forces and stresses,455 the generalization of the approach for solids456 by adding two- and three-body descriptors, and the possibility to compare structures with multiple chemical species.457 The materials studied in these works were tungsten, carbon, and silicon, respectively. The application of the GAPs to bcc ferromagnetic iron by Dragoni et al.458 proves the accuracy of these potentials for both DFT energetics and thermodynamical properties. In particular, bulk point defects, phonons, the Bain path, and Γ surfaces459 are correctly reproduced. By combining single-point DFT calculations, GAPs, and random structure search,220,221 Deringer et al. showed a procedure that simultaneously explores and fits a complex potential energy surface.460 They used 500 random structures to train a GAP model, which was then used to perform the conjugate gradient steps of the random search. The minimum structures were added to the training set after being recalculated with single-point DFT calculations. The potential for boron resulting from this procedure was able to describe the energetics of multiple polymorphs, which included αB12 and βB106.

The GAP methodology was also applied to graphene.461 The potential constructed by Rowe et al. was able to reproduce DFT phonon dispersion curves at 0 K. In addition, the potential predicted quantitatively the lattice parameter, phonon spectra at finite temperature, and the in-plane thermal expansion. Other works concerning GPR include its application to formaldehyde and comparison of the results with neural networks462 and the acceleration of geometry optimization for some molecules.463

Jacobsen et al. presented another structure optimization technique based on evolutionary algorithms and atomic potentials constructed using KRR.464 To represent the atomic environment, they used the fingerprint function proposed by Oganov and Valle.465 By using the atomic potentials to estimate the energy, they were able to reach a considerable speed-up of the search for the global minimum structure of SnO2(110)-(4 × 1).

In an unconventional way to construct atomic potentials, Han et al.466 presented a deep neural network that, for each atom in a structure, takes as input Nc functions of the distance between the atom and its neighbors, where Nc is the maximum number of neighbors considered. As a consequence, some of the inputs of the neural network have to be zero. Furthermore, the potential might have transferability problems if ever used on a structure with smaller inter-atomic distances than the ones considered in the training set. Nevertheless, their potential showed good accuracy in energy predictions for copper and zirconium. Zhang et al.467 improved this methodology with the generalization of the loss function to include forces and stresses.

DFT functionals

The application of machine learning techniques also spread to the creation of exchange and correlation potential and energy functionals. The first application emerged from the work of Tozer et al.468 in 1996, where they devised a one-layer feed-forward multiperceptron neural network to map the electronic density ρ(r) to the exchange and correlation potential vxc(r) at the same points. Technically, this exchange and correlation functional belongs to the family of local-density approximations. Tozer et al. trained the neural network on two different datasets, first on the data of a single water molecule and afterwards on several molecules (namely, Ne, HF, N2, H2O, and H2). Using 3768 data points calculated with a regular molecular numerical integration scheme,469 the method achieved an accuracy of 2–3% in the exchange and correlation energy of the water molecule. When applied in a self-consistent Kohn–Sham calculation, the potential lead to eigenvalues and optimized geometries congruent with the local density approximation. On the other hand, for the set of several molecules, Tozer et al. obtained an error of 7.6% using 1279 points. The points were obtained in the same manner as before but were constrained to avoid successive points with similar densities. This potential generated geometries close to the local density approximation and good eigenvalues for molecules sufficiently represented in the training set. Meanwhile, and as expected, the neural network potential failed for molecules not sufficiently represented in the training like LiH and Li2.

In 2012, Snyder et al. tackled the problem of noninteracting spinless fermions confined to a 1D box.470 They employed KRR to construct a machine learning approximation for the kinetic energy functional of the density. This is the idea behind orbital-free DFT and an attempt to bypass the need to solve a Shrödinger-like equation. The kinetic energy and density pairs of up to four electrons were obtained using Numerov’s method471 for several external potentials. These potentials were created using a linear combination of three Gaussian dips with random depths, widths, and centers. Furthermore, 1000 densities were taken for the test set while M were taken for the training set. For M = 200 chemical accuracy was achieved, as no error surpassed 1 kcal/mol. To obtain the correct behavior of the functional derivative of this energy, which is necessary for the self-consistent DFT procedure, PCA was used. Self-consistent calculations with this functional led to a range of similar densities instead of a unique density and to higher errors in the energy than when using the exact density. Nevertheless, the functional reached chemical accuracy.

This methodology was later improved during the study of the bond breaking for a 1D model of a diatomic molecule, subjected to a soft Coulomb interaction.472 The training data consisted of Kohn–Sham energies and densities calculated with the local-density approximation for 1D H2, H2, Li2, Be2, and LiH with different nuclear separations. Choosing up to 20 densities for each molecule for the training set produced smaller errors in the kinetic energy functional than those due to the approximation to the exchange-correlation functional. This new functional was able to produce binding energy curves indistinguishable from the local-density approximation.

A different path was taken by Brockherde et al.473 that, instead of solving the Kohn–Sham equations self-consistently as usually, used KRR to learn the Hohenberg–Kohn map between the potential v(r) and the density n(r). Among the machine learning community, this approach is normally designated as transductive inference. The energy is obtained from the density, also using KRR. When applied to the problem of noninteracting spinless fermions confined to a 1D box (same problem as in ref. 470), this machine learning map reproduced the correct energy up to 0.042 kcal/mol (if calculated in a grid) or 0.017 kcal/mol (using other basis sets), for a training set of 200 samples. Comparison of this map with other machine learning maps that learn only the kinetic energy reveals that the Hohenberg–Kohn map approach is much more accurate. Furthermore, this map achieved similar results when applied to molecules, reaching accuracies of 0.0091 kcal/mol for water and 0.5 kcal/mol for benzene, ethane, and malinaldehyde. These values measure the difference to the PBE energy. The training sets consisted of 20 points for the water and 2000 points for the other molecules. To generate the training sets for the larger molecules, molecular dynamics simulations using the general amber force-field474 were used to yield a large set of geometries. These were subsequently sampled using the k-means approach to obtain 2000 representative structures that were only then evaluated using the PBE functional. In addition, the precision of the density prediction for benzene was compared with the results for the local-density approximation and PBE. Not only did the Hohenberg–Kohn map produce densities with errors smaller than the difference between different functionals (when evaluated on a grid) but these errors were also smaller than the ones introduced by evaluating the PBE functional using a Fourier basis representation instead of the evaluation on the grid.

A distinct approach comes from Liu et al.,475 who applied a neural network to determine the value of the range-separation parameter μ of the long-range corrected Becke–Lee–Yang–Parr functional.476,477 They trained a neural network, characterized by one hidden layer, with 368 thermochemical and kinetic energies. These values came from experimental data and from highly accurate quantum chemistry calculations. When compared with the original functional (μ = 0.47), the new functional improved the accuracy of heats of formation and atomization energies while performing slightly worse in the calculation for ionization potentials, reaction barriers, and electronic affinities.

Nagai et al.478 trained a neural network with 2 hidden layers (300 nodes) to produce the projection from the charge density onto the Hartree-exchange-correlation potential (vHxc). For that, they solved a simple model of two interacting spinless fermions under the effect of a 1D Gaussian potential, using exact diagonalization. The ground state density was then used to calculate vHxc using an inverse Kohn–Sham method based on the Haydock–Foulkes variational principle.479,480 When applied in the Kohn–Sham self-consistent cycle, this potential reproduced the exact densities and total energies, provided that a suitable training set was chosen (see Fig. 20). The system studied by the authors admits as solution either a bound and an unbound state or two bound states, depending on the Gaussian potential. Choosing points surrounding the boundary for the training set of the neural network leads to the most accurate results, with errors around 10−3 a.u. everywhere except at the boundary (where they can almost reach 1 a.u.). On the other hand, choosing points in one of the regions results in a poor description of the other region.

Fig. 20
figure 20

Transferability of the neural network vHxc. The bold frames indicate the training set and the lines show the boundary between solutions with (green) and without (pink) the Coulomb interaction. The errors are plotted as color maps. (Reprinted from ref. 478 with the permission of AIP Publishing.)

Discussion and conclusions

Interpretability

We already noted in the introduction that a major criticism of machine learning techniques is that their black-box algorithms do not provide us with new “physical laws” and that their inner workings remain outside our understanding.481 For example, Ghiringhelli et al. argue that “a trustful prediction of new promising materials, identification of anomalies, and scientific advancement are doubtful,” if the scientific connection between features and prediction is unknown.96 Johnson writes in the context of quantitative structure–activity relationships: “By not following through with careful, designed, hypothesis testing we have allowed scientific thinking to be co-opted by statistics and arbitrarily defined fitness functions.”378 The main concern is that models not based on physical principles might fail in completely unexpected cases (that are trivial for humans) while providing a very good result on average. Such cases can only be predicted and prevented if one understands the causality between the inputs and outputs of the model. Furthermore, especially in applications where a single failure is extremely expensive or potentially deadly (as in medicine), the lack of trust in black-box machine learning models stops their widespread use even when they provide a superior performance.273

As there are different concepts of interpretability, we will define its various facets according to Lipton et al.482 To start with, we can divide interpretability into transparency and post hoc explanations, which consist of additional information provided by or extracted from a model.

Transparency can once again be split into the concepts of simulatability, decomposability, and algorithmic transparency. Simulatability is a partially subjective notion and concerns the ability of humans to follow and retrace the calculations of the model. This is, e.g., the case for sparse linear models such as the ones resulting from LASSO,159 SISSO,163 or flat decision tree models. Decomposability is closely related to the intelligibility of a model and describes whether its various parts (input, parameters, calculations) allow for an intuitive interpretation. Algorithmic transparency considers our understanding of the error surface (e.g., whether the training will converge to a unique solution). This is clearly not the case for modern neural networks, for example.

Post hoc interpretability considers the possibility to extract additional information from the model. Examples for this are variable importance from a decision tree model or active response maps, which highlight regions of a picture that were particularly important for its classification by a convolutional neural network.

Starting from these concepts of interpretability, it is obvious that the notion of a complex model runs counter to the claim that it is simulatable by a human. Furthermore, models that are simulatable (e.g., low-dimensional linear models) and accurate often require unintuitive highly processed features that reduce the decomposability483 (e.g., spectral neighbor analysis potential potentials) in order to reach a comparable performance to a more complex model. In contrast, a complex model like a deep convolutional neural network only requires relatively simple un-engineered features and relies on its own ability to extract descriptors of different abstraction levels. In this sense, there is a definite conflict between the complexity and accuracy of a model, on one hand, and a simulatable decomposable model on the other hand.

The simplest examples of models that are simulatable are techniques based in dimensionality reduction or feature selection algorithms, like SISSO.163 These are usually used in combination with linear fits and result in simple equations describing the problem. An example is the estimation of the probability of a material to exist as a perovskite (ABX3), as given in ref. 143:

$$\tau = \frac{{r_{\mathrm{X}}}}{{r_{\mathrm{B}}}} - n_{\mathrm{A}}\left( {n_{\mathrm{A}} - \frac{{r_{\mathrm{A}}/r_{\mathrm{B}}}}{{{\mathrm{ln}}(r_{\mathrm{A}}/r_{\mathrm{B}})}}} \right),$$
(42)

where nA is the oxidation state of A and ri is the ionic radius of ion i. Another example is given by Kim et al.,38 who used LASSO, as well as RF and KRR, to predict the dielectric breakdown field of elemental and binary insulators, on the basis of eight features obtained from first-principle calculations (e.g. band gap, phonon cutoff frequency, etc.). In the end, all three methods determined the same two features as optimal and demonstrated nearly the same error. However, Kim et al. favored LASSO,159 because it provided a simple analytical formula, even if no further knowledge was gained from the formula. In any case, the knowledge of the analytical formula and therefore the simulatability seems to be far less relevant than the knowledge of the most relevant physical variables. In general, we can even argue that simulatability is not relevant for materials science as computational methods based on physical reasoning, like DFT or tight binding, are even further removed from simulatability than most machine learning models.

A second method that provides a variable importance measure (see section “Basic principles of machine learning—Features”) are RFs or other decision tree-based methods. Stanev et al. demonstrate the usefulness of this method for post hoc interpretability in ref. 76, by recovering numerous known (e.g., isotope effect) and some unknown rules and limits for the superconducting critical temperature. This was done by first reducing the number of features via variable importance measure (Gini importance) and subsequently visualizing the correlation between the features and the critical temperature (see Fig. 21).

Fig. 21
figure 21

Superconducting critical temperature TC plotted versus the various features; a demonstrates the isotope effect and bd show how the critical temperature is limited and influenced by various physical quantities of the materials. (Reprinted with permission from ref. 76 licensed under the CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).)

Pankajakshan et al.168 developed bootstrapped-projected gradient descent as a feature selection method specifically for materials science. The motivation came from some consistency issues for correlated or linearly dependent variables (present, for example, in LASSO), which bootstrapped-projected gradient descent can alleviate through extra clustering and bootstrap**. In their work, Pankajakshan et al. used machine learning mostly to find and understand descriptors, in order to improve the d-band model of catalysts for CO2 reduction,484 instead of actually using the machine learning model for predictions. This can definitely be a reasonable approach in cases where datasets are too small and incomplete for any successful extrapolation. Notwithstanding, in most cases it is questionable if a classical (in the sense of “non-machine learning”) model should be used directly when a machine learning model is superior, as in the case of the d-band model.485 Of course, it is a bonus when a classical model exists, as it can be used to check for consistency issues or as a crude estimation of property. However, in our opinion, pragmatic applications of advanced materials design should always use the best model.

While RFs and linear fits are considered more accessible from a interpretability point of view, deep neural networks are one of the prime examples for algorithms that are traditionally considered a black box. While their complex nature often results in superior performance in comparison to simpler algorithms, an unwanted consequence is the lack of simulatability and algorithmic transparency. As the lack of interpretability is one of the main challenges for a wider adoption of neural networks in industry and experimental sciences, post hoc methods to visualize the response and understand the inner workings of neural networks were developed during the past years. One example are attentive response maps for image recognition networks that highlight regions of the picture according to their importance in the decision making process. Kumar et al.271 demonstrated that, by combining the understanding gained from attentive response maps with domain knowledge and applying it to the design process of the neural network, one can not only achieve a better informed decision making process but also higher performance. An improvement of the performance through integration of domain knowledge is not completely surprising, but the result is nevertheless remarkable, as usually higher interpretability comes at the cost of a lower performance.

Zilleti et al.268 introduced attentive response maps, as implemented in ref. 271, to materials science in order to visualize the ability of their convolutional neural networks to recognize crystal structures from diffraction patterns. The response maps of the different convolutional layers demonstrate that the neural networks recover the position of the diffraction peaks and their orientation as features (see Fig. 22).

Fig. 22
figure 22

a Attentive response maps for the four most activated filters of the first, third, and last convolutional layers for simple cubic lattices. The brightness of the pixel represents the importance of the location for classification. b Sum of the last convolutional layer filters for all seven crystal classes showing that the network learned crystal templates automatically from the data. (Reprinted with permission from ref. 268 licensed under the CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).)

A second example, which demonstrates the ability of neural networks to convey additional post hoc information, is described in ref. 40. **e et al. used a crystal graph convolutional neural network to learn the distance to the convex hull of perovskites ABX3. By using the output of the pooling layers instead of the fully connected layers as a predictor, the energy can be split into contributions from the different crystal sites (see Fig. 23). This allowed **e et al. to not only confirm the importance of the radii of the A- and B-atoms but also to gain new insights that were then used for an efficient combinatorial search of perovskites. In ref. 486, **e et al. follow up with the interpretation of the features extracted from the convolutional neural networks and demonstrated how similarity patterns emerge for different material groups and at different scales.

Fig. 23
figure 23

Contributions to the distance to the convex hull per element, A site (c) and B site (d). (Reprinted with permission from ref. 40. Copyright 2018 American Physical Society.)

Zhang et al.321 also highlighted the ability of convolutional neural networks to extract physically meaningful features out of un-engineered descriptors. They built a convolutional neural network (two convolutional layers, one fully connected layer) to calculate the topological winding number of 1D band insulators with chiral symmetry based on their Hamiltonian as input data

$$\left[ {\begin{array}{*{20}{l}} {h_x(0)} \hfill & {h_x\left( {\frac{{2\pi }}{L}} \right)} \hfill & \cdots \hfill \\ {h_y(0)} \hfill & {h_y\left( {\frac{{2\pi }}{L}} \right)} \hfill & \cdots \hfill \end{array}} \right]^T = \left[ {\begin{array}{*{20}{l}} {{\mathrm{cos}}({\mathrm{\Phi }})} \hfill & {{\mathrm{cos}}\,({\mathrm{\Phi }} + {\mathrm{\Delta \Phi }})} \hfill & \cdots \hfill \\ {{\mathrm{sin}}({\mathrm{\Phi }})} \hfill & {{\mathrm{sin}}\,({\mathrm{\Phi }} + {\mathrm{\Delta \Phi }})} \hfill & \cdots \hfill \end{array}} \right]^{\mathrm{T}}.$$
(43)

From the theoretical equation for the winding number,487 one can derive that the second convolutional layer should produce an output linearly depending on ΔΦ with the exception of a jump at ΔΦ = π. We can see in Fig. 24 that this is exactly the case, and consequently, the convolutional neural network actually learned the discrete formula for the winding number. Sun et al.323 studied similar models of higher complexity with deep convolutional neural networks and were also able to demonstrate that their networks learned the known mathematical formulas for the winding and the Chern numbers.488

Fig. 24
figure 24

Output of the second layer as a function of δΦ and Φ. (Reprinted with permission from ref. 321. Copyright 2018 American Physical Society.)

Naturally, neural networks will never reach the algorithmic transparency of linear models. However, representative datasets, a good knowledge of the training process, and a comprehensive validation of the model can usually overcome this obstacle. Furthermore, if we consider the possibilities for post hoc explanations or the decomposability of neural networks, they are actually far more interpretable than their reputation might suggest.

To conclude this chapter, we would like to summarize a few points: (i) Interpretability is not a single algorithmic property but a multifaceted concept (simulatability, decomposability, algorithmic transparency, post hoc knowledge extraction) (ii) The various facets have different priorities depending on the dataset and the research goal. (iii) Simulatability is usually non-existent in materials science (e.g., in DFT or Monte Carlo simulations) regardless of whether one uses a machine learning or a classical algorithm. Therefore, it should probably not be a point of concern in materials informatics. (iv) Part of the progress of materials informatics has to include the increasing use of post hoc knowledge techniques, like attentive response maps, to improve the viability of, and the trust in, high-performing black-box models. Often this knowledge alleviates the fear that the model is operating on unphysical principles.268,321,323

Conclusions

Just like the industrial revolution, which consisted of the creation of machines that could perform mechanical tasks more efficiently than humans, in the field of machine learning machines are progressively trained to identify patterns and to find relations between properties and features more efficiently than us. In materials science, machine learning is mostly applied to classification and regression problems. In this context, we discussed a wide variety of quantitative structure–property relationships, which encompass a high number of properties essential for modern technology. It seems likely that further properties, should they be needed, can also be predicted with a similar level of accuracy.

If we consider the direction of future research, there will be a clear division between methodologies depending on the availability of data. For continuous properties, which can be calculated realistically for ≥105 materials, we assume that universal models and especially deep neural networks, like **e et al.’s crystal graph convolutional networks40 or Chen et al.’s MatErials Graph Networks,132 will be the future. They are able to predict a diverse set of properties, such as formation energies, band gaps, Fermi energies, bulk moduli, shear moduli, and Poisson ratios for a wide material space (87 elements, 7 lattice systems, and 216 space groups in the case of ref. 40). At the same time, they reach an accuracy with respect to DFT calculations that is comparable with (or even smaller than) the DFT errors with respect to experiment. Such models have the potential to end the need for applications trained for only a single structural prototype and/or property, which can in turn drastically reduce the amount of resources spent by single researchers. Comparing to the state of the art of neural network architectures and training methods in fields like image recognition and natural language procession, we can also expect that the success of neural network models will only increase once modern topologies, training methods, and fast implementations reach a wider audience in materials science. To reach this goal, a closer interdisciplinary collaboration with computer scientists will be essential.

In other cases that are characterized by a lack of data, several strategies are very promising. First of all, one can take into consideration surrogate-based optimization (active learning), which allows researchers to optimize the results achieved with a limited experimental or computational budget. Surrogate-based optimization allows us to somewhat overlook the limited accuracy of the machine learning models while nevertheless arriving at sufficient design results. As the use of such optimal design algorithms is still confined to relatively few studies with small datasets, much future work can be foreseen in this direction. A second strategy to overcome the limited data available in materials science is transfer learning. While it has already been applied with success in chemistry,489 wider applications in solid-state materials informatics are still missing. A last strategy to handle the small datasets that are so common in materials science was discussed by Zhang et al. in ref. 77. Crude estimation of properties basically allows us to shift the problem of predicting a property to the problem of predicting the error of the crude model with respect to the higher-fidelity training data. Up to now, this strategy was mostly used for the prediction of band gap, as datasets of different fidelity are openly available (DFT, GW, or experimental). Moreover the use of crude estimators allows researchers to benefit from decades of work and expertise that went into classical (non-machine learning) models. If the lower-fidelity data are not available for all materials, it is also possible to use a co-kriging approach that still profits from the crude estimators but does not require it for every prediction.292

Component prediction is a highly effective way to speed up the material discovery process and we expect high-throughput searches of all common crystal structure prototypes that were not yet researched in the coming years. While the prediction of the energy can also be considered, a quantitative structure–property relationships, metastable materials, and an incomplete knowledge of the theoretical convex hull have to be taken into account. Several studies demonstrated that better accuracy can be achieved with experimental training data. However, as experimental data are seldom available and expensive to generate, the number of prototypes for which studies analog to ref. 143 are an option will quickly be exhausted. A second challenge is the lack of published data of failed experiments. In this case, a cultural shift toward the publication of all valid data, may it be positive or negative, is required.

The direct prediction or generation of a crystal structure is still an extremely challenging problem. While several studies demonstrate how to differentiate between a small number of prototypes for a certain composition, the difficulty quickly rises with an increasing number of possible crystal structures. This is amplified by the fact that the majority of available data belongs to only a small number of extensively researched prototypes. Recently, more complex modern neural network structures (e.g., VAEs, GANs, etc.) were introduced to the problem, with some interesting results. Moreover, the use of machine learning-based optimization algorithms, like Bayesian optimization for global structure prediction, is also a direction that should be further explored.

Machine learning was successfully integrated with other numerical techniques, such as molecular dynamics and global structural prediction. Force fields built with neural networks enjoy an efficiency that parallels that of classical force fields and an accuracy comparable to the reference method (usually DFT in solid state, although in chemistry some force fields already achieved coupled cluster accuracy489). Consequently, we expect them to completely replace classical force fields in the long term. Owing to their vastly superior numerical scaling, machine learning methods allow us to tackle challenging problems, which go far beyond the limitations of current electronic structure methods, and to investigate novel, emerging phenomena that stem from the complexity of the systems.

The majority of early machine learning applications to solid-state materials science employed straightforward and simple-to-use algorithms, like linear kernel models and decision trees. Now, that these proofs-of-concept exist for a variety of application, we expect that research will follow two different directions. The first will be the continuation of the present research, the development of more sophisticated machine learning methods, and their applications in materials science. Here one of the major problems is the lack of benchmarking datasets and standards. In chemistry, a number of such datasets already exists, such as the QM7 dataset,490,491 QM8 dataset,491,492 QM7b dataset,493,494 etc. These are absolutely essential to measure the progress in features and algorithms. While we discussed countless machine learning studies in this review, definitive quantitative comparisons between the different works were mostly impossible, impeding the evaluation of progress and thereby progress itself. It has to be noted that there has been one recent competition for the prediction of formation energies and band gaps.495 In our opinion, this is an very important step in the right direction. Unfortunately, the dataset used in this competition was extremely small and specific, putting the generalizability of the results to larger and more diverse datasets into doubt.

The second direction regards the usability of machine learning models. In the electronic structure community, both the models (e.g., new approximations to the exchange-correlation functional of DFT) and the computer codes are developed by a relatively small group of experts and put at the disposal of the much larger community of materials scientists. Even though this is slowly starting to change, models from most publications are not publicly available. This results in most researchers spending resources on building their own models to solve very specific problems. We note that frameworks to disseminate models are now starting to emerge.496

In conclusion, we reviewed the latest applications of machine learning in the field of materials science. These applications have been mushrooming in the past couple of years, fueled by the unparalleled success that machine learning algorithms have found in several different fields of science and technology. It is our firm conviction that this collection of efficient statistical tools are indeed capable of speeding up considerably both fundamental and applied research. As such, they are clearly more than a temporary fashion and will certainly shape materials science for the years to come.