<sup>4</sup> EPSRC Centre for Doctoral Training on Theory and Simulation of Materials Department of Physics, Imperial College London, London, U.K. <br>
<sup>5</sup> Thomas Young Centre for Theory and Simulation of Materials, Department of Materials, Imperial College London, London, U.K <br>
<spanclass="nomad--last-updated"data-version="v1.0.0">[Last updated: March 21, 2019]</span>
### A paradigm shift in solving materials science grand challenges by crowd sourcing solutions through an open and global big-data competition
Innovative materials design is needed to tackle some of the most important health, environmental, energy, societal, and economic challenges. Improving the properties of materials that are intrinsically connected to the generation and utilization of energy is crucial if we are to mitigate environmental damage due to a growing global demand. Transparent conductors are an important class of compounds that are both electrically conductive and have low absorption in the visible range, which are typically competing properties. A combination of both of these characteristics is key for the operation of a variety of technological devices such as photovoltaic cells, light-emitting diodes for flat-panel displays, transistors, sensors, touch screens, and lasers. However, only a small number of compounds are currently known to display both transparency and conductivity to a high enough degree to be used as transparent conducting materials.
To address the need for finding new materials with an ideal target functionality, the Novel Materials Discovery (NOMAD) Centre of Excellence has organized a crowd-sourced data analytics competition with Kaggle, which is one of the most well known online platforms for hosting big-data competitions. Kaggle has a community of over a half of a million users from around the world with various backgrounds in computer science, statistics, biology, and medicine. The competition occurred from December 18th 2017 to February 15th, 2018 and involved nearly 900 participants. The goal of this competition was to develop or apply data analytics models for the prediction of two target properties: the formation energy (which is an indication of the stability of a material) and the bandgap energy (which is an indicator for the potential for transparency over the visible range) to facilitate the discovery of new transparent conductors and allow for advancements in (opto)electronic technologies. A total of 5,000.00 euros in prizes were awarded to the top-three participants with the best performing models (i.e., lowest average root mean square log error (RMSLE) of the formation and bandgap energies). The RMSLE is defined as:
where $N$ is the total number of observations, $\hat{y}_i$ is the predicted value, $y_i$ is the reference value for either the formation or bandgap energies.
The dataset consists of 3,000 materials, 2,400 of which made the training set and the remaining 600 were used as the test set (i.e., only structures and input features were provided), with the target properties kept secret. Of that test set, 100 materials were used to determine the public leaderboard score so that the participants could assess their model performance on the fly (but the exact values used in this assessment were kept secret). The top three winners of the competition were determined the private leaderboard score based on the test set of 500 materials with the target properties kept secret.
Because only 100 values were used for assessing the performance on the leaderboard, participants had to ensure the predictive accuracy of their model for unseen data, even if a disagreement was found with the public leaderboard score. This is evident in the summary of the average RMSLE for all of the participants with scores below 0.25 in Figure 1, where a large shift in the values between the public leaderboard (100 compounds) and private leaderboard (500 compounds). The winning score has a RMSLE of 0.0509, while the 2nd and 3rd places winners were closely stacked together with a RMSLE of 0.0521 and 0.0523. However, within the first bin, there were a total of four participants with an RMLSE 0.053 (i.e., 0.45% of participants).
# More about the dataset: Group-III transparent conductors
Group-III oxides provide promising candidates for wide band-gap transparent conductors. A wide range of experimental band gap energies from 3.6 to 7.5 eV have been reported from alloying In$_2$O$_3$/Ga$_2$O$_3$ or Ga$_2$O$_3$/Al$_2$O$_3$,
which suggest that alloying of group-III oxides is a viable strategy for designing new wide band gap semiconductors. However, Al$_2$O$_3$, Ga$_2$O$_3$, and In$_2$O$_3$ all display very different
ground-state structures. Therefore, it is unclear which structure will be stable for various compositions. Within the dataset, there are six different lattice symmetries: $C2/m$, $Pna2_1$, $R\bar{3}c$, $P6_3/mmc$, $Ia\bar{3}$, $Fd\bar{3}m$.
The 1st place winning solution was obtained using the metal-oxygen coordination number derived from the number of bonds that are within the sum of the ionic Shannon experimental radii (which were enlarged by 30-50% depending on the crystal structure type). These ionic bonds are then used for building a crystal graph, where each atom is a node in the graph and the corresponding edges between nodes are defined by the ionic bond, which are shown as coordination numbers for each atom for a sequence of 6 atoms.
<center> Figure 1 : Depiction of a crystal graph representation of In3Ga1O6 showing the connections between each atom (node) that are defined by the ionic bonds. </center>
### 2. SOAP-based descriptor
In the 3rd place winning solution, the smooth overlap of atomic positions (SOAP) kernel developed by Bartók et al. that incorporates information on the local atomic environment through a rotationally integrated overlap of neighbor densities.20-21 The SOAP kernel describes the local environment for a given atom ($i$) through the sum of Gaussians centered on each of the atomic neighbors ($j$) within a specific cutoff radius ($r_{ij}$):
where $\sigma_{atom}$ is a smoothing parameter and the switching function $f_{cut}$ goes smoothly to zero beyond a specified radial value. This local atomic neighbor density can be expanded in terms of spherical harmonics and orthogonal radial functions, while the expansion coefficients are then combined to form the rotationally invariant power spectrum corresponding to the neighbor density for each atom.
Kernel ridge regression (KRR) is a generalization of ridge regression, where a linear function is learned in the space induced by the respective kernel and the data. The squared loss is minimized with a squared norm L2 regularization term.
The model learned by KRR has the same form with support vector regression (SVR). In both KRR and SVR, the L2 regularization are used, but KRR uses squared error loss function which is different than SVR. The learned model by KRR is non-sparse, therefore the fitting of KRR is typically slower than SVR.
### 2. Neural network $^2$
In the 3rd place winning solution, a multi-layer perceptron (MLP) is employed. MLP is composed of more than one perceptron: typically it has an input layer, several hidden layer and an output layer. The input of the network is transformed using a learnt non-linear transformation. As a supervised learning algorithm, MLP learns a function $f(\cdot):R^{I} \rightarrow R^{O}$, where $I$ is the number of dimensions for input and $O$ is the number of dimensions for output. Several parameters are adjusted during the training of MLP, including the weights of the neurons and the biases. Gradient-based optimisation algorithms, such as stochastic gradient descent, are employed to minimize the loss funtion. There are different choices of loss functions, for example the root mean squared error (RMSE), or the cross entropy. One can also switch loss functions during the training for better gradients. Figure 1 shows a one hidden layer MLP with scalar output as an example.
<center> Figure 2 : One hidden layer MLP $^2$. </center>
%% Cell type:markdown id: tags:
<aid='make_predictions'></a>
%% Cell type:code id: tags:
``` python
```
%%Celltype:codeid:tags:
``` python
%%HTML
<br><br><br>
<fontsize="6.5em"><b>Make predictions on formation energies and bandgaps</b></font>
<br><hr><br>
<fontsize="5em"><b>Winning representations combined with different regression methods</b></font>
<br><br>
<fontsize = "3.5em"> To understand the relative importance of the representation vs. regression model, one can examine the performance of each representation combined with different regression models.
The hyperparameters are optimized for each representation/regressor combination. </font>
<br>
<fontsize = "3.5em"color="009FC2"><br>Warning: the learning algorithm employed in this study (e.g. grid-search) can not guarantee deterministic results. The actual predictions can divergent from the published data.
</font>
<br><br><br>
<form>
<fontsize="4em">Select a representation and a regression method:</font>
<fontsize="3em">Number of (linear) hidden layers </font><inputtype="number"id="input_n_nn"value="2"min="1"max="20">
<buttontype="button"id="set_n_neurons"style="background-color:#f2f2f2;border:#555555;border-radius: 4px;font-size: 16px; width:250px; height:30px;"onclick="set_nn_n_neurons()">Set the number of neurons</button>
Warning: It can be very time-consuming (5~20 min/point, depending on method, model size, and number of threads employed) when getting a learning curve.
</font>
<p><fontsize="3em">Number of points in learning curve: <inputtype="number"id="N_learning_curve"value="4"min="1"max="25"style="display:none"></font></p>