\n ## Introduction

\n

"
},
"selectedType": "BeakerDisplay",
- "elapsedTime": 1,
+ "elapsedTime": 0,
"height": 209
},
"evaluatorReader": true,
@@ -291,7 +291,7 @@
"object": "\nIn this notebook we show an example of using data in the NOMAD Archive to train a machine learning classifier that can predict the existence of an electronic band gap from crystal structure only. We achieve roughly 80% prediction accuracy with the best classifier and 9894 samples. As input for the learning we use very compact structural descriptors that are invariant under rotation and translation and include only information about the atomic species and their positions.\n

\n\n ## Data

\n

\n\n"
},
"selectedType": "BeakerDisplay",
- "elapsedTime": 1,
+ "elapsedTime": 0,
"height": 702
},
"evaluatorReader": true,
@@ -361,13 +361,13 @@
"psubtype": "OutputContainer",
"items": [
"We use data from the NOMAD Archive to train the classifiers. We have selected calculations with the following criteria:

\n- \n
- VASP calculations originating from the AFLOWLIB
^{1}project \n - Periodic crystal structures \n
- PBE exchange-correlation functional
^{2}\n - Projector-augmented wavefunction (PAW) potentials
^{3,4}\n - The data should conform to the AFLOWLIB Standard for High-Throughput Computing
^{5}, which ensures reproducibility of the data, and provides reasoning for any parameters set in the calculation, such as accuracy thresholds, calculation pathways, and mesh dimensions. \n - The calculation must have density of states (\"dos_energies_normalized\" and \"dos_values\") available, because we detect and calculate the band gap based on this information. The DOS energies in the Archive have been normalized so that 0 is at the top of the valence band. \n
- No more than 8 atoms in the simulation cell \n
- Ignoring elements with < 0.5% occurence in the whole dataset. Occurrence of an atomic element is the percentage of samples with at least one atom of that species. \n
- To ensure that we do not allow the same structure to enter the dataset twice, we only allow one sample for each chemical formula. \n

The Archive data is not alone sufficient to ensure some properties of the calculations. The following issues/restrictions have been identified and should be considered when analyzing the results:

\n- \n
- Based on the Archive information we cannot determine if the structure has been relaxed, so we may be including also unrelaxed samples. \n
- The Archive output does not have convergence information (\"single_configuration_calculation_converged\") to determine if the calculation has been converged against some set of convergence criteria (\"settings_scf\"). \n

From the calculations that match these criteria we choose all calculations with a band gap, which equals to 3298 samples. The dataset contains much more data without a band gap, but we randomly choose 6596 of such samples. The final dataset then consists of 9894 samples. The inbalance between the classes does not seem to significantly affect training as long as the samples are weighted during training.

\nThe band gap distribution is higly skewed towards materials with a low band gap (semiconductors), which poses a challenge for the training of the classifiers as many of the samples will be near the decision boundary.

\n \n Plot histogram of non-zero band gaps\n Plot element occurence\n