Commit 46eb0d86 authored by Thomas Purcell's avatar Thomas Purcell
Browse files

Update docs as per Luca's suggestions

parent 2e132412
......@@ -12,19 +12,12 @@ SISSO++
This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface.
Future work will expand the python interface to include more postporcessing analysis tools.
Indices
=======
* :ref:`genindex`
* :ref:`search`
Table of Contents
^^^^^^^^^^^^^^^^^
.. toctree::
:maxdepth: 2
self
quick_start/QuickStart
tutorial/tutorial
cpp_api/cpp_api
......
.. _quick_start:
Quick Start Guide
Quick-Start Guide
=================
.. toctree::
:maxdepth: 2
......
......@@ -31,7 +31,7 @@ A list containing the set of all operators that will be used during the feature
#### `param_opset`
A list containing the set of all operators, for which the non-linear scale and bias terms will be optimized, that will be used during the feature creation step of SISSO. (If empty none of the available features)
A list containing the set of all operators, for which the non-linear scale and bias terms will be optimized, that will be used during the feature creation step of SISSO. (If empty none of the available features are used)
#### `calc_type`
......@@ -39,15 +39,15 @@ The type of calculation to run either regression, log regression, or classificat
#### `desc_dim`
The maximum dimension of the model to be created
The maximum dimension of the model to be created (no default value)
#### `n_sis_select`
The number of features that SIS selects over each iteration
The number of features that SIS selects over each iteration (no default value)
#### `max_rung`
The maximum rung of the feature (height of the tallest possible binary expression tree - 1)
The maximum rung of the feature (height of the tallest possible binary expression tree - 1) (no default value)
#### `n_residual`
......
......@@ -2,17 +2,17 @@
This tutorial is based on the [Predicting energy differences between crystal structures: (Meta-)stability of octet-binary compounds](https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/descriptor_role.ipynb) tutorial created by Mohammad-Yasin Arif, Luigi Sbailò, Thomas A. R. Purcell, Luca M. Ghiringhelli, and Matthias Scheffler.
The goal of the tutorial is to teach a user how to use `SISSO++` to find and analyze quantitative models for materials properties.
In particular we will use SISSO to predict the crystal structure (rock salt or zincblende) of a series of octet binaries.
In particular we will use SISSO to predict the crystal structure (rock-salt or zinc-blende) of a series of octet binaries.
The tutorial will be split into three parts: 1) explaining how to use the executable to perform the calculations and the python utilities to analyze the results and 2) How to use only python to run, analyze, and demonstrate results 3) How to perform classification problems using SISSO.
## Outline
The following tutorials are available:
- [Combined Binary and Python](1_combined.md)
- [Python only](2_python.md)
- [Using the Command Line Interface](1_command_line.md)
- [Using the Python Interface](2_python.md)
- [Classification](3_classification.md)
All tutorials use the octet binary dataset first described in [PRL-2015](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.10550) with the goal of predicting whether a material will crystallize in a rock salt or zincblende phase.
All tutorials use the octet binary dataset first described in [PRL-2015](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.10550) with the goal of predicting whether a material will crystallize in a rock-salt or zinc-blende phase.
For all applications of SISSO a data set has to be passed via a standard `csv` file where the first row represents the feature and property label and the first column are the index-label for each sample for example
```
Material, energy_diff (eV), rs_A (AA), rs_B (AA), E_HOMO_A (eV), E_HOMO_B (eV),....
......
......@@ -64,7 +64,7 @@ The standard output provides information about what step the calculation just fi
When all calculations are complete the code prints out a summary of the best 1D, 2D, ..., {desc_dim}D models with their training RMSE/Testing RMSE (Only training if there is no test set provided).
Additionally, two additional output files are stored in `feature_space/`: `SIS_summary.txt` and `selected_features.txt`.
These files represent a human readable (`SIS_summary.txt`) and computer readable (`selected_features.txt`) summary of the selected feature space from SIS.
Below are reconstructions of both files for this calculation
Below are reconstructions of both files for this calculation (To see the file click the triangle)
<details>
<summary>feature_space/SIS_summary.txt</summary>
......@@ -284,10 +284,12 @@ An example of these files is provided here:
```
</details>
## Determining the Ideal Model Complexity with Cross-Validation
While the training error always decreases with descriptor dimensionality for a given application, over-fitting can reduce the general applicability of the models outside of the training set.
In order to determine the optimal dimensionality of a model and optimize the hyperparameters associated with SISSO, we need to perform cross-validation.
As an example we will discuss how to perform leave-out 10% using the command line
As an example we will discuss how to perform leave-out 10% using the command line.
To do this we have to modify the `sisso.json` file to automatically leave out a random sample of the training data and use that as a test set by changing `"leave_out_frac": 0.0,` do `"leave_out_frac": 0.10,`.
<details>
......@@ -415,7 +417,7 @@ As can be seen from the standard error measurements the results are now reasonab
<details>
<summary> Converged cross-validation results </summary>
![image](combined/cv/cv_100_error.png)
![image](command_line/cv/cv_100_error.png)
</details>
Because the validation error for the three and four dimensional models are within each others error bars and the standard error increases when going to the fourth dimension, we conclude that the three-dimensional model has the ideal complexity.
......@@ -431,11 +433,13 @@ To see the distributions for this system we run
<details>
<summary> Distribution of Errors </summary>
![image](./combined/error_cv.png)
![image](./command_line/error_cv.png)
</details>
One thing that stands out in the plot is the large error seen in a single point for both the one and two dimensional models.
By looking at the validation errors, we find that the point with the largest error is diamond for all model dimensions, which is by far the most stable zincblende structure in the data set.
By looking at the validation errors, we find that the point with the largest error is diamond for all model dimensions, which is by far the most stable zinc-blende structure in the data set.
As a note for this setup there is a 0.22\% chance that one of the samples is never in the validation set so if `max_error_ind != 21` check if that sample is in one of the validation sets.
```python
>>> import numpy as np
>>> import pandas as pd
......@@ -585,12 +589,66 @@ To get the final models we will perform the same calculation we started off the
From here we can use `models/train_dim_3_model_0.dat` for all of the analysis.
In order to generate a machine learning plot for this model in matplotlib, run the following in python
```python
>>> from sissopp.postprocess.plot.parity_plot import plot_model_ml_plot_from_file
>>> plot_model_ml_plot_from_file("models/train_dim_3_model_0.dat", filename="3d_model.pdf").show()
>>> from sissopp.postprocess.plot.parity_plot import plot_model_parity_plot
>>> plot_model_parity_plot("models/train_dim_3_model_0.dat", filename="3d_model.pdf").show()
```
The result of which is shown below:
<details>
<summary> Final 3D model </summary>
![image](./combined/3d_model.png)
![image](./command_line/3d_model.png)
</details>
Additionally you can generate a output the model as a Matlab function or a LaTeX string using the following commands.
```python
>>> from sissopp.postprocess.load_models import load_model
>>> model = load_model("models/train_dim_3_model_0.dat")
>>> print(model.latex_str)
>>> model.write_matlab_fxn("matlab_fxn/model.m")
```
A copy of the generated matlab function is below.
<details>
<summary> Final 3D model </summary>
```matlab
function P = model(X)
% Returns the value of E_{RS} - E_{ZB} = c0 + a0 * ((r_d_B / r_d_A) * (r_p_B * E_HOMO_A)) + a1 * ((IP_A^3) * (|r_sigma - r_s_B|)) + a2 * ((IP_A / r_p_A) / (r_p_B + r_p_A))
%
% X = [
% r_d_B,
% r_d_A,
% r_p_B,
% E_HOMO_A,
% IP_A,
% r_sigma,
% r_s_B,
% r_p_A,
% ]
if(size(X, 2) ~= 8)
error("ERROR: X must have a size of 8 in the second dimension.")
end
r_d_B = reshape(X(:, 1), 1, []);
r_d_A = reshape(X(:, 2), 1, []);
r_p_B = reshape(X(:, 3), 1, []);
E_HOMO_A = reshape(X(:, 4), 1, []);
IP_A = reshape(X(:, 5), 1, []);
r_sigma = reshape(X(:, 6), 1, []);
r_s_B = reshape(X(:, 7), 1, []);
r_p_A = reshape(X(:, 8), 1, []);
f0 = ((r_d_B ./ r_d_A) .* (r_p_B .* E_HOMO_A));
f1 = ((IP_A).^3 .* abs(r_sigma - r_s_B));
f2 = ((IP_A ./ r_p_A) ./ (r_p_B + r_p_A));
c0 = -1.3509197357e-01;
a0 = 2.8311062079e-02;
a1 = 3.7282871777e-04;
a2 = -2.3703222974e-01;
P = reshape(c0 + a0 * f0 + a1 * f1 + a2 * f2, [], 1);
end
```
</details>
Performing Classification with SISSO++
---
Finally `SISSO++` can be used to solve classification problems as well as regression problems.
As an example of this we will adapt the previous example by replacing the property with the identifier of if the material favors the rock-salt or zincblende structure, and change the calculation type to be `classification`.
inally, besides regression problems, `SISSO++` can be used to solve classification problems.
As an example of this we will adapt the previous example by replacing the property with the identifier of if the material favors the rock-salt or zinc-blende structure, and change the calculation type to be `classification`.
It is important to note that while this problem only has two classes, multi-class classification is also possible.
## The Data File
Here is the updated data file, with the property `E_RS - E_ZB (eV)` replaced with a `Class` column where any negative `E_RS - E_ZB (eV)` is replaced with 0 and any positive value replaced with 1.
Here is the updated data file, with the property `E_RS - E_ZB (eV)` replaced with a `Class` column where any negative `E_RS - E_ZB (eV)` is replaced with 0 and any positive value replaced with 1. While this example has only one task and two classes, the method works for an arbitrary number of classes and tasks.
<details>
<summary>Here is the full data_class.csv file for the calculation</summary>
......@@ -300,7 +301,7 @@ The estimated property vector in this case refers to the predicted class from SV
```
</details>
## Updating the SVM Model the Python Interface
## Updating the SVM Model Using `sklearn`
Because the basis of the classification algorithm is based on the overlap region of the convex hull, the `c` value for the SVM model is set at a fairly high value of 1000.0.
This will prioritize reducing the number of misclassified points, but does make the model more susceptible to being over fit.
To account for this the python interface has the ability to refit the Linear SVM using the `svm` module of `sklearn`.
......
......@@ -20,9 +20,9 @@ plot_model_parity_plot: Wrapper to plot_model for a set of training and testing
import numpy as np
import toml
from sissopp.postprocess.check_cv_convergence import jackknife_cv_conv_est
from sissopp.postprocess.load_models import load_model
from sissopp.postprocess.plot.utils import setup_plot_ax
from sissopp.postprocess.plot.utils import setup_plot_ax, latexify
from sissopp import ModelClassifier
def plot_model_parity_plot(model, filename=None, fig_settings=None):
......@@ -52,9 +52,12 @@ def plot_model_parity_plot(model, filename=None, fig_settings=None):
fig_config, fig, ax = setup_plot_ax(fig_settings)
ax.set_xlabel(model.prop_label + " (" + model.prop_unit.latex_str + ")")
ax.set_xlabel(latexify(model.prop_label) + " (" + model.prop_unit.latex_str + ")")
ax.set_ylabel(
f"Estimated {model.prop_label}" + " (" + model.prop_unit.latex_str + ")"
f"Estimated {latexify(model.prop_label)}"
+ " ("
+ model.prop_unit.latex_str
+ ")"
)
if len(model.prop_test) > 0:
lims = [
......
......@@ -91,19 +91,26 @@ def adjust_box_widths(ax, fac):
def latexify(s):
"""Convert a string s into a latex string"""
power_split = s.split("^")
print(power_split)
if len(power_split) == 1:
return s
power_split[0] += "$"
for pp in range(1, len(power_split)):
unit_end = power_split[pp].split(" ")
unit_end[0] = "{" + unit_end[0] + "}$"
temp_s = s
else:
power_split[0] += "$"
for pp in range(1, len(power_split)):
unit_end = power_split[pp].split(" ")
unit_end[0] = "{" + unit_end[0] + "}$"
unit_end[-1] += "$"
power_split[pp] = " ".join(unit_end)
temp_s = "^".join(power_split)[:-1]
subscript_split = temp_s.split("_")
if len(subscript_split) == 1:
return temp_s
subscript_split[0] += "$"
for pp in range(1, len(subscript_split)):
unit_end = subscript_split[pp].split(" ")
unit_end[0] = "\\mathrm{" + unit_end[0] + "}$"
unit_end[-1] += "$"
power_split[pp] = " ".join(unit_end)
print("^".join(power_split)[:-1])
subscript_split[pp] = " ".join(unit_end)
return "^".join(power_split)[:-1]
return "_".join(subscript_split)[:-1]
......@@ -121,6 +121,12 @@ void sisso::register_all()
"@DocString_str_utils_matlabify@"
);
def(
"latexify",
&str_utils::latexify,
(arg("str")),
"@DocString_str_utils_latexify@"
);
#ifdef PARAMETERIZE
sisso::feature_creation::node::registerAddParamNode();
sisso::feature_creation::node::registerSubParamNode();
......
......@@ -75,7 +75,7 @@ std::string str_utils::matlabify(const std::string str)
std::string copy_str = str;
std::replace(copy_str.begin(), copy_str.end(), ' ', '_');
std::vector<std::string> split_str = split_string_trim(str, "\\");
std::vector<std::string> split_str = split_string_trim(str, "\\}{");
for(auto& term_str : split_str)
{
std::string add_str = term_str;
......
......@@ -42,6 +42,7 @@ namespace str_utils
*/
std::vector<std::string> split_string_trim(const std::string str, const std::string split_tokens = ",;:");
// DocString: str_utils_latexify
/**
* @brief Convert a string into a latex string
*
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment