@@ -31,7 +31,7 @@ A list containing the set of all operators that will be used during the feature

#### `param_opset`

A list containing the set of all operators, for which the non-linear scale and bias terms will be optimized, that will be used during the feature creation step of SISSO. (If empty none of the available features)

A list containing the set of all operators, for which the non-linear scale and bias terms will be optimized, that will be used during the feature creation step of SISSO. (If empty none of the available features are used)

#### `calc_type`

...

...

@@ -39,15 +39,15 @@ The type of calculation to run either regression, log regression, or classificat

#### `desc_dim`

The maximum dimension of the model to be created

The maximum dimension of the model to be created (no default value)

#### `n_sis_select`

The number of features that SIS selects over each iteration

The number of features that SIS selects over each iteration (no default value)

#### `max_rung`

The maximum rung of the feature (height of the tallest possible binary expression tree - 1)

The maximum rung of the feature (height of the tallest possible binary expression tree - 1) (no default value)

This tutorial is based on the [Predicting energy differences between crystal structures: (Meta-)stability of octet-binary compounds](https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/descriptor_role.ipynb) tutorial created by Mohammad-Yasin Arif, Luigi Sbailò, Thomas A. R. Purcell, Luca M. Ghiringhelli, and Matthias Scheffler.

The goal of the tutorial is to teach a user how to use `SISSO++` to find and analyze quantitative models for materials properties.

In particular we will use SISSO to predict the crystal structure (rocksalt or zincblende) of a series of octet binaries.

In particular we will use SISSO to predict the crystal structure (rock-salt or zinc-blende) of a series of octet binaries.

The tutorial will be split into three parts: 1) explaining how to use the executable to perform the calculations and the python utilities to analyze the results and 2) How to use only python to run, analyze, and demonstrate results 3) How to perform classification problems using SISSO.

## Outline

The following tutorials are available:

-[Combined Binary and Python](1_combined.md)

-[Python only](2_python.md)

-[Using the Command Line Interface](1_command_line.md)

-[Using the Python Interface](2_python.md)

-[Classification](3_classification.md)

All tutorials use the octet binary dataset first described in [PRL-2015](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.10550) with the goal of predicting whether a material will crystallize in a rocksalt or zincblende phase.

All tutorials use the octet binary dataset first described in [PRL-2015](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.10550) with the goal of predicting whether a material will crystallize in a rock-salt or zinc-blende phase.

For all applications of SISSO a data set has to be passed via a standard `csv` file where the first row represents the feature and property label and the first column are the index-label for each sample for example

@@ -64,7 +64,7 @@ The standard output provides information about what step the calculation just fi

When all calculations are complete the code prints out a summary of the best 1D, 2D, ..., {desc_dim}D models with their training RMSE/Testing RMSE (Only training if there is no test set provided).

Additionally, two additional output files are stored in `feature_space/`: `SIS_summary.txt` and `selected_features.txt`.

These files represent a human readable (`SIS_summary.txt`) and computer readable (`selected_features.txt`) summary of the selected feature space from SIS.

Below are reconstructions of both files for this calculation

Below are reconstructions of both files for this calculation (To see the file click the triangle)

<details>

<summary>feature_space/SIS_summary.txt</summary>

...

...

@@ -284,10 +284,12 @@ An example of these files is provided here:

```

</details>

## Determining the Ideal Model Complexity with Cross-Validation

While the training error always decreases with descriptor dimensionality for a given application, over-fitting can reduce the general applicability of the models outside of the training set.

In order to determine the optimal dimensionality of a model and optimize the hyperparameters associated with SISSO, we need to perform cross-validation.

As an example we will discuss how to perform leave-out 10% using the command line

As an example we will discuss how to perform leave-out 10% using the command line.

To do this we have to modify the `sisso.json` file to automatically leave out a random sample of the training data and use that as a test set by changing `"leave_out_frac": 0.0,` do `"leave_out_frac": 0.10,`.

<details>

...

...

@@ -415,7 +417,7 @@ As can be seen from the standard error measurements the results are now reasonab

Because the validation error for the three and four dimensional models are within each others error bars and the standard error increases when going to the fourth dimension, we conclude that the three-dimensional model has the ideal complexity.

...

...

@@ -431,11 +433,13 @@ To see the distributions for this system we run

<details>

<summary> Distribution of Errors </summary>

![image](./combined/error_cv.png)

![image](./command_line/error_cv.png)

</details>

One thing that stands out in the plot is the large error seen in a single point for both the one and two dimensional models.

By looking at the validation errors, we find that the point with the largest error is diamond for all model dimensions, which is by far the most stable zincblende structure in the data set.

By looking at the validation errors, we find that the point with the largest error is diamond for all model dimensions, which is by far the most stable zinc-blende structure in the data set.

As a note for this setup there is a 0.22\% chance that one of the samples is never in the validation set so if `max_error_ind != 21` check if that sample is in one of the validation sets.

```python

>>>importnumpyasnp

>>>importpandasaspd

...

...

@@ -585,12 +589,66 @@ To get the final models we will perform the same calculation we started off the

From here we can use `models/train_dim_3_model_0.dat` for all of the analysis.

In order to generate a machine learning plot for this model in matplotlib, run the following in python

Finally `SISSO++` can be used to solve classification problems as well as regression problems.

As an example of this we will adapt the previous example by replacing the property with the identifier of if the material favors the rock-salt or zincblende structure, and change the calculation type to be `classification`.

inally, besides regression problems,`SISSO++` can be used to solve classification problems.

As an example of this we will adapt the previous example by replacing the property with the identifier of if the material favors the rock-salt or zinc-blende structure, and change the calculation type to be `classification`.

It is important to note that while this problem only has two classes, multi-class classification is also possible.

## The Data File

Here is the updated data file, with the property `E_RS - E_ZB (eV)` replaced with a `Class` column where any negative `E_RS - E_ZB (eV)` is replaced with 0 and any positive value replaced with 1.

Here is the updated data file, with the property `E_RS - E_ZB (eV)` replaced with a `Class` column where any negative `E_RS - E_ZB (eV)` is replaced with 0 and any positive value replaced with 1. While this example has only one task and two classes, the method works for an arbitrary number of classes and tasks.

<details>

<summary>Here is the full data_class.csv file for the calculation</summary>

...

...

@@ -300,7 +301,7 @@ The estimated property vector in this case refers to the predicted class from SV

```

</details>

## Updating the SVM Model the Python Interface

## Updating the SVM Model Using `sklearn`

Because the basis of the classification algorithm is based on the overlap region of the convex hull, the `c` value for the SVM model is set at a fairly high value of 1000.0.

This will prioritize reducing the number of misclassified points, but does make the model more susceptible to being over fit.

To account for this the python interface has the ability to refit the Linear SVM using the `svm` module of `sklearn`.