Commit 6385b732 authored by Thomas Purcell's avatar Thomas Purcell
Browse files

Make changes to tutorials based off of Chris' comments

parent 6bbc8890
......@@ -30,14 +30,16 @@ Below in a minimal example of the data file used to learn a model for a material
```csv
material, Structure_Type, Volume (AA^3), lat_param (AA)
C, diamond, 11.4, 2.526
Si, diamond, 40.86, 3.866
Ge, diamond, 47.38.86, 4.062
Sn, diamond, 76.12, 4.757
NaF, rock_salt, 25.72, 3.313
NaCl, rock_salt, 45.45, 4.006
NaBr, rock_salt, 54, 4.243
NaI, rock_salt, 68.35, 4.589
C, diamond, 45.64, 3.57
Si, diamond, 163.55, 5.47
Ge, diamond, 191.39, 5.76
Sn, diamond, 293.58, 6.65
Pb, diamond, 353.84, 7.07.757
LiF, rock_salt, 67.94, 4.08
NaF, rock_salt, 103.39, 4.69
KF, rock_salt, 159.00, 5.42
RbF, rock_salt, 189.01, 5.74
CsF, rock_salt, 228.33, 6.11
```
### `sisso.json`
......@@ -52,23 +54,23 @@ Here is a complete example of a `sisso.json` file where the property and task ke
"property_key": "Volume",
"task_key": "Structure_Type",
"opset": ["add", "sub", "mult", "div", "sq", "cb", "cbrt", "sqrt"],
"param_opset": ["sq", "cb", "cbrt", "sqrt", "log", "exp"],
"param_opset": [],
"calc_type": "regression",
"desc_dim": 1,
"n_sis_select": 10,
"desc_dim": 2,
"n_sis_select": 5,
"max_rung": 2,
"n_residual": 5,
"n_models_store": 2,
"n_residual": 1,
"n_models_store": 1,
"n_rung_store": 1,
"n_rung_generate": 0,
"min_abs_feat_val": 1e-5,
"max_abs_feat_val": 1e8,
"leave_out_inds": [0, 4],
"leave_out_inds": [0, 5],
"leave_out_frac": 0.25,
"fix_intercept": false,
"max_feat_cross_correlation": 1.0,
"nlopt_seed": 13,
"global_param_opt": true,
"global_param_opt": false,
"reparam_residual": true
}
```
......@@ -205,40 +207,106 @@ mpiexec -n 2 ~/sisso++/main directory/bin/sisso++ sisso.json
which will give the following output for the simple problem defined above
```text
time input_parsing: 0.00622392 s
time to generate feat sapce: 1.78322 s
Projection time: 0.00070405 s
Time to get best features on rank : 9.89437e-05 s
Complete final combination/selection from all ranks: 0.000319004 s
Time for SIS: 0.00143313 s
Time for l0-norm: 0.00337982 s
Train RMSE: 0.00650172 AA^3; Test RMSE: 0.0326653 AA^3
c0 + a0 * ((lat_param-6.449216e-05)^3)
time input_parsing: 0.000721931 s
time to generate feat sapce: 0.00288105 s
Projection time: 0.00304198 s
Time to get best features on rank : 1.09673e-05 s
Complete final combination/selection from all ranks: 0.00282502 s
Time for SIS: 0.00595999 s
Time for l0-norm: 0.00260496 s
Projection time: 0.000118971 s
Time to get best features on rank : 1.38283e-05 s
Complete final combination/selection from all ranks: 0.00240111 s
Time for SIS: 0.00276804 s
Time for l0-norm: 0.000256062 s
Train RMSE: 0.293788 AA^3; Test RMSE: 0.186616 AA^3
c0 + a0 * (lat_param^3)
Train RMSE: 0.0936332 AA^3; Test RMSE: 15.8298 AA^3
c0 + a0 * ((lat_param^3)^2) + a1 * (sqrt(lat_param)^3)
```
## Analyzing the Results
Once the calculations are done, two sets of output files are generated.
A list of all selected features is stored in: `feature_space/selected_features.txt` and every model used as a residual for SIS is stored in `models/`.
Two files that summarize the results from SIS in a computer and human readable manner are stored in: `feature_space/` and every model used as a residual for SIS is stored in `models/`.
The human readable file describing the selected feature space is `feature_space/SIS_summary.txt` which contains the projection score (The Pearson correlation to the target property or model residual).
```
# FEAT_ID Score Feature Expression
0 0.99997909235669924 (lat_param^3)
1 0.999036700010245471 ((lat_param^2)^2)
2 0.998534266139345261 (lat_param^2)
3 0.996929900301868899 (sqrt(lat_param)^3)
4 0.994755117666830335 lat_param
#-----------------------------------------------------------------------
5 0.0318376000648976157 ((lat_param^3)^3)
6 0.00846237838476477863 ((lat_param^3)^2)
7 0.00742498801557322716 cbrt(cbrt(lat_param))
8 0.00715447033658055554 cbrt(sqrt(lat_param))
9 0.00675695980092700429 sqrt(sqrt(lat_param))
#---------------------------------------------------------------------
```
The computer readable file file is `feature_space/selected_features.txt` and contains a the list of selected features represented by an alphanumeric code where the integers are the index of the feature in the primary feature space and strings represent the operators.
The order of each term in these expressions is the same as the order it would appear using postfix (reverse polish) notation.
```
# FEAT_ID Feature Postfix Expression (RPN)
0 0|cb
1 0|sq|sq
2 0|sq
3 0|sqrt|cb
4 0
#-----------------------------------------------------------------------
5 0|cb|cb
6 0|cb|sq
7 0|cbrt|cbrt
8 0|sqrt|cbrt
9 0|sqrt|sqrt
#-----------------------------------------------------------------------
```
The model output files are split into train/test files sorted by the dimensionality of the model and by the train RMSE.
The model with the lowest RMSE is stored in the lowest numbered file.
For example `train_dim_1_model_0.dat` will have the best 2D model, `train_dim_1_model_1.dat` would have the second best, etc.
For example `train_dim_2_model_0.dat` will have the best 2D model, `train_dim_2_model_1.dat` would have the second best, etc., whereas `train_dim_1_model_0.dat` will have the best 1D model.
Each model file has a large header containing information about the features selected and model generated
```
# c0 + a0 * ((lat_param-6.449216e-05)^3)
# c0 + a0 * (lat_param^3)
# Property Label: $Volume$; Unit of the Property: AA^3
# RMSE: 0.293787533962641; Max AE: 0.56084644346538
# Coefficients
# Task a0 c0
# diamond, 1.000735616997855e+00, -1.551085274074442e-01,
# rock_salt, 9.998140372873336e-01, 6.405707194855371e-02,
# Feature Rung, Units, and Expressions
# 0; 1; AA^3; 0|cb; (lat_param^3); $\left(lat_{param}^3\right)$; (lat_param).^3; lat_param
# Number of Samples Per Task
# Task , n_mats_train
# diamond, 4
# rock_salt, 4
```
The first section of the header summarizes the model by providing a string representation of the model, defines the property's label and unit, and summarizes the error of the model.
```
# c0 + a0 * (lat_param^3)
# Property Label: $Volume$; Unit of the Property: AA^3
# RMSE: 0.0065017240994181; Max AE: 0.00917629346981386
# RMSE: 0.293787533962641; Max AE: 0.56084644346538
```
Next the linear coefficients (as shown in the first line) for each task is listed.
```
# Coefficients
# Task a0 c0
# diamond, 7.072338733511608e-01, -9.662083275737180e-03,
# rock_salt, 7.079543192237714e-01, -6.681973682629133e-02,
# diamond, 1.000735616997855e+00, -1.551085274074442e-01,
# rock_salt, 9.998140372873336e-01, 6.405707194855371e-02,
```
Then a description of each feature in the model is listed, including units and various expressions.
```
# Feature Rung, Units, and Expressions
# 0; 1; AA^3; 0|cb: 1.0000000000000e+00,-6.4492159105003e-05; ((lat_param-6.449216e-05)^3); $\left(\left(lat_{param}-6.449e-05\right)^3\right)$; (lat_param-6.449216e-05).^3; lat_param
# 0; 1; AA^3; 0|cb; (lat_param^3); $\left(lat_{param}^3\right)$; (lat_param).^3; lat_param
```
Finally information about the number of samples in each task is given
```
# Number of Samples Per Task
# Task , n_mats_train
# diamond, 3
# rock_salt, 3
# diamond, 4
# rock_salt, 4
```
After this header the following data is stored in the file:
......@@ -246,7 +314,7 @@ After this header the following data is stored in the file:
```
# Sample ID , Property Value , Property Value (EST) , Feature 0 Value
```
With this file the model can be perfectly reconstructed using the python binding.
With this data, one can plot and analyzed the model, e.g., by using the python binding.
## Using the Python Library
To see how the python interface can be used refer to the [tutorials](../tutorial/2_python.md).
......
......@@ -6,12 +6,17 @@ In particular we will use SISSO to predict the crystal structure (rock-salt or z
The tutorial will be split into three parts: 1) explaining how to use the executable to perform the calculations and the python utilities to analyze the results and 2) How to use only python to run, analyze, and demonstrate results 3) How to perform classification problems using SISSO.
## Outline
The following tutorials are available:
The tutorials are split between solving regression problems:
- [Using the Command Line Interface](1_command_line.md)
- [Using the Python Interface](2_python.md)
And classification:
- [Classification](3_classification.md)
For several large and independent calculations using the command-line interface will be best; however, the python interface does provide a good way of performing initial tests and demonstrating the final results.
All tutorials use the octet binary dataset first described in [PRL-2015](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.10550) with the goal of predicting whether a material will crystallize in a rock-salt or zinc-blende phase.
For all applications of SISSO a data set has to be passed via a standard `csv` file where the first row represents the feature and property label and the first column are the index-label for each sample for example
```
......@@ -27,7 +32,7 @@ The feature labels have an optional term in () that represents the units of the
If no unit is passed then the feature is assumed to be unitless.
<details>
<summary>Here is the full data.csv file for the calculation</summary>
<summary>Here is the full data.csv file for the calculation. The features describe the nuclear charge (Z); ionization potential (IP); electron affinity (EA); HOMO and LUMO energies (E_HOMO and E_LUMO); and radii of the atomic s, p and d-orbitals (r_s, r_p, and r_d) of the cation (A) and anion(B) of the materials. Additionally the radii of the \sigma and \pi orbitals of the dimer for each material is included.</summary>
```
# Material,E_RS - E_ZB (eV),Z_A (nuc_charge) ,Z_B (nuc_charge) ,period_A,period_B,IP_A (eV_IP) ,IP_B (eV_IP) ,EA_A (eV_IP),EA_B (eV_IP) ,E_HOMO_A (eV) ,E_HOMO_B (eV) ,E_LUMO_A (eV),E_LUMO_B (eV) , r_s_A (AA) , r_s_B (AA) , r_p_A (AA) , r_p_B (AA) , r_d_A (AA) , r_d_B (AA), r_sigma (AA) , r_pi (AA)
......
......@@ -20,6 +20,8 @@ As an example here is the `sisso.json` file we will initially use for this syste
"opset": ["add", "sub", "abs_diff", "mult", "div", "inv", "abs", "exp", "log", "sin", "cos", "sq", "cb", "six_pow", "sqrt", "cbrt", "neg_exp"]
}
```
Of these parameters `n_sis_select`, `n_residual`, `max_rung`, and `desc_dim` are the hyperparameters that must be optimized for each calculation.
Additionally `property_key` and `task_key` both must be columns headers in the `data_file` (Here we are only using one task so `task_key` is not included).
With this input file and the provided `data.csv` file we are now able to perform SISSO with the following command
```
mpiexec -n 2 sisso++ sisso.json
......@@ -61,8 +63,14 @@ Train RMSE: 0.0588116 eV
c0 + a0 * ((E_LUMO_A / EA_A) / (r_p_B^6)) + a1 * ((|period_B - period_A|) / (r_pi * EA_B)) + a2 * ((EA_B - IP_A) * (|r_sigma - r_s_B|)) + a3 * ((E_HOMO_B / r_p_A) / (r_sigma + r_p_B))
```
The standard output provides information about what step the calculation just finished and how long it took to complete so you can see where a job failed or ran out of time.
When all calculations are complete the code prints out a summary of the best 1D, 2D, ..., {desc_dim}D models with their training RMSE/Testing RMSE (Only training if there is no test set provided).
Additionally, two additional output files are stored in `feature_space/`: `SIS_summary.txt` and `selected_features.txt`.
If this:
```
Train RMSE: 0.0588116 eV
c0 + a0 * ((E_LUMO_A / EA_A) / (r_p_B^6)) + a1 * ((|period_B - period_A|) / (r_pi * EA_B)) + a2 * ((EA_B - IP_A) * (|r_sigma - r_s_B|)) + a3 * ((E_HOMO_B / r_p_A) / (r_sigma + r_p_B))
```
is not shown at the bottom of standard output then the calculation did not complete successfully.
When all calculations are complete the code prints out a summary of the best 1D, 2D, ..., {desc_dim}D models with their training RMSE/Testing RMSE (Only training if there is no test set provided as in this case).
We also see that, two additional output files are stored in `feature_space/`: `SIS_summary.txt` and `selected_features.txt`.
These files represent a human readable (`SIS_summary.txt`) and computer readable (`selected_features.txt`) summary of the selected feature space from SIS.
Below are reconstructions of both files for this calculation (To see the file click the triangle)
......@@ -115,6 +123,9 @@ Below are reconstructions of both files for this calculation (To see the file cl
39 0.253659279222423484 ((E_LUMO_A / r_p_B) * (E_LUMO_B * E_LUMO_A))
#-----------------------------------------------------------------------
</details>
This file contains the index of the selected feature space, a projection score, and a string representation of the feature.
For regression problems the score represents the Pearson correlation between the feature and target property (all feature above the first dashed line) or the highest Pearson correlation between the feature and the residual of the best `n_residual` models of the previous dimension.
<details>
<summary>feature_space/selected_features.txt</summary>
......@@ -165,6 +176,9 @@ Below are reconstructions of both files for this calculation (To see the file cl
#-----------------------------------------------------------------------
</details>
This files is a computer readable file used to reconstruct the selected feature space.
In these files each feature is displayed an alphanumeric string where the integers represent an index of the primary feature space, and the strings represent operations.
The order of each term matches the order of terms if the equation is written in postfix (reverse polish) notation.
In both files the change in rung is represented by the commented out dashed (--) line.
The `models/` directory is used to store the output files representing the models for each dimension:
......@@ -174,7 +188,9 @@ train_dim_1_model_0.dat train_dim_2_model_0.dat train_dim_3_model_0.dat train
```
Each of these files represents one of the {`n_models_store`} model stored for each dimension, and can be used to reconstruct the models within python.
The file has a header that provides metadata associated with the selected features, coefficients, modeled property, and the task sizes for the calculations.
After the header the value of the property, estimated property, and feature value for each sample is listed wit the same label used in `data.csv`.
The first six lines of the header are the most important because it defines what the model is, what the error is, and the coefficients for each task.
After the header the value of the property, estimated property, and feature value for each sample is listed with the same label used in `data.csv`.
For a line by line description of the header refer to the [quick-start guide](../quick_start/code_ref.md).
An example of these files is provided here:
<details>
......@@ -276,16 +292,18 @@ An example of these files is provided here:
SZn , 2.758133256065780e-01, 2.143486874814196e-01, 2.332991532953503e+00, -2.400270322363168e+00
SeZn , 2.631368992806530e-01, 2.463580576975095e-01, 7.384497385908948e-01, -2.320488278555971e+00
TeZn , 2.450012951740060e-01, 1.776248032825628e-01, 2.763715059556858e+00, -2.304848319397327e+00
</details>
## Determining the Ideal Model Complexity with Cross-Validation
While the training error always decreases with descriptor dimensionality for a given application, over-fitting can reduce the general applicability of the models outside of the training set.
In order to determine the optimal dimensionality of a model and optimize the hyperparameters associated with SISSO, we need to perform cross-validation.
Cross-validation
As an example we will discuss how to perform leave-out 10% using the command line.
To do this we have to modify the `sisso.json` file to automatically leave out a random sample of the training data and use that as a test set by changing `"leave_out_frac": 0.0,` do `"leave_out_frac": 0.10,`.
To do this we have to modify the `sisso.json` file to automatically leave out a random sample of the training data and use that as a test set by changing `"leave_out_frac": 0.0,` to `"leave_out_frac": 0.10,`,
i.e. in this case SISSO will ignore 8 materials (10% of all data) during training.
In each run, this 8 materials are chosen randomly, so each SISSO run will
differ from one another.
<details>
<summary> updated sisso.json file</summary>
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment