Commit dfe86884 authored by Thomas Purcell's avatar Thomas Purcell
Browse files

Update code refrence to match Chris' comments

Updated the format towards Chris' suggestions
parent bb78cca9
Running the code
Running the Code
---
### Input Files
## Input Files
To see a sample of the input files look in `~/sisso++/main directory/test/exec_test`
To see a sample of the input files look in `~/sisso++/main directory/tests/exec_test`.
In this directory there are multiple subdirectories for different types of calculations, but the `default/` directory would be the most common application.
To use the code two files are necessary: `sisso.json` and `data.csv`.
`data.csv` stores all of the data for the calculation in a `csv` file.
......@@ -13,132 +14,241 @@ The first column of the file are sample labels for all of the other rows, and is
The input parameters are stored in `sisso.json`, here is a list of all possible variables that can be set in `sisso.json`
#### `data_file`
### `data.csv`
The data file contains all relevant data and metadata to describe the individual features and samples.
The first row of the file corresponds to the features metadata and has the following format `expression (Unit)` or `expression`.
For the cases where no `(Unit)` is included in the header then the feature is considered to be dimensionless.
For example if one of the primary features used in the set is the lattice constant of a material the header would be `lat_param (AA)`, but the number of species in the material would be `n_species` because it is a dimensionless number.
The first column provide the labels for each sample in the data file, and is used to set the sample ids in the output files.
In the simplest case, this can be just a running index.
The data describing the property vector is defined in the column with an `expression` matching the `property_key` filed in the `sisso.json` file, and will not be included in the feature space.
Additionally, an optional `Task` column whose header matches the `task_key` field in the sisso.json file can also be included in the data file.
This column maps each sample to a respective task with a label defined in the task column.
Below in a minimal example of the data file used to learn a model for a materials volume.
```csv
material, Structure_Type, Volume (AA^3), lat_param (AA)
C, diamond, 11.4, 2.526
Si, diamond, 40.86, 3.866
Ge, diamond, 47.38.86, 4.062
Sn, diamond, 76.12, 4.757
NaF, rock_salt, 25.72, 3.313
NaCl, rock_salt, 45.45, 4.006
NaBr, rock_salt, 54, 4.243
NaI, rock_salt, 68.35, 4.589
```
### `sisso.json`
All input parameters that can not be extracted from the data file are defined in the `sisso.json` file.
Here is a complete example of a `sisso.json` file where the property and task keys match those in the above data file example.
```json
{
"data_file": "data.csv",
"property_key": "Volume",
"task_key": "Structure_Type",
"opset": ["add", "sub", "mult", "div", "sq", "cb", "cbrt", "sqrt"],
"param_opset": ["sq", "cb", "cbrt", "sqrt", "log", "exp"],
"calc_type": "regression",
"desc_dim": 1,
"n_sis_select": 10,
"max_rung": 2,
"n_residual": 5,
"n_models_store": 2,
"n_rung_store": 1,
"n_rung_generate": 0,
"min_abs_feat_val": 1e-5,
"max_abs_feat_val": 1e8,
"leave_out_inds": [0, 4],
"leave_out_frac": 0.25,
"fix_intercept": false,
"max_feat_cross_correlation": 1.0,
"nlopt_seed": 13,
"global_param_opt": true,
"reparam_residual": true
}
```
A description of all fields is listed below. Anything highlighted in red should be in all `sisso.json` files, while anything highlighted in blue highlighted terms should be defined depending on the circumstances.
#### <span style="color:red">data_file</span>
*Default: "data.csv"*
The name of the csv file where the data is stored.
The name of the csv file where the data is stored. (Default: "data.csv")
#### <span style="color:red">property_key</span>
#### `property_key`
*Default: "prop"*
The expression of the column where the property to be modeled is stored. (Default: "prop")
The expression of the column where the property to be modeled is stored.
#### `task_key`
#### <span style="color:red">task_key</span>
The expression of the column where the task identification is stored. (Default: "task")
*Default: "task"*
#### `opset`
The expression of the column where the task identification is stored.
#### <span style="color:red">opset</span>
A list containing the set of all operators that will be used during the feature creation step of SISSO. (If empty use all available features)
#### `param_opset`
#### <span style="color:blue">param_opset</span>
A list containing the set of all operators, for which the non-linear scale and bias terms will be optimized, that will be used during the feature creation step of SISSO. (If empty none of the available features are used)
#### `calc_type`
#### <span style="color:red">calc_type</span>
*Default: "regression"*
The type of calculation to run either regression, log regression, or classification (Default: regression)
The type of calculation to run either regression, log regression, or classification
#### `desc_dim`
#### <span style="color:red">desc_dim</span>
The maximum dimension of the model to be created (no default value)
#### `n_sis_select`
#### <span style="color:red">n_sis_select</span>
The number of features that SIS selects over each iteration (no default value)
#### `max_rung`
#### <span style="color:red">max_rung</span>
The maximum rung of the feature (height of the tallest possible binary expression tree - 1) (no default value)
#### `n_residual`
#### <span style="color:red">n_residual</span>
*Default: 1*
Number of residuals to used to select the next subset of materials in the iteration. (Affects SIS after the 1D model)
#### n_models_store
*Default: n_residual*
Number of residuals to used to select the next subset of materials in the iteration. (Affects SIS after the 1D model) (Default: 1)
Number of models to output as file for each dimension
#### `n_models_store`
#### n_rung_store
Number of models to output as file for each dimension (Default: n_residual)
*Default: `max_rung` - 1*
#### `n_rung_store`
The number of rungs where all of the training/testing data of the materials are stored in memory.
The number of rungs where all of the training/testing data of the materials are stored in memory. (Default: `max_rung` - 1)
#### <span style="color:blue">n_rung_generate</span>
#### `n_rung_generate`
*Default: 0*
The number of rungs to generate on the fly during each SIS step. Must be 1 or 0. (Default: 0)
The number of rungs to generate on the fly during each SIS step. Must be 1 or 0.
#### `min_abs_feat_val`
#### min_abs_feat_val
Minimum absolute value allowed in the feature's training data (Default: 1e-50)
*Default: 1e-50*
#### `max_abs_feat_val`
Minimum absolute value allowed in the feature's training data
Maximum absolute value allowed in the feature's training data (Default: 1e50)
#### max_abs_feat_val
#### `leave_out_inds`
*Default: 1e50*
Maximum absolute value allowed in the feature's training data
#### leave_out_inds
The list of indexes from the data set to use as the test set. If empty and `leave_out_frac > 0` the selection will be random
#### `leave_out_frac`
#### <span style="color:blue">leave_out_frac</span>
*Default: 0.0*
Fraction (in decimal form) of the data to use as a test set. This is not used if `leave_out_inds` is set.
#### <span style="color:blue">fix_intercept</span>
*Default: false*
If true set the bias term for regression models to 0.0. For classification problems this must be
Fraction (in decimal form) of the data to use as a test set (Default: 0.0 if `leave_out_inds` is empty, otherwise `len(leave_out_inds)) / Number of rows in data file`
#### <span style="color:blue">max_feat_cross_correlation</span>
#### `fix_intercept`
*Default: 1.0*
If true set the bias term for regression models to 0.0 (Default: false)
The maximum Pearson correlation allowed between selected features
This does not work for classification
#### nlopt_seed
#### `max_feat_cross_correlation`
*Default: 42*
The maximum Pearson correlation allowed between selected features (Default: 1.0)
The random seed used for seeding the pseudo-random number generator for NLopt
#### `nlopt_seed`
#### <span style="color:blue">global_param_opt</span>
The random seed used for seeding the pseudo-random number generator for NLopt (Default: 42)
*Default: false*
#### `global_param_opt`
If true then attempt to globally optimize the non-linear scale/bias terms for the operators in `param_opset`
If true then attempt to globally optimize the non-linear scale/bias terms for the operators in `param_opset` (Default: false)
#### <span style="color:blue">reparam_residual</span>
#### `reparam_residual`
*Default: false*
If true then reparameterize features based on the residuals (default false)
If true then reparameterize features based on the residuals
### Perform the Calculation
## Perform the Calculation
Once the input files are made the code can be run using the following command
```
mpiexec -n 2 ~/sisso++/main directory/bin/sisso++ sisso.json
```
### Analyzing the Results
which will give the following output for the simple problem defined above
```text
time input_parsing: 0.00622392 s
time to generate feat sapce: 1.78322 s
Projection time: 0.00070405 s
Time to get best features on rank : 9.89437e-05 s
Complete final combination/selection from all ranks: 0.000319004 s
Time for SIS: 0.00143313 s
Time for l0-norm: 0.00337982 s
Train RMSE: 0.00650172 AA^3; Test RMSE: 0.0326653 AA^3
c0 + a0 * ((lat_param-6.449216e-05)^3)
```
## Analyzing the Results
Once the calculations are done, two sets of output files are generated.
A list of all selected features is stored in: `feature_space/selected_features.txt` and every model used as a residual for SIS is stored in `models/`.
The model output files are split into train/test files sorted by the dimensionality of the model and by the train RMSE.
The model with the lowest RMSE is stored in the lowest numbered file.
For example `train_dim_2_model_0.dat` will have the best 2D model, `train_dim_2_model_1.dat` would have the second best, etc.
For example `train_dim_1_model_0.dat` will have the best 2D model, `train_dim_1_model_1.dat` would have the second best, etc.
Each model file has a large header containing information about the features selected and model generated
```
# c0 + a0 * ((E_LUMO_A + E_HOMO_A) * cos(r_sigma)) + a1 * (cos(r_p_B) / (r_p_A^2))
# Property Label: $E_{RS} - E_{ZB}$; Unit of the Property: eV
# RMSE: 0.0825250613234847; Max AE: 0.310809109469674
# c0 + a0 * ((lat_param-6.449216e-05)^3)
# Property Label: $Volume$; Unit of the Property: AA^3
# RMSE: 0.0065017240994181; Max AE: 0.00917629346981386
# Coefficients
# Task a0 a1 c0
# all , -3.923248458003308e-02, 1.367351808604120e+00, -2.453863254932724e-01,
# Task a0 c0
# diamond, 7.072338733511608e-01, -9.662083275737180e-03,
# rock_salt, 7.079543192237714e-01, -6.681973682629133e-02,
# Feature Rung, Units, and Expressions
# 0; 2; eV; 10|8|add|18|cos|mult; ((E_LUMO_A + E_HOMO_A) * cos(r_sigma)); $\left(\left(E_{LUMO, A} + E_{HOMO, A}\right) \left(\cos{ r_{sigma} }\right)\right)$; ((E_LUMO_A + E_HOMO_A) .* cos(r_sigma)); E_LUMO_A,E_HOMO_A,r_sigma
# 1; 2; Unitless; 15|cos|14|sq|div; (cos(r_p_B) / (r_p_A^2)); $\left(\frac{ \left(\cos{ r_{p, B} }\right) }{ \left(r_{p, A}^2\right) } \right)$; (cos(r_p_B) ./ (r_p_A).^2); r_p_B,r_p_A
# 0; 1; AA^3; 0|cb: 1.0000000000000e+00,-6.4492159105003e-05; ((lat_param-6.449216e-05)^3); $\left(\left(lat_{param}-6.449e-05\right)^3\right)$; (lat_param-6.449216e-05).^3; lat_param
# Number of Samples Per Task
# Task, n_mats_train
# all , 82
# Task , n_mats_train
# diamond, 3
# rock_salt, 3
```
After this header the following data is stored in the file:
```
# Sample ID , Property Value , Property Value (EST) , Feature 0 Value , Feature 1 Value
# Sample ID , Property Value , Property Value (EST) , Feature 0 Value
```
With this file the model can be perfectly reconstructed using the python binding.
### Using the Python Library
## Using the Python Library
To see how the python interface can be used refer to the [tutorials](../tutorial)
If you get an error about not being able to load MKL libraries, you may have to run `conda install numpy` to get proper linking.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment