Skip to content
Snippets Groups Projects
Commit dd5b6a95 authored by Thomas Purcell's avatar Thomas Purcell
Browse files

Update README.md

parent a0a13662
No related branches found
No related tags found
No related merge requests found
......@@ -5,192 +5,4 @@ C++ Implementation of SISSO with python bindings
This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface.
Future work will expand the python interface to include more postporcessing analysis tools.
## Installation
The package uses a CMake build system, and compatible all versions of the C++ standard library after C++ 14.
### Prerequisites
To install the sisso++ the following packages are needed:
- CMake version 3.10 and up
- A C++ complier (compatible with C++ 14 and later)
- BLAS/LAPACK (Architecture specific compilations like MKL or ACML are recommended)
- MPI
- Boost with the following libraries compiled (mpi, serialization, system, and filesystem)
To build the optional python bindings the following are also needed:
- Python 3 interpreter
- Boost with the python and numpy libraries compiled
### Install `sisso++`
`sisso++` is installed using a cmake build system, with some basic configuration files stored in `cmake/toolchains/`
As an example here is an `initial_config.cmake` file used to construct `sisso++` and the python bindings using the gnu compiler.
```
###############
# Basic Flags #
###############
set(CMAKE_CXX_COMPILER g++ CACHE STRING "")
set(CMAKE_CXX_FLAGS "-O2" CACHE STRING "")
#################
# Feature Flags #
#################
set(USE_PYTHON ON CACHE BOOL "")
set(EXTERNAL_BOOST OFF CACHE BOOL "")
```
Here the `-O2` flag is for optimizations, it is recommended to stay as `-O2` or `-O3`, but it can be changed to match compiler requirements.
When building Boost from source (`EXTERNAL_BOOST OFF`) the number of processes used when building Boost may be set using the
`BOOST_BUILD_N_PROCS` flag in CMake. For example, to build Boost using 4 processes, the following flag should be included in the
`initial_config.cmake` file:
```
#set(BOOST_BUILD_N_PROCS 4 CACHE STRING "")
```
This flag will have no effect when linking against external boost, i.e. `EXTERNAL_BOOST ON`.
To install `sisso++` run the following commands (this assumes gnu compiler and MKL are used, if you are using a different compiler/BLAS library change the flags to the relevant data)
```
export MKLROOT=/path/to/mkl/
export BOOST_ROOT=/path/to/boost
cd ~/sisso++/main directory
mkdir build/;
cd build/;
cmake -C initial_config.cmake ../
make install
```
Once all the commands are run `sisso++` should be in the `~/cpp_sisso/main directory/bin/` directory.
### Install `_sisso`
To install the python bindings first ensure your python path matches the path used to configure `boost` and then repeat the same commands as above but set `USE_PYTHON` in `initial_config.cmake` to `ON`.
Once installed you should have access to the python interface via `import cpp_sisso`.
## Running the code
### Input files
To see a sample of the input files look in `~/sisso++/main directory/test/exec_test`
To use the code two files are necessary: `sisso.json` and `data.csv`.
`data.csv` stores all the data for the calculation in a `csv` file.
The first row in the file corresponds to the feature meta data with the following format `expression (Unit)`.
For example if one of the primary features used in the set is the lattice constant of a material the header would be `lat_param (AA)`.
The first column of the file are sample labels for all of the other rows, and is not used.
The input parameters are stored in `sisso.json`, here is a list of all possible variables that can be sored in `sisso.json`
#### `data_file`
The name of the csv file where the data is stored. (Default: "data.csv")
#### `property_key`
The expression of the column where the property to be modeled is stored. (Default: "prop")
#### `task_key`
The expression of the column where the task identification is stored. (Default: "Task")
#### `opset`
The set of operators to use to combine the features during feature creation. (If empty use all available features)
#### `calc_type`
The type of calculation to run either regression or classification
#### `desc_dim`
The maximum dimension of the model to be created
#### `n_sis_select`
The number of features that SIS selects over each iteration
#### `max_rung`
The maximum rung of the feature (height of the tallest possible binary expression tree - 1)
#### `n_residual`
Number of residuals to used to select the next subset of materials in the iteration. (Affects SIS after the 1D model) (Default: 1)
#### `n_models_store`
Number of models to output as file for each dimension (Default: n_residual)
#### `n_rung_store`
The number of rungs where all of the training/testing data of the materials are stored in memory. (Default: `max_rung` - 1)
#### `n_rung_generate`
The number of rungs to generate on the fly during each SIS step. Must be 1 or 0. (Default: 0)
#### `min_abs_feat_val`
Minimum absolute value allowed in the feature's training data (Default: 1e-50)
#### `max_abs_feat_val`
Maximum absolute value allowed in the feature's training data (Default: 1e50)
#### `leave_out_inds`
The indicies from the data set to use as the test set. If empty and `leave_out_frac > 0` the selection will be random
#### `leave_out_frac`
Fraction (in decimal form) of the data to use as a test set (Default: 0.0 if `leave_out_inds` is empty, otherwise `len(leave_out_inds)) / Number of rows in data file`
#### `fix_intercept`
If true set the intercept to 0.0 for all Regression models (Default: false)
This does not work for classification
#### `max_feat_cross_correlation`
The maximum Pearson correlation allowed between selected features (Default: 1.0)
### Perform the Calculation
Once the input files are made the code can be run using the following command
```
mpiexec -n 2 ~/sisso++/main directory/bin/sisso++ sisso.json
```
### Analyzing the Results
Once the calculations are done, two sets of output files are generated.
A list of all selected features is stored in: `feature_space/selected_features.txt` and every model used as a residual for SIS is stored in `models`.
The model output files are split into train/test files sorted by the dimensionality of the model and by the train RMSE. The model with the lowest RMSE is stored in the lowest number file.
For example `train_dim_3_model_0.dat` will have the best 3D model, `train_dim_3_model_1.dat` would have the second best, etc.
Each model file has a large header containing information about the features selected and model generated
```
# c0 + a0 * [(|r_p_B - (r_s_B)|) / ([(r_d_A) * (E_HOMO_B)])] + a1 * [(|r_p_B - (r_s_A)|) * ([(IP_A) / (r_s_A)])] + a2 * [(|E_HOMO_B - (EA_B)|) / ((r_p_A)^2)]
# RMSE: 0.0779291679452223; Max AE: 0.290810937048465
# Coefficients
# Task; a0 a1 a2 c0
# 0, 7.174549961742731e+00, 8.687856036798111e-02, 2.468463139364077e-01, -3.995345676823570e-02,
# Feature Rung, Units, and Expressions
# 0, 2, 1 / eV, [(|r_p_B - (r_s_B)|) / ([(r_d_A) * (E_HOMO_B)])]
# 1, 2, eV, [(|r_p_B - (r_s_A)|) * ([(IP_A) / (r_s_A)])]
# 2, 2, 1 / AA^2 * eV, [(|E_HOMO_B - (EA_B)|) / ((r_p_A)^2)]
# Number of Samples Per Task
# Task; n_mats_train
# 0, 78
```
After this header the following data is stored in the file:
```
#Property Value Property Value (EST) Feature 0 Value Feature 1 Value Feature 2 Value
```
With this file the model can be perfectly recreated using the python binding.
### Using the Python Library
To see how the python interface can be used look at `examples/python_interface_demo.ipynb`
If you get an error about not being able to load MKL libraries, you may have to run `conda install numpy` to get proper linking.
For a more detailed expaplanation please visit our documentation at: https://tpurcell.pages.mpcdf.de/cpp_sisso
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment