@@ -5,192 +5,4 @@ C++ Implementation of SISSO with python bindings
This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface.
Future work will expand the python interface to include more postporcessing analysis tools.
## Installation
The package uses a CMake build system, and compatible all versions of the C++ standard library after C++ 14.
### Prerequisites
To install the sisso++ the following packages are needed:
- CMake version 3.10 and up
- A C++ complier (compatible with C++ 14 and later)
- BLAS/LAPACK (Architecture specific compilations like MKL or ACML are recommended)
- MPI
- Boost with the following libraries compiled (mpi, serialization, system, and filesystem)
To build the optional python bindings the following are also needed:
- Python 3 interpreter
- Boost with the python and numpy libraries compiled
### Install `sisso++`
`sisso++` is installed using a cmake build system, with some basic configuration files stored in `cmake/toolchains/`
As an example here is an `initial_config.cmake` file used to construct `sisso++` and the python bindings using the gnu compiler.
```
###############
# Basic Flags #
###############
set(CMAKE_CXX_COMPILER g++ CACHE STRING "")
set(CMAKE_CXX_FLAGS "-O2" CACHE STRING "")
#################
# Feature Flags #
#################
set(USE_PYTHON ON CACHE BOOL "")
set(EXTERNAL_BOOST OFF CACHE BOOL "")
```
Here the `-O2` flag is for optimizations, it is recommended to stay as `-O2` or `-O3`, but it can be changed to match compiler requirements.
When building Boost from source (`EXTERNAL_BOOST OFF`) the number of processes used when building Boost may be set using the
`BOOST_BUILD_N_PROCS` flag in CMake. For example, to build Boost using 4 processes, the following flag should be included in the
`initial_config.cmake` file:
```
#set(BOOST_BUILD_N_PROCS 4 CACHE STRING "")
```
This flag will have no effect when linking against external boost, i.e. `EXTERNAL_BOOST ON`.
To install `sisso++` run the following commands (this assumes gnu compiler and MKL are used, if you are using a different compiler/BLAS library change the flags to the relevant data)
```
export MKLROOT=/path/to/mkl/
export BOOST_ROOT=/path/to/boost
cd ~/sisso++/main directory
mkdir build/;
cd build/;
cmake -C initial_config.cmake ../
make install
```
Once all the commands are run `sisso++` should be in the `~/cpp_sisso/main directory/bin/` directory.
### Install `_sisso`
To install the python bindings first ensure your python path matches the path used to configure `boost` and then repeat the same commands as above but set `USE_PYTHON` in `initial_config.cmake` to `ON`.
Once installed you should have access to the python interface via `import cpp_sisso`.
## Running the code
### Input files
To see a sample of the input files look in `~/sisso++/main directory/test/exec_test`
To use the code two files are necessary: `sisso.json` and `data.csv`.
`data.csv` stores all the data for the calculation in a `csv` file.
The first row in the file corresponds to the feature meta data with the following format `expression (Unit)`.
For example if one of the primary features used in the set is the lattice constant of a material the header would be `lat_param (AA)`.
The first column of the file are sample labels for all of the other rows, and is not used.
The input parameters are stored in `sisso.json`, here is a list of all possible variables that can be sored in `sisso.json`
#### `data_file`
The name of the csv file where the data is stored. (Default: "data.csv")
#### `property_key`
The expression of the column where the property to be modeled is stored. (Default: "prop")
#### `task_key`
The expression of the column where the task identification is stored. (Default: "Task")
#### `opset`
The set of operators to use to combine the features during feature creation. (If empty use all available features)
#### `calc_type`
The type of calculation to run either regression or classification
#### `desc_dim`
The maximum dimension of the model to be created
#### `n_sis_select`
The number of features that SIS selects over each iteration
#### `max_rung`
The maximum rung of the feature (height of the tallest possible binary expression tree - 1)
#### `n_residual`
Number of residuals to used to select the next subset of materials in the iteration. (Affects SIS after the 1D model) (Default: 1)
#### `n_models_store`
Number of models to output as file for each dimension (Default: n_residual)
#### `n_rung_store`
The number of rungs where all of the training/testing data of the materials are stored in memory. (Default: `max_rung` - 1)
#### `n_rung_generate`
The number of rungs to generate on the fly during each SIS step. Must be 1 or 0. (Default: 0)
#### `min_abs_feat_val`
Minimum absolute value allowed in the feature's training data (Default: 1e-50)
#### `max_abs_feat_val`
Maximum absolute value allowed in the feature's training data (Default: 1e50)
#### `leave_out_inds`
The indicies from the data set to use as the test set. If empty and `leave_out_frac > 0` the selection will be random
#### `leave_out_frac`
Fraction (in decimal form) of the data to use as a test set (Default: 0.0 if `leave_out_inds` is empty, otherwise `len(leave_out_inds)) / Number of rows in data file`
#### `fix_intercept`
If true set the intercept to 0.0 for all Regression models (Default: false)
This does not work for classification
#### `max_feat_cross_correlation`
The maximum Pearson correlation allowed between selected features (Default: 1.0)
### Perform the Calculation
Once the input files are made the code can be run using the following command
Once the calculations are done, two sets of output files are generated.
A list of all selected features is stored in: `feature_space/selected_features.txt` and every model used as a residual for SIS is stored in `models`.
The model output files are split into train/test files sorted by the dimensionality of the model and by the train RMSE. The model with the lowest RMSE is stored in the lowest number file.
For example `train_dim_3_model_0.dat` will have the best 3D model, `train_dim_3_model_1.dat` would have the second best, etc.
Each model file has a large header containing information about the features selected and model generated