C++ Implementation of SISSO
Overview
This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface. Future work will expand the python interface to include more postporcessing analysis tools.
Installation
The package uses a CMake build system, and compatible all versions of the C++ standard library after C++ 14.
Prerequisites
To install the sisso++ the following packages are needed:
- CMake version 3.10 and up
- A C++ complier (compatible with C++ 14 and later)
- BLAS/LAPACK (Architecture specific compilations like MKL or ACML are recommended)
- MPI
- Boost with the following libraries compiled (mpi, serialization, system, and filesystem)
To build the optional python bindings the following are also needed:
- Python 3 interpreter
- Boost with the python and numpy libraries compiled
sisso++
Install sisso++
is installed using a cmake build system, with some basic configuration files stored in cmake/toolchains/
As an example here is an initial_config.cmake
file used to construct sisso++
and the python bindings using the gnu compiler.
###############
# Basic Flags #
###############
set(CMAKE_CXX_COMPILER g++ CACHE STRING "")
set(CMAKE_CXX_FLAGS "-O2" CACHE STRING "")
#################
# Feature Flags #
#################
set(USE_PYTHON ON CACHE BOOL "")
set(EXTERNAL_BOOST OFF CACHE BOOL "")
Here the -O2
flag is for optimizations, it is recommended to stay as -O2
or -O3
, but it can be changed to match compiler requirements.
When building Boost from source (EXTERNAL_BOOST OFF
) the number of processes used when building Boost may be set using the
BOOST_BUILD_N_PROCS
flag in CMake. For example, to build Boost using 4 processes, the following flag should be included in the
initial_config.cmake
file:
#set(BOOST_BUILD_N_PROCS 4 CACHE STRING "")
This flag will have no effect when linking against external boost, i.e. EXTERNAL_BOOST ON
.
To install sisso++
run the following commands (this assumes gnu compiler and MKL are used, if you are using a different compiler/BLAS library change the flags to the relevant data)
export MKLROOT=/path/to/mkl/
export BOOST_ROOT=/path/to/boost
cd ~/sisso++/main directory
mkdir build/;
cd build/;
cmake -C initial_config.cmake ../
make install
Once all the commands are run sisso++
should be in the ~/cpp_sisso/main directory/bin/
directory.
_sisso
Install To install the python bindings first ensure your python path matches the path used to configure boost
and then repeat the same commands as above but set USE_PYTHON
in initial_config.cmake
to ON
.
Once installed you should have access to the python interface via import cpp_sisso
.
Running the code
Input files
To see a sample of the input files look in ~/sisso++/main directory/test/exec_test
To use the code two files are necessary: sisso.json
and data.csv
.
data.csv
stores all the data for the calculation in a csv
file.
The first row in the file corresponds to the feature meta data with the following format expression (Unit)
.
For example if one of the primary features used in the set is the lattice constant of a material the header would be lat_param (AA)
.
The first column of the file are sample labels for all of the other rows, and is not used.
The input parameters are stored in sisso.json
, here is a list of all possible variables that can be sored in sisso.json
data_file
The name of the csv file where the data is stored. (Default: "data.csv")
property_key
The expression of the column where the property to be modeled is stored. (Default: "prop")
task_key
The expression of the column where the task identification is stored. (Default: "Task")
opset
The set of operators to use to combine the features during feature creation. (If empty use all available features)
calc_type
The type of calculation to run either regression or classification
desc_dim
The maximum dimension of the model to be created
n_sis_select
The number of features that SIS selects over each iteration
max_rung
The maximum rung of the feature (height of the tallest possible binary expression tree - 1)
n_residual
Number of residuals to used to select the next subset of materials in the iteration. (Affects SIS after the 1D model) (Default: 1)
n_models_store
Number of models to output as file for each dimension (Default: n_residual)
n_rung_store
The number of rungs where all of the training/testing data of the materials are stored in memory. (Default: max_rung
- 1)
n_rung_generate
The number of rungs to generate on the fly during each SIS step. Must be 1 or 0. (Default: 0)
min_abs_feat_val
Minimum absolute value allowed in the feature's training data (Default: 1e-50)
max_abs_feat_val
Maximum absolute value allowed in the feature's training data (Default: 1e50)
leave_out_inds
The indicies from the data set to use as the test set. If empty and leave_out_frac > 0
the selection will be random
leave_out_frac
Fraction (in decimal form) of the data to use as a test set (Default: 0.0 if leave_out_inds
is empty, otherwise len(leave_out_inds)) / Number of rows in data file
fix_intercept
If true set the intercept to 0.0 for all Regression models (Default: false)
This does not work for classification
max_feat_cross_correlation
The maximum Pearson correlation allowed between selected features (Default: 1.0)
Perform the Calculation
Once the input files are made the code can be run using the following command
mpiexec -n 2 ~/sisso++/main directory/bin/sisso++ sisso.json
Analyzing the Results
Once the calculations are done, two sets of output files are generated.
A list of all selected features is stored in: feature_space/selected_features.txt
and every model used as a residual for SIS is stored in models
.
The model output files are split into train/test files sorted by the dimensionality of the model and by the train RMSE. The model with the lowest RMSE is stored in the lowest number file.
For example train_dim_3_model_0.dat
will have the best 3D model, train_dim_3_model_1.dat
would have the second best, etc.
Each model file has a large header containing information about the features selected and model generated
# c0 + a0 * [(|r_p_B - (r_s_B)|) / ([(r_d_A) * (E_HOMO_B)])] + a1 * [(|r_p_B - (r_s_A)|) * ([(IP_A) / (r_s_A)])] + a2 * [(|E_HOMO_B - (EA_B)|) / ((r_p_A)^2)]
# RMSE: 0.0779291679452223; Max AE: 0.290810937048465
# Coefficients
# Task; a0 a1 a2 c0
# 0, 7.174549961742731e+00, 8.687856036798111e-02, 2.468463139364077e-01, -3.995345676823570e-02,
# Feature Rung, Units, and Expressions
# 0, 2, 1 / eV, [(|r_p_B - (r_s_B)|) / ([(r_d_A) * (E_HOMO_B)])]
# 1, 2, eV, [(|r_p_B - (r_s_A)|) * ([(IP_A) / (r_s_A)])]
# 2, 2, 1 / AA^2 * eV, [(|E_HOMO_B - (EA_B)|) / ((r_p_A)^2)]
# Number of Samples Per Task
# Task; n_mats_train
# 0, 78
After this header the following data is stored in the file:
#Property Value Property Value (EST) Feature 0 Value Feature 1 Value Feature 2 Value
With this file the model can be perfectly recreated using the python binding.
Using the Python Library
To see how the python interface can be used look at examples/python_interface_demo.ipynb
If you get an error about not being able to load MKL libraries, you may have to run conda install numpy
to get proper linking.