C++ Implementation of SISSO with python bindings
Overview
This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface. Future work will expand the python interface to include more postporcessing analysis tools.
For a more detailed explanation please visit our documentation here
Installation
The package uses a CMake build system, and compatible all versions of the C++ standard library after C++ 14. You can access the code here
Prerequisites
To install SISSO++
the following packages are needed:
- CMake version 3.10 and up
- A C++ compiler (compatible with C++ 14 and later, e.g. gcc 5.0+ or icpc 17.0+)
- BLAS/LAPACK
- MPI
Additionally the following packages needed by SISSO++ will be installed (if they are not installed already/if they cannot be found in $PATH)
- Boost (mpi, serialization, system, filesystem, and python libraries)
- GTest
- Coin-Clp
- NLopt
- {fmt} (Used for the C++ 20 std::format library)
To build and use the optional python bindings the following are also needed:
The setup of the python environment can be done using anaconda with
conda create -n sissopp_env python=3.9 numpy pandas scipy seaborn scikit-learn toml
SISSO++
Installing SISSO++
is installed using a cmake build system, with sample configuration files located in cmake/toolchains/
For example, here is initial_config.cmake
file used to construct SISSO++
and the python bindings using the gnu compiler.
###############
# Basic Flags #
###############
set(CMAKE_CXX_COMPILER g++ CACHE STRING "")
set(CMAKE_C_COMPILER gcc CACHE STRING "")
set(CMAKE_CXX_FLAGS "-O3 -march=native" CACHE STRING "")
set(CMAKE_C_FLAGS "-O3 -march=native" CACHE STRING "")
#################
# Feature Flags #
#################
set(BUILD_PYTHON ON CACHE BOOL "")
set(BUILD_PARAMS ON CACHE BOOL "")
Because we want to build with the python bindings in this example and assuming there is no preexisting python environment, we need to first create/activate it.
For this example we will use conda
, but standard python installations or virtual environments are also possible.
conda create -n sissopp_env python=3.9 numpy pandas scipy seaborn scikit-learn toml
conda activate sissopp_env
Note if you are using a python environment with a local MKL installation then make sure the versions of all accessible MKL libraries are the same.
Now we can install SISSO++
using initial_config.cmake
and the following commands (this assumes gnu compiler and MKL are used, if you are using a different compiler/BLAS library change the flags to the relevant directories)
export MKLROOT=/path/to/mkl/
export BOOST_ROOT=/path/to/boost
cd ~/sissopp/
mkdir build/;
cd build/;
cmake -C initial_config.cmake ../
make
make install
Once all the commands are run SISSO++
should be in the ~/SISSO++/main directory/bin/
directory.
Installing the Python Bindings Without Administrative Privileges
To install the python bindings on a machine where you do not have write privilege to the default python install directory (typical on most HPC systems), you must set the PYTHON_INSTDIR
to a directory where you do have write access.
This can be done by modifying the camke
command to:
cmake -C initial_config.cmake -DPYTHON_INSTDIR=/path/to/python/install/directory/ ../
A standard local python installation directory for pip and conda is $HOME/.local/lib/python3.X/site-packages/
where X is the minor version of python.
It is important that if you do set this variable then that directory is also inside your PYTHONPATH
envrionment variable. This can be updated with
export PYTHONPATH=$PYTHONPATH:/path/to/python/install/directory/
If you are using anaconda, then this can be avoided by creating a new conda environment as detailed above.
You will need to set this variable and recompile the code (remove all build files first) if you see this error
CMake Error at src/cmake_install.cmake:114 (file):
file cannot create directory:
${PTYHON_BASE_DIR}/lib/python3.X/site-packages/sissopp.
Maybe need administrative privileges.
Call Stack (most recent call first):
cmake_install.cmake:42 (include)"
Install the Binary Without the Python Bindings
To install only the SISSO++
executable repeat the same commands as above but set USE_PYTHON
in initial_config.cmake
to OFF
.
Running the Code
Input Files
To see a sample of the input files look in ~/sisso++/main directory/tests/exec_test
.
In this directory there are multiple subdirectories for different types of calculations, but the default/
directory would be the most common application.
To use the code two files are necessary: sisso.json
and data.csv
.
data.csv
stores all of the data for the calculation in a csv
file.
The first row in the file corresponds to the feature meta data with the following format expression (Unit)
.
For example if one of the primary features used in the set is the lattice constant of a material the header would be lat_param (AA)
.
The first column of the file are sample labels for all of the other rows, and is used to set the sample ids in the output files.
The input parameters are stored in sisso.json
, here is a list of all possible variables that can be set in sisso.json
data.csv
The data file contains all relevant data and metadata to describe the individual features and samples.
The first row of the file corresponds to the features metadata and has the following format expression (Unit)
or expression
.
For the cases where no (Unit)
is included in the header then the feature is considered to be dimensionless.
For example if one of the primary features used in the set is the lattice constant of a material the header would be lat_param (AA)
, but the number of species in the material would be n_species
because it is a dimensionless number.
The first column provide the labels for each sample in the data file, and is used to set the sample ids in the output files.
In the simplest case, this can be just a running index.
The data describing the property vector is defined in the column with an expression
matching the property_key
filed in the sisso.json
file, and will not be included in the feature space.
Additionally, an optional Task
column whose header matches the task_key
field in the sisso.json file can also be included in the data file.
This column maps each sample to a respective task with a label defined in the task column.
Below in a minimal example of the data file used to learn a model for a materials volume.
material, Structure_Type, Volume (AA^3), lat_param (AA)
C, diamond, 45.64, 3.57
Si, diamond, 163.55, 5.47
Ge, diamond, 191.39, 5.76
Sn, diamond, 293.58, 6.65
Pb, diamond, 353.84, 7.07.757
LiF, rock_salt, 67.94, 4.08
NaF, rock_salt, 103.39, 4.69
KF, rock_salt, 159.00, 5.42
RbF, rock_salt, 189.01, 5.74
CsF, rock_salt, 228.33, 6.11
sisso.json
All input parameters that can not be extracted from the data file are defined in the sisso.json
file.
Here is a complete example of a sisso.json
file where the property and task keys match those in the above data file example.
{
"data_file": "data.csv",
"property_key": "Volume",
"task_key": "Structure_Type",
"opset": ["add", "sub", "mult", "div", "sq", "cb", "cbrt", "sqrt"],
"param_opset": [],
"calc_type": "regression",
"desc_dim": 2,
"n_sis_select": 5,
"max_rung": 2,
"max_leaves": 4,
"n_residual": 1,
"n_models_store": 1,
"n_rung_store": 1,
"n_rung_generate": 0,
"min_abs_feat_val": 1e-5,
"max_abs_feat_val": 1e8,
"leave_out_inds": [0, 5],
"leave_out_frac": 0.25,
"fix_intercept": false,
"max_feat_cross_correlation": 1.0,
"nlopt_seed": 13,
"global_param_opt": false,
"reparam_residual": true
}
Performing the Calculation
Once the input files are made the code can be run using the following command
mpiexec -n 2 ~/sisso++/main directory/bin/sisso++ sisso.json
which will give the following output for the simple problem defined above
time input_parsing: 0.000721931 s
time to generate feat sapce: 0.00288105 s
Projection time: 0.00304198 s
Time to get best features on rank : 1.09673e-05 s
Complete final combination/selection from all ranks: 0.00282502 s
Time for SIS: 0.00595999 s
Time for l0-norm: 0.00260496 s
Projection time: 0.000118971 s
Time to get best features on rank : 1.38283e-05 s
Complete final combination/selection from all ranks: 0.00240111 s
Time for SIS: 0.00276804 s
Time for l0-norm: 0.000256062 s
Train RMSE: 0.293788 AA^3; Test RMSE: 0.186616 AA^3
c0 + a0 * (lat_param^3)
Train RMSE: 0.0936332 AA^3; Test RMSE: 15.8298 AA^3
c0 + a0 * ((lat_param^3)^2) + a1 * (sqrt(lat_param)^3)
Analyzing the Results
Once the calculations are done, two sets of output files are generated.
Two files that summarize the results from SIS in a computer and human readable manner are stored in: feature_space/
and every model used as a residual for SIS is stored in models/
.
The human readable file describing the selected feature space is feature_space/SIS_summary.txt
which contains the projection score (The Pearson correlation to the target property or model residual).
# FEAT_ID Score Feature Expression
0 0.99997909235669924 (lat_param^3)
1 0.999036700010245471 ((lat_param^2)^2)
2 0.998534266139345261 (lat_param^2)
3 0.996929900301868899 (sqrt(lat_param)^3)
4 0.994755117666830335 lat_param
#-----------------------------------------------------------------------
5 0.0318376000648976157 ((lat_param^3)^3)
6 0.00846237838476477863 ((lat_param^3)^2)
7 0.00742498801557322716 cbrt(cbrt(lat_param))
8 0.00715447033658055554 cbrt(sqrt(lat_param))
9 0.00675695980092700429 sqrt(sqrt(lat_param))
#---------------------------------------------------------------------
The computer readable file file is feature_space/selected_features.txt
and contains a the list of selected features represented by an alphanumeric code where the integers are the index of the feature in the primary feature space and strings represent the operators.
The order of each term in these expressions is the same as the order it would appear using postfix (reverse polish) notation.
# FEAT_ID Feature Postfix Expression (RPN)
0 0|cb
1 0|sq|sq
2 0|sq
3 0|sqrt|cb
4 0
#-----------------------------------------------------------------------
5 0|cb|cb
6 0|cb|sq
7 0|cbrt|cbrt
8 0|sqrt|cbrt
9 0|sqrt|sqrt
#-----------------------------------------------------------------------
The model output files are split into train/test files sorted by the dimensionality of the model and by the train RMSE.
The model with the lowest RMSE is stored in the lowest numbered file.
For example train_dim_2_model_0.dat
will have the best 2D model, train_dim_2_model_1.dat
would have the second best, etc., whereas train_dim_1_model_0.dat
will have the best 1D model.
Each model file has a large header containing information about the features selected and model generated
# c0 + a0 * (lat_param^3)
# Property Label: $Volume$; Unit of the Property: AA^3
# RMSE: 0.293787533962641; Max AE: 0.56084644346538
# Coefficients
# Task a0 c0
# diamond, 1.000735616997855e+00, -1.551085274074442e-01,
# rock_salt, 9.998140372873336e-01, 6.405707194855371e-02,
# Feature Rung, Units, and Expressions
# 0; 1; AA^3; 0|cb; (lat_param^3); $\left(lat_{param}^3\right)$; (lat_param).^3; lat_param
# Number of Samples Per Task
# Task , n_mats_train
# diamond, 4
# rock_salt, 4
The first section of the header summarizes the model by providing a string representation of the model, defines the property's label and unit, and summarizes the error of the model.
# c0 + a0 * (lat_param^3)
# Property Label: $Volume$; Unit of the Property: AA^3
# RMSE: 0.293787533962641; Max AE: 0.56084644346538
Next the linear coefficients (as shown in the first line) for each task is listed.
# Coefficients
# Task a0 c0
# diamond, 1.000735616997855e+00, -1.551085274074442e-01,
# rock_salt, 9.998140372873336e-01, 6.405707194855371e-02,
Then a description of each feature in the model is listed, including units and various expressions.
# Feature Rung, Units, and Expressions
# 0; 1; AA^3; 0|cb; (lat_param^3); $\left(lat_{param}^3\right)$; (lat_param).^3; lat_param
Finally information about the number of samples in each task is given
# Number of Samples Per Task
# Task , n_mats_train
# diamond, 4
# rock_salt, 4
The header for the test data files contain the same information as the training file, with an additional line at the end to list all indexes included in the test set:
# Test Indexes: [ 0, 5 ]
These indexes can be used to reproduce the results by setting leave_out_inds
to those listed on this line.
After this header in both file the following data is stored in the file:
# Sample ID , Property Value , Property Value (EST) , Feature 0 Value
With this data, one can plot and analyzed the model, e.g., by using the python binding.
Using the Python Library
To see how the python interface can be used refer to the tutorials.
If you get an error about not being able to load MKL libraries, you may have to run conda install numpy
to get proper linking.