Skip to content
Snippets Groups Projects

C++ Implementation of SISSO

Overview

This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface. Future work will expand the python interface to include more postporcessing analysis tools.

Installation

The package uses a CMake build system, and compatible all versions of the C++ standard library after C++ 14.

Prerequisites

To install the sisso++ the following packages are needed:

  • CMake version 3.10 and up
  • A C++ complier (compatible with C++ 14 and later)
  • BLAS/LAPACK (Architecture specific compilations like MKL or ACML are recommended)
  • MPI
  • Boost with the following libraries compiled (mpi, serialization, system, and filesystem)

To build the optional python bindings the following are also needed:

  • Python 3 interpreter
  • Boost with the python and numpy libraries compiled

Install sisso++

sisso++ is installed using a cmake build system, with some basic configuration files stored in cmake/toolchains/ As an example here is an initial_config.cmake file used to construct sisso++ and the python bindings using the gnu compiler.

###############
# Basic Flags #
###############
set(CMAKE_CXX_COMPILER g++ CACHE STRING "")
set(CMAKE_CXX_FLAGS "-O2" CACHE STRING "")

#################
# Feature Flags #
#################
set(USE_PYTHON ON CACHE BOOL "")
set(EXTERNAL_BOOST OFF CACHE BOOL "")

Here the -O2 flag is for optimizations, it is recommended to stay as -O2 or -O3, but it can be changed to match compiler requirements.

When building Boost from source (EXTERNAL_BOOST OFF) the number of processes used when building Boost may be set using the BOOST_BUILD_N_PROCS flag in CMake. For example, to build Boost using 4 processes, the following flag should be included in the initial_config.cmake file:

#set(BOOST_BUILD_N_PROCS 4 CACHE STRING "")

This flag will have no effect when linking against external boost, i.e. EXTERNAL_BOOST ON.

To install sisso++ run the following commands (this assumes gnu compiler and MKL are used, if you are using a different compiler/BLAS library change the flags to the relevant data)

export MKLROOT=/path/to/mkl/
export BOOST_ROOT=/path/to/boost

cd ~/sisso++/main directory
mkdir build/;
cd build/;

cmake -C initial_config.cmake ../
make install

Once all the commands are run sisso++ should be in the ~/cpp_sisso/main directory/bin/ directory.

Install _sisso

To install the python bindings first ensure your python path matches the path used to configure boost and then repeat the same commands as above but set USE_PYTHON in initial_config.cmake to ON.

Once installed you should have access to the python interface via import cpp_sisso.

Running the code

Input files

To see a sample of the input files look in ~/sisso++/main directory/test/exec_test

To use the code two files are necessary: sisso.json and data.csv. data.csv stores all the data for the calculation in a csv file. The first row in the file corresponds to the feature meta data with the following format expression (Unit). For example if one of the primary features used in the set is the lattice constant of a material the header would be lat_param (AA). The first column of the file are sample labels for all of the other rows, and is not used.

The input parameters are stored in sisso.json, here is a list of all possible variables that can be sored in sisso.json

data_file

The name of the csv file where the data is stored. (Default: "data.csv")

property_key

The expression of the column where the property to be modeled is stored. (Default: "prop")

task_key

The expression of the column where the task identification is stored. (Default: "Task")

opset

The set of operators to use to combine the features during feature creation. (If empty use all available features)

calc_type

The type of calculation to run either regression or classification

desc_dim

The maximum dimension of the model to be created

n_sis_select

The number of features that SIS selects over each iteration

max_rung

The maximum rung of the feature (height of the tallest possible binary expression tree - 1)

n_residual

Number of residuals to used to select the next subset of materials in the iteration. (Affects SIS after the 1D model) (Default: 1)

n_models_store

Number of models to output as file for each dimension (Default: n_residual)

n_rung_store

The number of rungs where all of the training/testing data of the materials are stored in memory. (Default: max_rung - 1)

n_rung_generate

The number of rungs to generate on the fly during each SIS step. Must be 1 or 0. (Default: 0)

min_abs_feat_val

Minimum absolute value allowed in the feature's training data (Default: 1e-50)

max_abs_feat_val

Maximum absolute value allowed in the feature's training data (Default: 1e50)

leave_out_inds

The indicies from the data set to use as the test set. If empty and leave_out_frac > 0 the selection will be random

leave_out_frac

Fraction (in decimal form) of the data to use as a test set (Default: 0.0 if leave_out_inds is empty, otherwise len(leave_out_inds)) / Number of rows in data file

fix_intercept

If true set the intercept to 0.0 for all Regression models (Default: false)

This does not work for classification

max_feat_cross_correlation

The maximum Pearson correlation allowed between selected features (Default: 1.0)

Perform the Calculation

Once the input files are made the code can be run using the following command

mpiexec -n 2 ~/sisso++/main directory/bin/sisso++ sisso.json

Analyzing the Results

Once the calculations are done, two sets of output files are generated. A list of all selected features is stored in: feature_space/selected_features.txt and every model used as a residual for SIS is stored in models. The model output files are split into train/test files sorted by the dimensionality of the model and by the train RMSE. The model with the lowest RMSE is stored in the lowest number file. For example train_dim_3_model_0.dat will have the best 3D model, train_dim_3_model_1.dat would have the second best, etc. Each model file has a large header containing information about the features selected and model generated

# c0 + a0 * [(|r_p_B - (r_s_B)|) / ([(r_d_A) * (E_HOMO_B)])] + a1 * [(|r_p_B - (r_s_A)|) * ([(IP_A) / (r_s_A)])] + a2 * [(|E_HOMO_B - (EA_B)|) / ((r_p_A)^2)]
# RMSE: 0.0779291679452223; Max AE: 0.290810937048465
# Coefficients
# Task;    a0                      a1                      a2                      c0
# 0,       7.174549961742731e+00,  8.687856036798111e-02,  2.468463139364077e-01, -3.995345676823570e-02,
# Feature Rung, Units, and Expressions
# 0,  2, 1 / eV,                                           [(|r_p_B - (r_s_B)|) / ([(r_d_A) * (E_HOMO_B)])]
# 1,  2, eV,                                               [(|r_p_B - (r_s_A)|) * ([(IP_A) / (r_s_A)])]
# 2,  2, 1 / AA^2 * eV,                                    [(|E_HOMO_B - (EA_B)|) / ((r_p_A)^2)]
# Number of Samples Per Task
# Task;   n_mats_train
# 0,      78

After this header the following data is stored in the file:

#Property Value          Property Value (EST)    Feature 0 Value         Feature 1 Value         Feature 2 Value

With this file the model can be perfectly recreated using the python binding.

Using the Python Library

To see how the python interface can be used look at examples/python_interface_demo.ipynb If you get an error about not being able to load MKL libraries, you may have to run conda install numpy to get proper linking.