Skip to content
Snippets Groups Projects
Thomas's avatar
Thomas Purcell authored
gcc will now accept negative values
56e7fbc3
History

C++ Implementation of SISSO with python bindings

Overview

This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface. Future work will expand the python interface to include more postporcessing analysis tools.

For a more detailed explanation please visit our documentation here

Installation

The package uses a CMake build system, and compatible all versions of the C++ standard library after C++ 14. You can access the code here

Prerequisites

To install SISSO++ the following packages are needed:

  • CMake version 3.10 and up
  • A C++ compiler (compatible with C++ 14 and later, e.g. gcc 5.0+ or icpc 17.0+)
  • BLAS/LAPACK
  • MPI

Additionally the following packages needed by SISSO++ will be installed (if they are not installed already/if they cannot be found in $PATH)

To build and use the optional python bindings the following are also needed:

The setup of the python environment can be done using anaconda with

conda create -n sissopp_env python=3.9 numpy pandas scipy seaborn scikit-learn toml

Installing SISSO++

SISSO++ is installed using a cmake build system, with sample configuration files located in cmake/toolchains/ For example, here is initial_config.cmake file used to construct SISSO++ and the python bindings using the gnu compiler.

###############
# Basic Flags #
###############
set(CMAKE_CXX_COMPILER g++ CACHE STRING "")
set(CMAKE_C_COMPILER gcc CACHE STRING "")
set(CMAKE_CXX_FLAGS "-O3 -march=native" CACHE STRING "")
set(CMAKE_C_FLAGS "-O3 -march=native" CACHE STRING "")

#################
# Feature Flags #
#################
set(BUILD_PYTHON ON CACHE BOOL "")
set(BUILD_PARAMS ON CACHE BOOL "")

Because we want to build with the python bindings in this example and assuming there is no preexisting python environment, we need to first create/activate it. For this example we will use conda, but standard python installations or virtual environments are also possible.

conda create -n sissopp_env python=3.9 numpy pandas scipy seaborn scikit-learn toml
conda activate sissopp_env

Note if you are using a python environment with a local MKL installation then make sure the versions of all accessible MKL libraries are the same.

Now we can install SISSO++ using initial_config.cmake and the following commands (this assumes gnu compiler and MKL are used, if you are using a different compiler/BLAS library change the flags to the relevant directories)

export MKLROOT=/path/to/mkl/
export BOOST_ROOT=/path/to/boost

cd ~/sissopp/
mkdir build/;
cd build/;

cmake -C initial_config.cmake ../
make
make install

Once all the commands are run SISSO++ should be in the ~/SISSO++/main directory/bin/ directory.

Installing the Python Bindings Without Administrative Privileges

To install the python bindings on a machine where you do not have write privilege to the default python install directory (typical on most HPC systems), you must set the PYTHON_INSTDIR to a directory where you do have write access. This can be done by modifying the camke command to:

cmake -C initial_config.cmake -DPYTHON_INSTDIR=/path/to/python/install/directory/ ../

A standard local python installation directory for pip and conda is $HOME/.local/lib/python3.X/site-packages/ where X is the minor version of python. It is important that if you do set this variable then that directory is also inside your PYTHONPATH envrionment variable. This can be updated with

export PYTHONPATH=$PYTHONPATH:/path/to/python/install/directory/

If you are using anaconda, then this can be avoided by creating a new conda environment as detailed above.

You will need to set this variable and recompile the code (remove all build files first) if you see this error


CMake Error at src/cmake_install.cmake:114 (file):
  file cannot create directory:
  ${PTYHON_BASE_DIR}/lib/python3.X/site-packages/sissopp.
  Maybe need administrative privileges.
Call Stack (most recent call first):
  cmake_install.cmake:42 (include)" 

Install the Binary Without the Python Bindings

To install only the SISSO++ executable repeat the same commands as above but set USE_PYTHON in initial_config.cmake to OFF.

Running the Code

Input Files

To see a sample of the input files look in ~/sisso++/main directory/tests/exec_test. In this directory there are multiple subdirectories for different types of calculations, but the default/ directory would be the most common application.

To use the code two files are necessary: sisso.json and data.csv. data.csv stores all of the data for the calculation in a csv file. The first row in the file corresponds to the feature meta data with the following format expression (Unit). For example if one of the primary features used in the set is the lattice constant of a material the header would be lat_param (AA). The first column of the file are sample labels for all of the other rows, and is used to set the sample ids in the output files.

The input parameters are stored in sisso.json, here is a list of all possible variables that can be set in sisso.json

data.csv

The data file contains all relevant data and metadata to describe the individual features and samples. The first row of the file corresponds to the features metadata and has the following format expression (Unit) or expression. For the cases where no (Unit) is included in the header then the feature is considered to be dimensionless. For example if one of the primary features used in the set is the lattice constant of a material the header would be lat_param (AA), but the number of species in the material would be n_species because it is a dimensionless number.

The first column provide the labels for each sample in the data file, and is used to set the sample ids in the output files. In the simplest case, this can be just a running index. The data describing the property vector is defined in the column with an expression matching the property_key filed in the sisso.json file, and will not be included in the feature space. Additionally, an optional Task column whose header matches the task_key field in the sisso.json file can also be included in the data file. This column maps each sample to a respective task with a label defined in the task column. Below in a minimal example of the data file used to learn a model for a materials volume.

material, Structure_Type, Volume (AA^3), lat_param (AA)
C, diamond, 45.64, 3.57
Si, diamond, 163.55, 5.47
Ge, diamond, 191.39, 5.76
Sn, diamond, 293.58, 6.65
Pb, diamond, 353.84, 7.07.757
LiF, rock_salt, 67.94, 4.08
NaF, rock_salt, 103.39, 4.69
KF, rock_salt, 159.00, 5.42
RbF, rock_salt, 189.01, 5.74
CsF, rock_salt, 228.33, 6.11

sisso.json

All input parameters that can not be extracted from the data file are defined in the sisso.json file.

Here is a complete example of a sisso.json file where the property and task keys match those in the above data file example.

{
    "data_file": "data.csv",
    "property_key": "Volume",
    "task_key": "Structure_Type",
    "opset": ["add", "sub", "mult", "div", "sq", "cb", "cbrt", "sqrt"],
    "param_opset": [],
    "calc_type": "regression",
    "desc_dim": 2,
    "n_sis_select": 5,
    "max_rung": 2,
    "max_leaves": 4,
    "n_residual": 1,
    "n_models_store": 1,
    "n_rung_store": 1,
    "n_rung_generate": 0,
    "min_abs_feat_val": 1e-5,
    "max_abs_feat_val": 1e8,
    "leave_out_inds": [0, 5],
    "leave_out_frac": 0.25,
    "fix_intercept": false,
    "max_feat_cross_correlation": 1.0,
    "nlopt_seed": 13,
    "global_param_opt": false,
    "reparam_residual": true
}

Performing the Calculation

Once the input files are made the code can be run using the following command

mpiexec -n 2 ~/sisso++/main directory/bin/sisso++ sisso.json

which will give the following output for the simple problem defined above

time input_parsing: 0.000721931 s
time to generate feat sapce: 0.00288105 s
Projection time: 0.00304198 s
Time to get best features on rank : 1.09673e-05 s
Complete final combination/selection from all ranks: 0.00282502 s
Time for SIS: 0.00595999 s
Time for l0-norm: 0.00260496 s
Projection time: 0.000118971 s
Time to get best features on rank : 1.38283e-05 s
Complete final combination/selection from all ranks: 0.00240111 s
Time for SIS: 0.00276804 s
Time for l0-norm: 0.000256062 s
Train RMSE: 0.293788 AA^3; Test RMSE: 0.186616 AA^3
c0 + a0 * (lat_param^3)

Train RMSE: 0.0936332 AA^3; Test RMSE: 15.8298 AA^3
c0 + a0 * ((lat_param^3)^2) + a1 * (sqrt(lat_param)^3)

Analyzing the Results

Once the calculations are done, two sets of output files are generated. Two files that summarize the results from SIS in a computer and human readable manner are stored in: feature_space/ and every model used as a residual for SIS is stored in models/. The human readable file describing the selected feature space is feature_space/SIS_summary.txt which contains the projection score (The Pearson correlation to the target property or model residual).

# FEAT_ID     Score                   Feature Expression
0             0.99997909235669924     (lat_param^3)
1             0.999036700010245471    ((lat_param^2)^2)
2             0.998534266139345261    (lat_param^2)
3             0.996929900301868899    (sqrt(lat_param)^3)
4             0.994755117666830335    lat_param
#-----------------------------------------------------------------------
5             0.0318376000648976157   ((lat_param^3)^3)
6             0.00846237838476477863  ((lat_param^3)^2)
7             0.00742498801557322716  cbrt(cbrt(lat_param))
8             0.00715447033658055554  cbrt(sqrt(lat_param))
9             0.00675695980092700429  sqrt(sqrt(lat_param))
#---------------------------------------------------------------------

The computer readable file file is feature_space/selected_features.txt and contains a the list of selected features represented by an alphanumeric code where the integers are the index of the feature in the primary feature space and strings represent the operators. The order of each term in these expressions is the same as the order it would appear using postfix (reverse polish) notation.

# FEAT_ID     Feature Postfix Expression (RPN)
0             0|cb
1             0|sq|sq
2             0|sq
3             0|sqrt|cb
4             0
#-----------------------------------------------------------------------
5             0|cb|cb
6             0|cb|sq
7             0|cbrt|cbrt
8             0|sqrt|cbrt
9             0|sqrt|sqrt
#-----------------------------------------------------------------------

The model output files are split into train/test files sorted by the dimensionality of the model and by the train RMSE. The model with the lowest RMSE is stored in the lowest numbered file. For example train_dim_2_model_0.dat will have the best 2D model, train_dim_2_model_1.dat would have the second best, etc., whereas train_dim_1_model_0.dat will have the best 1D model. Each model file has a large header containing information about the features selected and model generated

# c0 + a0 * (lat_param^3)
# Property Label: $Volume$; Unit of the Property: AA^3
# RMSE: 0.293787533962641; Max AE: 0.56084644346538
# Coefficients
# Task       a0                      c0
#  diamond,  1.000735616997855e+00, -1.551085274074442e-01,
#  rock_salt,  9.998140372873336e-01,  6.405707194855371e-02,
# Feature Rung, Units, and Expressions
# 0;  1; AA^3;                                             0|cb; (lat_param^3); $\left(lat_{param}^3\right)$; (lat_param).^3; lat_param
# Number of Samples Per Task
# Task    , n_mats_train
#  diamond, 4
#  rock_salt, 4

The first section of the header summarizes the model by providing a string representation of the model, defines the property's label and unit, and summarizes the error of the model.

# c0 + a0 * (lat_param^3)
# Property Label: $Volume$; Unit of the Property: AA^3
# RMSE: 0.293787533962641; Max AE: 0.56084644346538

Next the linear coefficients (as shown in the first line) for each task is listed.

# Coefficients
# Task       a0                      c0
#  diamond,  1.000735616997855e+00, -1.551085274074442e-01,
#  rock_salt,  9.998140372873336e-01,  6.405707194855371e-02,

Then a description of each feature in the model is listed, including units and various expressions.

# Feature Rung, Units, and Expressions
# 0;  1; AA^3;                                             0|cb; (lat_param^3); $\left(lat_{param}^3\right)$; (lat_param).^3; lat_param

Finally information about the number of samples in each task is given

# Number of Samples Per Task
# Task    , n_mats_train
#  diamond, 4
#  rock_salt, 4

The header for the test data files contain the same information as the training file, with an additional line at the end to list all indexes included in the test set:

# Test Indexes: [ 0, 5 ]

These indexes can be used to reproduce the results by setting leave_out_inds to those listed on this line.

After this header in both file the following data is stored in the file:

# Sample ID , Property Value        ,  Property Value (EST) ,  Feature 0 Value

With this data, one can plot and analyzed the model, e.g., by using the python binding.

Using the Python Library

To see how the python interface can be used refer to the tutorials. If you get an error about not being able to load MKL libraries, you may have to run conda install numpy to get proper linking.