Skip to content
Snippets Groups Projects
Select Git revision
  • pre_gpu_changes
  • master default protected
  • classification
3 results

cpp_sisso

  • Clone with SSH
  • Clone with HTTPS
  • Thomas's avatar
    Thomas Purcell authored
    Set _n_class for all ModelClassiers
    caa2a8cf
    History

    C++ Implementation of SISSO with python bindings

    Overview

    This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface. Future work will expand the python interface to include more postporcessing analysis tools.

    Installation

    The package uses a CMake build system, and compatible all versions of the C++ standard library after C++ 14.

    Prerequisites

    To install the sisso++ the following packages are needed:

    • CMake version 3.10 and up
    • A C++ complier (compatible with C++ 14 and later)
    • BLAS/LAPACK (Architecture specific compilations like MKL or ACML are recommended)
    • MPI
    • Boost with the following libraries compiled (mpi, serialization, system, and filesystem)

    To build the optional python bindings the following are also needed:

    • Python 3 interpreter
    • Boost with the python and numpy libraries compiled

    Install sisso++

    sisso++ is installed using a cmake build system, with some basic configuration files stored in cmake/toolchains/ As an example here is an initial_config.cmake file used to construct sisso++ and the python bindings using the gnu compiler.

    ###############
    # Basic Flags #
    ###############
    set(CMAKE_CXX_COMPILER g++ CACHE STRING "")
    set(CMAKE_CXX_FLAGS "-O2" CACHE STRING "")
    
    #################
    # Feature Flags #
    #################
    set(USE_PYTHON ON CACHE BOOL "")
    set(EXTERNAL_BOOST OFF CACHE BOOL "")

    Here the -O2 flag is for optimizations, it is recommended to stay as -O2 or -O3, but it can be changed to match compiler requirements.

    When building Boost from source (EXTERNAL_BOOST OFF) the number of processes used when building Boost may be set using the BOOST_BUILD_N_PROCS flag in CMake. For example, to build Boost using 4 processes, the following flag should be included in the initial_config.cmake file:

    #set(BOOST_BUILD_N_PROCS 4 CACHE STRING "")

    This flag will have no effect when linking against external boost, i.e. EXTERNAL_BOOST ON.

    To install sisso++ run the following commands (this assumes gnu compiler and MKL are used, if you are using a different compiler/BLAS library change the flags to the relevant data)

    export MKLROOT=/path/to/mkl/
    export BOOST_ROOT=/path/to/boost
    
    cd ~/sisso++/main directory
    mkdir build/;
    cd build/;
    
    cmake -C initial_config.cmake ../
    make install

    Once all the commands are run sisso++ should be in the ~/cpp_sisso/main directory/bin/ directory.

    Install _sisso

    To install the python bindings first ensure your python path matches the path used to configure boost and then repeat the same commands as above but set USE_PYTHON in initial_config.cmake to ON.

    Once installed you should have access to the python interface via import cpp_sisso.

    Running the code

    Input files

    To see a sample of the input files look in ~/sisso++/main directory/test/exec_test

    To use the code two files are necessary: sisso.json and data.csv. data.csv stores all the data for the calculation in a csv file. The first row in the file corresponds to the feature meta data with the following format expression (Unit). For example if one of the primary features used in the set is the lattice constant of a material the header would be lat_param (AA). The first column of the file are sample labels for all of the other rows, and is not used.

    The input parameters are stored in sisso.json, here is a list of all possible variables that can be sored in sisso.json

    data_file

    The name of the csv file where the data is stored. (Default: "data.csv")

    property_key

    The expression of the column where the property to be modeled is stored. (Default: "prop")

    task_key

    The expression of the column where the task identification is stored. (Default: "Task")

    opset

    The set of operators to use to combine the features during feature creation. (If empty use all available features)

    calc_type

    The type of calculation to run either regression or classification

    desc_dim

    The maximum dimension of the model to be created

    n_sis_select

    The number of features that SIS selects over each iteration

    max_rung

    The maximum rung of the feature (height of the tallest possible binary expression tree - 1)

    n_residual

    Number of residuals to used to select the next subset of materials in the iteration. (Affects SIS after the 1D model) (Default: 1)

    n_models_store

    Number of models to output as file for each dimension (Default: n_residual)

    n_rung_store

    The number of rungs where all of the training/testing data of the materials are stored in memory. (Default: max_rung - 1)

    n_rung_generate

    The number of rungs to generate on the fly during each SIS step. Must be 1 or 0. (Default: 0)

    min_abs_feat_val

    Minimum absolute value allowed in the feature's training data (Default: 1e-50)

    max_abs_feat_val

    Maximum absolute value allowed in the feature's training data (Default: 1e50)

    leave_out_inds

    The indicies from the data set to use as the test set. If empty and leave_out_frac > 0 the selection will be random

    leave_out_frac

    Fraction (in decimal form) of the data to use as a test set (Default: 0.0 if leave_out_inds is empty, otherwise len(leave_out_inds)) / Number of rows in data file

    fix_intercept

    If true set the intercept to 0.0 for all Regression models (Default: false)

    This does not work for classification

    max_feat_cross_correlation

    The maximum Pearson correlation allowed between selected features (Default: 1.0)

    Perform the Calculation

    Once the input files are made the code can be run using the following command

    mpiexec -n 2 ~/sisso++/main directory/bin/sisso++ sisso.json

    Analyzing the Results

    Once the calculations are done, two sets of output files are generated. A list of all selected features is stored in: feature_space/selected_features.txt and every model used as a residual for SIS is stored in models. The model output files are split into train/test files sorted by the dimensionality of the model and by the train RMSE. The model with the lowest RMSE is stored in the lowest number file. For example train_dim_3_model_0.dat will have the best 3D model, train_dim_3_model_1.dat would have the second best, etc. Each model file has a large header containing information about the features selected and model generated

    # c0 + a0 * [(|r_p_B - (r_s_B)|) / ([(r_d_A) * (E_HOMO_B)])] + a1 * [(|r_p_B - (r_s_A)|) * ([(IP_A) / (r_s_A)])] + a2 * [(|E_HOMO_B - (EA_B)|) / ((r_p_A)^2)]
    # RMSE: 0.0779291679452223; Max AE: 0.290810937048465
    # Coefficients
    # Task;    a0                      a1                      a2                      c0
    # 0,       7.174549961742731e+00,  8.687856036798111e-02,  2.468463139364077e-01, -3.995345676823570e-02,
    # Feature Rung, Units, and Expressions
    # 0,  2, 1 / eV,                                           [(|r_p_B - (r_s_B)|) / ([(r_d_A) * (E_HOMO_B)])]
    # 1,  2, eV,                                               [(|r_p_B - (r_s_A)|) * ([(IP_A) / (r_s_A)])]
    # 2,  2, 1 / AA^2 * eV,                                    [(|E_HOMO_B - (EA_B)|) / ((r_p_A)^2)]
    # Number of Samples Per Task
    # Task;   n_mats_train
    # 0,      78

    After this header the following data is stored in the file:

    #Property Value          Property Value (EST)    Feature 0 Value         Feature 1 Value         Feature 2 Value

    With this file the model can be perfectly recreated using the python binding.

    Using the Python Library

    To see how the python interface can be used look at examples/python_interface_demo.ipynb If you get an error about not being able to load MKL libraries, you may have to run conda install numpy to get proper linking.