abstract={The authors showcase the potential of symbolic regression as an analytic method for use in materials research. First, the authors briefly describe the current state-of-the-art method, genetic programming-based symbolic regression (GPSR), and recent advances in symbolic regression techniques. Next, the authors discuss industrial applications of symbolic regression and its potential applications in materials science. The authors then present two GPSR use-cases: formulating a transformation kinetics law and showing the learning scheme discovers the well-known Johnson-Mehl-Avrami-Kolmogorov form, and learning the Landau free energy functional form for the displacive tilt transition in perovskite LaNiO3. Finally, the authors propose that symbolic regression techniques should be considered by materials scientists as an alternative to other machine learning-based regression models for learning from data.},

archivePrefix={arXiv},

arxivId={1901.04136},

author={Wang, Yiqun and Wagner, Nicholas and Rondinelli, James M.},

doi={10.1557/mrc.2019.85},

eprint={1901.04136},

issn={21596867},

journal={MRS Commun.},

month={sep},

number={3},

pages={793--805},

publisher={Cambridge University Press},

title={{Symbolic regression in materials science}},

abstract={A modification to the mixed-integer nonlinear programming (MINLP) formulation for symbolic regression was proposed with the aim of identification of physical models from noisy experimental data. In the proposed formulation, a binary tree in which equations are represented as directed, acyclic graphs, is fully constructed for a pre-defined number of layers. The introduced modification results in the reduction in the number of required binary variables and removal of redundancy due to possible symmetry of the tree formulation. The formulation was tested using numerical models and was found to be more efficient than the previous literature example with respect to the numbers of predictor variables and training data points. The globally optimal search was extended to identify physical models and to cope with noise in the experimental data predictor variable. The methodology was proven to be successful in identifying the correct physical models describing the relationship between shear stress and shear rate for both Newtonian and non-Newtonian fluids, and simple kinetic laws of chemical reactions. Future work will focus on addressing the limitations of the present formulation and solver to enable extension of target problems to larger, more complex physical models.},

author={Neumann, Pascal and Cao, Liwei and Russo, Danilo and Vassiliadis, Vassilios S. and Lapkin, Alexei A.},

doi={10.1016/j.cej.2019.123412},

issn={13858947},

journal={Chem. Eng. J.},

keywords={Automated model construction,Chemical process development,Global optimization,Mixed-integer nonlinear programming (MINLP),Model identification,Symbolic regression},

month={may},

pages={123412},

publisher={Elsevier},

title={{A new formulation for symbolic regression to identify physico-chemical laws from experimental data}},

volume={387},

year={2020}

}

@article{Udrescu2020a,

abstract={A core challenge for both physics and artificial intelligence (AI) is symbolic regression: Finding a symbolic expression that matches data from an unknown function. Although this problem is likely to be NP-hard in principle, functions of practical interest often exhibit symmetries, separability, compositionality, and other simplifying properties. In this spirit, we develop a recursive multidimensional symbolic regression algorithm that combines neural network fitting with a suite of physics-inspired techniques. We apply it to 100 equations from the Feynman Lectures on Physics, and it discovers all of them, while previous publicly available software cracks only 71; for a more difficult physics-based test set, we improve the state-of-the-art success rate from 15 to 90{\%}.},

Describe SISSO for a non-specialist audience [what are primary features, operators, symbolic regression, etc.] and what are its use cases in science.

2:25

(b) Statement of need:

Add what makes this implementation superior to existing ones with clear examples. In my eyes these are

(i) performance/scalability

(ii) documentation and extendibility

(iii) user-friendliness (advanced features and scripts to perform recurring tasks)

2:26

A statement of need: Does the paper have a section titled ‘Statement of Need’ that clearly states what problems the software is designed to solve and who the target audience is?

(c) Features:

API, documentations, tutorial and quickstart guides are also important features

# Summary

Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?

The SISSO++ package is a C++ implementation of the sure-independence screening and sparsifying operator (SISSO) method with python bindings.

SISSO is a symbolic regression method that takes in a set of input primary features and iteratively applies a set of analytical unary and binary operators to build a large and exhaustive feature space [@Ouyang2019a, @Ouyang2017].

The goal of symbolic regression techniques is to find the mathematical expression that best describes a given target property given a set of primary features, i.e. the input features, and analytic operations.

Because symbolic regression results an interpretable equation, it is an increasingly popular method across scientific disciplines [@Wang2019a, @Neumann2020, @Udrescu2020a].

The SISSO++ package provides a modular library, executable, and python interface for the application of SISSO to myriad applications.

SISSO++ implements the SISSO algorithm in a user-friendly, modular C++ library that is connected to both an executable and native python interface.

The first step of SISSO is to build all possible expressions up to a user-defined maximum complexity, from the initial set of primary features and analytic operations.

From here, an $\ell_0$-regularization is performed to find the best low-dimensional linear model of the features using the SISSO operator.

Specifically, SISSO++ applies this methodology for regression, log regression, and classification problems using separate loss functions.

The package uses standard input file formats to allow for an accessible command line interface as well as supporting python interface to the underlying C++ objects.

Using the python bindings and generated output files, it also provides basic postprocessing and visualization tools that are not only capable of reproducing the state of the final calculations, but perform initial analyses.

Finally, we designed a code to be modular, facilitating future extensions to existing functionality.

Specifically, SISSO++ applies this methodology for regression, log regression, and classification problems.

Additionally the library include multiple python functions to facilitate the post-processing, analyzing, and visualizing the resulting models.

Finally, we designed a code to be modular, allowing for future extensions to existing functionality.

# Statement of need

A statement of need: Does the paper have a section titled ‘Statement of Need’ that clearly states what problems the software is designed to solve and who the target audience is?

The SISSO++ package creates a high-performing and user-friendly implementation of SISSO.

A particular focus of the code is improving both the code and user interfaces to facilitate both the future extension and use of SISSO for solving symbolic regression problems.

Currently, SISSO++ is capable of solving both regression and classification problems, and then analyze the results with postprocessing tools in python.

The goal of the project is the allow for the broader use of SISSO to solve problems with symbolic regression.

The main goal of the SISSO++ package is to provide a user-friendly, easily-extendable version of the SISSO for the use of the scientific community.

While existing packages provide a high-performing implementation of SISSO [@Ouyang], multiple external efforts have implemented python wrappers to create a more accessible interface [@Xu, @Waroquiers].

Additionally, SISSO++ addresses need for an implementation of postprocessing tools that facilitate the standard analysis tasks for the output of SISSO.

Another key feature of the library is the modular design that simplifies the process of extending the code for other applications.

Finally the project's extensive documentation and tutorials provide a good access point for new-users of the method.

SISSO++ will broaden the applicability of SISSO to a wider audience and set of applications.

# Features

The following features are implimented in SISSO++:

The following features are implemented in SISSO++:

- A C++ library for using SISSO to find analytical models for a given problem

- Python bindings to be able to interface with the C++ objects in a python environement

- Python bindings to be able to interface with the C++ objects in a python environment

- Postprocessing tools for visualizing models and analyzing results using matplotlib

...

...

@@ -57,8 +72,12 @@ The following features are implimented in SISSO++:

- Features with better defined non-linearaities of the models by automatically optimizing the scale and bias terms to all operations using non-linear optimization

- Complete API defining all functions of the code

- Tutorials and Quick-Start Guides describing the basic functionality of the code to users

# Acknowledgements

The authors would like to thank Lucas Foppa, Jingkai Quan, Aakash Naik, and Luigi Sabilio for testing and providing valuable feedback. T.P. would like to thank the Alexander von Humboldt Foundation for their support through the Alexander von Humboldt Postdoctoral Fellowship Program. This project was supported by TEC1p (the European Research Council (ERC) Horizon 2020 research and innovation programme, grant agreement No. 740233), BigMax (the Max Planck Society’s Research Network on Big-Data-Driven Materials-Science), and the NOMAD pillar of the FAIR-DI e.V. association.

The authors would like to thank Markus Rampp and Meisam Tabriz for technical support. We would also like to thank Lucas Foppa, Jingkai Quan, Aakash Naik, and Luigi Sabilio for testing and providing valuable feedback. T.P. would like to thank the Alexander von Humboldt Foundation for their support through the Alexander von Humboldt Postdoctoral Fellowship Program. This project was supported by TEC1p (the European Research Council (ERC) Horizon 2020 research and innovation programme, grant agreement No. 740233), BigMax (the Max Planck Society’s Research Network on Big-Data-Driven Materials-Science), and the NOMAD pillar of the FAIR-DI e.V. association.