diff --git a/docs/index.rst b/docs/index.rst index bcbfd93baf9268ebd6179831da14a5b9d94bd988..bfab53a7ae3ee17b6a65e8bc01f43407dc875ae2 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -12,19 +12,12 @@ SISSO++ This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface. Future work will expand the python interface to include more postporcessing analysis tools. -Indices -======= - -* :ref:`genindex` -* :ref:`search` - Table of Contents ^^^^^^^^^^^^^^^^^ .. toctree:: :maxdepth: 2 - self quick_start/QuickStart tutorial/tutorial cpp_api/cpp_api diff --git a/docs/quick_start/QuickStart.rst b/docs/quick_start/QuickStart.rst index 393a3952edb07f2b6d8fc363e2a3d50ed3636744..0eadf44141e79089e32488cdfb154692c5d09c11 100644 --- a/docs/quick_start/QuickStart.rst +++ b/docs/quick_start/QuickStart.rst @@ -1,6 +1,6 @@ .. _quick_start: -Quick Start Guide +Quick-Start Guide ================= .. toctree:: :maxdepth: 2 diff --git a/docs/quick_start/code_ref.md b/docs/quick_start/code_ref.md index cde3e05cbdf4ab8a0eb0b1f6f4f9bc298b2319ed..50b60f332bb1a9a2718e6da729f9c27f8856891c 100644 --- a/docs/quick_start/code_ref.md +++ b/docs/quick_start/code_ref.md @@ -31,7 +31,7 @@ A list containing the set of all operators that will be used during the feature #### `param_opset` -A list containing the set of all operators, for which the non-linear scale and bias terms will be optimized, that will be used during the feature creation step of SISSO. (If empty none of the available features) +A list containing the set of all operators, for which the non-linear scale and bias terms will be optimized, that will be used during the feature creation step of SISSO. (If empty none of the available features are used) #### `calc_type` @@ -39,15 +39,15 @@ The type of calculation to run either regression, log regression, or classificat #### `desc_dim` -The maximum dimension of the model to be created +The maximum dimension of the model to be created (no default value) #### `n_sis_select` -The number of features that SIS selects over each iteration +The number of features that SIS selects over each iteration (no default value) #### `max_rung` -The maximum rung of the feature (height of the tallest possible binary expression tree - 1) +The maximum rung of the feature (height of the tallest possible binary expression tree - 1) (no default value) #### `n_residual` diff --git a/docs/tutorial/0_intro.md b/docs/tutorial/0_intro.md index 852b98a8b3a890470f6da2a395b3b0e2a89f6ae7..ee76cd06d75fdb21b9751352b0707c1ba0e5c6d9 100644 --- a/docs/tutorial/0_intro.md +++ b/docs/tutorial/0_intro.md @@ -2,17 +2,17 @@ This tutorial is based on the [Predicting energy differences between crystal structures: (Meta-)stability of octet-binary compounds](https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/descriptor_role.ipynb) tutorial created by Mohammad-Yasin Arif, Luigi Sbailò, Thomas A. R. Purcell, Luca M. Ghiringhelli, and Matthias Scheffler. The goal of the tutorial is to teach a user how to use `SISSO++` to find and analyze quantitative models for materials properties. -In particular we will use SISSO to predict the crystal structure (rock salt or zincblende) of a series of octet binaries. +In particular we will use SISSO to predict the crystal structure (rock-salt or zinc-blende) of a series of octet binaries. The tutorial will be split into three parts: 1) explaining how to use the executable to perform the calculations and the python utilities to analyze the results and 2) How to use only python to run, analyze, and demonstrate results 3) How to perform classification problems using SISSO. ## Outline The following tutorials are available: -- [Combined Binary and Python](1_combined.md) -- [Python only](2_python.md) +- [Using the Command Line Interface](1_command_line.md) +- [Using the Python Interface](2_python.md) - [Classification](3_classification.md) -All tutorials use the octet binary dataset first described in [PRL-2015](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.10550) with the goal of predicting whether a material will crystallize in a rock salt or zincblende phase. +All tutorials use the octet binary dataset first described in [PRL-2015](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.10550) with the goal of predicting whether a material will crystallize in a rock-salt or zinc-blende phase. For all applications of SISSO a data set has to be passed via a standard `csv` file where the first row represents the feature and property label and the first column are the index-label for each sample for example ``` Material, energy_diff (eV), rs_A (AA), rs_B (AA), E_HOMO_A (eV), E_HOMO_B (eV),.... diff --git a/docs/tutorial/1_command_line.md b/docs/tutorial/1_command_line.md index c82de07775fba6145c5dd12cd7e466b1acccbc2e..65656a07c3dbd6f175cfa617f9e32f30f448bbbe 100644 --- a/docs/tutorial/1_command_line.md +++ b/docs/tutorial/1_command_line.md @@ -64,7 +64,7 @@ The standard output provides information about what step the calculation just fi When all calculations are complete the code prints out a summary of the best 1D, 2D, ..., {desc_dim}D models with their training RMSE/Testing RMSE (Only training if there is no test set provided). Additionally, two additional output files are stored in `feature_space/`: `SIS_summary.txt` and `selected_features.txt`. These files represent a human readable (`SIS_summary.txt`) and computer readable (`selected_features.txt`) summary of the selected feature space from SIS. -Below are reconstructions of both files for this calculation +Below are reconstructions of both files for this calculation (To see the file click the triangle) <details> <summary>feature_space/SIS_summary.txt</summary> @@ -284,10 +284,12 @@ An example of these files is provided here: ``` </details> + + ## Determining the Ideal Model Complexity with Cross-Validation While the training error always decreases with descriptor dimensionality for a given application, over-fitting can reduce the general applicability of the models outside of the training set. In order to determine the optimal dimensionality of a model and optimize the hyperparameters associated with SISSO, we need to perform cross-validation. -As an example we will discuss how to perform leave-out 10% using the command line +As an example we will discuss how to perform leave-out 10% using the command line. To do this we have to modify the `sisso.json` file to automatically leave out a random sample of the training data and use that as a test set by changing `"leave_out_frac": 0.0,` do `"leave_out_frac": 0.10,`. <details> @@ -415,7 +417,7 @@ As can be seen from the standard error measurements the results are now reasonab <details> <summary> Converged cross-validation results </summary> - + </details> Because the validation error for the three and four dimensional models are within each others error bars and the standard error increases when going to the fourth dimension, we conclude that the three-dimensional model has the ideal complexity. @@ -431,11 +433,13 @@ To see the distributions for this system we run <details> <summary> Distribution of Errors </summary> - + </details> + One thing that stands out in the plot is the large error seen in a single point for both the one and two dimensional models. -By looking at the validation errors, we find that the point with the largest error is diamond for all model dimensions, which is by far the most stable zincblende structure in the data set. +By looking at the validation errors, we find that the point with the largest error is diamond for all model dimensions, which is by far the most stable zinc-blende structure in the data set. As a note for this setup there is a 0.22\% chance that one of the samples is never in the validation set so if `max_error_ind != 21` check if that sample is in one of the validation sets. + ```python >>> import numpy as np >>> import pandas as pd @@ -585,12 +589,66 @@ To get the final models we will perform the same calculation we started off the From here we can use `models/train_dim_3_model_0.dat` for all of the analysis. In order to generate a machine learning plot for this model in matplotlib, run the following in python ```python ->>> from sissopp.postprocess.plot.parity_plot import plot_model_ml_plot_from_file ->>> plot_model_ml_plot_from_file("models/train_dim_3_model_0.dat", filename="3d_model.pdf").show() +>>> from sissopp.postprocess.plot.parity_plot import plot_model_parity_plot +>>> plot_model_parity_plot("models/train_dim_3_model_0.dat", filename="3d_model.pdf").show() ``` The result of which is shown below: <details> <summary> Final 3D model </summary> - + +</details> + +Additionally you can generate a output the model as a Matlab function or a LaTeX string using the following commands. +```python +>>> from sissopp.postprocess.load_models import load_model +>>> model = load_model("models/train_dim_3_model_0.dat") +>>> print(model.latex_str) + +>>> model.write_matlab_fxn("matlab_fxn/model.m") +``` + +A copy of the generated matlab function is below. +<details> +<summary> Final 3D model </summary> + + ```matlab + function P = model(X) + % Returns the value of E_{RS} - E_{ZB} = c0 + a0 * ((r_d_B / r_d_A) * (r_p_B * E_HOMO_A)) + a1 * ((IP_A^3) * (|r_sigma - r_s_B|)) + a2 * ((IP_A / r_p_A) / (r_p_B + r_p_A)) + % + % X = [ + % r_d_B, + % r_d_A, + % r_p_B, + % E_HOMO_A, + % IP_A, + % r_sigma, + % r_s_B, + % r_p_A, + % ] + + if(size(X, 2) ~= 8) + error("ERROR: X must have a size of 8 in the second dimension.") + end + r_d_B = reshape(X(:, 1), 1, []); + r_d_A = reshape(X(:, 2), 1, []); + r_p_B = reshape(X(:, 3), 1, []); + E_HOMO_A = reshape(X(:, 4), 1, []); + IP_A = reshape(X(:, 5), 1, []); + r_sigma = reshape(X(:, 6), 1, []); + r_s_B = reshape(X(:, 7), 1, []); + r_p_A = reshape(X(:, 8), 1, []); + + f0 = ((r_d_B ./ r_d_A) .* (r_p_B .* E_HOMO_A)); + f1 = ((IP_A).^3 .* abs(r_sigma - r_s_B)); + f2 = ((IP_A ./ r_p_A) ./ (r_p_B + r_p_A)); + + c0 = -1.3509197357e-01; + a0 = 2.8311062079e-02; + a1 = 3.7282871777e-04; + a2 = -2.3703222974e-01; + + P = reshape(c0 + a0 * f0 + a1 * f1 + a2 * f2, [], 1); + end + ``` </details> diff --git a/docs/tutorial/3_classification.md b/docs/tutorial/3_classification.md index 1ff724f57d7dfb3e5e7cc4710f0c4b31ebd8ce57..2c1a9650015b284d16448bcce049f100e77639ca 100644 --- a/docs/tutorial/3_classification.md +++ b/docs/tutorial/3_classification.md @@ -1,11 +1,12 @@ Performing Classification with SISSO++ --- -Finally `SISSO++` can be used to solve classification problems as well as regression problems. -As an example of this we will adapt the previous example by replacing the property with the identifier of if the material favors the rock-salt or zincblende structure, and change the calculation type to be `classification`. +inally, besides regression problems, `SISSO++` can be used to solve classification problems. +As an example of this we will adapt the previous example by replacing the property with the identifier of if the material favors the rock-salt or zinc-blende structure, and change the calculation type to be `classification`. It is important to note that while this problem only has two classes, multi-class classification is also possible. ## The Data File -Here is the updated data file, with the property `E_RS - E_ZB (eV)` replaced with a `Class` column where any negative `E_RS - E_ZB (eV)` is replaced with 0 and any positive value replaced with 1. +Here is the updated data file, with the property `E_RS - E_ZB (eV)` replaced with a `Class` column where any negative `E_RS - E_ZB (eV)` is replaced with 0 and any positive value replaced with 1. While this example has only one task and two classes, the method works for an arbitrary number of classes and tasks. + <details> <summary>Here is the full data_class.csv file for the calculation</summary> @@ -300,7 +301,7 @@ The estimated property vector in this case refers to the predicted class from SV ``` </details> -## Updating the SVM Model the Python Interface +## Updating the SVM Model Using `sklearn` Because the basis of the classification algorithm is based on the overlap region of the convex hull, the `c` value for the SVM model is set at a fairly high value of 1000.0. This will prioritize reducing the number of misclassified points, but does make the model more susceptible to being over fit. To account for this the python interface has the ability to refit the Linear SVM using the `svm` module of `sklearn`. diff --git a/docs/tutorial/command_line/cv/3d_model.png b/docs/tutorial/command_line/cv/3d_model.png new file mode 100644 index 0000000000000000000000000000000000000000..c9eb59412557ebad45d65d0e4b8ae438fdd3b4c1 Binary files /dev/null and b/docs/tutorial/command_line/cv/3d_model.png differ diff --git a/src/python/postprocess/plot/parity_plot.py b/src/python/postprocess/plot/parity_plot.py index d653c67c80654f27c2d86e88c56d301597af8e1f..714c1e7fbc70ab25bb1cd4242bc5c100cc8ac164 100644 --- a/src/python/postprocess/plot/parity_plot.py +++ b/src/python/postprocess/plot/parity_plot.py @@ -20,9 +20,9 @@ plot_model_parity_plot: Wrapper to plot_model for a set of training and testing import numpy as np import toml from sissopp.postprocess.check_cv_convergence import jackknife_cv_conv_est - from sissopp.postprocess.load_models import load_model -from sissopp.postprocess.plot.utils import setup_plot_ax +from sissopp.postprocess.plot.utils import setup_plot_ax, latexify +from sissopp import ModelClassifier def plot_model_parity_plot(model, filename=None, fig_settings=None): @@ -52,9 +52,12 @@ def plot_model_parity_plot(model, filename=None, fig_settings=None): fig_config, fig, ax = setup_plot_ax(fig_settings) - ax.set_xlabel(model.prop_label + " (" + model.prop_unit.latex_str + ")") + ax.set_xlabel(latexify(model.prop_label) + " (" + model.prop_unit.latex_str + ")") ax.set_ylabel( - f"Estimated {model.prop_label}" + " (" + model.prop_unit.latex_str + ")" + f"Estimated {latexify(model.prop_label)}" + + " (" + + model.prop_unit.latex_str + + ")" ) if len(model.prop_test) > 0: lims = [ diff --git a/src/python/postprocess/plot/utils.py b/src/python/postprocess/plot/utils.py index eafde92cc894973407ea3a6562b6a1fd18b0db79..6b768bcd8d0c6786b863ec8050e0d8c7228a5e26 100644 --- a/src/python/postprocess/plot/utils.py +++ b/src/python/postprocess/plot/utils.py @@ -91,19 +91,26 @@ def adjust_box_widths(ax, fac): def latexify(s): """Convert a string s into a latex string""" power_split = s.split("^") - - print(power_split) - if len(power_split) == 1: - return s - - power_split[0] += "$" - for pp in range(1, len(power_split)): - unit_end = power_split[pp].split(" ") - unit_end[0] = "{" + unit_end[0] + "}$" + temp_s = s + else: + power_split[0] += "$" + for pp in range(1, len(power_split)): + unit_end = power_split[pp].split(" ") + unit_end[0] = "{" + unit_end[0] + "}$" + unit_end[-1] += "$" + power_split[pp] = " ".join(unit_end) + temp_s = "^".join(power_split)[:-1] + + subscript_split = temp_s.split("_") + if len(subscript_split) == 1: + return temp_s + + subscript_split[0] += "$" + for pp in range(1, len(subscript_split)): + unit_end = subscript_split[pp].split(" ") + unit_end[0] = "\\mathrm{" + unit_end[0] + "}$" unit_end[-1] += "$" - power_split[pp] = " ".join(unit_end) - - print("^".join(power_split)[:-1]) + subscript_split[pp] = " ".join(unit_end) - return "^".join(power_split)[:-1] + return "_".join(subscript_split)[:-1] diff --git a/src/python/py_binding_cpp_def/bindings_docstring_keyed.cpp b/src/python/py_binding_cpp_def/bindings_docstring_keyed.cpp index 514b52092fea38011c6de145ff58fdee737409bd..d3c83e45cc238b49a1e51724e99a665fd9b8a820 100644 --- a/src/python/py_binding_cpp_def/bindings_docstring_keyed.cpp +++ b/src/python/py_binding_cpp_def/bindings_docstring_keyed.cpp @@ -121,6 +121,12 @@ void sisso::register_all() "@DocString_str_utils_matlabify@" ); + def( + "latexify", + &str_utils::latexify, + (arg("str")), + "@DocString_str_utils_latexify@" + ); #ifdef PARAMETERIZE sisso::feature_creation::node::registerAddParamNode(); sisso::feature_creation::node::registerSubParamNode(); diff --git a/src/utils/string_utils.cpp b/src/utils/string_utils.cpp index 8989dc2d76c2bc70df83ca6606c328f671255a39..0558e6409874d058b885bd682a894b29d62c859d 100644 --- a/src/utils/string_utils.cpp +++ b/src/utils/string_utils.cpp @@ -75,7 +75,7 @@ std::string str_utils::matlabify(const std::string str) std::string copy_str = str; std::replace(copy_str.begin(), copy_str.end(), ' ', '_'); - std::vector<std::string> split_str = split_string_trim(str, "\\"); + std::vector<std::string> split_str = split_string_trim(str, "\\}{"); for(auto& term_str : split_str) { std::string add_str = term_str; diff --git a/src/utils/string_utils.hpp b/src/utils/string_utils.hpp index d022a951fc320e34c5819ba67b4174d80e36563f..ce628972c4908ddbf08d640ffeb8443148e3141d 100644 --- a/src/utils/string_utils.hpp +++ b/src/utils/string_utils.hpp @@ -42,6 +42,7 @@ namespace str_utils */ std::vector<std::string> split_string_trim(const std::string str, const std::string split_tokens = ",;:"); + // DocString: str_utils_latexify /** * @brief Convert a string into a latex string *