Commit 101956ca authored by Thomas Purcell's avatar Thomas Purcell
Browse files

Merge branch 'master' of gitlab.mpcdf.mpg.de:tpurcell/cpp_sisso into joss

parents d5682d75 e1301943
......@@ -351,7 +351,9 @@ build-gnu-gcov:
pages:
stage: doc_builds
script:
- source cpp_sisso_gnu_py_env/bin/activate
- source cpp_sisso_gnu_param_py_env/bin/activate
- export LD_LIBRARY_PATH=$HOME/intel/oneapi/intelpython/latest/lib/:$HOME/intel/oneapi/intelpython/latest/lib/python3.7:$LD_LIBRARY_PATH
- export PYTHONPATH=$HOME/intel/oneapi/intelpython/latest/lib/python3.7/site-packages/:cpp_sisso_gnu_param_py_env/lib/python3.7/site-packages/
- cd docs/
- make html
- mv _build/html/ ../public
......
Acknowledgements
---
# Acknowledgements
`SISSO++` would not be possible without the following packages:
......@@ -9,7 +8,7 @@ Acknowledgements
- [googletest](https://github.com/google/googletest)
- [NLopt](http://github.com/stevengj/nlopt)
# How to cite these packages:
## How to cite these packages:
Please make sure to give credit to the right people when using `SISSO++`:
For classification problems cite:
......
......@@ -12,19 +12,12 @@ SISSO++
This package provides a C++ implementation of SISSO with built in Python bindings for an efficient python interface.
Future work will expand the python interface to include more postporcessing analysis tools.
Indices
=======
* :ref:`genindex`
* :ref:`search`
Table of Contents
^^^^^^^^^^^^^^^^^
.. toctree::
:maxdepth: 2
self
quick_start/QuickStart
tutorial/tutorial
cpp_api/cpp_api
......
.. _quick_start:
Quick Start Guide
Quick-Start Guide
=================
.. toctree::
:maxdepth: 2
......
......@@ -31,7 +31,7 @@ A list containing the set of all operators that will be used during the feature
#### `param_opset`
A list containing the set of all operators, for which the non-linear scale and bias terms will be optimized, that will be used during the feature creation step of SISSO. (If empty none of the available features)
A list containing the set of all operators, for which the non-linear scale and bias terms will be optimized, that will be used during the feature creation step of SISSO. (If empty none of the available features are used)
#### `calc_type`
......@@ -39,15 +39,15 @@ The type of calculation to run either regression, log regression, or classificat
#### `desc_dim`
The maximum dimension of the model to be created
The maximum dimension of the model to be created (no default value)
#### `n_sis_select`
The number of features that SIS selects over each iteration
The number of features that SIS selects over each iteration (no default value)
#### `max_rung`
The maximum rung of the feature (height of the tallest possible binary expression tree - 1)
The maximum rung of the feature (height of the tallest possible binary expression tree - 1) (no default value)
#### `n_residual`
......
......@@ -2,17 +2,17 @@
This tutorial is based on the [Predicting energy differences between crystal structures: (Meta-)stability of octet-binary compounds](https://analytics-toolkit.nomad-coe.eu/public/user-redirect/notebooks/tutorials/descriptor_role.ipynb) tutorial created by Mohammad-Yasin Arif, Luigi Sbailò, Thomas A. R. Purcell, Luca M. Ghiringhelli, and Matthias Scheffler.
The goal of the tutorial is to teach a user how to use `SISSO++` to find and analyze quantitative models for materials properties.
In particular we will use SISSO to predict the crystal structure (rock salt or zincblende) of a series of octet binaries.
In particular we will use SISSO to predict the crystal structure (rock-salt or zinc-blende) of a series of octet binaries.
The tutorial will be split into three parts: 1) explaining how to use the executable to perform the calculations and the python utilities to analyze the results and 2) How to use only python to run, analyze, and demonstrate results 3) How to perform classification problems using SISSO.
## Outline
The following tutorials are available:
- [Combined Binary and Python](1_combined.md)
- [Python only](2_python.md)
- [Using the Command Line Interface](1_command_line.md)
- [Using the Python Interface](2_python.md)
- [Classification](3_classification.md)
All tutorials use the octet binary dataset first described in [PRL-2015](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.10550) with the goal of predicting whether a material will crystallize in a rock salt or zincblende phase.
All tutorials use the octet binary dataset first described in [PRL-2015](http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.114.10550) with the goal of predicting whether a material will crystallize in a rock-salt or zinc-blende phase.
For all applications of SISSO a data set has to be passed via a standard `csv` file where the first row represents the feature and property label and the first column are the index-label for each sample for example
```
Material, energy_diff (eV), rs_A (AA), rs_B (AA), E_HOMO_A (eV), E_HOMO_B (eV),....
......
......@@ -64,12 +64,11 @@ The standard output provides information about what step the calculation just fi
When all calculations are complete the code prints out a summary of the best 1D, 2D, ..., {desc_dim}D models with their training RMSE/Testing RMSE (Only training if there is no test set provided).
Additionally, two additional output files are stored in `feature_space/`: `SIS_summary.txt` and `selected_features.txt`.
These files represent a human readable (`SIS_summary.txt`) and computer readable (`selected_features.txt`) summary of the selected feature space from SIS.
Below are reconstructions of both files for this calculation
Below are reconstructions of both files for this calculation (To see the file click the triangle)
<details>
<summary>feature_space/SIS_summary.txt</summary>
```text
# FEAT_ID Score Feature Expression
0 0.920868624862486329 ((E_HOMO_B / r_p_A) / (r_sigma + r_p_B))
1 0.919657911026942054 ((|r_pi - r_s_A|) / (r_s_A^3))
......@@ -115,12 +114,10 @@ Below are reconstructions of both files for this calculation
38 0.262777418218664849 ((E_LUMO_A^6) / (r_p_B^3))
39 0.253659279222423484 ((E_LUMO_A / r_p_B) * (E_LUMO_B * E_LUMO_A))
#-----------------------------------------------------------------------
```
</details>
<details>
<summary>feature_space/selected_features.txt</summary>
```text
# FEAT_ID Feature Postfix Expression (RPN)
0 9|14|div|18|15|add|div
1 19|12|abd|12|cb|div
......@@ -166,7 +163,7 @@ Below are reconstructions of both files for this calculation
38 10|sp|15|cb|div
39 10|15|div|11|10|mult|mult
#-----------------------------------------------------------------------
```
</details>
In both files the change in rung is represented by the commented out dashed (--) line.
......@@ -183,7 +180,6 @@ An example of these files is provided here:
<details>
<summary>models/train_dim_2_model_0.dat</summary>
```csv
# c0 + a0 * ((EA_B - IP_A) * (|r_sigma - r_s_B|)) + a1 * ((E_HOMO_B / r_p_A) / (r_sigma + r_p_B))
# Property Label: $E_{RS} - E_{ZB}$; Unit of the Property: eV
# RMSE: 0.0931540779192557; Max AE: 0.356632500670745
......@@ -281,19 +277,19 @@ An example of these files is provided here:
SeZn , 2.631368992806530e-01, 2.463580576975095e-01, 7.384497385908948e-01, -2.320488278555971e+00
TeZn , 2.450012951740060e-01, 1.776248032825628e-01, 2.763715059556858e+00, -2.304848319397327e+00
```
</details>
## Determining the Ideal Model Complexity with Cross-Validation
While the training error always decreases with descriptor dimensionality for a given application, over-fitting can reduce the general applicability of the models outside of the training set.
In order to determine the optimal dimensionality of a model and optimize the hyperparameters associated with SISSO, we need to perform cross-validation.
As an example we will discuss how to perform leave-out 10% using the command line
As an example we will discuss how to perform leave-out 10% using the command line.
To do this we have to modify the `sisso.json` file to automatically leave out a random sample of the training data and use that as a test set by changing `"leave_out_frac": 0.0,` do `"leave_out_frac": 0.10,`.
<details>
<summary> updated sisso.json file</summary>
```json
{
"data_file": "data.csv",
"property_key": "E_RS - E_ZB",
......@@ -309,7 +305,7 @@ To do this we have to modify the `sisso.json` file to automatically leave out a
"leave_out_inds": [],
"opset": ["add", "sub", "abs_diff", "mult", "div", "inv", "abs", "exp", "log", "sin", "cos", "sq", "cb", "six_pow", "sqrt", "cbrt", "neg_exp"]
}
```
</details>
Now lets make ten cross validation directories in the working directory and copy the `data.csv` and `sisso.json` into them and run separate calculations for each run.
......@@ -337,7 +333,6 @@ A full example of the testing set output file is reproduced below:
<details>
<summary>The test data file cv_0/models/test_dim_2_model_0.dat</summary>
```csv
# c0 + a0 * ((E_HOMO_B / r_p_A) / (r_sigma + r_p_B))
# Property Label: $E_{RS} - E_{ZB}$; Unit of the Property: eV
# RMSE: 0.212994478440008; Max AE: 0.442277221520276
......@@ -361,7 +356,6 @@ A full example of the testing set output file is reproduced below:
BrNa , -1.264287278827400e-01, -1.888626375989341e-01, -8.644123624074346e-01
CSi , 6.690237272359810e-01, 3.948280949265375e-01, -3.351692484156472e+00
```
</details>
## Analyzing the Results with python
......@@ -386,7 +380,9 @@ To visualize these results we will also use `plot_validation_rmse` at the end, a
Here is an example of the `plot_validation_rmse` output:
<details>
<summary> Cross-Validation results </summary>
![image](command_line/cv/cv_10_error.png)
![image](./command_line/cv/cv_10_error.png)
</details>
These initial results, particularly the high standard error of the mean for the 1D and 3D models, indicate that more cross-validation samples are needed (Note: you will have different values as the random samples will be different), so lets increase the total number of samples to 100, and redo the analysis
......@@ -415,7 +411,8 @@ As can be seen from the standard error measurements the results are now reasonab
<details>
<summary> Converged cross-validation results </summary>
![image](combined/cv/cv_100_error.png)
![image](./command_line/cv/cv_100_error.png)
</details>
Because the validation error for the three and four dimensional models are within each others error bars and the standard error increases when going to the fourth dimension, we conclude that the three-dimensional model has the ideal complexity.
......@@ -431,11 +428,14 @@ To see the distributions for this system we run
<details>
<summary> Distribution of Errors </summary>
![image](./combined/error_cv.png)
![image](./command_line/cv/error_cv_dist.png)
</details>
One thing that stands out in the plot is the large error seen in a single point for both the one and two dimensional models.
By looking at the validation errors, we find that the point with the largest error is diamond for all model dimensions, which is by far the most stable zincblende structure in the data set.
By looking at the validation errors, we find that the point with the largest error is diamond for all model dimensions, which is by far the most stable zinc-blende structure in the data set.
As a note for this setup there is a 0.22\% chance that one of the samples is never in the validation set so if `max_error_ind != 21` check if that sample is in one of the validation sets.
```python
>>> import numpy as np
>>> import pandas as pd
......@@ -585,12 +585,66 @@ To get the final models we will perform the same calculation we started off the
From here we can use `models/train_dim_3_model_0.dat` for all of the analysis.
In order to generate a machine learning plot for this model in matplotlib, run the following in python
```python
>>> from sissopp.postprocess.plot.parity_plot import plot_model_ml_plot_from_file
>>> plot_model_ml_plot_from_file("models/train_dim_3_model_0.dat", filename="3d_model.pdf").show()
>>> from sissopp.postprocess.plot.parity_plot import plot_model_parity_plot
>>> plot_model_parity_plot("models/train_dim_3_model_0.dat", filename="3d_model.pdf").show()
```
The result of which is shown below:
<details>
<summary> Final 3D model </summary>
![image](./combined/3d_model.png)
![image](./command_line/cv/3d_model.png)
</details>
Additionally you can generate a output the model as a Matlab function or a LaTeX string using the following commands.
```python
>>> from sissopp.postprocess.load_models import load_model
>>> model = load_model("models/train_dim_3_model_0.dat")
>>> print(model.latex_str)
>>> model.write_matlab_fxn("matlab_fxn/model.m")
```
A copy of the generated matlab function is below.
<details>
<summary> Matlab function of the Final 3D model </summary>
function P = model(X)
% Returns the value of E_{RS} - E_{ZB} = c0 + a0 * ((r_d_B / r_d_A) * (r_p_B * E_HOMO_A)) + a1 * ((IP_A^3) * (|r_sigma - r_s_B|)) + a2 * ((IP_A / r_p_A) / (r_p_B + r_p_A))
%
% X = [
% r_d_B,
% r_d_A,
% r_p_B,
% E_HOMO_A,
% IP_A,
% r_sigma,
% r_s_B,
% r_p_A,
% ]
if(size(X, 2) ~= 8)
error("ERROR: X must have a size of 8 in the second dimension.")
end
r_d_B = reshape(X(:, 1), 1, []);
r_d_A = reshape(X(:, 2), 1, []);
r_p_B = reshape(X(:, 3), 1, []);
E_HOMO_A = reshape(X(:, 4), 1, []);
IP_A = reshape(X(:, 5), 1, []);
r_sigma = reshape(X(:, 6), 1, []);
r_s_B = reshape(X(:, 7), 1, []);
r_p_A = reshape(X(:, 8), 1, []);
f0 = ((r_d_B ./ r_d_A) .* (r_p_B .* E_HOMO_A));
f1 = ((IP_A).^3 .* abs(r_sigma - r_s_B));
f2 = ((IP_A ./ r_p_A) ./ (r_p_B + r_p_A));
c0 = -1.3509197357e-01;
a0 = 2.8311062079e-02;
a1 = 3.7282871777e-04;
a2 = -2.3703222974e-01;
P = reshape(c0 + a0 * f0 + a1 * f1 + a2 * f2, [], 1);
end
</details>
Performing Classification with SISSO++
---
Finally `SISSO++` can be used to solve classification problems as well as regression problems.
As an example of this we will adapt the previous example by replacing the property with the identifier of if the material favors the rock-salt or zincblende structure, and change the calculation type to be `classification`.
# Performing Classification with SISSO++
inally, besides regression problems, `SISSO++` can be used to solve classification problems.
As an example of this we will adapt the previous example by replacing the property with the identifier of if the material favors the rock-salt or zinc-blende structure, and change the calculation type to be `classification`.
It is important to note that while this problem only has two classes, multi-class classification is also possible.
## The Data File
Here is the updated data file, with the property `E_RS - E_ZB (eV)` replaced with a `Class` column where any negative `E_RS - E_ZB (eV)` is replaced with 0 and any positive value replaced with 1.
Here is the updated data file, with the property `E_RS - E_ZB (eV)` replaced with a `Class` column where any negative `E_RS - E_ZB (eV)` is replaced with 0 and any positive value replaced with 1. While this example has only one task and two classes, the method works for an arbitrary number of classes and tasks.
<details>
<summary>Here is the full data_class.csv file for the calculation</summary>
```
# Material,Class,Z_A (nuc_charge) ,Z_B (nuc_charge) ,period_A,period_B,IP_A (eV_IP) ,IP_B (eV_IP) ,EA_A (eV_IP),EA_B (eV_IP) ,E_HOMO_A (eV) ,E_HOMO_B (eV) ,E_LUMO_A (eV),E_LUMO_B (eV) ,r_s_A ,r_s_B ,r_p_A ,r_p_B ,r_d_A ,r_d_B,r_sigma ,r_pi
AgBr,0,47,35,5,4,-8.0580997467,-12.649600029,-1.66659998894,-3.73930001259,-4.71000003815,-8.00100040436,-0.479000002146,0.708000004292,1.32000005245,0.75,1.87999999523,0.879999995232,2.97000002861,1.87000000477,1.570000052448,0.689999938012
AgCl,0,47,17,5,3,-8.0580997467,-13.9018001556,-1.66659998894,-3.97079992294,-4.71000003815,-8.69999980927,-0.479000002146,0.574000000954,1.32000005245,0.680000007153,1.87999999523,0.759999990463,2.97000002861,1.66999995708,1.760000050064,0.63999992609
......@@ -93,7 +93,7 @@ Here is the updated data file, with the property `E_RS - E_ZB (eV)` replaced wit
SZn,1,30,16,4,3,-10.1354999542,-11.7951002121,1.08070003986,-2.84489989281,-6.21700000763,-7.10599994659,-1.19400000572,0.64200001955,1.10000002384,0.740000009537,1.54999995232,0.850000023842,2.25,2.36999988556,1.059999942781,0.559999942785
SeZn,1,30,34,4,4,-10.1354999542,-10.9460000992,1.08070003986,-2.75099992752,-6.21700000763,-6.65399980545,-1.19400000572,1.31599998474,1.10000002384,0.800000011921,1.54999995232,0.949999988079,2.25,2.18000006676,0.89999997616,0.599999904638
TeZn,1,30,52,4,5,-10.1354999542,-9.86670017242,1.08070003986,-2.66599988937,-6.21700000763,-6.10900020599,-1.19400000572,0.0989999994636,1.10000002384,0.939999997616,1.54999995232,1.13999998569,2.25,1.83000004292,0.569999992854,0.649999916554
```
</details>
## Running `SISSO++` for Classification problems
......@@ -147,7 +147,6 @@ The two output files stored in `feature_space/` are also very similar, with the
<details>
<summary>feature_space/SIS_summary.txt</summary>
```text
# FEAT_ID Score Feature Expression
0 2.00218777423865069 (r_sigma + r_p_B)
1 2.0108802733799549 (r_pi - r_p_A)
......@@ -191,7 +190,8 @@ The two output files stored in `feature_space/` are also very similar, with the
38 -0.999999633027590651 (period_A * Z_B)
39 -0.999999625788926316 (Z_B / EA_A)
#-----------------------------------------------------------------------
```
</details>
Additionally the model files change to better represent the classifier.
The largest changes are in the header, where the coefficients now represent the linear decision boundaries calculated using support-vector machines (SVM).
......@@ -199,7 +199,6 @@ The estimated property vector in this case refers to the predicted class from SV
<details>
<summary>models/train_dim_2_model_0.dat</summary>
```csv
# [(EA_B * Z_A), (r_sigma + r_p_B)]
# Property Label: $$Class$$; Unit of the Property: Unitless
# # Samples in Convex Hull Overlap Region: 0;# Samples SVM Misclassified: 0
......@@ -297,10 +296,9 @@ The estimated property vector in this case refers to the predicted class from SV
SeZn , 1.000000000000000e+00, 1.000000000000000e+00, -8.252999782560001e+01, 1.849999964239000e+00
TeZn , 1.000000000000000e+00, 1.000000000000000e+00, -7.997999668110000e+01, 1.709999978544000e+00
```
</details>
## Updating the SVM Model the Python Interface
## Updating the SVM Model Using `sklearn`
Because the basis of the classification algorithm is based on the overlap region of the convex hull, the `c` value for the SVM model is set at a fairly high value of 1000.0.
This will prioritize reducing the number of misclassified points, but does make the model more susceptible to being over fit.
To account for this the python interface has the ability to refit the Linear SVM using the `svm` module of `sklearn`.
......@@ -331,11 +329,14 @@ These changes are a result of different SVM libraries leading to slightly differ
<summary> `SISSO++` Classification </summary>
![image](./classification/sissopp.png)
</details>
<details>
<summary> sklearn SVM </summary>
![image](./classification/c_1000.png)
</details>
However as we decrease the value of `c` an increasing number of points becomes miss classified, suggesting the model is potentially over-fitting the data .
......
......@@ -84,8 +84,16 @@ add_test(
COMMAND ${MPIEXEC_EXECUTABLE} ${MPIEXEC_NUMPROC_FLAG} 2 ${MPIEXEC_PREFLAGS} "${CMAKE_BINARY_DIR}/bin/sisso++" "${CMAKE_SOURCE_DIR}/tests/exec_test/log_reg/sisso.json" ${MPIEXEC_POSTFLAGS}
)
add_test(
NAME Train_Only
COMMAND ${MPIEXEC_EXECUTABLE} ${MPIEXEC_NUMPROC_FLAG} 2 ${MPIEXEC_PREFLAGS} "${CMAKE_BINARY_DIR}/bin/sisso++" "${CMAKE_SOURCE_DIR}/tests/exec_test/no_test_data/sisso.json" ${MPIEXEC_POSTFLAGS}
NAME Log_Regression_Max_Correlation_NE_One
COMMAND ${MPIEXEC_EXECUTABLE} ${MPIEXEC_NUMPROC_FLAG} 2 ${MPIEXEC_PREFLAGS} "${CMAKE_BINARY_DIR}/bin/sisso++" "${CMAKE_SOURCE_DIR}/tests/exec_test/log_reg_gen_proj/sisso.json" ${MPIEXEC_POSTFLAGS}
)
add_test(
NAME Log_Regression_Generate_Project
COMMAND ${MPIEXEC_EXECUTABLE} ${MPIEXEC_NUMPROC_FLAG} 2 ${MPIEXEC_PREFLAGS} "${CMAKE_BINARY_DIR}/bin/sisso++" "${CMAKE_SOURCE_DIR}/tests/exec_test/log_reg_max_corr/sisso.json" ${MPIEXEC_POSTFLAGS}
)
add_test(
NAME Log_Regression_Max_Correlation_NE_One_Generate_Project
COMMAND ${MPIEXEC_EXECUTABLE} ${MPIEXEC_NUMPROC_FLAG} 2 ${MPIEXEC_PREFLAGS} "${CMAKE_BINARY_DIR}/bin/sisso++" "${CMAKE_SOURCE_DIR}/tests/exec_test/log_reg_max_corr_gen_proj/sisso.json" ${MPIEXEC_POSTFLAGS}
)
if(BUILD_PARAMS)
add_test(
......
......@@ -32,30 +32,27 @@ ConvexHull1D::ConvexHull1D() :
_n_class(0)
{}
ConvexHull1D::ConvexHull1D(std::vector<int> sizes, const double* prop) :
_sorted_value(std::accumulate(sizes.begin(), sizes.end(), 0), 0.0),
ConvexHull1D::ConvexHull1D(std::vector<int> task_sizes, const double* prop) :
_sorted_value(std::accumulate(task_sizes.begin(), task_sizes.end(), 0), 0.0),
_cls_max(),
_cls_min(),
_sorted_prop_inds(std::accumulate(sizes.begin(), sizes.end(), 0), 0),
_sorted_prop_inds(std::accumulate(task_sizes.begin(), task_sizes.end(), 0), 0),
_cls_start(),
_cls_sz(),
_n_task(sizes.size()),
_n_task(task_sizes.size()),
_n_class(0)
{
initialize_prop(sizes, prop);
initialize_prop(task_sizes, prop);
}
void ConvexHull1D::initialize_prop(std::vector<int> sizes, const double* prop)
void ConvexHull1D::initialize_prop(std::vector<int> task_sizes, const double* prop)
{
// Set the number of tasks and samples
_n_task = sizes.size();
_n_task = task_sizes.size();
_task_scores.resize(_n_task, 0.0);
int n_samp = std::accumulate(sizes.begin(), sizes.end(), 0);
int n_samp = std::accumulate(task_sizes.begin(), task_sizes.end(), 0);
// Set number of classes
std::vector<double> unique_prop_vals;
for(int pp = 0; pp < n_samp; ++pp)
{
......@@ -74,6 +71,9 @@ void ConvexHull1D::initialize_prop(std::vector<int> sizes, const double* prop)
_cls_sz.resize(_n_class * _n_task, 0);
_cls_start.resize(_n_class * _n_task, 0);
std::fill_n(_cls_sz.data(), _cls_sz.size(), 0);
std::fill_n(_cls_start.data(), _cls_start.size(), 0);
// Set the values of the cls vectors and sorted inds
_sorted_value.resize(n_samp, 0.0);
_sorted_prop_inds.resize(n_samp, 0);
......@@ -87,15 +87,15 @@ void ConvexHull1D::initialize_prop(std::vector<int> sizes, const double* prop)
int start = 0;
for(int tt = 0; tt < sizes.size(); ++tt)
for(int tt = 0; tt < task_sizes.size(); ++tt)
{
util_funcs::argsort<double>(
_sorted_prop_inds.data() + start,
_sorted_prop_inds.data() + start + sizes[tt],
_sorted_prop_inds.data() + start + task_sizes[tt],
prop
);
for(int ind = start; ind < start + sizes[tt]; ++ind)
for(int ind = start; ind < start + task_sizes[tt]; ++ind)
{
++_cls_sz[tt * _n_class + cl_ind[prop[ind]]];
}
......@@ -143,7 +143,9 @@ double ConvexHull1D::overlap_1d(double* value, double width)
for(int c1 = 0; c1 < _n_class; ++c1)
{
if(_cls_sz[tt * _n_class + c1] == 0)
{
continue;
}
double min = _cls_min[tt * _n_class + c1];
double max = _cls_max[tt * _n_class + c1];
......
......@@ -51,10 +51,10 @@ public:
/**
* @brief Constructor
*
* @param sizes The size of each task
* @param task_sizes The size of each task
* @param prop The pointer to the property vector
*/
ConvexHull1D(std::vector<int> sizes, const double* prop);
ConvexHull1D(std::vector<int> task_sizes, const double* prop);
/**
* @brief Default constructor
......@@ -64,10 +64,10 @@ public:
/**
* @brief Initialize the projection objects
*
* @param sizes The size of each task
* @param task_sizes The size of each task
* @param prop The pointer to the property vector
*/
void initialize_prop(std::vector<int> sizes, const double* prop);
void initialize_prop(std::vector<int> task_sizes, const double* prop);
/**
* @brief Calculate the projection scores of a set of features to a vector via Pearson correlation
......
......@@ -206,7 +206,7 @@ void LPWrapper::copy_data(const int cls, const std::vector<double*> val_ptrs, co
{
throw std::logic_error("Size of the val_ptrs vector is larger than _n_dim");
}
if(val_ptrs.size() > _n_dim)
if(test_val_ptrs.size() > _n_dim)
{
throw std::logic_error("Size of the test_val_ptrs vector is larger than _n_dim");
}
......
......@@ -22,27 +22,8 @@
#include "classification/SVMWrapper.hpp"
SVMWrapper::SVMWrapper(const int n_class, const int n_dim, const int n_samp, const double* prop) :
_model(nullptr),
_y(prop, prop + n_samp),
_y_est(n_samp),
_x_space(n_samp * (n_dim + 1)),
_x(n_samp),
_coefs(n_class * (n_class - 1) / 2, std::vector<double>(n_dim, 0.0)),
_C(1000.0),
_intercept(n_class * (n_class - 1) / 2, 0.0),
_w_remap(n_dim, 1.0),
_b_remap(n_dim, 0.0),
_n_dim(n_dim),
_n_samp(n_samp),
_n_class(n_class)
{
setup_parameter_obj(_C);
setup_x_space();
_prob.l = _n_samp;
_prob.y = _y.data();
_prob.x = _x.data();
}
SVMWrapper(1000.0, n_class, n_dim, n_samp, prop)
{}
SVMWrapper::SVMWrapper(const int n_class, const int n_dim, const std::vector<double> prop) :
SVMWrapper(n_class, n_dim, prop.size(), prop.data())
......@@ -63,6 +44,14 @@ SVMWrapper::SVMWrapper(const double C, const int n_class, const int n_dim, const
_n_samp(n_samp),
_n_class(n_class)
{
std::vector<double> unique_prop_vals = vector_utils::unique(_y);
std::map<double, double> rep_map;
for(int cc = 0; cc < _n_class; ++cc)
{
_map_prop_vals[static_cast<double>(cc)] = unique_prop_vals[cc];
rep_map[unique_prop_vals[cc]] = static_cast<double>(cc);
}
std::transform(_y.begin(), _y.end(), _y.begin(), [&](double val){return rep_map[val];});
setup_parameter_obj(_C);
setup_x_space();
......@@ -121,6 +110,11 @@ void SVMWrapper::setup_x_space()
void SVMWrapper::copy_data(const std::vector<int> inds, const int task)
{
if((task < 0) || (task >= node_value_arrs::TASK_SZ_TRAIN.size()))
{
throw std::logic_error("The requested task is invalid.");
}
if(inds.size() > _n_dim)
{
throw std::logic_error("Size of the inds vector is larger than _n_dim");
......@@ -210,7 +204,10 @@ void SVMWrapper::train(const bool remap_coefs)
std::transform(_x.begin(), _x.end(), _y_est.begin(), [this](svm_node* sn){return svm_predict(_model, sn);});
_n_misclassified = 0;
for(int ss = 0; ss < _n_samp; ++ss)
{
_n_misclassified += _y[ss] != _y_est[ss];
}
std::transform(_y_est.begin(), _y_est.end(), _y_est.begin(), [this](double val){return _map_prop_vals[val];});
}
void SVMWrapper::train(const std::vector<int> inds, const int task, const bool remap_coefs)
......@@ -251,6 +248,6 @@ std::vector<double> SVMWrapper::predict(const int n_samp_test, const std::vector
x_test[ss] = &x_space_test[ss * (val_ptrs.size() + 1)];
}
std::transform(x_test.begin(), x_test.end(), y_est_test.begin(), [this](svm_node* sn){return svm_predict(_model, sn);});
std::transform(y_est_test.begin(), y_est_test.end(), y_est_test.begin(), [this](double val){return _map_prop_vals[val];});
return y_est_test;
}
......@@ -22,11 +22,13 @@
#ifndef LIBSVM_WRAPPER
#define LIBSVM_WRAPPER
#include <map>
#include <algorithm>
#include <iostream>
#include "external/libsvm/svm.h"
#include "feature_creation/node/value_storage/nodes_value_containers.hpp"
#include "utils/vector_utils.hpp"
// DocString: cls_svm_wrapper
/**
......@@ -50,6 +52,7 @@ protected:
std::vector<double> _w_remap; //!< Prefactors to convert the data to/from the preferred SVM range (0 to 1)
std::vector<double> _b_remap; //!< Prefactors to map the minimum of the features to 0.0
std::map<double, double> _map_prop_vals; //!< Map of the property values to the values used for SVM
const double _C; //!< The C parameter for the SVM calculation