Commit 7274879b authored by Thomas Purcell's avatar Thomas Purcell
Browse files

Merge branch 'master' into joss

parents 6385b732 57047dd3
......@@ -309,13 +309,21 @@ Finally information about the number of samples in each task is given
# rock_salt, 4
```
After this header the following data is stored in the file:
The header for the test data files contain the same information as the training file, with an additional line at the end to list all indexes included in the test set:
```
# Test Indexes: [ 0, 5 ]
```
These indexes can be used to reproduce the results by setting `leave_out_inds` to those listed on this line.
After this header in both file the following data is stored in the file:
```
# Sample ID , Property Value , Property Value (EST) , Feature 0 Value
```
With this data, one can plot and analyzed the model, e.g., by using the python binding.
## Using the Python Library
To see how the python interface can be used refer to the [tutorials](../tutorial/2_python.md).
If you get an error about not being able to load MKL libraries, you may have to run `conda install numpy` to get proper linking.
......
......@@ -298,9 +298,11 @@ An example of these files is provided here:
## Determining the Ideal Model Complexity with Cross-Validation
While the training error always decreases with descriptor dimensionality for a given application, over-fitting can reduce the general applicability of the models outside of the training set.
In order to determine the optimal dimensionality of a model and optimize the hyperparameters associated with SISSO, we need to perform cross-validation.
Cross-validation
The goal of cross-validation is to test how generalizable a given model is with respect to new data.
In practice, we perform cross-validation by randomly splitting the data-set into separate train/test sets and evaluate the performance of the model on the test set.
As an example we will discuss how to perform leave-out 10% using the command line.
To do this we have to modify the `sisso.json` file to automatically leave out a random sample of the training data and use that as a test set by changing `"leave_out_frac": 0.0,` to `"leave_out_frac": 0.10,`,
To do this we have to modify the `sisso.json` file to automatically leave out a random sample of the training data and use that as a test set by changing `"leave_out_frac": 0.0` to `"leave_out_frac": 0.10`,
i.e. in this case SISSO will ignore 8 materials (10% of all data) during training.
In each run, this 8 materials are chosen randomly, so each SISSO run will
differ from one another.
......@@ -327,26 +329,23 @@ differ from one another.
</details>
Now lets make ten cross validation directories in the working directory and copy the `data.csv` and `sisso.json` into them and run separate calculations for each run.
Note the decision to begin with ten iterations is arbitrary, and not connected to the amount of data excluded from the test set.
```bash
for ii in `seq -f "%02g" 0 9`; do
for ii in `seq -f "%03g" 0 9`; do
mkdir cv_$ii;
cp sisso.json data.csv cv_$ii;
cd cv_$ii;
sisso++;
mpiexec -n 2 sisso++;
cd ../;
done
```
Each of these directories has the same output files as the non-cross-validation calculations, with the testing and training data defined in separate files in `cv_$ii/models/`
Each of these directories has the same kind of output files as the non-cross-validation calculations, with the testing and training data defined in separate files in `cv_$ii/models/`
```
ls cv_00/models/
test_dim_1_model_0.dat test_dim_3_model_0.dat train_dim_1_model_0.dat train_dim_3_model_0.dat
test_dim_2_model_0.dat test_dim_4_model_0.dat train_dim_2_model_0.dat train_dim_4_model_0.dat
```
The new files have the same information as the training data files, but with an additional line to show which samples were left out of the training set to allow for easy reproducibility:
```
# Test Indexes: [ 2, 19, 41, 42, 50, 59, 60, 69 ]
```
To rerun these exact calculations simply change the `"leave_out_inds": [],` line in the `sisso.json` file to `"leave_out_inds": [ 1, 13, 46, 5 ]`.
A full example of the testing set output file is reproduced below:
<details>
<summary>The test data file cv_0/models/test_dim_2_model_0.dat</summary>
......@@ -376,10 +375,13 @@ A full example of the testing set output file is reproduced below:
</details>
## Analyzing the Results with python
## Analyzing the Results with Python
*Note to do this part of the tutorial the python binding must also be built*
Once all of the calculations are completed the python interface provides some useful post-processing tools to easily analyze the results.
The `jackknife_cv_conv_est` tools provides a way to reasonably check the convergence of the cross-validation results with respect to the number number of calculations performed.
This tool uses [jackknife resampling](https://en.wikipedia.org/wiki/Jackknife_resampling) to calculate the mean and variance of the validation RMSEs across all cross-validation runs.
This technique essentially calculates the mean and standard error of all validation RMSEs for the system.
This data can then be used to estimate the overall validation RMSE for a given problem/set of hyper-parameters and the standard error associated with the random sampling of the test indexes.
It is important to mention that the error bars are based on the standard error of the mean of the validation RMSE, which assumes the sampling error follows a normal distribution.
Because the data set may not represent a uniform sampling of materials space, the standard error of the mean may only be a rough estimate of the true sampling error.
......@@ -403,13 +405,17 @@ Here is an example of the `plot_validation_rmse` output:
</details>
These initial results, particularly the high standard error of the mean for the 1D and 3D models, indicate that more cross-validation samples are needed (Note: you will have different values as the random samples will be different), so lets increase the total number of samples to 100, and redo the analysis
These initial results suggest that we need to run more cross-validation samples in order to get converged results.
Using these results, we can only clearly state that the there is a significant decrease in the validation error when going from a one-dimensional model to a two-dimensional one.
However, because of the large error bars, it is impossible to determine which of the two, three, or four dimensional model is best.
To solve this lets increase the total number of samples to 100, and redo the analysis
```bash
for ii in `seq -f "%02g" 10 99`; do
for ii in `seq -f "%03g" 10 99`; do
mkdir cv_$ii;
cp sisso.json data.csv cv_$ii;
cd cv_$ii;
sisso++;
mpiexec -n 2 sisso++;
cd ../;
done
```
......@@ -424,7 +430,10 @@ done
[0.0051855 0.00571521 0.00398963 0.00473639]
>>> plot_validation_rmse("cv*", "cv_100._error.png").show()
```
As can be seen from the standard error measurements the results are now reasonably converged, which can be easily seen by looking at this plot
With the additional calculations we now have relatively well converged results.
The key used in determining this is the relative size of the error bars when compared against the mean value.
For this example the estimate of the validation RMSEs for all dimensions up to the third dimension is outside the error bars of the other error bars, meaning that we can confidently say that the three-dimensional model is better than both the one and two-dimensional models.
Because the validation error for the three and four dimensional models are within each others error bars and the standard error increases when going to the fourth dimension, we can then conclude that the three-dimensional model has the ideal complexity.
<details>
<summary> Converged cross-validation results </summary>
......@@ -433,7 +442,6 @@ As can be seen from the standard error measurements the results are now reasonab
</details>
Because the validation error for the three and four dimensional models are within each others error bars and the standard error increases when going to the fourth dimension, we conclude that the three-dimensional model has the ideal complexity.
## Visualizing the Cross Validation Error
The previous section illustrated how to plot the validation RMSE for each dimension of the model, but the RMSE does not give a complete picture of the model performance.
......@@ -450,6 +458,7 @@ To see the distributions for this system we run
</details>
These plots show the histogram of the error for each dimension with the total area normalized to one.
One thing that stands out in the plot is the large error seen in a single point for both the one and two dimensional models.
By looking at the validation errors, we find that the point with the largest error is diamond for all model dimensions, which is by far the most stable zinc-blende structure in the data set.
As a note for this setup there is a 0.22\% chance that one of the samples is never in the validation set so if `max_error_ind != 21` check if that sample is in one of the validation sets.
......@@ -469,7 +478,8 @@ Index(['C2', 'C2', 'C2', 'C2'], dtype='object', name='# Material')
## Optimizing the hyper-parameters of SISSO
As discussed in the previous example `desc_dim` is one of the four hyperparameters used in `SISSO++` with the others being: `n_sis_select`, `max_rung`, and `n_residual`.
Due to the factorial increase in both computational time and required memory associated with `max_rung` only `desc_dim`, `n_sis_select`, and `n_residual` will be optimized in this exercise.
Of these `n_sis_select` and `n_residual` need to be optimized together while `desc_dim` and `max_rung` can be optimized independently.
Due to the factorial increase in both computational time and required memory associated with `max_rung` only `desc_dim`, `n_sis_select`, and `n_residual` will be optimized in this exercise, but for production purposes this will also have to be studied.
Additionally the exercise will only use use relatively small SIS subspace sizes and only go up to a 3D model in order to reduce the computational time for the exercise.
The first step of this process will be setting up nine directories for each combination `n_residual` (1, 5, and 10) and `n_sis_select` (10, 50, 100) and modify the base `sisso.json` to match these new parameters (Note: the dimension of the final model will be determined in the same way as the previous example).
Here is the new base `sisso.json file`:
......@@ -570,16 +580,16 @@ ns: 10; nr: 1; [0.15680869 0.17389737 0.16029643] [0.00646652 0.04735888 0.044
ns: 10; nr: 5; [0.15625644 0.12419926 0.15115378] [0.00663913 0.00631875 0.04471696]
ns: 10; nr: 10; [0.15597268 0.12273297 0.10921321] [0.0051855 0.00571521 0.00398963]
ns: 50; nr: 1; [0.15192553 0.12373729 0.13507366] [0.00513279 0.00523007 0.01695709]
ns: 50; nr: 5; [0.23435624 0.16147891 0.14203905] [0.06109923 0.03514775 0.0337021 ]
ns: 50; nr: 5; [0.15262692 0.12672067 0.11011062] [0.00491465 0.00522488 0.00407753]
ns: 50; nr: 10; [0.15119692 0.13040251 0.10993919] [0.00487215 0.00506964 0.00464264]
ns: 100; nr: 1; [0.15835654 0.13728706 0.12849654] [0.00557331 0.00606579 0.01114889]
ns: 100; nr: 5; [0.21091421 0.20396324 0.19513139] [0.05358098 0.0642375 0.06550557]
ns: 100; nr: 10; [0.15075894 0.13881415 0.115574 ] [0.00511985 0.00873076 0.01049183]
ns: 100; nr: 5; [0.15502757 0.14002783 0.12102758] [0.00489507 0.00546934 0.00612467]
ns: 100; nr: 10; [0.14996602 0.13248817 0.1070521 ] [0.00495617 0.0051647 0.00432492]
```
These results indicate that for the small SIS subspace sizes used here the validation error is stable relative to both the number of residuals and SIS subspace size.
These results indicate that for the small SIS subspace sizes used here the validation error is stable relative to both the number of residuals and SIS subspace size, given the simliarity
However it is important to note that this will not always be the case, particularly for for larger values of `n_sis_select` .
The data also illustrates how the standard error of the mean is only an approximation to the true sampling error as the validation error of the 1D models have a broader than expected distribution of values.
For finding the best model over all an `n_sis_select` of 100 and `n_residual` of 10 will be used.
This choice was made because it has the lowest validation RMSE of 0.107, but all calculations that use 10 residuals will have equivalent performance (at least for the small SIS subspace size).
## Final Training Data
To get the final models we will perform the same calculation we started off the tutorial with, but with the following `sisso.json` file based on the results from the previous steps:
......
# Performing Classification with SISSO++
inally, besides regression problems, `SISSO++` can be used to solve classification problems.
Finally, besides regression problems, `SISSO++` can be used to solve classification problems.
While we have already that this problem could solved with regression, `SISSO++` can also solve problems that can't be treated as a regression problem.
As an example of this we will adapt the previous example by replacing the property with the identifier of if the material favors the rock-salt or zinc-blende structure, and change the calculation type to be `classification`.
It is important to note that while this problem only has two classes, multi-class classification is also possible.
......@@ -99,6 +100,7 @@ Here is the updated data file, with the property `E_RS - E_ZB (eV)` replaced wit
## Running `SISSO++` for Classification problems
For settings file the only difference between solving classification and regression is the `calc_type` key which is now `classification` instead of `regression` and we are reducing `max_rung` to be 1.
Changing `max_rung` is not a necessary step; however, if rung 2 features are included here then a one-dimensional descriptor will perfectly separate the classes.
Normally this would be a good thing, but because we also want to illustrate two-dimensional visualization tools we will restrict ourselves to a single rung.
Additionally to make it easier to visualize the model we will restrict the calculation to two dimensions, but higher dimensional models are also possible.
```json
{
......@@ -298,6 +300,11 @@ The estimated property vector in this case refers to the predicted class from SV
</details>
### Cross-Validation
While we won't do it here, cross-validation should also be performed for classification problems.
For those calculations the number of miscalssified points in the test set is the most important measure of the error.
## Updating the SVM Model Using `sklearn`
Because the basis of the classification algorithm is based on the overlap region of the convex hull, the `c` value for the SVM model is set at a fairly high value of 1000.0.
This will prioritize reducing the number of misclassified points, but does make the model more susceptible to being over fit.
......
......@@ -210,7 +210,7 @@ def read_csv(
# Create Primary Feature Space
phi_0 = []
for feat_ind, col in enumerate(df.columns):
for feat_ind, col in enumerate(cols):
data, label, unit = extract_col(df, col, False)
phi_0.append(
FeatureNode(
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment