"/home/sbailo/anaconda3/envs/ai_toolkit/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.metrics.scorer module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.\n",
B. Regler, M. Scheffler, and L. M. Ghiringhelli: "TCMI: a non-parametric mutual-dependence estimator for multivariate continuous distributions"
</div>
This interactive notebook includes the original implementation of total cumulative mutual information (TCMI) to reproduce the main results presented in the publication.
TCMI is a measure of the relevance of mutual dependencies based on cumulative probability distributions. TCMI can be estimated directly from sample data and is a non-parametric, robust and deterministic measure that facilitates comparisons and rankings between feature sets with different cardinality. The ranking induced by TCMI allows for feature selection, i.e. the identification of the set of relevant features that are statistical related to the process or the property of a system, while taking into account the number of data samples as well as the cardinality of the feature subsets.
It is compared to [Cumulative mutual information (CMI)](https://dx.doi.org/10.1137/1.9781611972832.22), [Multivariate maximal correlation analysis (MAC)](http://proceedings.mlr.press/v32/nguyenc14.html), [Universal dependency analysis (UDS)](https://dx.doi.org/10.1137/1.9781611974348.89), and [Monte Carlo dependency estimation (MCDE)](https://dx.doi.org/10.1145/3335783.3335795).
This repository (notebook and code) is released under the [Apache License, Version 2.0](http://www.apache.org/licenses/). Please see the [LICENSE](LICENSE) file.
---
**Important notes:**
<ulstyle="color: #8b0000; font-style: italic;">
<li>All comparisons have been computed with the Java package <code>MCDE</code> written in Scala, which is not part of the repository. To use the most recent and maintained implementation, please visit <ahref="https://github.com/edouardfouche/MCDE">https://github.com/edouardfouche/MCDE</a> and run all examples with 50,000 iterations.</li>
<li>For the sake of simplicity, all results have been cached. However, results can be recalculated after adjusting the respective test sections. Depending on the test, the calculation time ranges from minutes to days.</li>
/home/sbailo/anaconda3/envs/ai_toolkit/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.metrics.scorer module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.metrics. Anything that cannot be imported from sklearn.metrics is now part of the private API.
warnings.warn(message, FutureWarning)
%% Cell type:markdown id: tags:
## 1. Basic tests
%% Cell type:markdown id: tags:
This section studies some of the properties of total cumulative mutual information. In particular, we check that the
- score is a monotonous function in the order of the conditionals
- score attains it's maximum and minimun theoretical values (linear and zero case)
- correction vanishes with increasing number of data samples
- adjusted version of the score is (almost) constant with respect to subset dimensionality and sample size
%% Cell type:code id: tags:
``` python
# Test case 1
methods=['cmi','mac','uds','mcde']
size=200
# Test case 2
sizes=[10,50,100,500]
n_repeats=50
dimensions=4
```
%% Cell type:markdown id: tags:
### 1.1. Monotonicity check and ranking of monotonous functions
%% Cell type:markdown id: tags:
**Test**: Monotonicity check of score<br/>
**Expected**: linear must be first, followed by step functions, zero must be last
In this section, we examine a simple feature selection task with a known distribution and nonlinear dependencies between features and the output variable. Essentially, we consider bivariate Gaussian distributions with different sample sizes, add noisy features, and test dependency estimators to find the optimal subset of features. Since the ground truth is known and the problem is two-dimensional, we expect only two traits to be selected by all dependency estimators.
### 2.4. Statistical power analysis (95% confidence)
%% Cell type:markdown id: tags:
To perform statistical power analysis we see, that all of the above dependency measures converge to the optimal feature subsets ${x,y}$ for at least 500 data samples, which will be chosen as the sample size in the following.
**Test**: Statistical power analysis (95% confidence)<br/>
**Expected**: High statistical power as well as high contrast between the actual score and independence
1. Friedman - https://sci2s.ugr.es/keel/dataset.php?cod=81 . It has been obtained from the LIACC repository. The original page where the data set can be found is: http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html.
Octet-binary compound semiconductors are materials consisting of two elements formed by groups of I/VII, II/VI, III/V, or IV/IV elements leading to a full valence shell. They crystallize in rock salt (RS) or zinc blende (ZB) structures.
The data set is composed of 82 materials with two atomic species in the unit cell. The objective is to accurately predict the energy difference $\Delta E$ between RS and ZB structures based on 8 electro-chemical atomic properties for each atomic species $A/B$ (in total 16) such as atomic ionization potential $\text{IP}$, electron affinity $\text{EA}$, the energies of the highest-occupied and lowest-unoccupied Kohn-Sham levels, $\text{H}$ and $\text{L}$, and the expectation value of the radial probability densities of the valence $s$-, $p$-, and $d$-orbitals, $r_s$, $r_p$, and $r_d$, respectively.
L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, C. & M. Scheffler: Big Data of Materials Science: Critical Role of the Descriptor. Physical Review Letters <strong>114</strong>, 105503 (2015). DOI: <ahref="https://dx.doi.org/10.1103/PhysRevLett.114.105503">10.1103/PhysRevLett.114.105503</a>