"This tutorial is dedicated to dimensionality reduction techniques (DDR). DDR is a group of methods that transform data from high-dimensional feature space to a low-dimensional feature space. In other words, these methods reduce the number of features either by manually selecting the best representative features (first class) or by combining and creating new representative features (second class). The ultimate goal of DDR techniques is to preserve the variation of the original dataset as much as possible while reducing the number of features. In this tutorial, we focus on the second class of DDR techniques and explain some of the well-known methods in detail and with practical examples. Part of this notebook is inspired by the book \"An Introduction to Statistical Learning: with Applications in R\" by Robert Tibshirani et al. We thank the authors of this book. \n",
"This tutorial is dedicated to dimension reduction techniques (DDR). DDR is a group of methods that transform data from high-dimensional feature space to a low-dimensional feature space. In other words, these methods reduce the number of features either by manually selecting the best representative features (first class) or by combining and creating new representative features (second class). The ultimate goal of DDR techniques is to preserve the variation of the original dataset as much as possible while reducing the number of features. In this tutorial, we focus on the second class of DDR techniques and explain some of the well-known methods in detail and with practical examples. Part of this notebook is inspired by the book \"An Introduction to Statistical Learning: with Applications in R\" by Robert Tibshirani et al. We thank the authors of this book. \n",
"James, Gareth and Witten, Daniela and Hastie, Trevor and Tibshirani, Robert: <span style=\"font-style: italic;\">An Introduction to Statistical Learning: with Applications in R</span>, Springer New York, 2014, 1461471370, 9781461471370 <a href=\"https://link.springer.com/book/10.1007/978-1-4614-7138-7#authorsandaffiliationsbook\" target=\"_blank\">[PDF]</a> .\n",
"James, Gareth and Witten, Daniela and Hastie, Trevor and Tibshirani, Robert: <span style=\"font-style: italic;\">An Introduction to Statistical Learning: with Applications in R</span>, Springer New York, 2014, 1461471370, 9781461471370 <a href=\"https://link.springer.com/book/10.1007/978-1-4614-7138-7#authorsandaffiliationsbook\" target=\"_blank\">[PDF]</a> .\n",
...
@@ -53,7 +53,7 @@
...
@@ -53,7 +53,7 @@
"</td>\n",
"</td>\n",
"</tr></table>\n",
"</tr></table>\n",
"\n",
"\n",
"In most realistic machine learning problems, we need to deal with a large number of features, i.e., a multidimensional dataset that makes it difficult and challenging for us to analyze, visualize and fit a machine learning model. In such problems, one instead of dealing with many features can reduce the number of features by applying a set of methods called data-dimensionality reduction (DDR) techniques. \n",
"In most realistic machine learning problems, we need to deal with a large number of features, i.e., a multidimensional dataset that makes it difficult and challenging for us to analyze, visualize and fit a machine learning model. In such problems, one instead of dealing with many features can reduce the number of features by applying a set of methods called data-dimension reduction (DDR) techniques. \n",
"\n",
"\n",
"In all DDR methods, first, a new set of transformed features are created. For instance, this can be done by a linear combination of the original features and creating a new set of features as follows:\n",
"In all DDR methods, first, a new set of transformed features are created. For instance, this can be done by a linear combination of the original features and creating a new set of features as follows:\n",
"\n",
"\n",
...
@@ -98,7 +98,7 @@
...
@@ -98,7 +98,7 @@
" \n",
" \n",
"</div> -->\n",
"</div> -->\n",
"\n",
"\n",
"Data dimensionality reduction methods are generally divided into two classes depending on how they reduce the number of features:\n",
"Data dimension reduction methods are generally divided into two classes depending on how they reduce the number of features:\n",
"\n",
"\n",
"1. Select the best representative features\n",
"1. Select the best representative features\n",
"\n",
"\n",
...
@@ -630,7 +630,7 @@
...
@@ -630,7 +630,7 @@
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {},
"source": [
"source": [
"UMAP is a novel and powerful data dimensionality reduction method (DDR) introduced in 2018 by Leland McInnes and his colleagues. UMAP not only reduce the dimension but also preserves the clusters and the relationship between them. Although this method is very much similar to t-SNE (they both contruct a highdimenstional graph and map it to a low dimensional graph) its mathematical foundation is different. UMAP has some advantages that makes it more practical than t-SNE. For instance, UMAP is much faster than t-SNE and it preserves the global structure of the data while t-SNE more focuses on local structures not the global structure. These advantages led UMAP to replace t-SNE in the scientific community.\n",
"UMAP is a novel and powerful data dimensionality reduction method (DDR) introduced [in 2018 by Leland McInnes and his colleagues](https://www.youtube.com/watch?v=nq6iPZVUxZU). UMAP not only reduces the dimension but also preserves the clusters and the relationship between them. Although this method has a similar purpose as t-SNE (they both contruct a high-dimenstional graph and map it to a low dimensional graph), its mathematical foundation is different. UMAP has some advantages that makes it more practical than t-SNE. For instance, UMAP is much faster than t-SNE and it preserves the global structure of the data while t-SNE focuses more on local structures rather than the global structure. These advantages are leading UMAP to steadly replace t-SNE in the scientific community.\n",
"Wang, Quan : <span style=\"font-style: italic;\"> UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction </span> arXiv, 2018, 1802.03426 <a href=\"\n",
"Wang, Quan : <span style=\"font-style: italic;\"> UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction </span> arXiv, 2018, 1802.03426 <a href=\"\n",
...
@@ -640,7 +640,7 @@
...
@@ -640,7 +640,7 @@
"\n",
"\n",
"## How does UMAP work?\n",
"## How does UMAP work?\n",
"\n",
"\n",
"UMAP algorithm works in two steps:\n",
"We explain the UMAP based on its [documentation](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html). UMAP algorithm works in two steps:\n",
"\n",
"\n",
"1. Contruct a high dimensional graph\n",
"1. Contruct a high dimensional graph\n",
"2. Find the low dimensional representation of that graph\n",
"2. Find the low dimensional representation of that graph\n",
...
@@ -650,7 +650,7 @@
...
@@ -650,7 +650,7 @@
"UMAP first approximates a manifold on which the data lie. Since we are mostly dealing with finite data, it is assumed that the data is uniformly distributed on the manifold. But in the real world, the data is not so distributed, so we need to assume that the manifold has a Riemannian metric, and then find a metric for which the data is approximately uniformly distributed. \n",
"UMAP first approximates a manifold on which the data lie. Since we are mostly dealing with finite data, it is assumed that the data is uniformly distributed on the manifold. But in the real world, the data is not so distributed, so we need to assume that the manifold has a Riemannian metric, and then find a metric for which the data is approximately uniformly distributed. \n",
"\n",
"\n",
"\n",
"\n",
"Then we make simple combinatorial building blocks, called \"simplicies\" out of data. k-simplex is created by taking the convex hull of k+1 independent points. In the following picture low dimensional simplices are shown. \n",
"Then we make simple combinatorial building blocks, called \"simplicies\" out of data. Each k-simplex is created by taking the convex hull of k+1 independent points. In the following picture low dimensional simplices are shown. \n",
"\n",
"\n",
"<table><tr>\n",
"<table><tr>\n",
"<td> \n",
"<td> \n",
...
@@ -662,11 +662,11 @@
...
@@ -662,11 +662,11 @@
"</td>\n",
"</td>\n",
"</tr></table>\n",
"</tr></table>\n",
"\n",
"\n",
"For our finite data, we can do this by creating a ball with a fixed radius for each data point. Since we assume that the data is uniformly distributed on the manifold, it is easy to choose an appropriate radius; otherwise, a small radius leads to many connected components and a large radius leads to some very high-dimensional simplices that we do not want to work with. As we mentioned earlier, we need to use Reimanian geometry for this assumption. This means that each point has a local metric space associated with it so that we can measure the distance in a meaningful way. In other word, we compute a local notion of distance for each point. In practice, this means that a unit ball around a point extends to the kth nearest neighbor of the point. k is the size of the sample we use to approximate the local notion of distance. Since each point has its own distance function, we can simply select ball of radius one with respect to this local distance function. \n",
"For our finite data, we can do this by creating a ball with a fixed radius for each data point. Since we assume that the data is uniformly distributed on the manifold, it is easy to choose an appropriate radius; otherwise, a small radius leads to many connected components and a large radius leads to some very high-dimensional simplices that we do not want to work with. As we mentioned earlier, we need to use Reimanian geometry for this assumption. This means that each point has a local metric space associated with it so that we can measure the distance in a meaningful way. Actaully we consider a patch for each point and we map it down to an standard euclidean space in a different scale. So there is a different notion of distance for each point depending on where they are located (in a denser and sparser area). In practice, this means that a unit ball around a point extends to the kth nearest neighbor of the point. k is the size of the sample we use to approximate the local notion of distance. Since each point has its own distance function, we can simply select ball of radius one with respect to this local distance function. \n",
"\n",
"\n",
"k is an important parameter ($n$_$neighbor$) in UMAP that determines how local we want to estimate the Riemannian metric. Thus, a small value of k gives us more details about the local structures, while a large value of k gives us more information about the global structures.\n",
"k is an important parameter ($n$_$neighbor$) in UMAP that determines how local we want to estimate the Riemannian metric. Thus, a small value of k gives us more details about the local structures, while a large value of k gives us more information about the global structures.\n",
"\n",
"\n",
"With the Riemannian metric we can even work in a fuzzy topology, which means that there is a certanity for each point to be located in a ball of a given radius (For each point in the neighborhood of a given point, this certainty is also called weights or similarity values). The further we move away from the center of each point, the smaller this certainty becomes. To avoid isolating the data points, especially in high dimensions, we activate fuzzy certainty beyond the first nearest neighbor. This ensures that each point on the manifold has at least one neighbor to which it is connected (certainty for the first neighbor is 100% and for other is between 0% to 100%). This is called local connectivity in UMAP and has a hyperparameter ($local$_$connectivity$) that determines the least number of connections. \n",
"With the Riemannian metric we can even work in a fuzzy topology, which means that there is a certanity for each point to be located in a ball of a given radius (fuzzy value or weights). The further we move away from the center of each point, the smaller this certainty becomes. To avoid isolating the data points, especially in high dimensions, we activate fuzzy certainty beyond the first nearest neighbor. This ensures that each point on the manifold has at least one neighbor to which it is connected (certainty for the first neighbor is 100% and for other is between 0% to 100%). This is called local connectivity in UMAP and has a hyperparameter ($local$_$connectivity$) that determines the least number of connections. \n",
This tutorial is dedicated to dimensionality reduction techniques (DDR). DDR is a group of methods that transform data from high-dimensional feature space to a low-dimensional feature space. In other words, these methods reduce the number of features either by manually selecting the best representative features (first class) or by combining and creating new representative features (second class). The ultimate goal of DDR techniques is to preserve the variation of the original dataset as much as possible while reducing the number of features. In this tutorial, we focus on the second class of DDR techniques and explain some of the well-known methods in detail and with practical examples. Part of this notebook is inspired by the book "An Introduction to Statistical Learning: with Applications in R" by Robert Tibshirani et al. We thank the authors of this book.
This tutorial is dedicated to dimension reduction techniques (DDR). DDR is a group of methods that transform data from high-dimensional feature space to a low-dimensional feature space. In other words, these methods reduce the number of features either by manually selecting the best representative features (first class) or by combining and creating new representative features (second class). The ultimate goal of DDR techniques is to preserve the variation of the original dataset as much as possible while reducing the number of features. In this tutorial, we focus on the second class of DDR techniques and explain some of the well-known methods in detail and with practical examples. Part of this notebook is inspired by the book "An Introduction to Statistical Learning: with Applications in R" by Robert Tibshirani et al. We thank the authors of this book.
James, Gareth and Witten, Daniela and Hastie, Trevor and Tibshirani, Robert: <spanstyle="font-style: italic;">An Introduction to Statistical Learning: with Applications in R</span>, Springer New York, 2014, 1461471370, 9781461471370 <ahref="https://link.springer.com/book/10.1007/978-1-4614-7138-7#authorsandaffiliationsbook"target="_blank">[PDF]</a> .
James, Gareth and Witten, Daniela and Hastie, Trevor and Tibshirani, Robert: <spanstyle="font-style: italic;">An Introduction to Statistical Learning: with Applications in R</span>, Springer New York, 2014, 1461471370, 9781461471370 <ahref="https://link.springer.com/book/10.1007/978-1-4614-7138-7#authorsandaffiliationsbook"target="_blank">[PDF]</a> .
</div>
</div>
# Introduction
# Introduction
When we talk about dimensions in machine learning, we refer to the number of features that we consider for our dataset. Typically, for each dataset, we have n data points and p features, X$_1$,X$_2$,...,X$_p$. If we have only two features (X$_1$,X$_2$), then we call it a two-dimensional dataset and it can be visualized simply in a two-dimensional feature space. Adding one more feature (X$_1$,X$_2$,X$_3$) makes a three-dimensional dataset that can be visualized in a three-dimensional space:
When we talk about dimensions in machine learning, we refer to the number of features that we consider for our dataset. Typically, for each dataset, we have n data points and p features, X$_1$,X$_2$,...,X$_p$. If we have only two features (X$_1$,X$_2$), then we call it a two-dimensional dataset and it can be visualized simply in a two-dimensional feature space. Adding one more feature (X$_1$,X$_2$,X$_3$) makes a three-dimensional dataset that can be visualized in a three-dimensional space:
In most realistic machine learning problems, we need to deal with a large number of features, i.e., a multidimensional dataset that makes it difficult and challenging for us to analyze, visualize and fit a machine learning model. In such problems, one instead of dealing with many features can reduce the number of features by applying a set of methods called data-dimensionality reduction (DDR) techniques.
In most realistic machine learning problems, we need to deal with a large number of features, i.e., a multidimensional dataset that makes it difficult and challenging for us to analyze, visualize and fit a machine learning model. In such problems, one instead of dealing with many features can reduce the number of features by applying a set of methods called data-dimension reduction (DDR) techniques.
In all DDR methods, first, a new set of transformed features are created. For instance, this can be done by a linear combination of the original features and creating a new set of features as follows:
In all DDR methods, first, a new set of transformed features are created. For instance, this can be done by a linear combination of the original features and creating a new set of features as follows:
\begin{equation}
\begin{equation}
Z_k= \sum\limits_{j=1}^{p} \phi_{jk} X_j
Z_k= \sum\limits_{j=1}^{p} \phi_{jk} X_j
\end{equation}
\end{equation}
where, $Z_k$ are new created features, $\phi_{jk}$ are some constants, $X_j$ are the original features and p is the number of original features. After that the machine learning model can be fit using these new features. For the linear regression model, we will have:
where, $Z_k$ are new created features, $\phi_{jk}$ are some constants, $X_j$ are the original features and p is the number of original features. After that the machine learning model can be fit using these new features. For the linear regression model, we will have:
where $\theta_{k}$ are the new coefficients. This problem can be solved using the least square approach by minimizing the residual sum of squares (RSS),
where $\theta_{k}$ are the new coefficients. This problem can be solved using the least square approach by minimizing the residual sum of squares (RSS),
\begin{equation}
\begin{equation}
RSS=\sum\limits_{i=1}^{n}(y_i-\hat{y}_i)^2
RSS=\sum\limits_{i=1}^{n}(y_i-\hat{y}_i)^2
\end{equation}
\end{equation}
with properly choosing the coefficients ($\theta_0,\theta_1,...,\theta_M$). $\hat{y}_i$ is the predicted value for the target property. If we consider the condition $m<p $ then the number of features is decreased and therefore we need to estimate the m+1 coefficients ($\theta_0,\theta_1,...,\theta_m$) instead of p+1 coefficients ($\beta_0,\beta_1,...,\beta_p$). In fact, we are reducing the dimension from p+1 to m+1 which not only makes the problem simpler but also leads to less computational time and other advantages that are listed below:
with properly choosing the coefficients ($\theta_0,\theta_1,...,\theta_M$). $\hat{y}_i$ is the predicted value for the target property. If we consider the condition $m<p $ then the number of features is decreased and therefore we need to estimate the m+1 coefficients ($\theta_0,\theta_1,...,\theta_m$) instead of p+1 coefficients ($\beta_0,\beta_1,...,\beta_p$). In fact, we are reducing the dimension from p+1 to m+1 which not only makes the problem simpler but also leads to less computational time and other advantages that are listed below:
<div class="alert alert-block alert-info">
<div class="alert alert-block alert-info">
<b>Some of the DDR Advantages:</b>
<b>Some of the DDR Advantages:</b>
- Less computational time
- Less computational time
- Avoid overfitting
- Avoid overfitting
- Better visualization of data
- Better visualization of data
- Convert non-linear datasets into a linear and separable dataset
- Convert non-linear datasets into a linear and separable dataset
- Convert non-linear datasets into a linear and separable dataset
- Convert non-linear datasets into a linear and separable dataset
- Remove noise to increase the accuracy
- Remove noise to increase the accuracy
</div> -->
</div> -->
Data dimensionality reduction methods are generally divided into two classes depending on how they reduce the number of features:
Data dimension reduction methods are generally divided into two classes depending on how they reduce the number of features:
1. Select the best representative features
1. Select the best representative features
2. Combine the features and create a new set of features
2. Combine the features and create a new set of features
1. Linear methods
1. Linear methods
2. Non-linear methods
2. Non-linear methods
Here, we just focus on the second class, and in particular, we explain the following methods:
Here, we just focus on the second class, and in particular, we explain the following methods:
- Principal component analysis (PCA)
- Principal component analysis (PCA)
- Kernel PCA
- Kernel PCA
- Multidimensional scaling (MDS)
- Multidimensional scaling (MDS)
- Uniform manifold approximation and projection (UMAP).
- Uniform manifold approximation and projection (UMAP).
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Principal Component Analysis (PCA)
# Principal Component Analysis (PCA)
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## Unsupervised Learning
## Unsupervised Learning
PCA is an unsupervised learning method and it is appropriate to first describe this learning concept. Supervised learning is very much familiar to us and If you simply browse the Internet, you will find numerous problems or models related to this learning approach where we are dealing with labeled data points. The term "labeled data points" means, besides n data points and p features that are measured on those data points, there are response Y (target property) which is also measured on the data points and the goal is to predict the target property based on the p features. But what about the problems where we do not have a specific target property (unlabeled data) but are still interested in the hidden patterns and relationships in the data?
PCA is an unsupervised learning method and it is appropriate to first describe this learning concept. Supervised learning is very much familiar to us and If you simply browse the Internet, you will find numerous problems or models related to this learning approach where we are dealing with labeled data points. The term "labeled data points" means, besides n data points and p features that are measured on those data points, there are response Y (target property) which is also measured on the data points and the goal is to predict the target property based on the p features. But what about the problems where we do not have a specific target property (unlabeled data) but are still interested in the hidden patterns and relationships in the data?
Unsupervised learning is a tool that helps us deal with these problems. For example, a supermarket that wants to increase its sales can use unsupervised learning to identify certain buying patterns of its customers, e.g., customers who buy a wireless mouse are also more likely to buy batteries or breakfast cereal is bought most often with milk. Unlike supervised learning, which is a well-known concept in machine learning and for which there are many different and well-known methods, unsupervised learning is not well understood and is more challenging. However, with the increasing interest in identifying data patterns, this research area is gaining more attention.
Unsupervised learning is a tool that helps us deal with these problems. For example, a supermarket that wants to increase its sales can use unsupervised learning to identify certain buying patterns of its customers, e.g., customers who buy a wireless mouse are also more likely to buy batteries or breakfast cereal is bought most often with milk. Unlike supervised learning, which is a well-known concept in machine learning and for which there are many different and well-known methods, unsupervised learning is not well understood and is more challenging. However, with the increasing interest in identifying data patterns, this research area is gaining more attention.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
## PCA for unsupervised learning
## PCA for unsupervised learning
PCA is a DDR method for unsupervised learning that was first formulated by Karl Pearson more than a hundread year ago (in 1901).
PCA is a DDR method for unsupervised learning that was first formulated by Karl Pearson more than a hundread year ago (in 1901).
In this method the main task is to find a small set of principal components (PCs). PCs are dimensions that are created by a linear combination of the original features based on the variance of the data points along each dimension. In fact, data points have the largest variance along the first PC. Assume that we have n data points with a set of p features ($X_1,X_2,...,X_p$). The first PC can be written as:
In this method the main task is to find a small set of principal components (PCs). PCs are dimensions that are created by a linear combination of the original features based on the variance of the data points along each dimension. In fact, data points have the largest variance along the first PC. Assume that we have n data points with a set of p features ($X_1,X_2,...,X_p$). The first PC can be written as:
where $\phi_{j1}$ are normalized elements ($\sum\limits_{j=1}^{p} \phi_{j1}^2 =1$) and called loadings of the first principal components. To calculate the first PC, for a n$\times$p data set we combine the featues as follows:
where $\phi_{j1}$ are normalized elements ($\sum\limits_{j=1}^{p} \phi_{j1}^2 =1$) and called loadings of the first principal components. To calculate the first PC, for a n$\times$p data set we combine the featues as follows:
where $z_{i1}$ are called the scores of the first PC. We are looking for the combinations that have the largest variance and therefore we need to solve the following optimization problem:
where $z_{i1}$ are called the scores of the first PC. We are looking for the combinations that have the largest variance and therefore we need to solve the following optimization problem:
Note that we also assume that the values of data points for each feature are centered and the mean is zero, i.e., $\frac{1}{n} \sum\limits_{i=1}^{n}x_{ij}=0$, therefore the mean of $z_{i1}$ is also zero and as a consequence, the above formula gives us the variance of $z_{i1}$. After solving the problem using linear algebra techniques, we obtain the first PC and the associated loadings ($\phi_{11},\phi_{21},...,\phi_{p1}$). The first loading vector ($\phi_{1}=(\phi_{11},\phi_{21},...,\phi_{p1})^T$) make a direction in the feature space and along this direction the data have the largest variance. In the same way, for the second PC, we need to obtain the loading vector ($\phi_{2})$ and the corresponding scores with maximum variance. However, with the restriction that all linear combinations ($z_{2i}$) are not correlated with the first PC. In other words, $\phi_{2}$ should be orthogonal to $\phi_{1}$. Other PCs can also be determined in this way.
Note that we also assume that the values of data points for each feature are centered and the mean is zero, i.e., $\frac{1}{n} \sum\limits_{i=1}^{n}x_{ij}=0$, therefore the mean of $z_{i1}$ is also zero and as a consequence, the above formula gives us the variance of $z_{i1}$. After solving the problem using linear algebra techniques, we obtain the first PC and the associated loadings ($\phi_{11},\phi_{21},...,\phi_{p1}$). The first loading vector ($\phi_{1}=(\phi_{11},\phi_{21},...,\phi_{p1})^T$) make a direction in the feature space and along this direction the data have the largest variance. In the same way, for the second PC, we need to obtain the loading vector ($\phi_{2})$ and the corresponding scores with maximum variance. However, with the restriction that all linear combinations ($z_{2i}$) are not correlated with the first PC. In other words, $\phi_{2}$ should be orthogonal to $\phi_{1}$. Other PCs can also be determined in this way.
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
Now we use PCA to analyze two real-world datasets. The first one is a dataset of 576 perovskite and non-perovskite materials with 30 features (dimensions) and 2 classes and the second one is a Sklearn build-in dataset of 178 wines with 13 features and 3 classes.
Now we use PCA to analyze two real-world datasets. The first one is a dataset of 576 perovskite and non-perovskite materials with 30 features (dimensions) and 2 classes and the second one is a Sklearn build-in dataset of 178 wines with 13 features and 3 classes.
Since PCA is sensitive to the scale of the features and it doesn't perform well when the range of the features is very different, for both of these datasets we first perform feature scaling before applying the PCA algorithm.
Since PCA is sensitive to the scale of the features and it doesn't perform well when the range of the features is very different, for both of these datasets we first perform feature scaling before applying the PCA algorithm.
Another step before performing PCA is to determine how much information is attributed to each PC during the analysis and how many PCs are appropriate. We can do this by creating a scree plot showing the proportion of variance explained by each of the PCs, as follows. Although for the two data sets for perovskite and wine, not even three PCs cover much of the variance (60% for perovskite and 67% for wine), here we only use a maximum of three PCs for the visualization of the data.
Another step before performing PCA is to determine how much information is attributed to each PC during the analysis and how many PCs are appropriate. We can do this by creating a scree plot showing the proportion of variance explained by each of the PCs, as follows. Although for the two data sets for perovskite and wine, not even three PCs cover much of the variance (60% for perovskite and 67% for wine), here we only use a maximum of three PCs for the visualization of the data.
As you can see, PCA divides the data into 5 parts, since the materials in the perovskite dataset have five different anions ($X = O^{2-}, F^{-}, Cl^-, Br^-, I^-$), we can assign these parts to these materials. This shows that PCA has good performance in identifying patterns for this data set. Now we can even check if it can identify the perovskite and non-perovskite materials by adding the experimental labels.
As you can see, PCA divides the data into 5 parts, since the materials in the perovskite dataset have five different anions ($X = O^{2-}, F^{-}, Cl^-, Br^-, I^-$), we can assign these parts to these materials. This shows that PCA has good performance in identifying patterns for this data set. Now we can even check if it can identify the perovskite and non-perovskite materials by adding the experimental labels.
%% Cell type:code id: tags:
%% Cell type:code id: tags:
``` python
``` python
#plot the perovskite data sets using 2 PCs with labels
#plot the perovskite data sets using 2 PCs with labels
As it is seen, the PCA method is not working well in identifying pervoskites and non-pervoskite materials in two dimentions and or even three dimensions. Actually, dimensional reduction methods such as PCA are mostly for preprocessing of data and prepring them for trainig and further analysis using other machine leaning methods. In the case of perovkite, one can use the obtained PCs as new features for support vector machine (SVM) to classify these materials. In the cell below, we use PCA + SVM to classify the perovskite materials.
As it is seen, the PCA method is not working well in identifying pervoskites and non-pervoskite materials in two dimentions and or even three dimensions. Actually, dimensional reduction methods such as PCA are mostly for preprocessing of data and prepring them for trainig and further analysis using other machine leaning methods. In the case of perovkite, one can use the obtained PCs as new features for support vector machine (SVM) to classify these materials. In the cell below, we use PCA + SVM to classify the perovskite materials.
If the data sets cannot be represented by linear subspaces, the standard PCA is not useful. For these datasets with more complex structures, we can use the kernel PCA, which allows us to analyze the data and detect the hidden patterns.
If the data sets cannot be represented by linear subspaces, the standard PCA is not useful. For these datasets with more complex structures, we can use the kernel PCA, which allows us to analyze the data and detect the hidden patterns.
The kernel PCA can be calculated as follows:
The kernel PCA can be calculated as follows:
- first we construct the kernel matrix $κ(x_i,x_j)=\phi(x_i)^T \phi(x_j)$
- first we construct the kernel matrix $κ(x_i,x_j)=\phi(x_i)^T \phi(x_j)$
- then we solve the following equation to obtain $a_k$:
- then we solve the following equation to obtain $a_k$:
\begin{equation}
\begin{equation}
Ka_k = \lambda_K N a_k
Ka_k = \lambda_K N a_k
\end{equation}
\end{equation}
- afterward we calculate the PCs $y_k(x)$:
- afterward we calculate the PCs $y_k(x)$:
\begin{equation}
\begin{equation}
y_k(x) = \sum_{i=1}^{N}a_{ki} K(x,x_{i})
y_k(x) = \sum_{i=1}^{N}a_{ki} K(x,x_{i})
\end{equation}
\end{equation}
where $k=1,2,...,m$, $a_k=[a_{k1},a_{k2},...,a_{kN}]^T$ and $\phi(x_i)$ is a nonlinear transformation from our original feature space (N-dimensional or p-dimensional) to an m-dimensional feature space ($m<<p$). Thus, each point $x_i$ is projected to a point $\phi(x_i)$. Kernel can be any function, but two commonly used kernels are the polynomial kernel and the Gaussian kernel.
where $k=1,2,...,m$, $a_k=[a_{k1},a_{k2},...,a_{kN}]^T$ and $\phi(x_i)$ is a nonlinear transformation from our original feature space (N-dimensional or p-dimensional) to an m-dimensional feature space ($m<<p$). Thus, each point $x_i$ is projected to a point $\phi(x_i)$. Kernel can be any function, but two commonly used kernels are the polynomial kernel and the Gaussian kernel.
Wang, Quan : <span style="font-style: italic;">Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models </span>, arxiv, 2012, 1207.3538 <a href="https://arxiv.org/abs/1207.3538" target="_blank">[PDF]</a> .
Wang, Quan : <span style="font-style: italic;">Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models </span>, arxiv, 2012, 1207.3538 <a href="https://arxiv.org/abs/1207.3538" target="_blank">[PDF]</a> .
</div>
</div>
We now run a kernal PCA on these two datasets and check their performance.
We now run a kernal PCA on these two datasets and check their performance.
Like PCA, multidimensional scaling (MDS) aims to reduce the number of features. In this method, we need the dissimilarity between the data points ($d_{ij}$) rather than the data points themselves ($x_1,x_2,...,x_N$). The $d_{ij}$ can be the distances (which are more often chosen as Euclidean distances) between data points.
Like PCA, multidimensional scaling (MDS) aims to reduce the number of features. In this method, we need the dissimilarity between the data points ($d_{ij}$) rather than the data points themselves ($x_1,x_2,...,x_N$). The $d_{ij}$ can be the distances (which are more often chosen as Euclidean distances) between data points.
MDS is divided into different types such as:
MDS is divided into different types such as:
- metric MDS
- metric MDS
- non-metric MDS.
- non-metric MDS.
Metric MDS is about actual (numerical) dissimilarities or similarities, while for non-metric MDS the dissimilarities or similarities are not actual values, but ranks. For metric MDS, there are two different scaling approaches, least squares (or Kruskal-Shephard) scaling and classical scaling. In these approaches, we define a loss function, the stress function, which for least squares is as follows.
Metric MDS is about actual (numerical) dissimilarities or similarities, while for non-metric MDS the dissimilarities or similarities are not actual values, but ranks. For metric MDS, there are two different scaling approaches, least squares (or Kruskal-Shephard) scaling and classical scaling. In these approaches, we define a loss function, the stress function, which for least squares is as follows.
Then we try to find the $z_i$ values such that $S_M$ is minimized using a gradient descent algorithm. In classical scaling, the similarities $s_{ii'}$ are considered ($s_{ii'}=<x_i-\bar{x},x_{i^{'}}-\bar{x}>$ as a centered inner product). Then the stress function has the following form:
Then we try to find the $z_i$ values such that $S_M$ is minimized using a gradient descent algorithm. In classical scaling, the similarities $s_{ii'}$ are considered ($s_{ii'}=<x_i-\bar{x},x_{i^{'}}-\bar{x}>$ as a centered inner product). Then the stress function has the following form:
even if we have Euclidean distances, we convert them to a centered inner product. Classical MDS would be equivalent to principal component analysis (PCA) if the similarities are a centered inner product.
even if we have Euclidean distances, we convert them to a centered inner product. Classical MDS would be equivalent to principal component analysis (PCA) if the similarities are a centered inner product.
In non-metric MDS, the stress function is defined as follows:
In non-metric MDS, the stress function is defined as follows:
Here $\theta$ is an arbitrary increasing function. Depending on which variable is fixed ($z_i$ or $\theta$), the stress function can be minimized using gradient descent or isotonic regression. In the following, the non-metric MDS is used to analyze the previous data sets.
Here $\theta$ is an arbitrary increasing function. Depending on which variable is fixed ($z_i$ or $\theta$), the stress function can be minimized using gradient descent or isotonic regression. In the following, the non-metric MDS is used to analyze the previous data sets.
# Uniform Manifold Approximation and Projection (UMAP)
# Uniform Manifold Approximation and Projection (UMAP)
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
UMAP is a novel and powerful data dimensionality reduction method (DDR) introduced in 2018 by Leland McInnes and his colleagues. UMAP not only reduce the dimension but also preserves the clusters and the relationship between them. Although this method is very much similar to t-SNE (they both contruct a highdimenstional graph and map it to a low dimensional graph) its mathematical foundation is different. UMAP has some advantages that makes it more practical than t-SNE. For instance, UMAP is much faster than t-SNE and it preserves the global structure of the data while t-SNE more focuses on local structures not the global structure. These advantages led UMAP to replace t-SNE in the scientific community.
UMAP is a novel and powerful data dimensionality reduction method (DDR) introduced [in 2018 by Leland McInnes and his colleagues](https://www.youtube.com/watch?v=nq6iPZVUxZU). UMAP not only reduces the dimension but also preserves the clusters and the relationship between them. Although this method has a similar purpose as t-SNE (they both contruct a high-dimenstional graph and map it to a low dimensional graph), its mathematical foundation is different. UMAP has some advantages that makes it more practical than t-SNE. For instance, UMAP is much faster than t-SNE and it preserves the global structure of the data while t-SNE focuses more on local structures rather than the global structure. These advantages are leading UMAP to steadly replace t-SNE in the scientific community.
We explain the UMAP based on its [documentation](https://umap-learn.readthedocs.io/en/latest/how_umap_works.html). UMAP algorithm works in two steps:
1. Contruct a high dimensional graph
1. Contruct a high dimensional graph
2. Find the low dimensional representation of that graph
2. Find the low dimensional representation of that graph
### 1. Contruct a high dimensional graph
### 1. Contruct a high dimensional graph
UMAP first approximates a manifold on which the data lie. Since we are mostly dealing with finite data, it is assumed that the data is uniformly distributed on the manifold. But in the real world, the data is not so distributed, so we need to assume that the manifold has a Riemannian metric, and then find a metric for which the data is approximately uniformly distributed.
UMAP first approximates a manifold on which the data lie. Since we are mostly dealing with finite data, it is assumed that the data is uniformly distributed on the manifold. But in the real world, the data is not so distributed, so we need to assume that the manifold has a Riemannian metric, and then find a metric for which the data is approximately uniformly distributed.
Then we make simple combinatorial building blocks, called "simplicies" out of data. k-simplex is created by taking the convex hull of k+1 independent points. In the following picture low dimensional simplices are shown.
Then we make simple combinatorial building blocks, called "simplicies" out of data. Each k-simplex is created by taking the convex hull of k+1 independent points. In the following picture low dimensional simplices are shown.
For our finite data, we can do this by creating a ball with a fixed radius for each data point. Since we assume that the data is uniformly distributed on the manifold, it is easy to choose an appropriate radius; otherwise, a small radius leads to many connected components and a large radius leads to some very high-dimensional simplices that we do not want to work with. As we mentioned earlier, we need to use Reimanian geometry for this assumption. This means that each point has a local metric space associated with it so that we can measure the distance in a meaningful way. In other word, we compute a local notion of distance for each point. In practice, this means that a unit ball around a point extends to the kth nearest neighbor of the point. k is the size of the sample we use to approximate the local notion of distance. Since each point has its own distance function, we can simply select ball of radius one with respect to this local distance function.
For our finite data, we can do this by creating a ball with a fixed radius for each data point. Since we assume that the data is uniformly distributed on the manifold, it is easy to choose an appropriate radius; otherwise, a small radius leads to many connected components and a large radius leads to some very high-dimensional simplices that we do not want to work with. As we mentioned earlier, we need to use Reimanian geometry for this assumption. This means that each point has a local metric space associated with it so that we can measure the distance in a meaningful way. Actaully we consider a patch for each point and we map it down to an standard euclidean space in a different scale. So there is a different notion of distance for each point depending on where they are located (in a denser and sparser area). In practice, this means that a unit ball around a point extends to the kth nearest neighbor of the point. k is the size of the sample we use to approximate the local notion of distance. Since each point has its own distance function, we can simply select ball of radius one with respect to this local distance function.
k is an important parameter ($n$_$neighbor$) in UMAP that determines how local we want to estimate the Riemannian metric. Thus, a small value of k gives us more details about the local structures, while a large value of k gives us more information about the global structures.
k is an important parameter ($n$_$neighbor$) in UMAP that determines how local we want to estimate the Riemannian metric. Thus, a small value of k gives us more details about the local structures, while a large value of k gives us more information about the global structures.
With the Riemannian metric we can even work in a fuzzy topology, which means that there is a certanity for each point to be located in a ball of a given radius (For each point in the neighborhood of a given point, this certainty is also called weights or similarity values). The further we move away from the center of each point, the smaller this certainty becomes. To avoid isolating the data points, especially in high dimensions, we activate fuzzy certainty beyond the first nearest neighbor. This ensures that each point on the manifold has at least one neighbor to which it is connected (certainty for the first neighbor is 100% and for other is between 0% to 100%). This is called local connectivity in UMAP and has a hyperparameter ($local$_$connectivity$) that determines the least number of connections.
With the Riemannian metric we can even work in a fuzzy topology, which means that there is a certanity for each point to be located in a ball of a given radius (fuzzy value or weights). The further we move away from the center of each point, the smaller this certainty becomes. To avoid isolating the data points, especially in high dimensions, we activate fuzzy certainty beyond the first nearest neighbor. This ensures that each point on the manifold has at least one neighbor to which it is connected (certainty for the first neighbor is 100% and for other is between 0% to 100%). This is called local connectivity in UMAP and has a hyperparameter ($local$_$connectivity$) that determines the least number of connections.
<em style="color: grey">A 2d dataset with fuzzy open balls of radius one with a locally varying metric and considering the local connectivity</em>
<em style="color: grey">A 2d dataset with fuzzy open balls of radius one with a locally varying metric and considering the local connectivity</em>
</p>
</p>
</td>
</td>
</tr></table>
</tr></table>
Since we have a local metric for each point, the distance to the second point from the point of view of the first point can be different from the distance from the second point to the first point. This problem arises from the fact that there can be two edges between two points and the weights associated with each edge can be different. Here UMAP considers an edge between these points with a combined weight $a+b-a⋅b$, where a and b are the weights of the first and second points.
Since we have a local metric for each point, the distance to the second point from the point of view of the first point can be different from the distance from the second point to the first point. This problem arises from the fact that there can be two edges between two points and the weights associated with each edge can be different. Here UMAP considers an edge between these points with a combined weight $a+b-a⋅b$, where a and b are the weights of the first and second points.
Finally we construct a simplicial complex out of the fuzzy open balls and come up with a single fuzzy simplicial complex. The justification for such a procedure is provided by the Nerve Theorem, which states that we can recover the entire topology of a topological space if we form a simplicial complex from that space in a certain way. This simplicail complex can be shown as follows:
Finally we construct a simplicial complex out of the fuzzy open balls and come up with a single fuzzy simplicial complex. The justification for such a procedure is provided by the Nerve Theorem, which states that we can recover the entire topology of a topological space if we form a simplicial complex from that space in a certain way. This simplicail complex can be shown as follows:
### Find the low dimensional representation of the high-dimensional graph
### Find the low dimensional representation of the high-dimensional graph
Once the high-dimensional graph is constructed, UMAP maps it to a low-dimensional graph and optimizes it to obtain a good representation of the first graph.
Once the high-dimensional graph is constructed, UMAP maps it to a low-dimensional graph and optimizes it to obtain a good representation of the first graph.
The low-dimensional graph is constructed using the same procedure we used for the fuzzy topological structure of the data. This time we already know the manifold, e.g., the two-dimensional space, and it is not necessary to derive the manifold as we did the previous time. This means that we do not need to vary the distances in this low-dimensional representation space and instead we want to use the standard Euclidean distance with respect to the global coordinate system. When optimizing the low-dimensional representation, we should also determine the minimum distance between embedded points. Otherwise, some points might be on top of each other. This is specified as a hyperparameter in UMAP ($min$_$dist$).
The low-dimensional graph is constructed using the same procedure we used for the fuzzy topological structure of the data. This time we already know the manifold, e.g., the two-dimensional space, and it is not necessary to derive the manifold as we did the previous time. This means that we do not need to vary the distances in this low-dimensional representation space and instead we want to use the standard Euclidean distance with respect to the global coordinate system. When optimizing the low-dimensional representation, we should also determine the minimum distance between embedded points. Otherwise, some points might be on top of each other. This is specified as a hyperparameter in UMAP ($min$_$dist$).
UMAP uses a stochastic gradient descent algorithm to find the best low-dimensional representation by minimizing a cross-entroy cost function defined as follows:
UMAP uses a stochastic gradient descent algorithm to find the best low-dimensional representation by minimizing a cross-entroy cost function defined as follows:
Here $E$ is a set of all possible 1-simplices and $\omega_h(e)$ and $\omega_l(e)$ are the edge weight of $e$ in the high and low dimensional space, respectively. Now we employ UMAP to classify the Pervoskite and Wines samples.
Here $E$ is a set of all possible 1-simplices and $\omega_h(e)$ and $\omega_l(e)$ are the edge weight of $e$ in the high and low dimensional space, respectively. Now we employ UMAP to classify the Pervoskite and Wines samples.