前言
好久没写博客了,之前一直在做RelexNet的项目,感觉上是做出了一些还不错的东西,也不知道能不能发一篇论文。这个学期选了anu Statistic Machine Learning这门以前的神课,当然感觉现在好像没有以前那种难度的感觉了,但结合这门课加上PRML这本书的学习,也从数学的角度更好的理解了一些机器学习的原理,不得不说Bishop是真的很爱用贝叶斯来解释一切。
Preface
It is a long time for me not writing the Blog. Recently, I just finished my last honours year of bachelor degree and the thesis. I am not sure if the thesis could be published to a conference, but I will trying to do this. In the last semester, I chose the course COMP4670 (Statistical Machine Learning). Although it may be not as difficult as before, I could also learn much knowledge from this course and PRML (Pattern Recognition and Machine Learning) book. From a mathematical point of view, I have a better understanding of some machine learning principles. I have to say that Bishop really likes to use Bayes to explain everything (lol). I would like to update about Kernel method first, the previous chapters can also be updated slowly.
Introduction
What is PCA used for?
Since we want to know about Kernel PCA, the first step is to understand about PCA. PCA is called principal component analysis, a method used to dimensionality reduction. Its purpose is to find the components that best represent the data. In the field of machine learning, the data we use is usually large and multidimensional. It is complete but often consumes a lot of computing resources for training model. What PCA doing is to reduce the dimensionality of the data while maintaining the main features of the data. Just like if we cannot afford Burberry’s trench coat, we could buy the Uniqlo’s, which only takes One-thirtieth price of Burberry’s. This really saves a lot. Another important role is that PCA can make the data linearly separable, so that machine learning models could fit and classify the data better.
An Example for PCA
Let’s take an example. For the iris dataset (from sklearn package), it has a total of 150 samples, 4 dimensions, and 3 classes. We can use PCA to reduce the data to 2-dimension space and seen in the figure that the decision boundaries are obvious and the model could roughly distinguish the data.
Limits of PCA
So PCA is mainly to eliminate the correlations between variables and assume that the correlations are linear. However, good results cannot be obtained for non-linear dependencies between data. In question2 of assignment2, as shown in the figure, we need to make the data composed of rings linearly separable. It is difficult for us to divide different rings with linear decision boundaries.
Solution
Basic function
So there are 3 solutions I will introduce. The first is basic function. Simply speaking, the basic function makes the data linearly separable by reconstructing the original data in high-dimensional space.
For example, like the first basic function shown in the formula 1, it forms a new 3-d feature space and in the figure we could find the linear decision boundary flats easily.
If the number of feature dimension of the basic function exceeds 3, we can use PCA to remap it back to a two-dimensional space, like the second basic function of formula 2.
However, the disadvantage of doing this is that calculating the covariance matrix will consume a lot of computing resources. Besides the inner product of some basic functions cannot be displayed explicitly.
Kernel PCA
At that time, I realized that since we only need the covariance matrix and eigenvectors of the basic functions, we can use a more effective way to obtain them. That is the kernel PCA method. First of all, because we want to use the Kernel method to implement Kernel PCA, we need to transform the problem from finding the eigenvalues and eigenvectors of the Covariance Matrix (formula 3) to the Kernel Matrix (formula 4).
And each element of K Matrix is the inner product between each basic vector. The basic function of formula 2 has been transformed into such a simple product form (formula 5), also called the complete polynomial kernel function.
After finding the eigenvectors and eigenvalues of K (formula 6), we need to transform it to the value of covariance matrix(formula 7), which only need to multiply the transpose of the basic vector on both sides of the eigen equation of K Matrix. Then the eigenvalues and eigenvectors of the covariance matrix can be obtained.
Tips
And there are some tips that may cause mistakes, which I have met before.
- The first is do not mess up Kernel matrix and Covariance matrix, they look similar, but are not equal.
- The second is that we need to make zero average to matrix K. But Don’t straightly make zero average to matrix K, because what we really want to do is zero averaging basic function. We need to use basic function to derive zero averaging of matrix K (formula 8).
Besides, we should normalize the eigenvectors. Remember, we need to normalize the eigenvector of Covariance Matrix not Kernel Matrix (formula 9).
Last but not least, do not forget to project the data. Even if we don’t get the basic function, we could use the formula to derive the solution by multiply Kernel matrix and eigenvectors of Kernel matrix.
Different Kernel Function
Different choices of kernel function will have different results. If we use the kernel function in the previous slides, which is called the complete polynomial kernel function, the ring dataset cannot be divided. Therefore, at the end of the question 2 of assignment 2, we use the RBF kernel function, which often cannot be expressed explicitly by the basic function. After adjusting the value of gamma in RBF kernel function, we can see that the ring data set can be well linearly separated.
Nerual Network Method
The neural network method was inspired by the logistic regression problem of assignment 1. In assignment 1, we can draw the decision boundary of the XOR problem by inputting the basic function, adjusting the number of hidden numbers and using the sigmoid activation function. I consider if the number of hidden neurons is 2 and the model could be fitted well, then in this two-dimensional space, the decision boundary should be linear. Therefore, I set up a neural network with the architecture.
I adjusted the input to new 3-d features, which may present the circle data well. Then Set two latent variables in hidden layer and add the sigmoid activation function. Finally, the model is optimized by predicting the label. The final result is shown in the figure, and we can clearly find the decision boundaries.