How does kernel work work

Question

all, I have been learning kernel method for a long time. But I am still not very sure how it works. In my opinion, it works as follows: say $f(x) = \sum_i\alpha_ik(x_i, x)$.

First we need to decide which kernel we should use. The common one is the RBF kernel.

Then we compute all $\alpha_i$. This can be computed using the training data. For this step, I will compute the kernel matrix based on the training dataset and then compute $\alpha_i$.

In the testing part, for each testing point $t$, we need to compute $k(x_i, t)$ where $x_i$ is the $i^{th}$ training data. Then using computed $\alpha_i$, we get $f(t)$ for the testing data $t$.`

If I were wrong, please give me some comments.

probably more suitable for http://stats.stackexchange.com/ (cross validated) — Memming, Jul 13 '15 at 21:14

Alt · Accepted Answer · 2015-07-14T21:25:46.153

Your overall procedure can be correct (depending on the perspective from which we look at the problem), but many details are missing.

First of all, you should understand why we use kernels: kernels implicitly project the data points onto a higher-dimensional space where either a (linear) hyperplane can separate the different classes, or we are provided with richer features for regression.

One good starting point to understand the kernels is the general class of kernel SVMs. There are many good tutorials that explain the details of kernel SVMs, and how they are derived from the dual optimization program of primal SVM formulation. (here is one: http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf)

The main detail that you are missing is how we learn the $\alpha_i$'s. In fact when you generate the kernel matrix and form the optimization problem, you can use any quadratic solver to learn the $\alpha_i$'s. Currently most of state-of-the-art solvers use the SMO (Sequential Minimal Optimization) algorithm to learn $\alpha$'s.

The LibSVM library is a great tool that you can use for kernel SVMs (especially if you don't want to implement SMO yourself)

I hope it helps.

Joel · Answer 2 · 2015-07-13T21:55:14.337

Your interpretation is correct. From an approximation point of view, you are trying to approximate some unknown function from sampled points. These sampled points are used in order to "train" $\alpha_i$. This can be done through the determination of the solution of $$\left( K(x_i,x_j) \right)_{i=1,j=1}^M \left( \begin{array}{c} \alpha_1 \\ \vdots \\ \alpha_M\end{array}\right) = \left( \begin{array}{c} f(x_1)\\ \vdots \\ f(x_M)\end{array}\right),$$ which is guaranteed to have a unique solution, since the kernel matrix is positive definite. Once the $\alpha_i$'s are found, we have $$f(x_i) = \sum_{j=1}^M \alpha_j K(x_i,x_j)$$ for each $i$. In other words, the sum on the right produces a function that interpolates the data sampled from $f$.

Kernel functions such as the Gaussian RBF kernels come from what are known as universal reproducing kernel Hilbert spaces (RKHSs). Given a compact set, say $D=[-1,1]^N$, these sorts of RKHSs are dense in the space of continuous functions $C(D)$ (with respect to the supremum norm). Translated, this means that given any continuous function over $D$ and $\epsilon > 0$, for large enough $M$ there exist $x_i$ and $\alpha_i$ for which a $$\sup_{y \in D} \left| f(y) - \sum_{i=1}^M \alpha_i K(y,x_i) \right| < \epsilon,$$ and the search for $\alpha_i$ to approximate an unknown function is then well founded.

If $f$ is itself in the RKHS (and in many cases this holds), then the $\alpha_i$ found by determining the solution to the matrix problem above is sufficient. Since $f$ is unknown, it is assumed that the function is in the RKHS at the outset, and these methods are applied anyway.

Sam Bhatt · Answer 3 · 2015-07-31T10:51:31.083

0

I think an explanation around standard ridge regression might help understand whats going on.

we need to learn $f(x) = \sum_{i=1}^N x_i,w_i = \langle x,w\rangle=x^Tw$

and with a ridge penalty our minimisation problem becomes

$w= argmin_w \|y-Xw \|^2 + \lambda \|w\|^2 $

Which when you do the whole taking derivative of $w$ and setting to zero gives the solution

$w=(X^TX + \lambda I)^{-1} X^Ty$

Through some simple rearranging we can arrive at a form $w=\sum \alpha x$. Here's how

$w=(X^TX + \lambda )^{-1} X^Ty$

$wX^TX + w\lambda = X^Ty$

$w = \frac{1}{\lambda } ( X^Ty - wX^TX) = X^T \frac{1}{\lambda } (y-Xw) $

lets set $ \alpha= \frac{1}{\lambda } (y-Xw)$

and therefore $w = \sum_i \alpha_i x_i $

Now if we substitute this into the above original solution for $w$ and arrive what is known as the dual solution (as opposed to the original primal solution)

$\alpha = (XX^T + \lambda I)^{-1} y$

ok so if we go back to our regression basis $x^T w$ we can write this as

$x^T w = x^T (X^T\alpha) = \langle x, \sum_i \alpha_i x_i \rangle = \sum_i \alpha_i \langle x, x_i \rangle$

So now we have arrived at a form that involves only inner products. And so in a simplistic sense we can replace $\langle x, x_i \rangle$ with $k(x,x_i)$. Obviously there are many important details but this is just some intuition about whats going on.

So the steps now are

compute alpha

$\alpha = (K + \lambda I)^{-1} y$ where $K$ is your kernel/gram matrix

evaluate a new point by $\sum_i \alpha_i \langle x, x_i \rangle$

edited Jul 31 '15 at 10:51

answered Jul 30 '15 at 16:04

Sam Bhatt

332

Hey Sam, this is an unconstrained optimization problem right..why is there dual and primal coming up in this? – Sridhar Thiagarajan Mar 03 '16 at 04:37
No this is ridge regression. So there is a constraint $||w||^2 <C$, i.e $argmin \sum e^2$ s.t. $e=y-x^Tw$ AND $||w||^2 <C$ – Sam Bhatt Mar 04 '16 at 18:44
Please correct me if I'm wrong. For a constrained optimization problem with inequality constraints, we form the Lagrangian which has multipliers alpha and beta. So this can be solved using either primal or dual problem. Both solutions coincide during strong duality. In this example, you have solved the dual problem, which has enabled you to express the algorithm in terms of dot products right – Sridhar Thiagarajan Mar 04 '16 at 21:19
If the are constraints, why hasn't a Lagrangian been made.. Instead you take the gradient wrt w directly and obtain a closed form solution – Sridhar Thiagarajan Mar 04 '16 at 21:20
ahh right i understand. There is an optimisation equivalence between the Lagrangian form and penalty form I wrote. The actual reason behind this (which is what I think you are asking) is discussed here http://stats.stackexchange.com/questions/4371/lagrangian-relaxation-in-the-context-of-ridge-regression – Sam Bhatt Mar 07 '16 at 11:52

How does kernel work work

3 Answers3