Understanding the existence of a kernel function that transforms non linearly separable samples to separable samples in general

Question

I'm studying section $5.11$ on support vector machines from Duda and Hart's Pattern Classification. The authors write:

With an appropriate nonlinear mapping $\phi$ to a sufficiently high dimension, data from two categories can always be separated by a hyperplane (Problem 27)

Suppose we have $n$ points of dimension $d$ with $n_1, n_2$ points in classes $1,2$ respectively. There seem to be many posts that either show visualizations of kernels transforming non-linearly separable representations to separable ones, but very little information describing how these kernel functions are chosen for examples that are not hand-crafted, and not much proof of linear separability in the transformed space. Posts like these (https://stackoverflow.com/questions/3997454/help-me-understand-linear-separability-in-a-binary-svm).

I'm looking for a hint why such a kernel function always exists, assuming no restraints on the dimension of the transformed space.

Not sure I understand the question, but do you know the Gaussian kernel? It is universal and can hence approximate any decision boundary. The dimension of its feature space is infinite — Claudio Moneo, Jan 02 '23 at 21:27
Can you show that the embedding of the data given by a Gaussian kernel is linearly separable? My first rather simple attempts to do so, were not successful. — Hagen Knaf, Jan 03 '23 at 16:16
@ClaudioMoneo I'm assuming you mean $$K(x,y) = e^{\frac{-||x-y||^2}{2 \sigma^2}}$$ not sure what you mean by an infinite feature space? — IntegrateThis, Jan 04 '23 at 07:06

score 2 · Accepted Answer · answered Jan 03 '23 at 16:13

2

In the exercise it is not required that the non-linear map $\phi$ is determined by a kernel function. If $x_1,\ldots ,x_{n_1}\in\mathbb{R}^d$ are the samples of class $1$ and $y_1,\ldots ,y_{n_2}$ are those of class $2$ one could map the samples $x_i$ to $(x_i,0)\in\mathbb{R}^{d+1}$ and $y_i$ to $(y_i,1)$ and extend the resulting map somehow to the whole of $\mathbb{R}^d$, if one wishes so. The hyperplane $x_{d+1}=\frac{1}{2}$ separates the two classes.

answered Jan 03 '23 at 16:13

Hagen Knaf

8,932

I'm thinking about this a bit. So if I label the samples in class 1 with last component $-1$ rather than $0$, and then consider a normal vector $w = (0, \ldots, 1)$, then for all elements in class $1$, $w \cdot y_i = -1 < 0$ and for all elements in class $2$, $w \cdot x_i = 1 > 0$. Is this the idea? – IntegrateThis Jan 04 '23 at 07:15
Yes, that's a way to look at it. – Hagen Knaf Jan 04 '23 at 08:57

score 1 · Answer 2 · answered Jan 04 '23 at 08:56

Claudio Moneo is right: using the Gaussian kernel yields linear separability. One considers the map

$ \phi :X\rightarrow V,\; x\mapsto K(\cdot,x) $

where $K$ is the Gaussian kernel of bandwidth $1$ say, $X=\{x_1,\ldots ,x_{n_1},y_1,\ldots ,y_{n_2}\}$ and $V$ is the vector space generated by the functions $K(\cdot ,x)$. It is known, that these functions then form a basis of $V$. So we can define a linear functional $f:V\rightarrow\mathbb{R}$ by $f(\phi(x_i)):=-1$ for all $i$ and $f(\phi(y_j))=1$ for all $j$. The hyperplane $f^{-1}(0)$ then separates the classes.

Claudio Moneo · Answer 3 · 2023-01-04T12:35:09.283

I am going to give an answer based on the Gaussian kernel $$K(x,y) = \exp(- \gamma \|x-y \|^2)$$ First note that any PSD kernel is associated with a Hilbert space $H$ that it implicitly maps points to through its feature map $\phi$. This feature map satisfies $$K(x,y) = \langle \phi(x), \phi(y) \rangle$$ for all $x,y \in \mathbb{R}^d$. These Hilbert spaces are actually reproducing kernel Hilbert spaces (RKHS), on which there is a vast literature. You could take a look at wikipedia if you want.

The Gaussian kernel satisfies the nice property of being universal, which means that for any two compact, disjoint sets $A,B \subset \mathbb{R}^d$ there exists some $w \in H$ such that $$sgn(\langle w , \phi(x) \rangle) = 1_A(x) - 1_B(x)$$ Note that this implies that we have linearly separated $A$ and $B$ in the space $H$, which is what you wanted (just choose $A,B$ as subsets of your dataset corresponding to whatever labels your points have).

For the Gaussian kernel, one can show that $H$ is of infinite dimension. A feature map is given by $$\phi(x) = exp(- \gamma \|x- \cdot \|^2)$$ Note that we map a point to a function (a Gaussian centered at that very point, to be specific)!

So why is the Gaussian kernel universal? There are more elegant proofs on this (using e.g. Stone Weierstrass theorem), but if you trust me that for $n$ distinct points, the vectors $v_i = \phi(x_i)$ are linearly independent in the feature space $H$ (which makes sense because no Gaussian can be written as the linear combination of any weighted sum of Gaussians centered at other points) then there is a quick way to see this:

For $n$ linearly independent points $v_1,\dots,v_n$ we may simply define the map $$f(v_i) = 1_A(x_i) - 1_B(x_i)$$ and extend it to a linear functional on the linear span of the $n$ points $v_1,\dots,v_n$ in the feature space. By Riesz representation, there must hence exist $w \in H$ such that $$f(v_i) = \langle w, \phi(x_i) \rangle$$

Understanding the existence of a kernel function that transforms non linearly separable samples to separable samples in general

3 Answers3