Kernels MIT 15.097 Course Notes

Kernels MIT 15.097 Course Notes

Cynthia Rudin

Credits: Bartlett, Scho?lkopf and Smola, Cristianini and Shawe-Taylor

The kernel trick that I'm going to show you applies much more broadly than SVM, but we'll use it for SVM's.

Warning: This lecture is technical. Try to get the basic idea even if you don't catch all the details.

Basic idea: If you can't separate positives from negatives in a low-dimensional space using a hyperplane, then map everything to a higher dimensional space where you can separate them.

The picture looks like this but it's going to be a little confusing at this point:

f

X

x

o

x

ox o

ox

or it might look like this:

f(o)

F

f(x)

f(x)

f(o) f(o)

f(o)

f(x) f(x)

Image by MIT OpenCourseWare.

1

Input Space

Feature Space

Image by MIT OpenCourseWare.

Say I want to predict whether a house on the real-estate market will sell today or not:

x=

x(1)

,

x(2)

,

x(3)

, x(4) , ... .

house's list price estimated worth length of time on market in a good area

We might want to consider something more complicated than a linear model:

Example 1: [x(1), x(2)] [x(1), x(2)] = x(1)2, x(2)2, x(1)x(2)

The 2d space gets mapped to a 3d space. We could have the inner product in the 3d space:

(x)T (z) = x(1)2z(1)2 + x(2)2z(2)2 + x(1)x(2)z(1)z(2). Example 2: [x(1), x(2), x(3)] [x(1), x(2), x(3)]

= [x(1)2, x(1)x(2), x(1)x(3), x(2)x(1), x(2)2, x(2)x(3), x(3)x(1), x(3)x(2), x(3)2] and we can take inner products in the 9d space, similarly to the last example.

2

Rather than apply SVM to the original x's, apply it to the (x)'s instead.

The kernel trick is that if you have an algorithm (like SVM) where the examples appear only in inner products, you can freely replace the inner product with a different one. (And it works just as well as if you designed the SVM with some map (x) in mind in the first place, and took the inner product in 's space instead.)

Remember the optimization problem for SVM?

m

1

max

i - 2

m

ijyiyk xTi xk

inner product

i=1

i,k=1

m

s.t. 0 i C, i = 1, ..., m and iyi = 0

i=1

You can replace this inner product with another one, without even knowing . In fact there can be many different feature maps that correspond to the same inner product.

In other words, we'll replace xT z (i.e., x, z Rn) with k(xi, xj), where k happens to be an inner product in some feature space, (x), (x) Hk. Note that k is also called the kernel (you'll see why later).

Example 3: We could make k(x, z) the square of the usual inner product:

k(x, z) =

x, z

2 Rn

=

n

2

nn

x(j)z(j) =

x(j)x(k)z(j)z(k).

j=1

j=1 k=1

But how do I know that the square of the inner product is itself an inner product in some feature space? We'll show a general way to check this later, but for now, let's see if we can find such a feature space.

Well, for n = 2, if we used [x(1), x(2)] = x(1)2, x(2)2, x(1)x(2), x(2)x(1) to map into a 4d feature space, then the inner product would be:

(x)T (z)

=

x(1)2z(1)2 + x(2)2z(2)2 + 2x(1)x(2)z(1)z(2) =

x, z

2 R2

.

3

So we showed that k is an inner product for n = 2 because we found a feature space corresponding to it.

For n = 3 we can also find a feature space, namely the 9d feature space from Example 2 would give us the inner product k. That is,

(x) = (x(1)2, x(1)x(2), ..., x(3)2), and (z) = (z(1)2, z(1)z(2), ..., z(3)2),

That's nice.

(x), (z) R9 =

x, z

2 R3

.

We can even add a constant, so that k is the inner product plus a constant squared.

Example 4:

n

k(x, z) = (xT z + c)2 =

x(j)z(j) + c

n

x( )z( ) + c

j=1

=1

nn

n

=

x(j)x( )z(j)z( ) + 2c x(j)z(j) + c2

j=1 =1

j=1

n

n

=

(x(j)x( ))(z(j)z( )) + ( 2cx(j))( 2cz(j)) + c2,

j, =1

j=1

and in n = 3 dimensions, one possible feature map is:

(x) = [x(1)2, x(1)x(2), ..., x(3)2, 2cx(1), 2cx(2), 2cx(3), c]

and c controls the relative weight of the linear and quadratic terms in the inner product.

Even more generally, if you wanted to, you could choose the kernel to be any higher power of the regular inner product.

Example 5: For any integer d 2 k(x, z) = (xT z + c)d,

4

where the feature space (x) will be of degree

n+d d

,

with

terms

for

all

monomi-

als up to and including degree d. The decision boundary in the feature space (of

course) is a hyperplane, whereas in the input space it's a polynomial of degree

d. Now do you understand that figure at the beginning of the lecture?

Because these kernels give rise to polynomial decision boundaries, they are called polynomial kernels. They are very popular.

Courtesy of Dr. Hagen Knaf. Used with permission.

Beyond these examples, it is possible to construct very complicated kernels, even ones that have infinite dimensional feature spaces, that provide a lot of modeling power:

5

? mlpy Developers. All rights reserved. This content is excluded from our Creative Commons license. For more information, see .

SVMs with these kinds of fancy kernels are among the most powerful ML algorithms currently (a lot of people say the most powerful). But the solution to the optimization problem is still a simple linear combination, even if the feature space is very high dimensional.

How do I evaluate f (x) for a test example x then? If we're going to replace xTi xk everywhere with some function of xi and xk that is hopefully an inner product from some other space (a kernel), we need to ensure that it really is an inner product. More generally, we'd like to know how to construct functions that are guaranteed to be inner products in some space. We need to know some functional analysis to do that.

6

Roadmap

1. make some definitions (inner product, Hilbert space, kernel) 2. give some intuition by doing a calculation in a space with a finite number

of states 3. design a general Hilbert space whose inner product is the kernel 4. show it has a reproducing property - now it's a Reproducing Kernel Hilbert

space 5. create a totally different representation of the space, which is a more intu-

itive to express the kernel (similar to the finite state one) 6. prove a cool representer theorem for SVM-like algorithms 7. show you some nice properties of kernels, and how you might construct them

Definitions

An inner product takes two elements of a vector space X and outputs a number. An inner product could be a usual dot product: u, v = u v = i u(i)v(i), or it could be something fancier. An inner product ?, ? must satisfy the following

conditions:

1. Symmetry

u, v = v, u u, v X

2. Bilinearity

u + v, w = u, w + v, w u, v, w X , , R

3. Strict Positive Definiteness u, u 0 x X

u, u = 0 u = 0.

7

An inner product space (or pre-Hilbert space) is a vector space together with an inner product.

A Hilbert space is a complete inner product space. (`Complete' means sequences converge to elements of the space - there aren't any "holes" in the space.)

Examples of Hilbert spaces include:

? The vector space Rn with u, v Rn = uT v, the vector dot product of u and v.

? The space

2 of square summable sequences, with inner product

u, v

=

2

i=1

uivi

? The space L2(X , ?) of square integrable functions, that is, functions f such that f (x)2d?(x) < , with inner product f, g L2(X ,?) = f (x)g(x)d?(x).

Finite States

Say we have a finite input space {x1, ..., xm}. So there's only m possible states for the xi's. (Think of the bag-of-words example where there are 2(#words) possible states.) I want to be able to take inner products between any two of them using my function k as the inner product. Inner products by definition are symmetric, so k(xi, xj) = k(xj, xi). In other words, in order for us to even consider k as a valid kernel function, the matrix:

needs to be symmetric, and this means we can diagonalize it, and the eigendecomposition takes this form:

K = VV where V is an orthogonal matrix where the columns of V are eigenvectors, vt, and is a diagonal matrix with eigenvalues t on the diagonal. This fact (that

8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download