What is Mercer's theorem?

It is relatively easy for machine learning models to learn linear classification, but it is not enough. Linear classifiers create linear decision boundaries prone to noise and fail to fit well in a non-linear dataset.

Example

Suppose we want to create a classifier to check whether a person is obese or not. The dataset contains the BMIBody Mass Index values of the subjects. An SVM decision boundary would be as shown in the figure below:

A linear decision boundary cannot be created. This is where kernel functions play their role.

Kernel functions

Kernel functions transform a set of inputs to a higher dimension where the data is linearly separable. With the help of kernel functions, we can transform data into infinite dimensions.

For the problem of cancer detection, we can transform the data into three dimensions where a hyperplaneIt is a plane in D-1 dimension. Where D is the dimension of the dataset. For example, in this 3 dimensional dataset, our hyperplane is of 2 dimensions. can be our decision boundary as follows:

Kernel Name	Equation
Linear kernel	k(X, X') = X.X'
Radial Basis Function (RBF) kernel	k(X, X') = exp( - (\|\| X - X' \|\|²)/ 2l²)
Polynomial kernel	k(X, X') = (X.X' + 1)^e

What is Mercer's theorem?

Example

Kernel functions

Mercer's theorem

Condition 1

Condition 2

Conclusion