What is Mercer's theorem?

It is relatively easy for machine learning models to learn linear classification, but it is not enough. Linear classifiers create linear decision boundaries prone to noise and fail to fit well in a non-linear dataset.

Example

Suppose we want to create a classifier to check whether a person is obese or not. The dataset contains the BMIBody Mass Index values of the subjects. An SVM decision boundary would be as shown in the figure below:

ObeseNot obese

This will work fine but let's take another example of cancer detection. The dataset may not be linearly separable, as shown in the figure below:

MalignantBenign

A linear decision boundary cannot be created. This is where kernel functions play their role.

Kernel functions

Kernel functions transform a set of inputs to a higher dimension where the data is linearly separable. With the help of kernel functions, we can transform data into infinite dimensions.

For the problem of cancer detection, we can transform the data into three dimensions where a hyperplaneIt is a plane in D-1 dimension. Where D is the dimension of the dataset. For example, in this 3 dimensional dataset, our hyperplane is of 2 dimensions. can be our decision boundary as follows:

A hyperplane decision boundary
A hyperplane decision boundary

There are several kernel functions available to use. Some of them are given in the table below:

Kernel Name

Equation

Linear kernel

k(X, X') = X.X'

Radial Basis Function (RBF) kernel

k(X, X') = exp( - (|| X - X' ||2 )/ 2l2)

Polynomial kernel

k(X, X') = (X.X' + 1)e

Kernels act like a feature map by mapping features of one space, say xx, into another feature xx'. Mercer's theorem helps us check whether a function is a kernel function or not.

Mercer's theorem

A function KK is a kernel function if it satisfies two conditions:

  • Condition 1 - The function KK must be symmetric.

  • Condition 2 - The function must be positive semi-definite.

Condition 1

This is a pretty simple condition. Symmetric condition is given as:

Condition 2

For a given unique set of NN data points x1,x2,...,xNx_1, x_2, ..., x_N, the function KK is said to be positive semi-definite if:

Here, cic_i is any real-valued constant number.

These conditions can be summarized in an alternative way using the Gram matrix. Gram matrix is defined as:

If this matrix is positive semi-definite, then the function KK holds the second condition. A matrix K\textbf{K} is said to be positive semi-definite if, for any vector xx, if:

Note: To learn more about semi-definite matrix, refer to this.

Conclusion

Though the word "kernel" is used in different contexts when it comes to machine learning, it holds great importance as its selection ultimately affects the performance of the model.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved