Mathematical Intuition Behind Support Vector Machines (SVM)

Support Vector Machines (SVM) are supervised learning models used for classification and regression. SVMs are particularly well-suited for classification tasks and aim to find the optimal hyperplane that separates different classes in the feature space.

Key Concepts

Hyperplane:
- In an (n)-dimensional space, a hyperplane is an (n-1)-dimensional subspace that separates the space into two half-spaces.
- For a 2D space, the hyperplane is a line; for a 3D space, it is a plane.
Margin:
- The margin is the distance between the hyperplane and the nearest data points from each class. These points are called support vectors.
- SVM aims to maximize this margin, making the classifier more robust.
Support Vectors:
- These are the data points that are closest to the hyperplane and influence its position and orientation.
- They lie on the edge of the margin.
Optimal Hyperplane:
- The optimal hyperplane is the one that maximizes the margin between the two classes.

Formulation

Given a dataset

{(𝑥𝑖,𝑦𝑖)}i=1 to m, where 𝑥𝑖 ∈ 𝑅^𝑛 are the feature vectors and yi ∈ {−1,1} are the class labels, the goal is to find the hyperplane that best separates the classes. The hyperplane can be represented as:

𝑤⋅𝑥𝑖−𝑏=0

where

w is the normal vector to the hyperplane and b is the bias term.For any point xi, the decision rule is:

𝑤⋅𝑥𝑖−𝑏≥1 if 𝑦𝑖=1

𝑤⋅𝑥𝑖−𝑏≤−1 if 𝑦𝑖=−1

This can be combined into one constraint:

𝑦𝑖(𝑤⋅𝑥𝑖−𝑏)≥1

Objective Function

The objective is to maximize the margin, which is equivalent to minimizing ∥w∥ (since the margin is 2/∥𝑤∥) subject to the constraint above. This leads to the following optimization problem:

min(w,b) 1/2 * ∥w∥^2

subject to yi.(w⋅xi − b) ≥ 1

Lagrangian Formulation

To solve this optimization problem, we use the method of Lagrange multipliers. The Lagrangian is formulated as:

L(w,b,α)= 1/2 ∥w∥^2 − ∑(i=1 to m) αi.[yi(w⋅xi−b)−1]

where αi ≥ 0 are the Lagrange multipliers.

Dual Problem

The primal problem can be converted to the dual problem by maximizing the Lagrangian with respect to w and b

Solution

Once the 𝛼𝑖 are found by solving the dual problem, the weight vector 𝑤 and bias 𝑏 can be computed as:

Predictions

Given a new data point xk, the decision function is:

f(x)=w⋅x−b

The predicted class label is:

yhat = sign(f(x))

For non-linearly separable data, we introduce slack variables 𝜉𝑖 to allow some misclassifications, leading to the soft margin SVM. The optimization problem becomes:

SUPPORT VECTOR MACHINES

Linearly Seperable Data