SUPPORT VECTOR MACHINES
Linearly Seperable Data
Mathematical Intuition Behind Support Vector Machines (SVM)
Support Vector Machines (SVM) are supervised learning models used for classification and regression. SVMs are particularly well-suited for classification tasks and aim to find the optimal hyperplane that separates different classes in the feature space.
Key Concepts
- Hyperplane:
- In an (n)-dimensional space, a hyperplane is an (n-1)-dimensional subspace that separates the space into two half-spaces.
- For a 2D space, the hyperplane is a line; for a 3D space, it is a plane.
- Margin:
- The margin is the distance between the hyperplane and the nearest data points from each class. These points are called support vectors.
- SVM aims to maximize this margin, making the classifier more robust.
- Support Vectors:
- These are the data points that are closest to the hyperplane and influence its position and orientation.
- They lie on the edge of the margin.
- Optimal Hyperplane:
- The optimal hyperplane is the one that maximizes the margin between the two classes.
Formulation
Given a dataset
{(𝑥𝑖,𝑦𝑖)}i=1 to m, where 𝑥𝑖 ∈ 𝑅^𝑛 are the feature vectors and yi ∈ {−1,1} are the class labels, the goal is to find the hyperplane that best separates the classes. The hyperplane can be represented as:
𝑤⋅𝑥𝑖−𝑏=0
where
w is the normal vector to the hyperplane and b is the bias term.For any point xi, the decision rule is:
𝑤⋅𝑥𝑖−𝑏≥1 if 𝑦𝑖=1
𝑤⋅𝑥𝑖−𝑏≤−1 if 𝑦𝑖=−1
This can be combined into one constraint:
𝑦𝑖(𝑤⋅𝑥𝑖−𝑏)≥1
Objective Function
The objective is to maximize the margin, which is equivalent to minimizing ∥w∥ (since the margin is 2/∥𝑤∥) subject to the constraint above. This leads to the following optimization problem:
min(w,b) 1/2 * ∥w∥^2
subject to yi.(w⋅xi − b) ≥ 1
Lagrangian Formulation
To solve this optimization problem, we use the method of Lagrange multipliers. The Lagrangian is formulated as:
L(w,b,α)= 1/2 ∥w∥^2 − ∑(i=1 to m) αi.[yi(w⋅xi−b)−1]
where αi ≥ 0 are the Lagrange multipliers.
Dual Problem
The primal problem can be converted to the dual problem by maximizing the Lagrangian with respect to w and b
Solution
Once the 𝛼𝑖 are found by solving the dual problem, the weight vector 𝑤 and bias 𝑏 can be computed as:
for any support vector xk.
Predictions
Given a new data point xk, the decision function is:
f(x)=w⋅x−b
The predicted class label is:
yhat = sign(f(x))
For non-linearly separable data, we introduce slack variables 𝜉𝑖 to allow some misclassifications, leading to the soft margin SVM. The optimization problem becomes: