Mastering SVM Formulation: The Ultimate Guide to Support Vector Machines

Support Vector Machine formulation represents the mathematical backbone that transforms a simple geometric intuition into a powerful classification algorithm. At its core, the problem involves finding a hyperplane that cleanly separates two sets of points in a high-dimensional space. The formulation begins by defining the hyperplane with the equation \( w^T x + b = 0 \), where \( w \) is the weight vector perpendicular to the surface and \( b \) is the bias term shifting the plane.

The Optimization Objective

The primary goal is not just any separating hyperplane, but the one that maximizes the margin, defined as the distance between the hyperplane and the nearest data points from each class. These closest points are the support vectors, and the margin equals \( \frac{2}{||w||} \). Consequently, maximizing the margin is equivalent to minimizing \( \frac{1}{2} ||w||^2 \), leading to the fundamental quadratic optimization problem. This objective function ensures the solution possesses the best generalization capability by placing the decision boundary as far away from any data point as possible.

Handling Non-Separable Data

Real-world datasets are rarely perfectly linearly separable, necessitating a modification to the rigid formulation. The introduction of slack variables \( \xi_i \) allows certain points to violate the margin constraints or be misclassified. The optimization problem is adjusted to minimize \( \frac{1}{2} ||w||^2 + C \sum_{i=1}^{n} \xi_i \), where \( C \) is a regularization parameter. This parameter acts as a tunable knob, balancing the trade-off between maximizing the margin and minimizing the classification error on the training data.

The Kernel Trick and Feature Mapping

To handle datasets that are not linearly separable in the original input space, the formulation incorporates a feature mapping function \( \phi(x) \). This function projects the data into a higher-dimensional space where a linear separator exists. The core optimization problem now involves the dot products between the mapped points, expressed as \( \phi(x_i)^T \phi(x_j) \). The kernel trick exploits the fact that this high-dimensional dot product can often be computed directly in the original space using a kernel function \( K(x_i, x_j) \), avoiding the explicit and potentially infinite-dimensional calculation of \( \phi(x)).

Kernel Name

Function Expression

Typical Use Case

Linear

\( K(x, y) = x^T y \)

Text classification, high-dimensional sparse data

Polynomial

\( K(x, y) = (\gamma x^T y + r)^d \)

Image recognition, problems with higher-order interactions

Radial Basis Function | \( K(x, y) = \exp(-\gamma ||x - y||^2) \) | Non-linear problems with no prior knowledge of structure

Duality and the Lagrangian

Solving the primal optimization problem directly is computationally intensive, especially with the kernel mapping. The formulation is elegantly transformed using Lagrange multipliers to construct the dual problem. The Lagrangian introduces constraints for the slack variables and the hyperplane parameters, resulting in a dual objective function that depends only on the Lagrange multipliers \( \alpha_i \). The dual form simplifies the optimization to \( \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j) \), subject to \( \sum_{i=1}^{n} \alpha_i y_i = 0 \) and \( \alpha_i \geq 0 \).

Mastering SVM Formulation: The Ultimate Guide to Support Vector Machines

The Optimization Objective

Handling Non-Separable Data

The Kernel Trick and Feature Mapping

Duality and the Lagrangian

Written by Ethan Brooks