Support Vector Machine formulation represents the mathematical backbone that transforms a simple geometric intuition into a powerful classification algorithm. At its core, the problem involves finding a hyperplane that cleanly separates two sets of points in a high-dimensional space. The formulation begins by defining the hyperplane with the equation \( w^T x + b = 0 \), where \( w \) is the weight vector perpendicular to the surface and \( b \) is the bias term shifting the plane.
The Optimization Objective
The primary goal is not just any separating hyperplane, but the one that maximizes the margin, defined as the distance between the hyperplane and the nearest data points from each class. These closest points are the support vectors, and the margin equals \( \frac{2}{||w||} \). Consequently, maximizing the margin is equivalent to minimizing \( \frac{1}{2} ||w||^2 \), leading to the fundamental quadratic optimization problem. This objective function ensures the solution possesses the best generalization capability by placing the decision boundary as far away from any data point as possible.
Handling Non-Separable Data
Real-world datasets are rarely perfectly linearly separable, necessitating a modification to the rigid formulation. The introduction of slack variables \( \xi_i \) allows certain points to violate the margin constraints or be misclassified. The optimization problem is adjusted to minimize \( \frac{1}{2} ||w||^2 + C \sum_{i=1}^{n} \xi_i \), where \( C \) is a regularization parameter. This parameter acts as a tunable knob, balancing the trade-off between maximizing the margin and minimizing the classification error on the training data.
The Kernel Trick and Feature Mapping
To handle datasets that are not linearly separable in the original input space, the formulation incorporates a feature mapping function \( \phi(x) \). This function projects the data into a higher-dimensional space where a linear separator exists. The core optimization problem now involves the dot products between the mapped points, expressed as \( \phi(x_i)^T \phi(x_j) \). The kernel trick exploits the fact that this high-dimensional dot product can often be computed directly in the original space using a kernel function \( K(x_i, x_j) \), avoiding the explicit and potentially infinite-dimensional calculation of \( \phi(x)).
Radial Basis Function | \( K(x, y) = \exp(-\gamma ||x - y||^2) \) | Non-linear problems with no prior knowledge of structure
Duality and the Lagrangian
Solving the primal optimization problem directly is computationally intensive, especially with the kernel mapping. The formulation is elegantly transformed using Lagrange multipliers to construct the dual problem. The Lagrangian introduces constraints for the slack variables and the hyperplane parameters, resulting in a dual objective function that depends only on the Lagrange multipliers \( \alpha_i \). The dual form simplifies the optimization to \( \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j) \), subject to \( \sum_{i=1}^{n} \alpha_i y_i = 0 \) and \( \alpha_i \geq 0 \).