April 29, 2022

Support Vector Machines

Authors: Ryan Peters*
* Core Contributor; Correspondence to ryanirl@icloud.com;

The Support Vector Machine (SVM) is another form of supervised learning of a binary classifier (see my post on the Perceptron). Specifically, SVM’s are a type of margin maximizers, that along with a kernel (more on this some other time), can become a very powerful tool.

Other models for supervised learning of a binary classifier, such the Rosenblatt Perceptron algorithm, can guarantee convergence, but there is no guarantee on “how good” the hyperplane it will find is. The key insight that led to the derivation of SVMs was the idea that we would not only like our model to converge, but would like our model to give us the “best possible” hyperplane. Although, how does one define the “best possible” hyperplane? Well, SVMs define this as the hyperplane that maximizes the distance between the two binary classes. We will define this distance as the margin. Therefore SVMs try to maximize this margin.

The Margin

Given a dataset XRDxNX \in \mathbb{R}^{D x N} with binary labels y{1,1}y \in \{ -1, 1 \}, weights wRDx1w \in \mathbb{R}^{D x 1}, and a bias bRb \in \mathbb{R}, we define the margin M\mathcal{M} as the distance from the hyperplane H\mathcal{H} (given by wTx+b=0w^Tx+b=0) to the closest point xiXx_{i} \in X.

To formalize the definition of the margin M\mathcal{M} lets first formalize the distance from the hyperplane H\mathcal{H} to some point xix_{i}. First, notice that if we project xix_{i} onto H\mathcal{H} we get another point denoted xpx_{p} (see Figure 1) such that the distance from xix_{i} to the hyperplane is simply the distance between the two points. More formally, the distance from H\mathcal{H} to xix_{i} is e2||\vec{e}||_{2} where e=xixp\vec{e} = x_{i} - x_{p}.

Next, notice that because e\vec{e} represents the minimum distance between xix_{i} and H\mathcal{H} that it will also be orthogonal to H\mathcal{H} and parallel to ww. Therefore we can rewrite e\vec{e} as e=αw\vec{e} = \alpha w, for some unknown α\alpha. Additionally, since we know xpx_{p} lies on the hyperplane, then it must be true that wTxp+b=0w^T x_{p} + b = 0. Using this fact, we can derive the following line of reasoning:

wTxp+b=0wT(xie)+b=0wT(xiαw)+b=0wTxiαwTw+b=0αwTw=wTxi+bα=wTxi+bwTw\begin{gather*} w^T x_{p} + b = 0 \\[5pt] w^T (x_{i} - \vec{e}) + b = 0 \\[5pt] w^T (x_{i} - \alpha w) + b = 0 \\[5pt] w^T x_{i} - \alpha w^T w + b = 0 \\[5pt] \alpha w^T w = w^T x_{i} + b \\[7pt] \alpha = \frac{w^T x_{i} + b}{w^T w} \\ \end{gather*}

Now, we can use this definition of α\alpha to rewrite the length e2||\vec{e}||_{2} as a function of ww, bb, and some xix_{i}:

e2=eTe=(αwT)(αw)=αwTw=wTxi+bwTwwTw=wTxi+bwTw=wTxi+bw2\begin{aligned} ||\vec{e}||_{2} &= \sqrt{\vec{e}^T \vec{e}} \\[7pt] &= \sqrt{(\alpha w^T) (\alpha w)} \\[7pt] &= |\alpha| \sqrt{w^T w} \\[7pt] &= \frac{|w^T x_{i} + b|}{w^T w} \sqrt{w^T w} \\[7pt] &= \frac{|w^T x_{i} + b|}{\sqrt{w^T w}} \\[7pt] &= \frac{|w^T x_{i} + b|}{|| w ||_{2}} \\ \end{aligned}

Then finally, the margin M\mathcal{M} given some hyperplane H\mathcal{H} and dataset XX is given by:

M=minxiwTxi+bw2\mathcal{M} = \min_{x_{i}} \frac{|w^T x_{i} + b|}{|| w ||_{2}}

Primal Form Derivation:

SVMs are a type of “margin maximizers”. That is, we want to maximize the MM. We can formalize this with the following optimization problem:

maxw,bminxiwTxi+bw2maxw,b1w2minxiwTxi+b\begin{aligned} \max_{w, b} \min_{x_{i}} & \frac{|w^T x_{i} + b|}{|| w ||_{2}} \\[7pt] \max_{w, b} \frac{1}{|| w ||_{2}} & \min_{x_{i}} |w^T x_{i} + b| \end{aligned}

Notice one last thing, with this definition of the margin we could find some hyperplane that is infinitely far from out data that will satisfy this optimization problem. Therefore we must add an essential constraint that: yi(wTxi+b)0      xiXy_{i}(w^Tx_{i} + b) \geq 0 \;\;\; \forall x_{i} \in X. Using this, we once again rewrite the optimization problem as:

maxw,b1w2minxiwTxi+bsubject to    yi(wTxi+b)0      xiX\begin{aligned} & \max_{w, b} \frac{1}{|| w ||_{2}} \min_{x_{i}} |w^T x_{i} + b| \\[10pt] \text{subject to} & \;\; y_{i}(w^Tx_{i} + b) \geq 0 \;\;\; \forall x_{i} \in X \end{aligned}

The above optimization problem is rather ugly, the maximization of a minimization with an inequality constraint does not look fun to optimize. Preferably we would like to simplify this by removing the inner minimization. To do so, notice that ww is scale invariant. That is, cw  cRcw \; \forall c \in \mathbb{R} still defines the same hyperplane. Therefore, we can always find some ww such that minxiwTxi+b=1\min_{x_{i}} | w^Tx_{i} + b | = 1. This constraint allows us to simplify the above objective as:

maxw,b1w2subject to:  yi(wTxi+b)0      xiXminxiwTxi+b=1\begin{aligned} & \max_{w, b} \frac{1}{|| w ||_{2}} \\[7pt] \text{subject to:} & \; y_{i}(w^Tx_{i} + b) \geq 0 \;\;\; \forall x_{i} \in X \\[5pt] & \min_{x_{i}} | w^Tx_{i} + b | = 1 \end{aligned}

Then notice that if minxiwTxi+b=1\min_{x_{i}} | w^Tx_{i} + b | = 1 then it must be true that wTxi+b1    xiX| w^Tx_{i} + b | \geq 1 \;\; \forall x_{i} \in X, and we can simplify our constraint even further:

maxw,b1w2subject to:  yi(wTxi+b)1      xiX\begin{aligned} & \max_{w, b} \frac{1}{|| w ||_{2}} \\[7pt] \text{subject to:} & \; y_{i}(w^Tx_{i} + b) \geq 1 \;\;\; \forall x_{i} \in X \\[5pt] \end{aligned}

As Kilian Weinberger once said, “maximization is for losers.” So notice that the maximization of the reciprocal of f(x)f(x) is just the minimization of f(x)f(x). That is, we can rewrite our formulation as:

minw,bw2subject to:  yi(wTxi+b)1      xiX\begin{aligned} & \min_{w, b} || w ||_{2} \\[7pt] \text{subject to:} & \; y_{i}(w^Tx_{i} + b) \geq 1 \;\;\; \forall x_{i} \in X \\[5pt] \end{aligned}

One last thing before we are done with the derivation of the SVM primal form. We cannot optimize w2|| w ||_{2} as we need it to be twice differentiable where it’s currently linear. So we write out the final SVM primal form as:

minw,b12w22subject to:  yi(wTxi+b)1      xiX\begin{aligned} & \min_{w, b} \frac{1}{2} || w ||_{2}^2 \\[7pt] \text{subject to:} & \; y_{i}(w^Tx_{i} + b) \geq 1 \;\;\; \forall x_{i} \in X \\[5pt] \end{aligned}

This is the primal form of the SVM problem that we aim to optimize over.

The Lagrangian and the Karush-Kuhn-Tucker (KKT) Conditions

Consider the constrained optimization problem with inequality constraints in the form:

minwf(w)subject to:  gi(w)0    i=1,2,...,uhi(w)=0    j=1,2,...,v\begin{aligned} & \min_{w} f(w) \\[5pt] \text{subject to:} & \; g_{i}(w) \geq 0 \;\; \forall i = 1, 2, ..., u \\[7pt] & h_{i}(w) = 0 \;\; \forall j = 1, 2, ..., v \\[7pt] \end{aligned}

Such that f(w)f(w) and gi(w)g_{i}(w) are convex, and hi(w)h_i(w) is affine. We call this form the primal optimization problem where the number of inequalities is uu and the number of equalities is vv. Furthermore, we define the generalized Lagrangian of this form to be:

L(w,α,β)=f(w)+i=1uαigi(w)+j=1vβjhj(w)\begin{aligned} \mathcal{L}(w, \alpha, \beta) = f(w) + \sum_{i = 1}^{u} \alpha_{i} g_{i}(w) + \sum_{j = 1}^{v} \beta_{j} h_{j}(w) \\ \end{aligned}

Such that α\alpha and β\beta are the Lagrange multipliers. We would like to know when the generalized Lagrangian is equal to the primal form. To do this first consider two quantities. The first quantity:

P(w)=maxα,β:α0L(w,α,β)\begin{aligned} \mathcal{P}(w) & = \max_{\alpha, \beta : \alpha \geq 0} \mathcal{L}(w, \alpha, \beta) \\[7pt] \end{aligned}

Given this quantity we can consider two cases: The first case is when ww takes on a value that satisfies the primal constraints. In this case we can see that P(w)=f(w)\mathcal{P}(w) = f(w). In the other case, the case that ww violates the primal constraints, we will have P(w)=\mathcal{P}(w) = \infty. To formalize this we get:

P(w)={f(w)  if w satisfies the primal constraints.  if w does NOT satisfy the primal constraints.\mathcal{P}(w) = \begin{cases} f(w) \; \text{if $w$ satisfies the primal constraints.} \\[7pt] \infty \; \text{if $w$ does NOT satisfy the primal constraints.} \\[7pt] \end{cases}

Thus this form gives the same values of the primal form when ww satisfies the primal constraint. Therefore, we can write an equivalent statement to the primal form as the following:

minwP(w)=minwmaxα,β:α0L(w,α,β)\begin{aligned} \min_{w} \mathcal{P}(w) & = \min_{w} \max_{\alpha, \beta : \alpha \geq 0} \mathcal{L}(w, \alpha, \beta) \\[7pt] \end{aligned}

Now lets consider the second quantity:

D(w)=minwL(w,α,β)\begin{aligned} \mathcal{D}(w) & = \min_{w} \mathcal{L}(w, \alpha, \beta) \\[7pt] \end{aligned}

Then we define the dual of this form as follows:

maxα,β:α0D(w)=maxα,β:α0minwL(w,α,β)\begin{aligned} \max_{\alpha, \beta : \alpha \geq 0} \mathcal{D}(w) & = \max_{\alpha, \beta : \alpha \geq 0} \min_{w} \mathcal{L}(w, \alpha, \beta) \\[7pt] \end{aligned}

Now lets compare these two quantities: Let P^\hat{P} be the optimal value of minwP(w)\min_{w} \mathcal{P}(w) and D^\hat{D} be the optimal value of maxα,βD(w)max_{\alpha, \beta} \mathcal{D}(w). Then if follows that because minmaxmaxmin\min \max \geq \max \min that:

minwP(w)maxα,β:α0D(w)minwmaxα,β:α0L(w,α,β)maxα,β:α0minwL(w,α,β)P^D^\begin{aligned} \min_{w} \mathcal{P}(w) & \geq \max_{\alpha, \beta : \alpha \geq 0} \mathcal{D}(w) \\[7pt] \min_{w} \max_{\alpha, \beta : \alpha \geq 0} \mathcal{L}(w, \alpha, \beta) & \geq \max_{\alpha, \beta : \alpha \geq 0} \min_{w} \mathcal{L}(w, \alpha, \beta) \\[7pt] \hat{P} & \geq \hat{D} \\[7pt] \end{aligned}

This inequality P^D^\hat{P} \geq \hat{D} is called weak duality where the difference P^D^\hat{P} - \hat{D} is called the duality gap. This inequality is always true even under non-convex conditions. Though we would like know under what conditions strong duality (P^=D^\hat{P} = \hat{D}) holds, this happens exactly when the duality gap is 0. By Slater’s theorem (I will not be proving it here) we can show that strong duality holds in the case that:

  1. f(x)f(x) and gi(x)g_i(x) is convex. hj(x)h_j(x) is affine.
  2. w\exists w such that gi(w)<0g_i(w) < 0.

For the hard margin SVM, there is no equality constraint hj(x)h_j(x). We know that f(x)f(x) is convex and gi(x)g_i(x) is affine and therefore convex. Therefore, Slater’s conditions hold under the assumption that w\exists w such that gi(w)<0g_i(w) < 0, which is true if there exists some hyperplane that separates our data (as we will let gi(w)=yi(wTxi+b)+10g_i(w) = -y_i(w^Tx_i + b) + 1 \leq 0).

Under these conditions we will have some optimal values w^\hat{w}, α^\hat{\alpha}, and β^\hat{\beta} that additionally satisfy the Karush-Kuhn-Tucker (KKT) conditions, which are as follows:

  1. wiL(w,α,β)=0  i=1,2,...,n\frac{\partial}{\partial w_{i}} \mathcal{L}(w, \alpha, \beta) = 0 \; \forall i = 1, 2, ..., n
  2. βiL(w,α,β)=0  i=1,2,...,v\frac{\partial}{\partial \beta_{i}} \mathcal{L}(w, \alpha, \beta) = 0 \; \forall i = 1, 2, ..., v
  3. αi(gi(w))=0  i=1,2,...,u\alpha_{i}(g_{i}(w)) = 0 \; \forall i = 1, 2, ..., u
  4. gi(w)0  i=1,2,...,ug_{i}(w) \leq 0 \; \forall i = 1, 2, ..., u
  5. αi0  i=1,2,...,u\alpha_i \geq 0 \; \forall i = 1, 2, ..., u

Dual Form Derivation:

The SVM primal form is a constrained optimization problem and can be solved using Lagrangian multipliers. Taking into account the above notes we let gi(w)=[yi(wTxi+b)1]0g_i(w) = -[y_i(w^Tx_i + b) - 1] \leq 0. Notice we do not have any equality constraints, then the Lagrangian of the SVM primal form is the following:

L(w,b,α)=12w22i=1nαi[yi(wTxi+b)1]=12wTwi=1n[αiyiwTxi+αiyibαi]=12wTwwTi=1nαiyixibi=1nαiyi+i=1nαi\begin{aligned} \mathcal{L}(w, b, \alpha) &= \frac{1}{2} || w ||_{2}^2 - \sum_{i = 1}^{n} \alpha_{i} [y_{i}(w^Tx_{i} + b) - 1] \\[7pt] &= \frac{1}{2} w^T w - \sum_{i = 1}^{n} [\alpha_{i} y_{i} w^Tx_{i} + \alpha_{i} y_{i} b - \alpha_{i}] \\[7pt] &= \frac{1}{2} w^T w - w^T \sum_{i = 1}^{n} \alpha_{i} y_{i} x_{i} - b \sum_{i = 1}^{n}\alpha_{i} y_{i} + \sum_{i = 1}^{n}\alpha_{i} \\[7pt] \end{aligned}

Considering the KKT condition #1, we take the partial derivatives w.r.t. ww and bb which gives:

wL(w,b,α)=wi=1nαiyixi=0w=i=1nαiyixi\begin{aligned} \nabla_{w}\mathcal{L}(w, b, \alpha) &= w - \sum_{i = 1}^{n} \alpha_{i} y_{i} x_{i} = 0\\[7pt] & w = \sum_{i = 1}^{n} \alpha_{i} y_{i} x_{i} \\[7pt] \end{aligned} bL(w,b,α)=i=1nαiyi=0i=1nαiyi=0\begin{aligned} \nabla_{b}\mathcal{L}(w, b, \alpha) &= - \sum_{i = 1}^{n}\alpha_{i} y_{i} = 0 \\[7pt] & \sum_{i = 1}^{n}\alpha_{i} y_{i} = 0 \\[7pt] \end{aligned}

Looking back at our original Lagrangian equation and plugging w=i=1nαiyixiw = \sum_{i = 1}^{n} \alpha_{i} y_{i} x_{i} and i=1nαiyi=0\sum_{i = 1}^{n}\alpha_{i} y_{i} = 0 in allows us to simplify:

L(w,b,α)=12wTwwTi=1nαiyixiwbi=1nαiyi0+i=1nαi=12wTwwTw+i=1nαi=12wTw+i=1nαi=12(i=1nαiyixiTj=1nαjyjxj)+i=1nαi=12i=1nj=1nαiαjyiyjxiTxj+i=1nαi=i=1nαi12i=1nj=1nαiαjyiyjxi,xj\begin{aligned} \mathcal{L}(w, b, \alpha) &= \frac{1}{2} w^T w - w^T \underbrace{\sum_{i = 1}^{n} \alpha_{i} y_{i} x_{i}}_{w} - b \underbrace{\sum_{i = 1}^{n}\alpha_{i} y_{i}}_{0} + \sum_{i = 1}^{n}\alpha_{i} \\[7pt] &= \frac{1}{2} w^T w - w^T w + \sum_{i = 1}^{n}\alpha_{i} \\[7pt] &= -\frac{1}{2}w^T w + \sum_{i = 1}^{n}\alpha_{i} \\[7pt] &= -\frac{1}{2}(\sum_{i = 1}^n \alpha_{i} y_{i} x_{i}^T \sum_{j = 1}^n \alpha_{j} y_{j} x_{j}) + \sum_{i = 1}^{n}\alpha_{i} \\[7pt] &= -\frac{1}{2}\sum_{i = 1}^n \sum_{j = 1}^n \alpha_{i} \alpha_{j} y_{i} y_{j} x_{i}^T x_{j} + \sum_{i = 1}^{n}\alpha_{i} \\[7pt] &= \sum_{i = 1}^{n}\alpha_{i} - \frac{1}{2}\sum_{i = 1}^n \sum_{j = 1}^n \alpha_{i} \alpha_{j} y_{i} y_{j} \langle x_{i}, x_{j} \rangle \\[7pt] \end{aligned}

Which leaves us with the final SVM Dual Form and the following optimization problem:

maxαi=1nαi12i=1nj=1nαiαjyiyjxi,xjsubject to:  i=1nαiyi=0αi0    i=1,2,...,n\begin{aligned} \max_{\alpha} & \sum_{i = 1}^{n}\alpha_{i} - \frac{1}{2}\sum_{i = 1}^n \sum_{j = 1}^n \alpha_{i} \alpha_{j} y_{i} y_{j} \langle x_{i}, x_{j} \rangle \\[7pt] \text{subject to:} & \; \sum_{i = 1}^{n}\alpha_{i} y_{i} = 0 \\[5pt] & \alpha_{i} \geq 0 \;\; i = 1, 2, ..., n \\ \end{aligned}

This SVM Dual Form will be the main focus of the practical implementation of this algorithm as we can optimize this form using quadratic programming and convex optimization libraries in Python such as CVXOPT .

Deriving the Weights and Bias from the Dual Form

We already have a known solution for the weights that satisfies the KKT condition #1. That being: w=i=1nαiyixiw = \sum_{i = 1}^{n} \alpha_{i} y_{i} x_{i}. Taking into account the KKT condition #3, notice that αi(yi(wTxi+b)1)=0\alpha_{i}(y_{i}(w^Tx_{i} + b) - 1) = 0 where αi0\alpha_i \geq 0. This leaves us with two main conditions, either αi\alpha_{i} is 0 and we are not dealing with a support vector, or yi(wTxi+b)=1y_{i}(w^Tx_{i} + b) = 1 in which case we are dealing with a support vector. We can ignore the case in which αi=0\alpha_{i} = 0 and deal solely with the case regarding the support vector. Simplifying the support vector case gives the following:

yi(wTxi+b)=1wTxi+b=1yib=1yiwTxib=yiwTxi\begin{aligned} y_{i}(w^Tx_{i} + b) = 1 \\ w^Tx_{i} + b = \frac{1}{y_{i}} \\ b = \frac{1}{y_{i}} - w^Tx_{i} \\ b = y_{i} - w^Tx_{i} \\ \end{aligned}

The equation above gives the bias for some support vector where αi0\alpha_{i} \geq 0. To find the bias of the hyperplane, we can just take the mean average over each support vector bias. Resulting in:

bH=[yiwTxi]+[yjwTxj]2=wTxi+wTxj2subject to:  yi=1αi0  yj=1αj0\begin{aligned} b_{\mathcal{H}} & = \frac{[y_{i} - w^Tx_{i}] + [y_{j} - w^Tx_{j}]}{2} \\[7pt] & = -\frac{w^Tx_{i} + w^Tx_{j}}{2} \\[7pt] \text{subject to:} \; & y_{i} = -1 \land \alpha_{i} \geq 0 \\[5pt] \; & y_{j} = 1 \land \alpha_{j} \geq 0 \\[5pt] \end{aligned}

In most cases this will just equal the average over all support vector biases, though this will not always be true. In fact, I have seen multiple sources say that this is true and that the bias is equal to the mean over all support vectors biases, including myself at one point. Though I realized that if you have two points on one of your support vectors then taking the mean over all biases would skew your bias towards that support vector. So, you must be careful to only take the mean average over two bias terms (one on each support vector).

Python Implementation and CVXOPT

CVXOPT is a python library for convex optimization. It uses quadratic programming to solve the convex optimization problem of the following form:

min12xTPx+qTx subject to: GxhAx=b\begin{array}{ll} & \min \frac{1}{2} x^{T} P x+q^{T} x \\ \text { subject to: } & G x \leq h \\ & A x=b \end{array}

We will use CVXOPT to build a Python implementation of the SVM. Much of the following code and design is modified from Aladdin Persson’s implementation (found here). His implementation of the SVM with CVXOPT was instrumental in my understanding, so a huge shout-out to him. Below I will show the core pieces of the code. The full code can be found on my GitHub.

svm.py
import matplotlib.pyplot as plt
import numpy as np
import cvxopt
from mlxtend.plotting import plot_decision_regions
from sklearn.datasets import make_circles, make_blobs
from sklearn.svm import SVC

We’re going to implement the SVM class in the traditional scikit-learn fashion taking the following abstract form:

svm.py
class SVM:
def __init__(self):
self.X = None
self.y = None
def fit(self, X, y):
...
def transform(self, X, y):
...

To do this, let’s first implement the fit function. Though one problem does arise, we need to convert our dual form problem into something recognizable by CVXOPT. That is, we need to find a way to represent P,q,G,h,A,bP, q, G, h, A, b. To do this I followed this article, although I have also posted my hand-written notes on this conversion here. If we let x=αx=\alpha, then after some tedious algebra we can represent PP, qq, GG, hh, AA, and bb as the following:

P:=  i=1nj=1nyiyjxi,xjq:=  1G:=  np.vstack(Im,Im)h:=  np.vstack(0,C)A:=  yb:=  0\begin{aligned} P := \; & \sum_{i=1}^{n} \sum_{j=1}^{n} y_{i} y_{j} \langle x_{i}, x_{j} \rangle \\[5pt] q := \; & -1 \\[2pt] G := \; & \text{np.vstack}( I_{m} , -I_{m} ) \\[2pt] h := \; & \text{np.vstack}( 0, C ) \\[2pt] A := \; & y \\[2pt] b := \; & 0 \\[2pt] \end{aligned}

Where ImI_{m} is an identity matrix of size m  x  mm \; x \; m, and where the size of XX is (m,n)(m, n). In other words mm is how many samples of data we have. Given this, our fit function becomes the following:

svm.py
class SVM:
...
def fit(self, x: np.ndarray, y: np.ndarray) -> None:
m, n = x.shape
# Requires this shape and data type.
y = y.copy().reshape(-1, 1).astype(np.float64)
self.X = x
self.y = y
# Compute the inner product.
self.kernel = np.dot(self.X, self.X.T)
# See my notes above for how this was derived.
P = cvxopt.matrix(np.outer(y, y) * self.kernel)
q = cvxopt.matrix(-np.ones((m, 1)))
G = cvxopt.matrix(np.vstack((-np.identity(m), np.identity(m))))
h = cvxopt.matrix(np.vstack((np.zeros((m, 1)), np.ones((m, 1)))))
A = cvxopt.matrix(y, (1, m), 'd')
b = cvxopt.matrix(np.array([0]), (1, 1), 'd')
optimal = cvxopt.solvers.qp(P, q, G, h, A, b)
cvxopt.solvers.options["show_progress"] = True
self.alphas = np.array(optimal['x'])

Next, we need to write the predict function. For SVMs, this will follow very closely with the above section on deriving the weights and bias from the dual form, and then using them to make the prediction.

svm.py
class SVM:
...
def predict(self, X: np.ndarray) -> np.ndarray:
non_zero_alphas = (self.alphas > 1e-4).flatten()
support_k = self.kernel[non_zero_alphas, non_zero_alphas][:, np.newaxis]
support_x = self.X[non_zero_alphas]
support_a = self.alphas[non_zero_alphas]
support_y = self.y[non_zero_alphas]
idx_0 = np.where(support_y == -1)[0][0]
idx_1 = np.where(support_y == 1)[0][0]
bias_0 = support_a[idx_0] * support_y[idx_0] * support_k[idx_0]
bias_1 = support_a[idx_1] * support_y[idx_1] * support_k[idx_1]
bias = -(bias_0 + bias_1) / 2
predictions = support_a * support_y * np.dot(support_x, X.T)
predictions = predictions.sum(axis = 0) + bias
return np.sign(predictions)

The full code, and it’s implementation can again be found on my GitHub. Now let’s test out the performance of our model on some fake data. We’ll also compare it against the scikit-learn implementation of the SVM.

svm.py
# Get the toy dataset.
X, y = make_blobs(n_samples = 200, centers = 2, random_state = 1)
y[y == 0] = -1
y_int_32 = y
y = y.reshape(-1, 1) * 1.0
# Our model.
model = SVM()
model.fit(X, y)
# sklearn implimentation for reference.
sklearn_model = SVC(kernel = "linear")
sklearn_model.fit(X, y_int_32)
# Visualize performance.
fig, (ax0, ax1) = plt.subplots(1, 2, figsize = (8, 3))
plot_decision_regions(X, y_int_32, clf = model, legend = 2, ax = ax0)
plot_decision_regions(X, y_int_32, clf = sklearn_model, legend = 2, ax = ax1)
ax0.set_title("Custom")
ax1.set_title("scikit-learn")
plt.show()

Which gives:

We can see that our model gives a nice separation of the data, and performs the same as the scikit-learn implementation.

Conclusion

In this post, we introduced the Support Vector Machine (SVM), a model for supervised learning of a binary classifier. We explored the mathematical foundations of SVMs, starting from the concept of margin maximization, and then derived both the primal and dual optimization forms. We also introduced and discussed the Karush-Kuhn-Tucker (KKT) conditions, and their role in solving the SVM optimization problem. Finally, we showed a practical implementation of an SVM in Python, using the CVXOPT library, and compared our implementation with scikit-learn’s.

References

  1. Cortes, C., Vapnik, V.N. (1995). Support-Vector Networks. Link
  2. Vapnik, V.N. (1997). The Support Vector method. Link
  3. M. S. Andersen, J. Dahl, and L. Vandenberghe. (2012). CVXOPT: A Python package for convex optimization. Link
  4. Persson, A. (2020). SVM from Scratch - Machine Learning Python (Support Vector Machine). Link
  5. Xavier, S. B. (2018). Support Vector Machine: Python implementation using CVXOPT. Link
  6. Andrew, NG (). CS229 Lecture notes. Link
  7. Weinberger, K. (2018). Lecture 9: SVM. Link
  8. Gordon, G., Tibshirani, R. (). Karush-Kuhn-Tucker conditions. Link

Figures

  1. Custom image.
  2. Example of the SVM and scikit-learn fit on a dataset. Link.