* Core Contributor;Correspondence to ryanirl@icloud.com;
The Support Vector Machine (SVM) is
another form of supervised learning of a binary classifier (see my post on the
Perceptron). Specifically, SVM’s are a type of margin maximizers, that along
with a kernel (more on this some other time), can become a very powerful tool.
Other models for supervised learning of a binary classifier, such the Rosenblatt
Perceptron algorithm, can guarantee convergence, but there is no guarantee on
“how good” the hyperplane it will find is.
The key insight that led to the derivation of SVMs was the idea that we would
not only like our model to converge, but would like our model to give us the
“best possible” hyperplane.
Although, how does one define the “best possible” hyperplane? Well, SVMs
define this as the hyperplane that maximizes the distance between the two binary
classes. We will define this distance as the margin. Therefore SVMs try to
maximize this margin.
The Margin
Given a dataset X∈RDxN with binary labels y∈{−1,1}, weights
w∈RDx1, and a bias b∈R, we define the margin M
as the distance from the hyperplane H (given by wTx+b=0) to the closest
point xi∈X.
To formalize the definition of the margin M lets first formalize the
distance from the hyperplane H to some point xi. First, notice
that if we project xi onto H we get another point denoted
xp (see Figure 1) such that the distance from xi to the hyperplane
is simply the distance between the two points. More formally, the distance from
H to xi is ∣∣e∣∣2 where e=xi−xp.
Next, notice that because e represents the minimum distance between
xi and H that it will also be orthogonal to H and
parallel to w. Therefore we can rewrite e as e=αw, for
some unknown α. Additionally, since we know xp lies on the
hyperplane, then it must be true that wTxp+b=0. Using this fact, we
can derive the following line of reasoning:
Notice one last thing, with this definition of the margin we could find
some hyperplane that is infinitely far from out data that will satisfy this
optimization problem. Therefore we must add an essential constraint that:
yi(wTxi+b)≥0∀xi∈X. Using this, we
once again rewrite the optimization problem as:
The above optimization problem is rather ugly, the maximization of a
minimization with an inequality constraint does not look fun to optimize.
Preferably we would like to simplify this by removing the inner minimization.
To do so, notice that w is scale invariant. That is, cw∀c∈R
still defines the same hyperplane. Therefore, we can always find some w such
that minxi∣wTxi+b∣=1. This constraint allows us to simplify
the above objective as:
Then notice that if minxi∣wTxi+b∣=1 then it must be true that
∣wTxi+b∣≥1∀xi∈X, and we can simplify our
constraint even further:
subject to:w,bmax∣∣w∣∣21yi(wTxi+b)≥1∀xi∈X
As Kilian Weinberger once said, “maximization is for losers.” So notice that the
maximization of the reciprocal of f(x) is just the minimization of f(x).
That is, we can rewrite our formulation as:
subject to:w,bmin∣∣w∣∣2yi(wTxi+b)≥1∀xi∈X
One last thing before we are done with the derivation of the SVM primal form. We
cannot optimize ∣∣w∣∣2 as we need it to be twice differentiable where it’s
currently linear. So we write out the final SVM primal form as:
Such that f(w) and gi(w) are convex, and hi(w) is affine. We call this form the
primal optimization problem where the number of inequalities is u and the number
of equalities is v. Furthermore, we define the generalized Lagrangian of this form to be:
L(w,α,β)=f(w)+i=1∑uαigi(w)+j=1∑vβjhj(w)
Such that α and β are the Lagrange multipliers. We would like to know when the generalized
Lagrangian is equal to the primal form. To do this first consider two quantities. The first quantity:
P(w)=α,β:α≥0maxL(w,α,β)
Given this quantity we can consider two cases: The first case is when w takes
on a value that satisfies the primal constraints. In this case we can see that
P(w)=f(w). In the other case, the case that w violates the
primal constraints, we will have P(w)=∞. To formalize this
we get:
P(w)=⎩⎨⎧f(w)if w satisfies the primal constraints.∞if w does NOT satisfy the primal constraints.
Thus this form gives the same values of the primal form when w satisfies the
primal constraint. Therefore, we can write an equivalent statement to the
primal form as the following:
wminP(w)=wminα,β:α≥0maxL(w,α,β)
Now lets consider the second quantity:
D(w)=wminL(w,α,β)
Then we define the dual of this form as follows:
α,β:α≥0maxD(w)=α,β:α≥0maxwminL(w,α,β)
Now lets compare these two quantities: Let P^ be the optimal value of
minwP(w) and D^ be the optimal value of
maxα,βD(w). Then if follows that because
minmax≥maxmin that:
This inequality P^≥D^ is called weak duality where the
difference P^−D^ is called the duality gap. This inequality is
always true even under non-convex conditions. Though we would like know under
what conditions strong duality (P^=D^) holds, this happens exactly
when the duality gap is 0. By Slater’s theorem (I will not be proving it here)
we can show that strong duality holds in the case that:
f(x) and gi(x) is convex. hj(x) is affine.
∃w such that gi(w)<0.
For the hard margin SVM, there is no equality constraint hj(x). We know that
f(x) is convex and gi(x) is affine and therefore convex. Therefore,
Slater’s conditions hold under the assumption that ∃w such that
gi(w)<0, which is true if there exists some hyperplane that separates our
data (as we will let gi(w)=−yi(wTxi+b)+1≤0).
Under these conditions we will have some optimal values w^,
α^, and β^ that additionally satisfy the
Karush-Kuhn-Tucker (KKT) conditions, which are as follows:
∂wi∂L(w,α,β)=0∀i=1,2,...,n
∂βi∂L(w,α,β)=0∀i=1,2,...,v
αi(gi(w))=0∀i=1,2,...,u
gi(w)≤0∀i=1,2,...,u
αi≥0∀i=1,2,...,u
Dual Form Derivation:
The SVM primal form is a constrained optimization problem and can be solved
using Lagrangian multipliers. Taking into account the above notes we let
gi(w)=−[yi(wTxi+b)−1]≤0. Notice we do not have any equality
constraints, then the Lagrangian of the SVM primal form is the following:
This SVM Dual Form will be the main focus of the practical implementation of
this algorithm as we can optimize this form using quadratic programming and
convex optimization libraries in Python such as CVXOPT .
Deriving the Weights and Bias from the Dual Form
We already have a known solution for the weights that satisfies the KKT
condition #1. That being: w=∑i=1nαiyixi.
Taking into account the KKT condition #3, notice that
αi(yi(wTxi+b)−1)=0 where αi≥0. This leaves
us with two main conditions, either αi is 0 and we are not dealing
with a support vector, or yi(wTxi+b)=1 in which case we are dealing
with a support vector. We can ignore the case in which αi=0 and deal
solely with the case regarding the support vector. Simplifying the support
vector case gives the following:
The equation above gives the bias for some support vector where
αi≥0. To find the bias of the hyperplane, we can just take the
mean average over each support vector bias. Resulting in:
In most cases this will just equal the average over all support vector biases,
though this will not always be true. In fact, I have seen multiple sources say
that this is true and that the bias is equal to the mean over all support
vectors biases, including myself at one point. Though I realized that if you
have two points on one of your support vectors then taking the mean over all
biases would skew your bias towards that support vector. So, you must be
careful to only take the mean average over two bias terms (one on each support
vector).
Python Implementation and CVXOPT
CVXOPT
is a python library for convex optimization. It uses quadratic programming to
solve the convex optimization problem of the following form:
subject to: min21xTPx+qTxGx≤hAx=b
We will use CVXOPT to build a Python implementation of the SVM. Much of the
following code and design is modified from Aladdin Persson’s implementation (found
here).
His implementation of the SVM with CVXOPT was instrumental in my understanding,
so a huge shout-out to him. Below I will show the core pieces of the code. The
full code can be found on my
GitHub.
We’re going to implement the SVM class in the traditional scikit-learn fashion taking
the following abstract form:
To do this, let’s first implement the fit function. Though one problem does
arise, we need to convert our dual form problem into something recognizable by
CVXOPT. That is, we need to find a way to represent P,q,G,h,A,b. To do
this I followed
this article,
although I have also posted my hand-written notes on this conversion
here.
If we let x=α, then after some tedious algebra we can represent P, q,
G, h, A, and b as the following:
Where Im is an identity matrix of size mxm, and where the size of
X is (m,n). In other words m is how many samples of data we have. Given this,
our fit function becomes the following:
Next, we need to write the predict function. For SVMs, this will follow very closely
with the above section on deriving the weights and bias from the dual form, and then
using them to make the prediction.
The full code, and it’s implementation can again be found on my
GitHub.
Now let’s test out the performance of our model on some fake data. We’ll
also compare it against the scikit-learn implementation of the SVM.
Which gives:
We can see that our model gives a nice separation of the data, and performs
the same as the scikit-learn implementation.
Conclusion
In this post, we introduced the Support Vector Machine (SVM), a model for
supervised learning of a binary classifier. We explored the mathematical
foundations of SVMs, starting from the concept of margin maximization, and then
derived both the primal and dual optimization forms. We also introduced and
discussed the Karush-Kuhn-Tucker (KKT) conditions, and their role in solving the
SVM optimization problem. Finally, we showed a practical implementation of an
SVM in Python, using the CVXOPT library, and compared our implementation with
scikit-learn’s.
[{"id":1,"title":"Support-Vector Networks","authors":["Cortes, C., Vapnik, V.N."],"year":1995,"url":"https://doi.org/10.1007/BF00994018"},{"id":2,"title":"The Support Vector method","authors":["Vapnik, V.N."],"year":1997,"url":"https://doi.org/10.1007/BFb0020166"},{"id":3,"title":"CVXOPT: A Python package for convex optimization","authors":["M. S. Andersen, J. Dahl, and L. Vandenberghe."],"year":2012,"url":"https://cvxopt.org/"},{"id":4,"title":"SVM from Scratch - Machine Learning Python (Support Vector Machine)","authors":["Persson, A."],"year":2020,"url":"https://www.youtube.com/watch?v=gBTtR0bs-1k&list=PLhhyoLH6IjfxpLWyOgBt1sBzIapdRKZmj&index=7"},{"id":5,"title":"Support Vector Machine: Python implementation using CVXOPT","authors":["Xavier, S. B."],"year":2018,"url":"https://xavierbourretsicotte.github.io/SVM_implementation.html"},{"id":6,"title":"CS229 Lecture notes","authors":["Andrew, NG"],"year":null,"url":"https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf"},{"id":7,"title":"Lecture 9: SVM","authors":["Weinberger, K."],"year":2018,"url":"https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote09.html"},{"id":8,"title":"Karush-Kuhn-Tucker conditions","authors":["Gordon, G., Tibshirani, R."],"year":null,"url":"https://www.cs.cmu.edu/~ggordon/10725-F12/slides/16-kkt.pdf"}]
References
Cortes, C., Vapnik, V.N. (1995). Support-Vector Networks.
Link
Vapnik, V.N. (1997). The Support Vector method.
Link
M. S. Andersen, J. Dahl, and L. Vandenberghe. (2012). CVXOPT: A Python package for convex optimization.
Link
Persson, A. (2020). SVM from Scratch - Machine Learning Python (Support Vector Machine).
Link
Xavier, S. B. (2018). Support Vector Machine: Python implementation using CVXOPT.
Link