Let’s continue and work towards posing a dual for our optimization problem. Specifically, we’ll focus on the constraints that we have. For each training example, we have
Shuffling terms around,
Note that if the functional margin () is 1, then , and all from the KKT conditions. Let’s recall the scaling condition we imposed on (in the first post):
What this means is that at the points closest to the optimal hyperplane, the functional margin is 1. This figure depicts it perfectly:
Don’t worry about the equation having : that’s just a consequence of this particular hyperplane having a negative intercept on the y-axis. Thus, in this example, only three of the s will be nonzero at the optimal solution. We call these three points support vectors. Sometimes, you’ll see the lines through them also being called the support vectors. The important thing is these are where the functional margin is equal to 1.
Let’s now frame the Lagrangian for our optimization problem.
From the post on convex optimization, recall that the dual problem is . So let’s first minimize our Lagrangian. To do so, we compute the partial gradients of the Lagrangian with respect to and . Why aren’t we doing this with respect to as well? Try it out. You’ll get the condition that every point is a support vector, which is obviously not the case. So we’ll fix , and only compute the other two partial derivatives, and set them to 0.
The result of this, which is below, is quite important. Let’s call this equation 1.
Similarly, by finding the partial derivative with respect to ,
Let’s call this equation 2. We now use equation 1 in our Lagrangian. This gives us
Let’s look at this derivation.
- The first step is from the observation that . We also plugged in the value of in the second sum.
- The next step expands the sums, and uses equation 2, so the third sum is 0.
- Next, we shuffle terms around, and note that the first two terms really are the same.
With this in place, let’s look at our final dual problem, which is the problem to maximize this function.
If we solve this problem for , we can plug it in equation 1, to get . How do we get ? Look at our constraint:
If , then dividing both sides by -1,
For the other case, we divide both sides by +1:
To combine these, note that our initial, non-convex optimization problem was to maximize the functional margin (we had a denominator , but that is always positive). To achieve this, we consider the maximum of the first result above, and the minimum of the second. In effect, we consider the equality in both cases. We then add up the results and divide by two. This yields us
Intuitively, this says find the smallest positive and the largest negative training examples, and place the line right in the middle of the two (because $b$ controls the position of the hyperplane, while $w$ controls the direction). The figure below explains this intuition (this is hand-drawn, so it’s probably not accurate).
Note that to make predictions, we simply compute , and predict 1 if this is positive; otherwise, we predict -1. Using equation 1, we need to compute
Thus, making predictions only requires finding the values of . Notice that these are all 0 except for the support vectors. So in reality, we have only a few computations to make, and the sum really is only over the support vectors, not the entire training set.
The fact that we only need to compute inner products, essentially, will be useful when we deal with kernels, which help when the data is not linearly separable. When the data is very high-dimensional, for some feature spaces, computing inner products can be done efficiently.