This post is based on Andrew Ng’s CS229 course at Stanford University.
We now briefly digress to talk about an algorithm used to find optima (remember: we used batch gradient descent to do this). This method is called Newton’s method, or the Newton-Raphson method, and is usually faster than batch gradient descent.
Suppose we have a function . That is, it takes in a real number as input, and gives some other real number as output. We’re interested in finding a value (a real number such that . We start with some initial guess, say . For the Newton-Raphson method, the subscript indicates which iteration we’re currently on. The 0th iteration means it’s our initial guess. The Newton-Raphson method performs the following update:
Let’s understand what this intuitively means. Consider the function shown below. The line is also plotted.
The minimum for this function occurs at around 1.3. Let’s pick an initial guess of 4.5. We draw a vertical line, . We take the point where this vertical line meets the function, and draw a tangent at that point. Then, we note the point where this tangent meets the horizontal, line that we drew earlier. This is illustrated in the next figure, numbered for your ease of understanding.
We continue doing the exact same procedure, now starting with , to obtain , and so on. is about 2.8, and is about 1.8.
So now we know how to use the Newton-Raphson method (or Newton’s method) to find a value for which . How do we use this to maximize or minimize a function? We note that at the maxima or minima of a function, . Thus, we can use Newton’s method in the same way as described above, using instead. We can use this to maximize our log-likelihood function, and thus our update rule is
We note that for logistic regression, is a vector and not a real number, so we modify the equation accordingly. Remember, from our post on linear regression, that we defined as the derivative of the cost function with respect to , where was a vector. We use the same for the numerator. But what about the denominator? We need something to represent the second derivative of the cost function, where is, again, a vector. Fortunately, this already exists, and is called the Hessian. Each entry of the Hessian is defined as
Since we need the double derivative in the denominator, and division is not defined for matrices, we use the inverse, to get the final update rule (here, is a vector).
Here, is of dimensions . While Newton’s method takes fewer steps to find the minimum, it does require finding and then inverting the Hessian, and then multiplying with another matrix. This can be very expensive for large values of . When we use the Newton-Raphson method instead of gradient descent in logistic regression, it’s called Fisher scoring.