# 2 – Linear Regression

The assumption so far is that you know on a very abstract level, what machine learning is. You have some data, and you possibly want to “learn” from it to gather insights or to make predictions in the future. I also assume you have Python installed and can run the following test program:


print("Hello, World!")



## Introduction

On to the topic at hand–linear regression. So what is regression? A regression algorithm is one that outputs a real number. Basically, you give it some data, it computes some function of that data, and spits out a real number. I’ll introduce notation used by Andrew Ng in his Coursera course.

While some courses will explain simple linear regression first and multiple regression (don’t worry about these terms, we’ll cover them) later, I’ll do the opposite so you get a general idea quickly.

Say you have some training data that is labeled. By labeled, I mean that you know the values for each training example. Here’s a concrete example. Suppose you have data about the prices of cars. Maybe you have an Excel sheet or a CSV, where each row corresponds to one car. You could, say, have information about the weight, the age, and the horsepower, among other data about each car. We call these pieces of information about the cars, features. In this example, car weight, age, etc. are all features. Suppose you’ve culled data for 1000 cars; for these 1000 cars, you have information for all these features, as well as the price, and that you want to predict the price of a car given these features for that unknown car.

## Notation

Here’s the notation we’ll use.

• A training example is a pair, $\left(x^{(i)}, y^{(i)} \right)$. The superscript $(i)$ indicates that this is the ith training example. In our cars example, this could be data about the ith car.
• In a training example, the output variable or target variable is denoted by $y^{(i)}$. In our cars example, this corresponds to the prices of the cars. For example, $y^{(5)}$ is the price of the 5th car.
• The input features are represented by $x^{(i)}$. It’s confusing because there are many input features. Don’t worry, we have a notation for that too. It’s important to understand that $x^{(i)}$ is a vector–you can think of a vector as a set of real numbers. So suppose the first column in our Excel sheet is the weight, and the second is the horsepower. Then the notation $x^{(i)}_1$ means the first feature (in this case, the weight) of the ith training example.
• The number of training examples is denoted by $m$. We denote the number of features by $n$.
• Finally, the hypothesis function is the function of the data that we talked about in the first paragraph. We denote this by $h(x)$.

You might take a while to let all that notation sink in, but we’ll be consistent in using this notation for a lot of algorithms. You’ll slowly get used to this.

## Hypothesis for linear regression

So what is linear regression? As the name suggests, in linear regression, our hypothesis function is linear in the input features. What this means is that our hypothesis function takes the form:

$h(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_n x_n$

Again, recall that each $x$ is a vector, not just a single number, and we’re using the notation above to represent each feature. Clearly, this function is linear in the input features. Why? Because we’re using each input feature (that is, $x_1$, $x_2$, etc.), but we’re not taking the square, the cube, or any power of these features. We’re using them as-is.

Note the extra terms in the function–the $\theta$ terms. A reasonable way to interpret them is as relative weights. Again, this is an intuitive understanding of these. So suppose that $\theta_1 = 10$, and that $\theta_2 = 1$. We’ll get to how we obtain these values later, but say we somehow got these values. One way to interpret this is that feature $x_1$ has 10 times more effect on the output (the price, in our example) than feature $x_2$. We’re only saying these are relative weights, because factors like scale affect these $\theta$ values (for example, if the weight feature was in kilograms, the corresponding $\theta$ would be 1/1000 of if the weight feature was in metric tons. More on this later).

Notice now, from the above paragraph, that we have to obtain the values of each $\theta$. This is why we call these as parameters of the linear regression model–this is the part of the model (the hypothesis function) that we’re actually “learning”. There’s some calculus behind how we obtain, or “learn” these values. Unfortunately, saying we’re using math to find a bunch of numbers isn’t as catchy a headline as saying machines are learning on their own.

## The cost function

So intuitively, we have an equation of some line (this may not be a line: if there are more than 2 features, this becomes a plane in n dimensions, but for simplicity we’ll call it one). Obviously, no data will fit perfectly on the line. Even the best possible line (curved lines aren’t allowed) won’t (in most cases) pass through all points. In other words, our line has some error, because it’s not perfect. The cost function gives a numerical value to how much error the line has.

How do we quantify this error? Easy: we find the error for each point in our dataset, and find their average (arithmetic mean). The error we choose is the square of the distance from the point to the line. Here’s a crude drawing of what I mean.

What’s marked as error is the Euclidean distance of the point to the line. For each point, we’ll take the square of this distance (avoids the hassle of the square root). Summing up all these gives us the equation for the cost function, which we denote by J.

$J(\theta) = \frac{1}{m} \sum_{i=1}^{m}\left( h(x^{(i)}) - y^{(i)} \right)^2$

For mathematical convenience later, we’ll make one minor change: we’ll add a 2 in the denominator. We’ll see why we add this later, it’s really just for convenience of calculation.

$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m}\left( h(x^{(i)}) - y^{(i)} \right)^2$

Notice that $J$ is a function of $\theta$ (by now you’ve figured out that $\theta$ is also a vector, just like $x$). This is because we want to calculate the values of these $\theta$. For any training example, we already know the values of the $x$ in the hypothesis function (this is our data); so it’s the $\theta$ that we tweak to get the best fitting line.

Since this cost function represents the error of the line, we want it to be as small as possible. Hence, when we optimize the model, we’ll try to find the values of each $\theta$ so that $J(\theta)$ is as small as possible. Because of this, this cost function is called the least-squares cost function.

Note about math: This section does have some calculus, but you can skip the math and look at just the result, if you want. The rest of this section should be easy to understand.

Gradient descent is a general method that we use to find the optimal values for $\theta$. We start with an initial guess for $\theta$, and repeatedly change it to make $J(\theta)$ smaller. Here’s a graphical explanation for what we do in gradient descent.

$J(\theta)$ will look like the function above (because it’s quadratic). In multiple dimensions, it will look like a bowl. We start with an initial guess for $\theta$ (the diagram above uses the term weights, and represents them by $w$ instead of $\theta$). We then find the gradient, which, if you’re familiar with calculus, is obtained by finding the derivative at that point. Specifically, you need to find the partial derivatives with respect to each of the $\theta$s. After you find the gradient, you move opposite to the direction of the gradient, which basically is towards the global minimum.

Notice how there are multiple arrows slowly inching towards the global minimum. This is because we perform gradient descent in multiple steps or iterations. On each iteration, we take a small step towards the global minimum. Why does this step have to be small? Because if we jump too far, we’ll end up on the other side of the minimum! This isn’t a problem per se, but it does slow down the algorithm because it keeps jumping around before converging to the optimal value.

With that understanding, it’s not too hard to formulate what one step of gradient descent looks like (the below is called the gradient descent update rule):

$\theta_j = \theta_j - \alpha \cdot \frac{\partial}{\partial \theta_j}J(\theta)$

Not too hard to figure out. At each step, we change each of the $\theta$s by moving opposite (hence the negative sign) to the gradient at that point. Because we don’t want to take too large a step, we control how big these steps are using the parameter $\alpha$, which is called the learning rate. Good values for $\alpha$ are 0.1 and 0.01. Let’s now find the value of that partial derivative.

\begin{aligned} \frac{\partial}{\partial \theta_j}J(\theta) &= \frac{\partial}{\partial \theta_j} \left( \frac{1}{2m} \sum_{i=1}^{m} \left( h(x^{(i)}) - y^{(i)} \right)^2 \right) \\ &= \frac{1}{2m} \left( \sum_{i=1}^{m} \frac{\partial}{\partial \theta_j} \left( h(x^{(i)}) - y^{(i)} \right)^2 \right) \\ &= \frac{1}{2m} \cdot \sum_{i=1}^{m} \left( 2 (h(x^{(i)}) - y^{(i)}) \cdot \frac{\partial}{\partial \theta_j} (h(x^{(i)}) - y^{(i)}) \right) \\ &= \frac{1}{m} \sum_{i=1}^{m} \left( (h(x^{(i)}) - y^{(i)}) \cdot \frac{\partial}{\partial \theta_j} \left( \theta_0 + \theta_1 x^{(i)}_1 +\theta_2 x^{(i)}_2 + \ldots + \theta_n x^{(i)}_n - y^{(i)} \right) \text{}\right) \\ &= \frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x^{(i)}_j \end{aligned}

Let’s go over these steps. We first take the constant out of the derivative.Then we use the chain rule, bringing the power 2 down and multiplying it (getting rid of this 2 in the numerator is why we added the 2 in the denominator in the cost function). We then simplify things out. For the last step, note that taking the partial derivative with respect to $\theta_j$ will give you only its coefficient, $x_j$.

Now that we have the partial derivative, we can plug this into our update rule.

$\theta_j = \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)}) \cdot x^{(i)}_j$

## Batch gradient descent and Python implementation

Note about math: This section uses very elementary linear algebra. If you’re comfortable with matrices, you should have no trouble in this section.

Look at what we did above. We ran the update rule for all the training examples at once. When we update the parameters this way, we call it batch gradient descent. Because, you know, we’re doing things in a batch. During implementation, we don’t do this via a for loop; rather, we write $\theta$ and $y$ as vectors, and $x$ as a matrix (of $m$ rows for each training example, and $n$ columns for each feature), and use built-in matrix manipulation functions to perform the update for all the training examples at once.

To do so, we need some new notation, in terms of vectors and matrices. Suppose we take all the $\theta_i$s (which are real numbers), and stack them up into a vector. We call this vector as $\Theta$.

$\Theta = \begin{bmatrix} \theta_0\\ \theta_1\\ \vdots \\ \theta_n \end{bmatrix}$

Notice that $\Theta$ is $(n + 1)$-dimensional. With this foundation set up, I suggest you go back up and look at the hypothesis function again. Done? Good, because you’ll need that. If you’re familiar with matrices, then you should, with the notation above, be able to show that the hypothesis function can be written as below. For your convenience, I’ve written the matrix dimensions as subscripts. Note that here, the lowercase $x$ is the vector representing all the features of one training example.

$h(x) = \Theta^T_{(1, n+1)}x_{(n+1, 1)}$

Note that along with $\Theta$, the dimensions of $x^{(i)}$ is also $(n + 1, 1)$. Where did we get this extra feature? Remember, we have $n$ features, so it should be of dimensions $(n, 1)$, right? Right, but for the sake of convenience, we add a feature $x_0$, whose value we set to 1. You should multiply these two to get the same hypothesis function that we wrote above and convince yourself that only by setting $x_0 = 1$, we get that hypothesis function.

So $h(x)$ has dimensions $(1, 1)$. That is, it’s a real number. Which is right, because the hypothesis function takes all the input features of one training example and gives a real number. So with efficient matrix-processing library functions, one (batch) update of batch gradient descent can be written as:

$\Theta_{(n+1, 1)} = \Theta_{(n+1, 1)} - \frac{\alpha}{m} \sum_{i=1}^{m} \left( h(x^{(i)})_{(n+1, 1)} - y^{(i)}_{(n+1, 1)} \right) x^{(i)}_{(n+1, 1)}$

I’ll write it out again without the clutter:

$\Theta = \Theta - \frac{\alpha}{m} \sum_{i=1}^{m} \left( \Theta^T x^{(i)} - y^{(i)} \right) x^{(i)}$

And thus, we end up with our batch gradient descent algorithm:

$\text{repeat until convergence \{} \\ \Theta = \Theta - \frac{\alpha}{m} \sum_{i=1}^{m} \left( \Theta^T x^{(i)} - y^{(i)} \right) x^{(i)} \\ \text{\}}$

Let’s now talk code. How would we implement this in Python? Here’s a dataset from Andrew Ng’s course at Stanford University.

We’ll use Numpy arrays to represent the $x$ and $y$ variables.

y = np.array([[400], [330], [369], [232], [540]], dtype=np.float64)
x = np.array([[2104,3], [1600,3], [2400,3], [1416,2], [3000,4]], dtype=np.float64)


Let’s not forget to add the 1s:

x = np.concatenate((np.ones((5,1), dtype=np.float64), x), axis=1)


And pick an initial guess for $\Theta$:


theta = np.array([[40], [30], [50]], dtype=np.float64)



And now here’s how we implement batch gradient descent:


def batch_gradient_descent(theta, x, y, alpha=0.01, n_iter=100, print_cost=100):
"""
Implements batch gradient descent using vectors and numpy.

Arguments:
theta     : (n + 1)-dimensional vector
x         : (m, n + 1)-dimension matrix
y         : (m, 1)-dimension vector
n_iter    : number of iterations to run
alpha     : learning rate
print_cost: # iterations to print cost

Returns:
theta after n_iter iterations
"""
m = y.shape[0]
n = theta.shape[0] - 1

plot_data = []

# Check the input dimensions to make sure they match our expectations
assert(x.shape == (m, n + 1)), 'invalid shape for x' + str(x.shape)
assert(y.shape == (m , 1)), 'invalid shape for y' + str(y.shape)

# Feature scaling
x_norm = np.sum(x, axis=0)
x = x / x_norm
# Perform the gradient descent update for the specified number of iterations
for z in range(n_iter):
h = np.dot(x, theta)
theta = theta - alpha / m * np.sum((h - y) * x, axis=0, keepdims=True).T

if (z % print_cost == 0):
cost = 1 / (2 * m) * np.sum((h - y) ** 2)
plot_data.append([z, cost])
print(cost)

# Plot the cost function
plot_x = [item[0] for item in plot_data]
plot_y = [item[1] for item in plot_data]
plt.plot(plot_x, plot_y, 'ro--')

return theta



The expression for $h(x)$ seems different, but it’s really the same. The difference arises because of how we’ve chosen to take $x$ as input, where we flipped the dimensions to make it easy to write the matrices.

We first check all the dimensions. After that, we perform something called feature scaling, which we’ll talk about later. The actual update itself is just two lines of code. Note how we’re not really iterating over anything, we’re using the vectors themselves and updating everything at once. This technique is called vectorization and can greatly speed up computation. You should do this whenever it’s possible. At last, we plot the cost function as we iterate. Remember: we want this to go down. Here’s a sample run in Jupyter notebook:

Note how the cost function seems to flatten out. It isn’t really flattening out; the scale of the y-axis makes it look flat. If this doesn’t convince you, you can look at the cost function value that it outputs every 10,000 iterations to assure yourself that it indeed is working.

Instead of updating as a batch, we could choose to run a loop. You would do this if $m$ is large, say 1 million or 10 million. This is easy:

$\text{loop\{} \\ \text{for } i = 1 \text{to } m \text{\{} \\ \theta_j = \theta_j - \frac{\alpha}{m} \left( h(x^{(i)}) - y^{(i)} \right) x^{(i)}_j \text{for every } j \\ \text{\}} \\ \text{\}}$

## Finding the optimal $\Theta$

This section outlines the math to find the optimal $\Theta$. It relies on basic linear algebra.

To maintain consistency with standard literature, we’ll change the way we’ve defined things just a tiny little bit. Remember that $X$ is the input data. In any dataset, you’ll see that the input data has $m$ rows, and $n$ columns. Since we added the extra $x_0$ terms, we’ll have $n+1$ columns. And so, the $X$ matrix has dimensions $(m, n+1)$. Easy, right? That’s the only change in notation we’ll use (so far we used the flipped way, where training examples were stacked side-by-side rather than on top of each other, so the dimensions were different. With this new notation, there’s a slight change in notation. If $Y$ represents the output variable (which we expect to have $m$ rows, so it has dimensions $(m, 1)$), and $\Theta$ represents the features (with dimensions $(n+1, 1)$, that is, the parameters are stacked vertically):

$Y_{(m, 1)} = X_{(m, n+1)} \Theta_{(n+1, 1)}$

Since we know the values of $X$ and $Y$ (since this is the training data), you’d think the obvious solution would be $\Theta = X^{-1}Y$. However, there is a problem: if $X$ is not a square matrix (and it almost never is), then it is not invertible, so we can’t use this. The fix is easy: we pre-multiply both sides with $X^T$ to get:

$X^T_{(n+1, m)} Y_{(m, 1)} = X^T_{(n+1, m)} X_{(m, n+1)} \Theta_{(n+1, 1)}$

Clearly, $X^T X$ is a square matrix, and so is invertible. Thus, the parameters are rather easily found by:

$\boxed{ \Theta = (X^T X)^{-1} X^T Y } \qquad$

With this solution, we conclude our discussion of linear regression. When referring to linear regression, if $n = 1$, that is, there’s only a single feature, it’s called simple linear regression; otherwise, it’s called multiple linear regression. Our general equations with $n$ features fit all scenarios; hence I decided to discuss this general case rather than the two individual cases.