# 2.1 – Feature Scaling

I briefly mentioned feature scaling in the previous post on linear regression, and even added it in the code sample.

Feature scaling is basically just “scaling” your features. Obvious, right? What you want, ideally, is for all your features to be in a similar range. Say you have a feature $x_1$, which is the size of the house in square feet, which ranges from 0 to 2000, and $x_2$, which is the number of bedrooms (this example is from Andrew Ng’s Coursera course). You can easily see there’s a big difference in scale. This is a problem because gradient descent will now take longer to converge (reach) the global minimum (remember that we use gradient descent to find the optimal values for each $\theta$ so that the cost function $J(\Theta)$ is minimum).

The technical reason for gradient descent taking longer is because when the features are of different scale, the contours of the cost function $J(\Theta)$ become skinny, and that makes gradient descent require more steps to find the global minimum.

How do we fix this problem of scale? There are two easy ways:

• You could simply divide each feature by the maximum. So you’d divide the number of bedrooms in each training example by 5, and the area in each training example by 2000. This forces these two features to be between 0 and 1.
• You could first subtract the mean from each feature before doing the above division. This reduces each feature to be between -1 and 1.

Remember: do not perform feature scaling for $x_0 = 1$. That’s a feature required to maintain the correct hypothesis function, so do not do either of these for $x_0$ (although if you use the first method it won’t have any effect).

Feature scaling isn’t always required. For example, if you’re scaling to [0, 1], and some feature is between [2, 5], or even [1, 9], that’s probably okay because it’s not too far from [0, 1]. But if you have something in say, [25, 100] or [0.01, 0.1], you might want to perform feature scaling.

On to the code. How do we perform feature scaling? The scikit-learn package provides a good way to do this. In a real environment, the real “test” data is the real-time data you get when you use your system in production. So you don’t know in advance what the maximum is. How do you scale? You do this. Say your training data is in the variable x_train, and the test data is in the variable x_test. Then you perform feature scaling on the training data, using the StandardScaler object. This scales your data to have a mean of 0 and variance 1. Then, you use this same object to scale your test data.


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(x_train)  # find the parameters to scale data

scaler.transform(x_train)  # scale the training data

scaler.transform(x_test)  # scale test data



The documentation for sklearn outlines this in greater detail.