09.08 Supervised Learning Exercises

Exercises rating:

★☆☆ - You should be able to based on Python knowledge plus the text.

★★☆ - You will need to do extra thinking and some extra reading/searching.

★★★ - The answer is difficult to find by a simple search, requires you to do a considerable amount of extra work by yourself (feel free to ignore these exercises if you're short on time).

The Boston dataset is a housing price statistical data from the Boston, Massachusetts area in the $1970$s. The dataset has is non-linear since its features are similarities within differently formed areas within the city. To make things harder the dataset is full of outliers, data produced by measurement error or a by lack of measurement filled with averages.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)

1. Build a Linear Regression for the entire dataset (★★☆)

Make a pipeline with polynomial features, try $2$, $3$ and $4$ polynomial features. The cross validation score from a linear model should not be good.

In [ ]:
 

2. Attempt a non-linear model a Random Forest (★★☆)

Try with $50$, $100$ and $300$ trees.

In [ ]:
 

3. Show the differences between the linear and non-linear model (★★★)

We now got two cross validated models, each of them estimates the house prices in a different way and produces different values. But how different these values are? Build a histogram of differences between the predictions and between each of the predictions and the targets.

In [ ]:
 

As an extra build graphs showing the prediction differences for each feature. One way to do it is to graph the house price on the horizontal axis, be it target or prediction, and the feature on the vertical axis.

The feature "CHAS" is a binary feature, hence it cannot be compared in such a way. Feel free to ignore this feature. As an extra on top of the extra, find a way to compare the prediction differences on that feature.

In [ ]: