07.06 Model Evaluation Exercises¶

Exercises rating:

★☆☆ - You should be able to based on Python knowledge plus lecture contents.

★★☆ - You will need to do extra thinking and some extra reading/searching.

★★★ - The answer is difficult to find by a simple search, requires you to do a considerable amount of extra work by yourself (feel free to ignore these exercises if you're short on time).

For the first couple of exercises we will use a subset of the World War II weather dataset. The full dataset originates from the [USA Gov Works][usa], and was collected as part of preparation for bombing raid in Europe.

We will use only the data from weather station $16405$, and only for the year $1945$, data which can be found in the file fe-station16405-1945.csv. Since Germany surrendered on the $5th$ of May only a quarter of our set was used to coordinate bombers, but the measures have been performed after the surrender declaration nevertheless. The weather was measured in distinct intervals, the column doy means the day of the year, from $1$ up to $365$. Note however that often several measures has been performed on the same day. Note also that there is not measure for March the $31st$.

In the second part of the exercises we will replicate our Bayesian Model but will inject stop words into the TF-IDF preprocessing. Stop words are common words that give little value to the topic intention of a sentence - despite the fact that these words give great value to the actual meaning of the sentence. A list of english stop words is in the file fe-stopwords-en.txt.

First let's import all the things we will need for the exercises.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score, KFold
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report
station_file = 'fe-station16405-1945.csv'
stopwords_file = 'fe-stopwords-en.txt'

1. Get the weather data and plot its features against the day of the year (★☆☆)¶

2. Use a plain linear regression to predict the temperature based on the other features (★★☆)¶

Do not use temperature values to predict other temperature values, that would be cheating. We can argue that depending on the time of the year temperature varies, and rain and snow has an influence on temperature as well.

Use cross validation with a kfold n_splits=7 and shuffled data to evaluate the model.

3. Based on the graphs of the features use a polynomial to predict the temperature (★★☆)¶

Remember that a polynomial is just a PolynomialFeatures. Keep the cross validation with n_splits=7.

4. Instead of a Linear regression use a Ridge regression for the polynomial (★★☆)¶

Keep the same splits in the cross validation.

5. Attempt both Linear and Ridge regression but on many splits (★★☆)¶

When one has little data one needs to split many times in order for the mean cross validation to be meaningful. Use n_splits=100 for both runs. Is Ridge regularization more consistent on more splits?

6. Load the stop words file into a list in Python (★☆☆)¶

7. Repeat the Naive Bayes example using these stop words injected into the TF-IDF preprocessing. (★★☆)¶

The TfidfVectorizer accepts stop words during its construction. Do you get a better separation of religion and politics? Use a classification_report to check if precision and recall improves for these classes.