Exercises rating:
★☆☆ - You should be able to based on Python knowledge plus lecture contents.
★★☆ - You will need to do extra thinking and some extra reading/searching.
★★★ - The answer is difficult to find by a simple search, requires you to do a considerable amount of extra work by yourself (feel free to ignore these exercises if you're short on time).
For the first couple of exercises we will use a subset of the World War II weather dataset. The full dataset originates from the [USA Gov Works][usa], and was collected as part of preparation for bombing raid in Europe.
We will use only the data from weather station $16405$,
and only for the year $1945$,
data which can be found in the file fe-station16405-1945.csv
.
Since Germany surrendered on the $5th$ of May
only a quarter of our set was used to coordinate bombers,
but the measures have been performed after the surrender
declaration nevertheless.
The weather was measured in distinct intervals,
the column doy
means the day of the year, from $1$ up to $365$.
Note however that often several measures has been performed on
the same day.
Note also that there is not measure for March the $31st$.
In the second part of the exercises we will replicate
our Bayesian Model but will inject stop words into the TF-IDF
preprocessing.
Stop words are common words that give little value
to the topic intention of a sentence - despite the
fact that these words give great value to the actual meaning
of the sentence.
A list of english stop words is in the file fe-stopwords-en.txt
.
First let's import all the things we will need for the exercises.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score, KFold
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report
station_file = 'fe-station16405-1945.csv'
stopwords_file = 'fe-stopwords-en.txt'
Do not use temperature values to predict other temperature values, that would be cheating. We can argue that depending on the time of the year temperature varies, and rain and snow has an influence on temperature as well.
Use cross validation with a kfold n_splits=7
and shuffled data to evaluate the model.
Remember that a polynomial is just a PolynomialFeatures
.
Keep the cross validation with n_splits=7
.
Keep the same splits in the cross validation.
When one has little data one needs to split many times in order
for the mean cross validation to be meaningful.
Use n_splits=100
for both runs.
Is Ridge regularization more consistent on more splits?
The TfidfVectorizer accepts stop words during its construction.
Do you get a better separation of religion and politics?
Use a classification_report
to check if precision and recall improves
for these classes.