★☆☆ - You should be able to based on Python knowledge plus lecture contents.
★★☆ - You will need to do extra thinking and some extra reading/searching.
★★★ - The answer is difficult to find by a simple search, requires you to do a considerable amount of extra work by yourself (feel free to ignore these exercises if you're short on time).
For the first couple of exercises we will use a subset of the World War II weather dataset. The full dataset originates from the [USA Gov Works][usa], and was collected as part of preparation for bombing raid in Europe.
We will use only the data from weather station $16405$,
and only for the year $1945$,
data which can be found in the file
Since Germany surrendered on the $5th$ of May
only a quarter of our set was used to coordinate bombers,
but the measures have been performed after the surrender
The weather was measured in distinct intervals,
doy means the day of the year, from $1$ up to $365$.
Note however that often several measures has been performed on
the same day.
Note also that there is not measure for March the $31st$.
In the second part of the exercises we will replicate
our Bayesian Model but will inject stop words into the TF-IDF
Stop words are common words that give little value
to the topic intention of a sentence - despite the
fact that these words give great value to the actual meaning
of the sentence.
A list of english stop words is in the file
First let's import all the things we will need for the exercises.
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline plt.style.use('seaborn-talk') from sklearn.pipeline import make_pipeline from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.preprocessing import PolynomialFeatures from sklearn.model_selection import cross_val_score, KFold from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import ComplementNB from sklearn.metrics import classification_report station_file = 'fe-station16405-1945.csv' stopwords_file = 'fe-stopwords-en.txt'
Do not use temperature values to predict other temperature values, that would be cheating. We can argue that depending on the time of the year temperature varies, and rain and snow has an influence on temperature as well.
Use cross validation with a kfold
n_splits=7 and shuffled data to evaluate the model.
Remember that a polynomial is just a
Keep the cross validation with
Keep the same splits in the cross validation.
When one has little data one needs to split many times in order
for the mean cross validation to be meaningful.
n_splits=100 for both runs.
Is Ridge regularization more consistent on more splits?
The TfidfVectorizer accepts stop words during its construction.
Do you get a better separation of religion and politics?
classification_report to check if precision and recall improves
for these classes.