06.05 SciKit Learn Exercises

Exercises rating:

★☆☆ - You should be able to based on Python knowledge plus the text.

★★☆ - You will need to do extra thinking and some extra reading/searching.

★★★ - The answer is difficult to find by a simple search, requires you to do a considerable amount of extra work by yourself (feel free to ignore these exercises if you're short on time).

We will use the penguin dataset collected by the Palmer Station in Antarctica. The set is composed of data on $344$ penguins found on three islands from the Palmer Archipelago at the end of the Antarctic peninsula. There are $4$ numeric measures done on each penguin, and its species, its gender and location is given. The dataset has $3$ penguin species and $11$ rows of missing data.

The numerical measures will allow us to build classifiers for the non numerical columns. The KNeighborsClassifier work in a similar fashion to the classifier we have built ourselves, use its fit and predict methods.

In [ ]:
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
numeric = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
penguins = sns.load_dataset('penguins').dropna()
penguins

1. Generate numeric labels for "species", "island" and "sex" columns. (★☆☆)

In [ ]:
 

2. Build a (kNN) classifier for the gender of a penguin. (★☆☆)

Split the dataset into a training set and a testing set and check the classifier with accuracy_score against the testing set. The number of neighbors in sklearn's KNeighborsClassifier is specified with the n_neighbors parameters. Use n_neighbors=3.

In [ ]:
 

3. Find a good value for n_neighbors in the penguin gender classifier. (★★☆)

Run the classifier training in a loop and print the accuracy score for different values of n_neighbors=. Values up to n_neighbors=20 should be enough.

In [ ]:
 

4. Build a classifier for penguin species and find a good value for n_neighbors on this classifier. (★★☆)

In [ ]:
 

5. Build a classifier (and find a good value for n_neighbors) for the island from where the penguins comes. (★★☆)

In [ ]:
 

6. Describe which classifiers do better with small values or big values for n_neighbors (★★☆)

Run the previous exercises several times in order to get different splits of training and test sets. The accuracy may vary dependent on the split but the values for n_neighbors should be similar in each run.