Exercises rating:
★☆☆ - You should be able to based on Python knowledge plus the text.
★★☆ - You will need to do extra thinking and some extra reading/searching.
★★★ - The answer is difficult to find by a simple search, requires you to do a considerable amount of extra work by yourself (feel free to ignore these exercises if you're short on time).
We will use the penguin dataset collected by the Palmer Station in Antarctica. The set is composed of data on $344$ penguins found on three islands from the Palmer Archipelago at the end of the Antarctic peninsula. There are $4$ numeric measures done on each penguin, and its species, its gender and location is given. The dataset has $3$ penguin species and $11$ rows of missing data.
The numerical measures will allow us to build classifiers for the
non numerical columns.
The KNeighborsClassifier
work in a similar fashion to the classifier
we have built ourselves,
use its fit
and predict
methods.
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
numeric = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
penguins = sns.load_dataset('penguins').dropna()
penguins
Split the dataset into a training set and a testing set and check
the classifier with accuracy_score
against the testing set.
The number of neighbors in sklearn
's KNeighborsClassifier
is specified with the n_neighbors
parameters.
Use n_neighbors=3
.
n_neighbors
in the penguin gender classifier. (★★☆)¶Run the classifier training in a loop and print the accuracy score
for different values of n_neighbors=
.
Values up to n_neighbors=20
should be enough.
n_neighbors
on this classifier. (★★☆)¶
n_neighbors
) for the island from where the penguins comes. (★★☆)¶
n_neighbors
(★★☆)¶Run the previous exercises several times in order to get different splits of training and test sets.
The accuracy may vary dependent on the split but the values for n_neighbors
should be similar
in each run.