08.08 Unsupervised Learning Exercises¶

Exercises rating:

★☆☆ - You should be able to based on Python knowledge plus the text.

★★☆ - You will need to do extra thinking and some extra reading/searching.

★★★ - The answer is difficult to find by a simple search, requires you to do a considerable amount of extra work by yourself (feel free to ignore these exercises if you're short on time).

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
from sklearn.cluster import AgglomerativeClustering, MiniBatchKMeans
from sklearn.metrics import v_measure_score
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

1. Attempt k-means on the digits dataset after PCA (★★☆)¶

Make a pipeline and join PCA and k-means into a single model. Does the v-measure improves after the use of linear preprocessing?

In [ ]:

2. Attempt k-means on the digits dataset after t-SNE (★★☆)¶

Now use t-SNE as the preprocessing. Does the v-measure improves after the use of non-linear preprocessing?

Note that the t-SNE implementation of sklearn is incomplete. It does not have a plain transform method and is not applicable beyond the data for which it is fit. This is not a problem for us who are only exploring the non-linearity of the digits dataset. Instead of using plain TSNE in your pipeline use the class defined below (remember to execute this cell).

In [ ]:
class PipeTSNE(TSNE):
def transform(x):
return self.fit_transform(x)
In [ ]:

3. Attempt agglomerative clustering on the digits dataset after PCA (★★☆)¶

Use linkage='ward' for the time being.

In [ ]:

4. Attempt agglomerative clustering on the digits dataset after t-SNE (★★☆)¶

Remember to use the PipeTSNE defined above. Keep linkage='ward' in this exercise.

In [ ]:

5. Attempt single linkage on the digits dataset after t-SNE (★★☆)¶

Remember to use the PipeTSNE defined above. Now it is time to use linkage='single' in the agglomerative clustering. Does single linkage perform better on the non-linearly preprocessed dataset than it did when we saw it performed on the raw data of the digits dataset?

In [ ]: