08.04 t-SNE

The t-distributed Stochastic Neighbor Embedding (t_SNE) is a manifold technique that uses a probabilistic distribution to measure distances between points in a dataset (high probability means short distance and low probability long distance). It then attempts to keep these distances during transformation.

William Sealy Gosset

ul-gosset.svg
William Sealy Gosset was the Chief Brewer for Guiness. Since Guiness did not allow their scientists to mention their own surname in their publications, Gosset published under the pseudonym "Student". He's discoveries in the realm of small sample statistics led to the "Student's t distribution" and "Student's tests of significance". An interesting feat of Gosset's life is his friendship to both Roland Fisher and Karl Pearson, the fathers of modern statistics, two men who deeply disliked each other.

t-SNE is considerably different from a decomposition technique such as PCA:

  • Has no inverse transform (this is often the case with manifold techniques)

  • The number of components is always much smaller than the number of dimensions in the original data. Since this originally is a visualization technique, most often you will only see 2 or 3 components

  • Since it is a stochastic technique the result is considerably different depending on the initial (random) state, or even the ordering of the input data

We import the usual stuff.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')

Let's see how t-SNE performs on the MNIST handwritten digits.

In [2]:
from sklearn.datasets import load_digits

digits = load_digits()
sample = digits.images[:20]
fig, axes = plt.subplots(4, 5, figsize=(16, 8))
for i, ax in enumerate(axes.flat):
    ax.imshow(sample[i], cmap='binary')
    ax.axis('off')

The t-SNE algorithm in sklearn works in a similar way as any other preprocessor. We give it the hyperparameters - here just the number of components - and fit the model. We use fit_transform to already transform the data.

The following takes a while to run, t-SNE is computationally expensive.

In [3]:
from sklearn.manifold import TSNE

proj = TSNE(n_components=2).fit_transform(digits.data)

As we did with PCA we plot the groups output by t-SNE, and we take the known labels and color the groups accordingly. We also scale the projection in the same way to optimize plotting.

In [4]:
x_min, x_max = np.min(proj, axis=0), np.max(proj, axis=0)
proj_norm = (proj - x_min) / (x_max - x_min)

fig, ax = plt.subplots(figsize=(16, 14))
ax.axis('off')
for i in range(len(digits.target)):
    ax.text(proj_norm[i, 0], proj_norm[i, 1], str(digits.target[i]),
            color=plt.cm.plasma(digits.target[i] / 10),
            fontdict=dict(size=14, weight='bold'))

The separation between classes is pretty good. t-SNE can find non-linear relations therefore it performs much better than PCA on a non-linear dataset. That said, t-SNE is not without its own flaws. For example, the algorithm is very sensitive to data ordering, and we can test it out by reordering our data.

Digits data is ordered as several groups from $0$ to $9$. We will change that ordering: first all zeros, then all ones, and so on; and run t-SNE with that ordering.

In [5]:
X = np.vstack([digits.data[digits.target==i] for i in range(10)])
y = np.hstack([digits.target[digits.target==i] for i in range(10)])
proj_ordered = TSNE(n_components=2).fit_transform(X)

In the same way we plotted it before we do the ordered digits projection. We can then compare both plots.

In [6]:
x_min, x_max = np.min(proj_ordered, axis=0), np.max(proj_ordered, axis=0)
proj_norm_ordered = (proj_ordered - x_min) / (x_max - x_min)
projs = [proj_norm, proj_norm_ordered]
labels = [digits.target, y]
titles = ['input data order: 0,1,2,3,4,5,6,7,8,9,0,1,2,...', 'input data order: 0,0,..,1,1,...,2,2,...']

fig, ax = plt.subplots(1, 2, figsize=(24, 12))
for i in [0, 1]:
    ax[i].axis('off')
    ax[i].set_title(titles[i], pad=30)
    for j in range(len(digits.target)):
        ax[i].text(projs[i][j, 0], projs[i][j, 1], str(labels[i][j]),
                   color=plt.cm.plasma(labels[i][j] / 10),
                   fontdict=dict(size=14, weight='bold'))

The class separation is still good but the plot looks very different. That is the big disadvantage of t-SNE, the algorithm is not trivially reproducible. That said, t-SNE will certainly be a better preprocessing technique than PCA for this dataset.

t-SNE is just one of many manifold techniques, each with its advantages and disadvantages. Understanding of manifolds is an area of active research.