We saw that we can divide machine learning models in four categories.
And although this grouping is quite outdated we will follow it for the time being. We did skim over classification and regression but more importantly we did see how we evaluate ML models and how we automate the evaluation. Now, one may be surprised that we already went through the majority of way of evaluating an ML model.
In unsupervised learning there are no good ways to know whether a model works well or not. Instead, faced with a problem against which we need to attempt unsupervised learning, we need to attempt to transform at least part of the problem into supervised learning and evaluate it with accuracy scores, f1 measures, and so on.
Often we do not have labels for the data we use unsupervised learning on, in such a case any form of model evaluation will be very difficult. Perhaps the easiest way to evaluate a model for a problem for which we have no labels at all and cannot get any labels, is to find a similar problem for which we have labels and train a model on that similar problem. Once we evaluate a good model on the similar problem we assume that it will work in a similar fashion on the original problem.
Most of the time with unsupervised learning we have or can get at least a handful of labels. We can then evaluate our unsupervised model on the handful of labels in a similar fashion as we would do for a supervised one. This can be sometimes called semi-supervised learning, and is our first indication that the list of ML groupings at the top is quite incomplete.
Will allow us to explore the data in lesser dimensions, whatever we call the dimension.
Here we come back to the concept of distance.
We did see different distances such as L-norms, including L1
and L2
,
but many other measures can be made to work as a distance:
sum of coordinates and polynomial functions on top of the data
are distances we have already used;
other distances include measures such as cosine similarity or probabilities.
Don't worry what these exactly are right now,
we will explore some of the ways of measuring distance.
Independent of what we use for distance, we say that two points (samples) are similar if they are close to one another, and dissimilar if they are far apart.
One way to evaluate dimensionality reduction is to perform the reduction to a set of numbers we can visualize and then check by-the-eye that things close together are the ones we expect to be similar and things far apart are the ones we expect to be dissimilar.
When we have a handful of labels we can perform the dimensionality reduction and then train another model on top of the reduced data. The performance of the resulting model will be an indication of the performance of the dimensionality reduction technique.
Is similar to the models we have already seen when looking at supervised learning. Instead of changing the structure of the space where the data lives as is done with dimensionality reduction, clustering attempts to identify the structure in the data. The identification of this structure is again performed by means of distance in one form or another.
One builds clusters, structures which then allow new (unseen) data to be classified into one of these clusters. We argue that samples that are similar to each other are in a cluster together, and in a different cluster from samples that are dissimilar to them. Contrary to supervised learning one such cluster has no meaning, unless we give it one using a guesstimate, by-the-eye, evaluation. Without labels clusters are just groupings of data, yet seeing such groupings we may be able to guess labels for them.
If we have some labels then we can check what label is present for the majority of samples in a cluster and give the cluster a meaning. In this fashion we can evaluate a clustering model in a similar way to supervised learning models. We can then say that a good clustering would group together the samples according to their labels.
When one has no labels at all and cannot guesstimate them there remains possibility of evaluating the shape of the clusters. Clusters live in the space where the data lives. Evaluation techniques can geometrically check whether the clusters are round or dense. And the concept of round or dense can be changed slightly depending of how we measure distance. Still, geometric techniques for clustering evaluation are rather limited, and are an area of active research.
Let's import a handful of things that will help us with the next topic.
We will need a handful of mathematical concepts, although not necessarily equations, in order to advance from here. We will also have a quick look at a few mathematicians that allowed us to get where we are today.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
For a long time science assumed the linearity of the world. This is another way to say that science believed that every phenomena can be distilled to its first principles and then those first principles will combine in a linear fashion to produce the more complex phenomena. That is a lot of talk that does not actually tell us what this linearity is. It often is the case that the exact definition of linearity is slightly different depending on the discipline one looks at. Instead we will attempt to define where linearity fails and work backward from there.
On the assumption that one would be capable of dividing the world into all its first principles then, by definition, none of the first principles could be correlated with any of the other first principles. Otherwise, if the first principles are indeed correlated, then one first principle is dependent on the other and hence not a first principle but phenomena caused by the other first principle.
Let us play with this concept and build two graphs: the capital letter B, and some white noise - a completely uniform random set of points - and make their histograms in each dimension.
B = np.array([
[.1, .9], [.20, .9],
[.23, .87],
[.1, .8], [.25, .8],
[.1, .7], [.25, .7],
[.23, .63],
[.1, .6], [.20, .6],
[.1, .5], [.20, .5],
[.1, .4], [.27, .4],
[.27, .37],
[.1, .3], [.30, .3],
[.27, .23],
[.1, .2], [.27, .2],
[.1, .1], [.20, .1],
])
noise = []
for i in range(70):
noise.append(B + np.random.random(B.shape)*.12)
B = np.vstack(noise)
white = np.random.random(B.shape)
fig, ax = plt.subplots(2, 5, figsize=(16, 9))
ax[1, 0].scatter(B[:, 0], B[:, 1], s=20, c='crimson')
ax[0, 0].hist(B[:, 0], color='crimson')
ax[0, 0].set_title('Letter B, corr %.06f' % np.corrcoef(B[:, 0], B[:, 1])[0, 1])
ax[1, 1].hist(B[:, 1], color='crimson', orientation='horizontal')
ax[1, 3].scatter(white[:, 0], white[:, 1], s=20, c='steelblue')
ax[0, 3].hist(white[:, 0], color='steelblue')
ax[0, 3].set_title('White Noise, corr %.06f' % np.corrcoef(white[:, 0], white[:, 1])[0, 1])
ax[1, 4].hist(white[:, 1], color='steelblue', orientation='horizontal')
for img in ax.flat:
img.axis('off')
In both graphs we only have points in $2$ dimensions, and we evaluated the correlation between the $2$ dimensions in each case. Both the capital letter B and the white noise have pretty much no correlation between the dimensions in which they exist.
If we accept the approximation that the correlations are indeed exactly zero, then we can argue that we have separated the world where the letter B and the white noise live into its first principles: the positions on the x-axis and the positions on the y-axis. And since the positions and correlations are pretty much the same the letter B and the white noise are exactly the same thing!
Obviously they are not the same thing. We can see the letter B because our human brain is very good at finding non-linear patters. When we draw a capital letter B we do not take notice of exactly where we place each part of the letter. Instead we take effort to make sure that the letter B has one straight side and round holes in it. The exact position of the parts of B does not matter, what matters is the position between the parts of B.
The letter B has a pattern we can identify with our human brain which do not follow from the definition of the two-dimensional world built by positions of points with two coordinates. Finding such patters as the letter B in the example is a case of a non-linear problem.
How do we know that a problem is non-linear? Well, we do not. The only way to know that there is more to be seen than what is possible by splitting the entire world where the problem lives into its first principles, is to perform the splitting and attempting to solve the problem. If the problem cannot be solved by splitting it into first (uncorrelated) principles then it is a non-linear problem. With experience one learns to identify non-linear problems. Notably, the world has many more non-linear problems than linear ones.
Linearity of problems, even when just an approximation, allows for several nice properties. Linear Algebra and Univariate statistics describe linear problems well, and have several closed form solutions (solutions than can be evaluated straight away). Hence it is worth to first attempt a linear solution to a problem before assuming that it is non-linear.
That being said, assuming that everything in the world around us can be explained by splitting it into first principles - and that linear techniques work on every problem - is akin of hitting a square peg into a round hole. In the case above one would just assume that every letter B is just white noise and never find out what B may actually be.
Univariate Statistics, where each random variable (or each feature when thinking about date) is described by a distribution with a singe peak, are not the only form of statistics. There are statistics which account for the possibility of non-linear patterns in the data. Univariate statistics are just more common, since they include the assumption of a Gaussian distribution of the data. And the Central Limit Theorem we mentioned earlier is also a form of univariate statistics since it is a sum of several univariate distributions (distributions with a single peak). Wilcoxon statistics and Tsallis statistics are examples of statistics that account for non-linearities.
Whether the commonalty of univariate statistics is due to the fact that linear problems are easier is up to debate. One can argue that the only reason why linear problems are easier is because we have more tools to deal with them. If mathematics has had developed in a different way than it did, perhaps today we would argue that non-linear problems are easy and non-linear problems are hard. That said, we currently have many more tools to deal with linear problems, hence we often argue that for practical purposes linear problems are easy to solve and non-linear ones are hard.