We saw that search engine research produced an elegant technique to encode words in plain test: Term Frequency by Inverse Document Frequency (TF-IDF). Each word in a sample is represented by the count of this word divided by the frequency of this same word across all samples. We end up with a vector where each word is a dimension. But think for a moment about what may happen if we encode entire articles in this fashion. Quickly we will reach a vocabulary of thousands of words, and the resulting vector space will have thousands of dimensions.
Thousands of dimensions is not an uncommon occurrence in machine learning. More often than not we will work with datasets with very high dimensions. In terms of representation such high dimensions are not much more than a rather big matrix - remember that we can represent several dimensions in a lesser dimensional construct because we can melt data.
We often argue that every column in such a matrix is a dimension. Very high dimensions have its problems and we will come back to discuss these But for now let us attempt a model that will work on highly dimensional data produced by TF-IDF.
We can try it out with samples from newsgroups. And since newsgroups are aggregated by topic we will try to classify the samples into topics. A newsgroup is what today one would call an email chain or mailing list. At the beginning of the Internet there was no World Wide Web (WWW) and communication happened mostly over email. To find people with similar interests one would sent an email to a newsgroup and then receive emails directed to that newsgroup in the hopes someone answers. We also import an Naive Bayes model that works well with classes that have real valued features, such as TF-IDF.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import ComplementNB
The newsgroup dataset is a profiling dataset, and is federated for training and testing sets. What this means for us is that the dataset itself already provides us with a split into a training and testing set. Since the dataset is text (emails) it is not memory efficient, such as datasets encoded in NumPy arrays. Repeatedly copying and splitting the dataset would likely cause memory issues.
We ask for specific topics, in this case we will as for $7$ rather unrelated newsgroup topics, and ask once for the training and once for the testing set. The dataset is downloaded from the web and then cached on the running machine, it may take a moment for it to download.
newsgroups = [
'comp.graphics',
'misc.forsale',
'rec.autos',
'rec.sport.hockey',
'sci.space',
'soc.religion.christian',
'talk.politics.misc',
]
train = fetch_20newsgroups(categories=newsgroups, subset='train')
test = fetch_20newsgroups(categories=newsgroups, subset='test')
train.target_names
First of all let us have a look what we are dealing with. These are emails, I find the $134th$ email in the subset downloaded rather familiar no matter the time. People had computer problem since computers were invented.
Now we have newsgroup topics (the classes) and emails in these newsgroups (samples). We will attempt to build a classifier that given some text, e.g. an email, will be capable of telling us the topic. The model will be limited to the $7$ topics of our choice above but it is still a classification worth attempting.
print(train.data[133])
We will make a pipeline of a TF-IDF preprocessor and a Naive Bayes classifier. The Naive Bayes classifier is a very simple non-parametric technique that just attempt to build (hyper)spherical probabilistic generators around the center of each class. We said that parametric techniques are those that have the number of parameters defined from its hyperparameters. It is a common misconception that non-parametric techniques are techniques which have no parameters or hyperparameters.
Non-parametric techniques take the number of parameters the model optimizes from the data itself. In the case of a Naive Bayes classifier each distinct class in the data will have a probabilistic generator assigned to it. In the many dimensions we will have from the TF-IDF each distinct class will have a Gaussian probability (including its center and variance) in every dimension. Remember that a Gaussian distribution is a construct in a single dimension, hence we need one per dimension.
What such a model may look like is shown below. Imagine that we have thousands of word dimensions - here we label only $5$ of these dimensions because drawing a thousand dimensions is rather hard. Each class, has a center - marked with an $X$ - a spread (variance) in each direction. Here we see $3$ of the $7$ classes.
What we see is the trained - fit - model. A new email - be it new data or unseen data from the test set - is vectorized by TF-IDF and placed within the vector space. The new vector will then have distances to each center of each class. In very simple terms one could say that the topic to the new email would be assigned based on the closest class center. The reality is a bit more complicated since the class spreads can cross themselves and more spread classes carry lesser weight than compact classes. These are the mentioned probabilistic (hyper)spherical probabilistic generators. The label of the new data point is defined by the highest probability across classes - it is similar to a distance but weighted based on each class spread.
The model from sklearn
that we use is ComplementNB
which accounts for a handful
of issues which would make a plain Naive Bayes classifier perform poorly here.
We give it one hyperparameter, norm=True
, which performs
some extra normalization of weights.
The normalization makes sense here because we also perform
the normalization of the vectors in TF-IDF.
Hence we are rather confident on the value of this hyperparameter
and need not attempt to find a best value (true or false) for it.
model = make_pipeline(TfidfVectorizer(), ComplementNB(norm=True))
model.fit(train.data, train.target)
labels = model.predict(test.data)
labels
Since we have lots of classes ($7$ different newsgroup topics) a single score may
not be the best approach to understand how our model works.
Instead we will build a classification report, which will give us a score for each class.
From sklearn
we can import a classification report which will
perform in a similar way to a scoring function.
We can provide target_names
to have string labels for the classes.
We did see precision, recall and f1 score before but we have a new
column in the report: support.
Class support is just the number of samples from that class in the test set.
A balanced set would have almost the same support across classes,
whilst an unbalanced set is a set where the support vary wildly.
Our newsgroup set is slightly unbalanced,
most classes have a support of almost $400$ whilst
one class has a support just above $300$.
This small support issue is one of the reasons for our choice of classifier.
There are several choices of Naive Bayes classifiers in sklearn
:
GaussianNB
has issues with very high dimensions,
MultinomialNB
works badly on unbalanced sets and may have issues
with non-discrete features (e.g. with floating point numbers as we have).
Our choice of ComplementNB
is an educated guess based on the
documentation of the distinct Naive Bayes classifiers.
from sklearn.metrics import classification_report
print(classification_report(test.target, labels, target_names=newsgroups))
The worst result is across religion and politics. No surprises there, these topics get intermingled in the real world too. Religion has a bad precision score and politics has a bad recall score, it suggests that several emails from the politics newsgroup are classified as religion.
That said, with a very simple classifier and some data encoding we have built a model that can tell us the topic of a sentence. We can see it in action with a small helper function.
def predict_chat(sentence):
predicted = model.predict([sentence])
return train.target_names[predicted[0]]
print('TUNING', predict_chat("I've added a new set of cyllinders, now I'm not even making 10 miles per galon"))
print('BALL', predict_chat('The ball never went even close to the goal'))
print('BUTTON', predict_chat("Dude, I'm telling you, there is no such button on my screen"))
print('WIFE', predict_chat('My wife went shopping in the morning, has not come back yet'))
print('APOLLO', predict_chat('No one ever landed on the moon, it was all a farse'))
Given that all this is doing is checking the word frequency probabilities, this is a rather amazing result for a such a simple algorithm.
And we can still see the problems with religion and politics in the predictions. This problem happens because these two topics use lots of stop words, i.e. words that are commonly used in sentence construction. For example, we can see it we if use only common words in a sentence we will get one of these topics.
predict_chat('the what where')
If we remove the stop words from the data representation we should get a better separation between religion and politics.
As a closing point we also argue that since Naive Bayes is non-parametric and has few tunable hyperparameters, it is a very good technique for a classification baseline. When one attempts classification on an unknown problem it is wise to try several ML techniques. Naive Bayes is a good first attempt, it is a technique that will not perform optimally on many problems but will not perform too badly either. It is common practice to try Naive Bayes first, and then when training other techniques discard any ML algorithm that performs worse than Naive Bayes. In this way Naive Bayes is a baseline between reasonable models for a problem and models that go bonkers on that specific problem.
In the thousands of dimensions we did work above we managed to build a useful model. But a few assumptions were not mentioned. Distance measurement quickly becomes quite distinct in high dimensions from our common understanding of distance in the real world. In our classification above we did avoid this issue because we normalized all vectors from TF-IDF to length $1$. Yet, in datasets with many dimensions this normalization may not be possible of may not be desirable. In such cases the use of different distance measures - be it euclidean, Manhattan, probabilistic weight as in the model above or even something else - will require interpretation and insight in order to build understanding about the data.
This issue with distances is often called the curse of dimensionality, and exemplified by the fact that despite the fact that two points are close in several dimensions they are still far in general if they are far in a few other dimensions. That is, assuming the use of euclidean distance, because the squares of the big dimensions become big values.
This bit is a tad more mathematical, you're free to skip it. But if you are curious of more advanced bit in ML research, go ahead.
Another, less talked about, part of the curse of dimensionality is the difficulty of optimizing a solution when fitting to the data. Behind the scenes an optimizer is run to fit the model, we will eventually talk more about optimizers but for now let us just say that the optimizer searches for critical points in the model error function that are minima (minimums) of the function. A function can have critical points that are maxima (maximum value), minima (minimum value) or a saddle point (neither a minima or maxima for the entire function but a minima in some dimensions and a maxima in other dimensions). As the number of dimensions increases the number of minima and maxima increase. But a result from topology add to it, it argues that.
number of minima
$+$ number of maxima
$+$ number of saddles
$=$ Euler number of the space
The Euler number of the plane (or projective plane - how it is called in topology) is $1$, and as we add more dimensions that number decreases (and becomes negative very fast). This means that as we have more dimensions we have more minima and maxima but also the number of saddles increases faster than the increase in the number of minima and maxima - because the Euler number of the space decreases. For the optimizer more and more saddles are more points through which it may wander in search for a minima, another piece of trouble in high dimensional spaces.