Machine Learning deals only with numeric data but not all data in the world is numeric. Two examples of non-numerical data that is meaningful for learning are: categorical features (e.g. animal species) and plain text (e.g. product reviews). There are tricks that allow us to deal with non-numerical data, these tricks are part of feature engineering.
For a start let's import a handful of things.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
Dealing with non-numerical data is only a part of feature engineering, although is often the most common application that is called by this name. Actually feature engineering is not a collection of techniques but a generic name to define tasks performed around input data to our model. These include:
The last example requires really huge amounts of data, hundreds of millions of samples.
There are plethora of texts on feature engineering an selection but perhaps the most important idea about building and selecting features in the current day and age is that no one has authoritative answers on feature engineering. There are well proven statistical techniques for feature selection but these require the knowledge of complete correlations across all combinations of features, often an infeasible task. Moreover these selection techniques are proven for linear relations across data and outputs, it has been proven that such techniques lose information about non-linear relations within the data.
The bottom line of feature engineering is that in the present times there is no single or combination of techniques that can substitute a human being and his capacity of understanding the data. Instead of relying on feature engineering techniques to build or select features one uses these techniques and builds sample models in order to gain understanding about the data. Then one uses that information about the data to construct his own pipelines feeding the data into the model. Despite advances, at the time of writing, feature engineering is more of a black art rather than engineering - and it is likely to stay that way for some time still.
Instead of building an authoritative method for extraction of good features from data for ML use. We will describe some common techniques and describe their meaning over the data.
When we deal with the proper names of things or people we are most often dealing with categorical data. One example of such would be the list of British Cities we saw before.
city = ['Liverpool', 'Manchester', 'Cardiff']
country = ['England', 'England', 'Wales']
population2001 = [435500, 405300, 305353]
df = pd.DataFrame({'city': city,
'country': country,
'population 2001': population2001,
})
df
One can force the city and country columns to have a categorical type.
In pandas
that is a way of assigning numbers to a column,
these numbers then reference a set of categorical labels.
This also means that the data now is completely numerical.
We can take the numbers of categories and assign them to
the columns.
In pandas
these are stored in cat.codes
.
cat_df = df.copy()
cat_df.city = cat_df.city.astype('category')
cat_df.city = cat_df.city.cat.codes
cat_df.country = cat_df.country.astype('category')
cat_df.country = cat_df.country.cat.codes
cat_df
Yet, that is not enough. Numerical values have an order, therefore we can test for inequality. Based on the data above we can say that:
$$\texttt{England} < \texttt{Wales}$$or
$$\texttt{Manchester} > \texttt{Liverpool}$$Unfortunately, apart from their use in rugby of football jokes, these inequalities are rather useless. Moreover, these inequalities are likely to confuse an ML algorithm. Instead we need to encode the data into a form that would disallow such inequalities one such form is called one-hot-encoding. In one-hot-encoding each sample has several features built from the categorical feature but only one of the columns contain a one, all other columns contain zeros.
In pandas
get_dummies
exists for this exact purpose,
to build a one-hot-encoding from a categorical feature.
pd.get_dummies(df, prefix_sep='=')
And this is data that we can feed into an ML technique without worrying about confusing it.
That said, this representation can use huge amounts of memory if there is a big number of features.
To alleviate the memory problem sklearn
can perform one-hot-encoding on sparse matrices (from scipy
),
this way we only need to store the ones.
Plain, unorganized, text data present different challenges to transform into a numeric representation. For a start we cannot just one-hot-encode words because they may appear more than once in each sample. We could encode the presence of words in each sample but when distinguishing between samples certain words are certainly more important than others, e.g. we can safely assume that the word "the" will appear in almost every sample.
Let's try to encode some sentences in a reasonable fashion. And let us try to keep as much of the sentence meaning as we can.
sentences = [
'quick brown fox',
'brown dog',
'gray fox',
'gray dog',
'fox',
]
sentences
For a start we may attempt to just count distinct words in ever sentence. We need two passes, one to figure out all available words and one to count these words in every sentence.
For simplicity we use Python dictionaries to build columns and then transpose the data frame.
from functools import reduce
all_words = reduce(lambda x, y: x + ' ' + y, sentences).split()
word_count = dict((k, all_words.count(k)) for k in sorted(all_words))
print('FULL WORD COUNT:', word_count)
print()
d = {}
for s in sentences:
d[s] = {w: s.count(w) for w in word_count}
pd.DataFrame(d).T
That is a viable encoding, and it has the name term frequency or just TF. But we can do better.
Let's notice that in the column "quick" appears only in the first sentence, the word "fox" appears in the majority of sentences, and the remaining words appear in a similar fashion across the corpus. Where corpus is a just a fancy name for a dataset composed of text documents. From our analysis, if we use the word "quick" we can identify the sentence straight away but if we use the word "fox" we need more information. This is the idea of more important words we alluded to earlier.
The word "quick" caries more decision value than the word "fox". And we can see that these decision values are the inverse of the total word counts, where "fox" is the most common word and "quick" the rarest. The idea that comes to mind is that if we adjust the values in our data frame by the inverses of the total word counts we will carry the decision values into our encoding. This idea is called the inverse document frequency or IDF for short.
d = {}
for s in sentences:
d[s] = {w: s.count(w) * (1/word_count[w]) for w in word_count}
df = pd.DataFrame(d).T
df
This looks even better but there are still some problems. We did place our sentences as points in $5$ dimensional space, where the dimensions are the values of the amount of brown, dog, fox, gray or quick in the sentence. This is often called a word vector space. The issue is that each point is at a different distance from the origin, and if we measure distances between the points then long sentences will be far from short sentences simply due to the number of words in the sentence. Making long sentences and short sentences far apart may be desirable in some circumstances but we could simply count all words in that case and not even bother with machine learning or sentence meaning.
To reduce the sentence length problem we can simply normalize the point coordinates. This is often said to be vector normalization, understand point coordinates as a vector and divide by vector length. The resulting vector has length $1$.
$$ v_n = \frac{v}{\Vert v \Vert} $$Which simply means divide by the square root of the sum of squares,
i.e. by the euclidean distance we have seen so many times now.
The div
method on a data frame allows us to specify the axis of division,
it just calls np.divide
behind the scenes.
df.div(np.sqrt((df**2).sum(axis=1)), axis=0)
And this technique is called TF-IDF (term frequency inverse document frequency) and can hold a good deal of the meaning of a sentence whilst it allows for numerical processing in ML algorithms.
In sklearn
TF-IDF can be performed with the TfidfVectorizer
.
It uses sparse matrices by default,
which we can convert back to NumPy arrays for display with dense
.
from sklearn.feature_extraction.text import TfidfVectorizer
tr = TfidfVectorizer()
sparse = tr.fit_transform(sentences)
sparse.todense()
If you paid attention then you noticed that only some of the numbers
from TfidfVectorizer
match our own attempt at TF-IDF.
The TF-IDF technique was originally proposed in search engine research,
where we have several way to achieve TF and IDF values.
By default sklearn
uses a logarithmic IDF with smoothing.
In sklearn
for each document we have for each term.
Just like in our implementation. But $idf$ is defined as.
$$ idf = 1 + log \frac{1 + n}{1 + df} $$Where the $1$ at the beginning is called the smoothing $df$ is the count of documents with the word - just like out full word count, and $n$ the number of documents - sentences in our case. We can adapt our own implementation to perform TF-IDF in this way.
d = {}
n = len(sentences)
for s in sentences:
d[s] = {w: s.count(w) * (1 + np.log( (1 + n) / (1 + word_count[w]))) for w in word_count}
df = pd.DataFrame(d).T
df.div(np.sqrt((df**2).sum(axis=1)), axis=0)
And we match the numbers from sklearn
's TfidfVectorizer
.
In summary we can argue the TF-IDF is a technique to transform sentences or even bigger documents into numbers and still keep some of the meaning of those documents. In the technique $tf$ is how often a term appears in the document, and $df$ is how many documents contain the term.
Note that our own implementation is rather primitive.
We did not account for the problem
of duplicate words within the same document.
And we did not account for punctuation in documents.
This is because our sentences were rather simple.
The sklearn
implementation accounts for duplicate words
by making three passes instead of two.
Then it performs tokenization more complex than our use of split
;
tokenization is the way of separating strings - be it sentences or full
documents - into individual tokens (words).
Finally there also exists the problem of stemming:
where "is not" and "isn't" have the same meaning despite the fact
that are written differently.
Moreover ends of words may change without changing the concrete meaning
of the text, for example, whether one uses "do", "done" or "doing"
has little difference for our vector space but is again written in different ways.
Stemming performs several linguistic tricks to reduce words to their stems,
pure meanings, without the conjugation.
In sklearn
we do not account for stemming but other libraries such as
nltk
have well tested stemmers.