07.04 Text Features

Machine Learning deals only with numeric data but not all data in the world is numeric. Two examples of non-numerical data that is meaningful for learning are: categorical features (e.g. animal species) and plain text (e.g. product reviews). There are tricks that allow us to deal with non-numerical data, these tricks are part of feature engineering.

For a start let's import a handful of things.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Feature Engineering and Selection

Dealing with non-numerical data is only a part of feature engineering, although is often the most common application that is called by this name. Actually feature engineering is not a collection of techniques but a generic name to define tasks performed around input data to our model. These include:

  • Modifying existing features - e.g. scaling
  • Selecting only a subset of features - e.g. removing highly correlated features
  • Building new features from existing ones - e.g. squaring features to get only positive values
  • Encoding features in a different representation - e.g. one-hot-encoding
  • Learning new features from data - e.g. huge neural networks fed with lots of data

The last example requires really huge amounts of data, hundreds of millions of samples.

There are plethora of texts on feature engineering an selection but perhaps the most important idea about building and selecting features in the current day and age is that no one has authoritative answers on feature engineering. There are well proven statistical techniques for feature selection but these require the knowledge of complete correlations across all combinations of features, often an infeasible task. Moreover these selection techniques are proven for linear relations across data and outputs, it has been proven that such techniques lose information about non-linear relations within the data.

The bottom line of feature engineering is that in the present times there is no single or combination of techniques that can substitute a human being and his capacity of understanding the data. Instead of relying on feature engineering techniques to build or select features one uses these techniques and builds sample models in order to gain understanding about the data. Then one uses that information about the data to construct his own pipelines feeding the data into the model. Despite advances, at the time of writing, feature engineering is more of a black art rather than engineering - and it is likely to stay that way for some time still.

Instead of building an authoritative method for extraction of good features from data for ML use. We will describe some common techniques and describe their meaning over the data.

Categorical Data

When we deal with the proper names of things or people we are most often dealing with categorical data. One example of such would be the list of British Cities we saw before.

In [2]:
city = ['Liverpool', 'Manchester', 'Cardiff']
country = ['England', 'England', 'Wales']
population2001 = [435500, 405300, 305353]
df = pd.DataFrame({'city': city,
                   'country': country,
                   'population 2001': population2001,
city country population 2001
0 Liverpool England 435500
1 Manchester England 405300
2 Cardiff Wales 305353

One can force the city and country columns to have a categorical type. In pandas that is a way of assigning numbers to a column, these numbers then reference a set of categorical labels.

This also means that the data now is completely numerical. We can take the numbers of categories and assign them to the columns. In pandas these are stored in cat.codes.

In [3]:
cat_df = df.copy()
cat_df.city = cat_df.city.astype('category')
cat_df.city = cat_df.city.cat.codes
cat_df.country = cat_df.country.astype('category')
cat_df.country = cat_df.country.cat.codes
city country population 2001
0 1 0 435500
1 2 0 405300
2 0 1 305353

Yet, that is not enough. Numerical values have an order, therefore we can test for inequality. Based on the data above we can say that:

$$\texttt{England} < \texttt{Wales}$$


$$\texttt{Manchester} > \texttt{Liverpool}$$

Unfortunately, apart from their use in rugby of football jokes, these inequalities are rather useless. Moreover, these inequalities are likely to confuse an ML algorithm. Instead we need to encode the data into a form that would disallow such inequalities one such form is called one-hot-encoding. In one-hot-encoding each sample has several features built from the categorical feature but only one of the columns contain a one, all other columns contain zeros.

In pandas get_dummies exists for this exact purpose, to build a one-hot-encoding from a categorical feature.

In [4]:
pd.get_dummies(df, prefix_sep='=')
population 2001 city=Cardiff city=Liverpool city=Manchester country=England country=Wales
0 435500 0 1 0 1 0
1 405300 0 0 1 1 0
2 305353 1 0 0 0 1

And this is data that we can feed into an ML technique without worrying about confusing it. That said, this representation can use huge amounts of memory if there is a big number of features. To alleviate the memory problem sklearn can perform one-hot-encoding on sparse matrices (from scipy), this way we only need to store the ones.

Textual Data

Plain, unorganized, text data present different challenges to transform into a numeric representation. For a start we cannot just one-hot-encode words because they may appear more than once in each sample. We could encode the presence of words in each sample but when distinguishing between samples certain words are certainly more important than others, e.g. we can safely assume that the word "the" will appear in almost every sample.

Let's try to encode some sentences in a reasonable fashion. And let us try to keep as much of the sentence meaning as we can.

In [5]:
sentences = [
    'quick brown fox',
    'brown dog',
    'gray fox',
    'gray dog',
['quick brown fox', 'brown dog', 'gray fox', 'gray dog', 'fox']

Quick Brown Fox

The phrase "The quick brown fox jumps over the lazy dog" is used to show characters of fonts since the printing press times. The sentence contains all letters used in the English alphabet and hence is a good visual evaluation of the shape of every letter in a font. This is contrary to "Lorem ipsum" which has little meaning. Whether "Lorem ipsum" text has been intentionally or accidentally altered to read as nonsensical Latin is still up for discussion.

For a start we may attempt to just count distinct words in ever sentence. We need two passes, one to figure out all available words and one to count these words in every sentence.

For simplicity we use Python dictionaries to build columns and then transpose the data frame.

In [6]:
from functools import reduce

all_words = reduce(lambda x, y: x + ' ' + y, sentences).split()
word_count = dict((k, all_words.count(k)) for k in sorted(all_words))
print('FULL WORD COUNT:', word_count)

d = {}
for s in sentences:
    d[s] = {w: s.count(w) for w in word_count}
FULL WORD COUNT: {'brown': 2, 'dog': 2, 'fox': 3, 'gray': 2, 'quick': 1}

brown dog fox gray quick
quick brown fox 1 0 1 0 1
brown dog 1 1 0 0 0
gray fox 0 0 1 1 0
gray dog 0 1 0 1 0
fox 0 0 1 0 0

That is a viable encoding, and it has the name term frequency or just TF. But we can do better.

Let's notice that in the column "quick" appears only in the first sentence, the word "fox" appears in the majority of sentences, and the remaining words appear in a similar fashion across the corpus. Where corpus is a just a fancy name for a dataset composed of text documents. From our analysis, if we use the word "quick" we can identify the sentence straight away but if we use the word "fox" we need more information. This is the idea of more important words we alluded to earlier.

The word "quick" caries more decision value than the word "fox". And we can see that these decision values are the inverse of the total word counts, where "fox" is the most common word and "quick" the rarest. The idea that comes to mind is that if we adjust the values in our data frame by the inverses of the total word counts we will carry the decision values into our encoding. This idea is called the inverse document frequency or IDF for short.

In [7]:
d = {}
for s in sentences:
    d[s] = {w: s.count(w) * (1/word_count[w]) for w in word_count}
df = pd.DataFrame(d).T
brown dog fox gray quick
quick brown fox 0.5 0.0 0.333333 0.0 1.0
brown dog 0.5 0.5 0.000000 0.0 0.0
gray fox 0.0 0.0 0.333333 0.5 0.0
gray dog 0.0 0.5 0.000000 0.5 0.0
fox 0.0 0.0 0.333333 0.0 0.0

This looks even better but there are still some problems. We did place our sentences as points in $5$ dimensional space, where the dimensions are the values of the amount of brown, dog, fox, gray or quick in the sentence. This is often called a word vector space. The issue is that each point is at a different distance from the origin, and if we measure distances between the points then long sentences will be far from short sentences simply due to the number of words in the sentence. Making long sentences and short sentences far apart may be desirable in some circumstances but we could simply count all words in that case and not even bother with machine learning or sentence meaning.

To reduce the sentence length problem we can simply normalize the point coordinates. This is often said to be vector normalization, understand point coordinates as a vector and divide by vector length. The resulting vector has length $1$.

$$ v_n = \frac{v}{\Vert v \Vert} $$

Which simply means divide by the square root of the sum of squares, i.e. by the euclidean distance we have seen so many times now. The div method on a data frame allows us to specify the axis of division, it just calls np.divide behind the scenes.

In [8]:
df.div(np.sqrt((df**2).sum(axis=1)), axis=0)
brown dog fox gray quick
quick brown fox 0.428571 0.000000 0.285714 0.000000 0.857143
brown dog 0.707107 0.707107 0.000000 0.000000 0.000000
gray fox 0.000000 0.000000 0.554700 0.832050 0.000000
gray dog 0.000000 0.707107 0.000000 0.707107 0.000000
fox 0.000000 0.000000 1.000000 0.000000 0.000000

Word Representation


And this technique is called TF-IDF (term frequency inverse document frequency) and can hold a good deal of the meaning of a sentence whilst it allows for numerical processing in ML algorithms.

In sklearn TF-IDF can be performed with the TfidfVectorizer. It uses sparse matrices by default, which we can convert back to NumPy arrays for display with dense.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tr = TfidfVectorizer()
sparse = tr.fit_transform(sentences)
matrix([[0.55681615, 0.        , 0.4622077 , 0.        , 0.69015927],
        [0.70710678, 0.70710678, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.63871058, 0.76944707, 0.        ],
        [0.        , 0.70710678, 0.        , 0.70710678, 0.        ],
        [0.        , 0.        , 1.        , 0.        , 0.        ]])

If you paid attention then you noticed that only some of the numbers from TfidfVectorizer match our own attempt at TF-IDF. The TF-IDF technique was originally proposed in search engine research, where we have several way to achieve TF and IDF values. By default sklearn uses a logarithmic IDF with smoothing.

In sklearn for each document we have for each term.

$$ v = tf \cdot idf $$

Just like in our implementation. But $idf$ is defined as.

$$ idf = 1 + log \frac{1 + n}{1 + df} $$

Where the $1$ at the beginning is called the smoothing $df$ is the count of documents with the word - just like out full word count, and $n$ the number of documents - sentences in our case. We can adapt our own implementation to perform TF-IDF in this way.

In [10]:
d = {}
n = len(sentences)
for s in sentences:
    d[s] = {w: s.count(w) * (1 + np.log( (1 + n) / (1 + word_count[w]))) for w in word_count}
df = pd.DataFrame(d).T
df.div(np.sqrt((df**2).sum(axis=1)), axis=0)
brown dog fox gray quick
quick brown fox 0.556816 0.000000 0.462208 0.000000 0.690159
brown dog 0.707107 0.707107 0.000000 0.000000 0.000000
gray fox 0.000000 0.000000 0.638711 0.769447 0.000000
gray dog 0.000000 0.707107 0.000000 0.707107 0.000000
fox 0.000000 0.000000 1.000000 0.000000 0.000000

And we match the numbers from sklearn's TfidfVectorizer.

In summary we can argue the TF-IDF is a technique to transform sentences or even bigger documents into numbers and still keep some of the meaning of those documents. In the technique $tf$ is how often a term appears in the document, and $df$ is how many documents contain the term.

Note that our own implementation is rather primitive. We did not account for the problem of duplicate words within the same document. And we did not account for punctuation in documents. This is because our sentences were rather simple. The sklearn implementation accounts for duplicate words by making three passes instead of two. Then it performs tokenization more complex than our use of split; tokenization is the way of separating strings - be it sentences or full documents - into individual tokens (words).

Finally there also exists the problem of stemming: where "is not" and "isn't" have the same meaning despite the fact that are written differently. Moreover ends of words may change without changing the concrete meaning of the text, for example, whether one uses "do", "done" or "doing" has little difference for our vector space but is again written in different ways. Stemming performs several linguistic tricks to reduce words to their stems, pure meanings, without the conjugation. In sklearn we do not account for stemming but other libraries such as nltk have well tested stemmers.