10.04 Artificial Neural Networks

If we place several perceptrons (neurons) together we can slice the search space with several lines. This is called an Artificial Neural Network (ANN, or just Neural Network, NN, for short). It is not trivial to train these neurons, after all we do not know what the output of most neurons should be.

For a first let's just use sklearn's MLPClassifier (multi-layer-perceptron-classifier) as a model and see if we can make it work. We take all model evaluation common suspects.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report, confusion_matrix

Lieutenant Columbo


Glass Dataset

The forensics of glass composition can reveal the provenience of a piece of glass, yet different brands of glass are slightly different from each other. We will build a model which will classify glass based on its composition, and will use online learning so that we are ready to learn from new data at any time. If anyone asks us to be police detectives someday we now have a head start.

This is a dataset at UCI machine learning repository, we need to build our load_glass function. The classes are numbered from $1$ in the dataset, and class $3$ is missing, we fix both issues during the data load.

In [2]:
import os
import sys
import requests
from sklearn import datasets
from sklearn.utils import Bunch

def load_glass():
    glass_dir = 'uci_glass'
    data_dir = datasets.get_data_home()
    data_path = os.path.join(data_dir, glass_dir, 'glass.data')
    descr_path = os.path.join(data_dir, glass_dir, 'glass.names')
    glass_data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data'
    glass_descr = 'https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.names'
    os.makedirs(os.path.join(data_dir, glass_dir), exist_ok=True)
        with open(descr_path, 'r') as f:
            descr = f.read()
    except IOError:
        print('Downloading file from', glass_descr, file=sys.stderr)
        r = requests.get(glass_descr)
        with open(descr_path, 'w') as f:
        descr = r.text
        data = pd.read_csv(data_path, delimiter=',', header=None).values
    except IOError:
        print('Downloading file from', glass_data, file=sys.stderr)
        r = requests.get(glass_data)
        with open(data_path, 'w') as f:
        data = pd.read_csv(data_path, delimiter=',', header=None).values
    target = data[:, 10].astype(np.int).copy()
    target[target > 3] -= 1  # fix non-existent classes
    target -= 1              # fix class numbering
    return Bunch(DESCR=descr,
                 data=data[:, :10].copy(),
                 feature_names=['ID', 'RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe'],

glass = load_glass()
1. Title: Glass Identification Database

2. Sources:
    (a) Creator: B. German
        -- Central Research Establishment
           Home Office Forensic Science Service
           Aldermaston, Reading, Berkshire RG7 4PN
    (b) Donor: Vina Spiehler, Ph.D., DABFT
               Diagnostic Products Corporation
               (213) 776-0180 (ext 3014)
    (c) Date: September, 1987

3. Past Usage:
    -- Rule Induction in Forensic Science
       -- Ian W. Evett and Ernest J. Spiehler
       -- Central Research Establishment
          Home Office Forensic Science Service
          Aldermaston, Reading, Berkshire RG7 4PN
       -- Unknown technical note number (sorry, not listed here)
       -- General Results: nearest neighbor held its own with respect to the
             rule-based system

4. Relevant Information:n
      Vina conducted a comparison test of her rule-based system, BEAGLE, the
      nearest-neighbor algorithm, and discriminant analysis.  BEAGLE is 
      a product available through VRS Consulting, Inc.; 4676 Admiralty Way,
      Suite 206; Marina Del Ray, CA 90292 (213) 827-7890 and FAX: -3189.
      In determining whether the glass was a type of "float" glass or not,
      the following results were obtained (# incorrect answers):

             Type of Sample                            Beagle   NN    DA
             Windows that were float processed (87)     10      12    21
             Windows that were not:            (76)     19      16    22

      The study of classification of types of glass was motivated by 
      criminological investigation.  At the scene of the crime, the glass left
      can be used as evidence...if it is correctly identified!

5. Number of Instances: 214

6. Number of Attributes: 10 (including an Id#) plus the class attribute
   -- all attributes are continuously valued

7. Attribute Information:
   1. Id number: 1 to 214
   2. RI: refractive index
   3. Na: Sodium (unit measurement: weight percent in corresponding oxide, as 
                  are attributes 4-10)
   4. Mg: Magnesium
   5. Al: Aluminum
   6. Si: Silicon
   7. K: Potassium
   8. Ca: Calcium
   9. Ba: Barium
  10. Fe: Iron
  11. Type of glass: (class attribute)
      -- 1 building_windows_float_processed
      -- 2 building_windows_non_float_processed
      -- 3 vehicle_windows_float_processed
      -- 4 vehicle_windows_non_float_processed (none in this database)
      -- 5 containers
      -- 6 tableware
      -- 7 headlamps

8. Missing Attribute Values: None

Summary Statistics:
Attribute:   Min     Max      Mean     SD      Correlation with class
 2. RI:       1.5112  1.5339   1.5184  0.0030  -0.1642
 3. Na:      10.73   17.38    13.4079  0.8166   0.5030
 4. Mg:       0       4.49     2.6845  1.4424  -0.7447
 5. Al:       0.29    3.5      1.4449  0.4993   0.5988
 6. Si:      69.81   75.41    72.6509  0.7745   0.1515
 7. K:        0       6.21     0.4971  0.6522  -0.0100
 8. Ca:       5.43   16.19     8.9570  1.4232   0.0007
 9. Ba:       0       3.15     0.1750  0.4972   0.5751
10. Fe:       0       0.51     0.0570  0.0974  -0.1879

9. Class Distribution: (out of 214 total instances)
    -- 163 Window glass (building windows and vehicle windows)
       -- 87 float processed  
          -- 70 building windows
          -- 17 vehicle windows
       -- 76 non-float processed
          -- 76 building windows
          -- 0 vehicle windows
    -- 51 Non-window glass
       -- 13 containers
       -- 9 tableware
       -- 29 headlamps

The support of the classes is quite different. This isn't an easy problem to model.

In [3]:
1    76
0    70
5    29
2    17
3    13
4     9
dtype: int64

Without knowing much about this model, let's try to use a neural network to classify this dataset. We know that the neural network is a good amount of interconnected perceptrons and that we perturb the perceptron weights based on errors in classification to achieve convergence.

This is a real dataset, we may as well try to do things properly and take a test set out. This may prove a tad difficult due to the big differences in support. One could we may do is to select a good subset of classes and split train and test sets from these. But we will attempt to use the full dataset, we pass stratify= to train_test_split which will this way attempt to maintain class support the same across the training and test sets.

In [4]:
xtrain, xtest, ytrain, ytest = train_test_split(glass.data, glass.target, test_size=0.2, stratify=glass.target)

We also know that sklearn provides us with a simple neural network class, no harm in trying it out.

By now we can guess what max_iter, alpha= and tol= arguments are. The solver=sgd is stochastic gradient descent, where the default is an optimizer with many other heuristics called adam. The hidden_layer_sizes=(20,) means that we have a neural network of $10$ input perceptrons/neurons in a layer, the number of features in or dataset; $20$ perceptrons/neurons in a hidden layer that do the bulk of the processing; and $6$ perceptrons/neurons in an output layer, one for each class. The activetion= is the activation function of the perceptrons to which we will return.

In [5]:
net = MLPClassifier(activation='relu', hidden_layer_sizes=(20,),
                    alpha=0.01, tol=0.001,
                    max_iter=500, solver='sgd')
param_dict = {
    'alpha': [0.0001, 0.001, 0.01, 0.1, 1.0],
    'tol': [0.001, 0.01, 0.1],
grid = GridSearchCV(net, param_dict, cv=5)
grid.fit(xtrain, ytrain)
grid.best_estimator_, grid.best_score_
(MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
               beta_2=0.999, early_stopping=False, epsilon=1e-08,
               hidden_layer_sizes=(20,), learning_rate='constant',
               learning_rate_init=0.001, max_fun=15000, max_iter=500,
               momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
               power_t=0.5, random_state=None, shuffle=True, solver='sgd',
               tol=0.001, validation_fraction=0.1, verbose=False,

Some models may not converge but the cross-validation should root them out. And the best model should be a converged network.

The score isn't bad but a classification report may give a better view of how a multilabel classifier performs.

In [6]:
y_hat = grid.best_estimator_.predict(xtest)
print(classification_report(ytest, y_hat, target_names=glass.target_names, zero_division=0))
                             precision    recall  f1-score   support

    windows_float_processed       1.00      1.00      1.00        14
windows_non_float_processed       0.94      1.00      0.97        15
            vehicle_windows       0.67      0.67      0.67         3
                 containers       0.00      0.00      0.00         3
                  tableware       0.00      0.00      0.00         2
                  headlamps       0.67      1.00      0.80         6

                   accuracy                           0.86        43
                  macro avg       0.55      0.61      0.57        43
               weighted avg       0.79      0.86      0.82        43

When the support is low, we may have cases with zero precision or recall, which in turn will make a zero F1 score due to a division by zero. That is the reason we added zero_division=0 above to prevent zero division warnings.

And a confusion matrix may give further information.

In [7]:
mat = confusion_matrix(ytest, y_hat)
fig, ax = plt.subplots(figsize=(12, 12))
cax = ax.matshow(mat, cmap='summer')
ticks = np.arange(0, len(glass.target_names))
ax.set_xticklabels(glass.target_names, rotation=45, ha='right')
ax.set_yticklabels(glass.target_names, rotation=45, ha='right')
ax.set_ylabel('true label')
ax.set_xlabel('predicted label')

for i in range(mat.shape[0]):
    for j in range(mat.shape[1]):
        ax.text(j, i, mat[i, j], ha='center', va='center')

From the above we learned one of the most important details about neural networks: since weights are updated based on classification errors between expected and predicted classes, if we have classes with very little support the network will often fail at classifying them.

As we get collect more data the effect of classes with smaller support diminishes. Neural Nets work in similar fashion to us, humans, you cannot show a person 2 paintings by Titian and then 200 painting not by Titian and expect her to be an expert at identifying Titian's work. ANNs work best the more data you have and the more diverse the data is.

For the glass data we'll leave as an exercise trying out a Random Forest, that model is very good at digging out classes with small support (contrary to decision trees!). But for now let's understand more about ANNs.