10.03 Forest Cover Type

In sklearn we have SDGClassifier which will perform SDG to achieve online learning on top of linear SVMs, logistic regression, or a perceptron. The SDGRegressor performs a linear regression as online learning. Note that this means that we can only find solutions to problems that can be approximated linearly. For non-linear online learning we need neural networks (which we will see soon).

We import the SDGCLassifier and the common preprocessors and validation techniques.

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, f1_score

Forest Cover Type Dataset

For a change let's take on a dataset that is not present inside sklearn. The forest cover dataset are cartographic data about types of forests in the Roosevelt National Forest in Colorado. Since we are thinking of walking blindfolded around mountain ranges we may as well use a dataset of mountain forest coverage.

First let's define a couple of details about the features of the dataset. It has several continuous features and then several categorical ones. The categorical features are already one-hot-encoded for us. The wild= areas are quite hidden in the dataset description, for easier access we added comments to the code below describing the features we will load.

In [2]:
continuous = [
    'Elevation',
    'Aspect',
    'Slope',
    'HHydro',
    'VHydro',
    'Road',
    'Shade_9am',
    'Shade_Noon',
    'Shade_3pm',
    'Fire',
]
categorical = [
    'wild=1',  # Rawah Wilderness Area
    'wild=2',  # Neota Wilderness Area
    'wild=3',  # Comanche Peak Wilderness Area
    'wild=4',  # Cache la Poudre Wilderness Area
    'soil=1','soil=2','soil=3','soil=4','soil=5','soil=6','soil=7','soil=8','soil=9','soil=10',
    'soil=11','soil=12','soil=13','soil=14','soil=15','soil=16','soil=17','soil=18','soil=19','soil=20',
    'soil=21','soil=22','soil=23','soil=24','soil=25','soil=26','soil=27','soil=28','soil=29','soil=30',
    'soil=31','soil=32','soil=33','soil=34','soil=35','soil=36','soil=37','soil=38','soil=39','soil=40',
]
columns = continuous + categorical + ['label']
target_names = ['Spruce/Fir', 'Lodgepole Pine', 'Ponderosa Pine',
                'Cottonwood/Willow', 'Aspen', 'Douglas-fir', 'Krummholz']

Ponderosa Pine Forest Cover

ol-forest-ponderosa-pine.svg

Based on the features we can then classify an area of forest cover into one of the seven labels. The set has more then half a million rows of data, it a reasonably sized dataset.

To keep with the spirit of what we have been doing until now we will write a function to actually retrieve the dataset. We will duplicate the sklearn convention for dataset loading and construct our load_cover_type function. The function not only allows for easy download of the dataset but also caches the downloaded data on the file system, so one does not need to download it the next time. A couple of things worth mentioning are that: the dataset is taken from the University of California Irvine Machine Learning Repository and is kept within their archives g-zipped. The decompression code is a tag obscure but it works on several corner cases. Also, the labels in the dataset start from $1$, we adjust the labels to start from $0$ during the dataset load.

This code is pretty much what sklearn does every time we call one of its load_* procedures to get a dataset. The description of the forest cover dataset is very extensive, we print only the descriptions of the main features to save space.

In [3]:
import os
import sys
import zlib
import requests
from sklearn import datasets
from sklearn.utils import Bunch


def load_cover_type():
    cov_dir = 'uci_cover_type'
    data_dir = datasets.get_data_home()
    data_path = os.path.join(data_dir, cov_dir, 'covtype.data')
    descr_path = os.path.join(data_dir, cov_dir, 'covtype.info')
    cov_data_gz = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz'
    cov_descr = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info'
    os.makedirs(os.path.join(data_dir, cov_dir), exist_ok=True)
    try:
        with open(descr_path, 'r') as f:
            descr = f.read()
    except IOError:
        print('Downloading file from', cov_descr, file=sys.stderr)
        r = requests.get(cov_descr)
        with open(descr_path, 'w') as f:
            f.write(r.text)
        descr = r.text
        r.close()
    try:
        data = pd.read_csv(data_path, delimiter=',', names=columns)
    except IOError:
        print('Downloading file from', cov_data_gz, file=sys.stderr)
        r = requests.get(cov_data_gz)
        cov_data = zlib.decompress(r.content, wbits=16+zlib.MAX_WBITS)  # obscure but works
        cov_data = cov_data.decode('utf8')
        with open(data_path, 'w') as f:
            f.write(cov_data)
        r.close()
        data = pd.read_csv(data_path, delimiter=',', names=columns)
    X = data[continuous + categorical].values
    y = data['label'].values - 1
    return Bunch(DESCR=descr,
                 data=X,
                 feature_names=columns[:-1],
                 feature_continuous=continuous,
                 feature_categorical=categorical,
                 target=y,
                 target_names=target_names)


covtype = load_cover_type()
print(covtype.DESCR[6923:8554])
print()
print(covtype.DESCR[12373:12713])
Name                                     Data Type    Measurement                       Description

Elevation                               quantitative    meters                       Elevation in meters
Aspect                                  quantitative    azimuth                      Aspect in degrees azimuth
Slope                                   quantitative    degrees                      Slope in degrees
Horizontal_Distance_To_Hydrology        quantitative    meters                       Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology          quantitative    meters                       Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways         quantitative    meters                       Horz Dist to nearest roadway
Hillshade_9am                           quantitative    0 to 255 index               Hillshade index at 9am, summer solstice
Hillshade_Noon                          quantitative    0 to 255 index               Hillshade index at noon, summer soltice
Hillshade_3pm                           quantitative    0 to 255 index               Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points      quantitative    meters                       Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns)      qualitative     0 (absence) or 1 (presence)  Wilderness area designation
Soil_Type (40 binary columns)           qualitative     0 (absence) or 1 (presence)  Soil Type designation
Cover_Type (7 types)                    integer         1 to 7                       Forest Cover Type designation

Forest Cover Type Classes:	1 -- Spruce/Fir
                                2 -- Lodgepole Pine
                                3 -- Ponderosa Pine
                                4 -- Cottonwood/Willow
                                5 -- Aspen
                                6 -- Douglas-fir
                                7 -- Krummholz

A quick look at the dataset is always a good idea. One can see all three parts of the set: the continuous features, the one-hot-encoded categorical features, and the forest cover type labels.

In the loading function we have also added a distinction between continuous and categorical features. This concept is often useful as one may need to scale continuous features but scaling one-hot-encoded features makes little sense.

In [4]:
df = pd.DataFrame(covtype.data, columns=covtype.feature_names)
df
Out[4]:
Elevation Aspect Slope HHydro VHydro Road Shade_9am Shade_Noon Shade_3pm Fire ... soil=31 soil=32 soil=33 soil=34 soil=35 soil=36 soil=37 soil=38 soil=39 soil=40
0 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 0 0 0 0 0 0
1 2590 56 2 212 -6 390 220 235 151 6225 ... 0 0 0 0 0 0 0 0 0 0
2 2804 139 9 268 65 3180 234 238 135 6121 ... 0 0 0 0 0 0 0 0 0 0
3 2785 155 18 242 118 3090 238 238 122 6211 ... 0 0 0 0 0 0 0 0 0 0
4 2595 45 2 153 -1 391 220 234 150 6172 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
581007 2396 153 20 85 17 108 240 237 118 837 ... 0 0 0 0 0 0 0 0 0 0
581008 2391 152 19 67 12 95 240 237 119 845 ... 0 0 0 0 0 0 0 0 0 0
581009 2386 159 17 60 7 90 236 241 130 854 ... 0 0 0 0 0 0 0 0 0 0
581010 2384 170 15 60 5 90 230 245 143 864 ... 0 0 0 0 0 0 0 0 0 0
581011 2383 165 13 60 4 67 231 244 141 875 ... 0 0 0 0 0 0 0 0 0 0

581012 rows × 54 columns

Cottonwood Forest Cover

ol-forest-cottonwood.svg

Half a million rows is a good dataset. It perhaps does not require online learning on most machines but on some it might.

That said, for presentation purposes it may take too long to run out code on the full dataset. Instead we will take two forest types: Aspen (label $4$) and Douglas-fir (label $5$), and use only the part of the dataset with these labels. Note how we change the labels to be $0$ and $1$ for a classification between only two forest types.

In [5]:
X = covtype.data
y = covtype.target
X = X[(y == 4) | (y == 5)].copy()
y = y[(y == 4) | (y == 5)].copy()
y[y == 4] = 0
y[y == 5] = 1
df = pd.DataFrame(X, columns=covtype.feature_names)
df
Out[5]:
Elevation Aspect Slope HHydro VHydro Road Shade_9am Shade_Noon Shade_3pm Fire ... soil=31 soil=32 soil=33 soil=34 soil=35 soil=36 soil=37 soil=38 soil=39 soil=40
0 2596 51 3 258 0 510 221 232 148 6279 ... 0 0 0 0 0 0 0 0 0 0
1 2590 56 2 212 -6 390 220 235 151 6225 ... 0 0 0 0 0 0 0 0 0 0
2 2595 45 2 153 -1 391 220 234 150 6172 ... 0 0 0 0 0 0 0 0 0 0
3 2606 45 7 270 5 633 222 225 138 6256 ... 0 0 0 0 0 0 0 0 0 0
4 2605 49 4 234 7 573 222 230 144 6228 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
26855 2596 36 22 210 -3 514 212 186 100 1200 ... 0 0 0 0 0 0 0 0 0 0
26856 2588 43 22 212 -2 511 218 185 91 1201 ... 0 0 0 0 0 0 0 0 0 0
26857 2579 48 21 218 0 510 222 190 92 1203 ... 0 0 0 0 0 0 0 0 0 0
26858 2571 55 21 228 -2 492 227 190 86 1206 ... 0 0 0 0 0 0 0 0 0 0
26859 2638 147 13 210 38 524 237 238 130 1176 ... 0 0 0 0 0 0 0 0 0 0

26860 rows × 54 columns

Willow Forest Cover

ol-forest-willow.svg

Looking at the data we can easily see that the continuous features have very distinct value ranges. And therefore will require scaling.

We have the columns that are continuous in an attribute of the loaded dataset. If we now scale only those columns and place them back together with the categorical columns we have a dataset we can work with.

In [6]:
sc = StandardScaler()
X_cont = sc.fit_transform(df[covtype.feature_continuous].values)
X_cat = df[covtype.feature_categorical].values
X = np.c_[X_cont, X_cat]
X.shape
Out[6]:
(26860, 54)

We have a real dataset, we should treat it as a real problem. We take out a test set which we will not touch.

In [7]:
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)

And train on the training set with cross-validation. The learning rate we will use will be constant with a value of eta0=0.001. Note also that SGD is an optimizer, not a model. The loss= argument chooses the model to optimize with SGD.

In [8]:
model = make_pipeline(
    PCA(n_components=10),
    SGDClassifier(loss='log', penalty='l2', max_iter=500, alpha=0.01, tol=0.01,
                  learning_rate='constant', eta0=0.001))
param_grid = {
    'sgdclassifier__alpha': [0.01, 0.1],
    'sgdclassifier__tol': [0.01, 0.1],
}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(xtrain, ytrain)
grid.best_score_, grid.best_estimator_
Out[8]:
(0.9438756118118056,
 Pipeline(memory=None,
          steps=[('pca',
                  PCA(copy=True, iterated_power='auto', n_components=20,
                      random_state=None, svd_solver='auto', tol=0.0,
                      whiten=False)),
                 ('sgdclassifier',
                  SGDClassifier(alpha=0.01, average=False, class_weight=None,
                                early_stopping=False, epsilon=0.1, eta0=0.001,
                                fit_intercept=True, l1_ratio=0.15,
                                learning_rate='constant', loss='log',
                                max_iter=500, n_iter_no_change=5, n_jobs=None,
                                penalty='l2', power_t=0.5, random_state=None,
                                shuffle=True, tol=0.1, validation_fraction=0.1,
                                verbose=0, warm_start=False))],
          verbose=False))

Aspen Forest Cover

ol-forest-aspen.svg

Finally evaluate on the test set. And we have quite good scores for both classes.

In [9]:
names = ['Aspen', 'Douglas-fir']
yfit = grid.best_estimator_.predict(xtest)
print(classification_report(ytest, yfit, target_names=names))
              precision    recall  f1-score   support

       Aspen       0.92      0.92      0.92      1930
 Douglas-fir       0.95      0.96      0.96      3442

    accuracy                           0.94      5372
   macro avg       0.94      0.94      0.94      5372
weighted avg       0.94      0.94      0.94      5372

Douglas-Fir Forest Cover

ol-forest-douglas-fir.svg

Hey! That wasn't online learning.

Right, it was not. The full dataset may need online learning but out sample of two forest types only fit fine in memory. Yet we can make up the idea of a dataset that is too big by slitting into two sets of data and train SGD in an online fashion. We will misuse the train_test_split function for this.

In [10]:
cov1, cov2, ycov1, ycov2 = train_test_split(X, y, test_size=0.9)

We will also argue that we are facing a real world problem in which we only have a tiny bit of data ($10\%$) right now. Whilst we wait for our rangers to collect more data for us we need to make do with what we already have and build a model. Later, when our forest rangers return with a lot more data, we will be able to use online learning to update the model.

In sklearn there are two ways of using online learning. One is to use a method called partial_fit, instead of fit, which will update parameters instead of fitting completely new ones. Another way to enable online learning is to pass warm_start=True, this forces fit to always work like partial_fit. Both methods only work on models that support online learning.

In [11]:
xtrain1, xtest1, ytrain1, ytest1 = train_test_split(cov1, ycov1, test_size=0.2)
model = make_pipeline(
    PCA(n_components=10),
    SGDClassifier(loss='log', penalty='l2', max_iter=500, alpha=0.01, tol=0.01,
                  learning_rate='constant', eta0=0.001, warm_start=True))
model.fit(xtrain1, ytrain1)
yfit = model.predict(xtest1)
f1_score(ytest1, yfit)
Out[11]:
0.957787481804949

We know about a tiny bit of the data and we can, more-or-less, classify that. Note that since we know that we have rather similar scores for both classes on this dataset we are fine using a single score here.

But if we try to classify the data we do not know about we may run into trouble.

In [12]:
xtrain2, xtest2, ytrain2, ytest2 = train_test_split(cov2, ycov2, test_size=0.2)
yfit = model.predict(xtest2)
f1_score(ytest2, yfit)
Out[12]:
0.9489121676067687

Our model without the extra data provided later by the rangers works quite reasonably.

But we can train with the extra data and see if things improve.

In [13]:
model.fit(xtrain2, ytrain2)
yfit = model.predict(xtest2)
f1_score(ytest2, yfit)
Out[13]:
0.9572925060435133

Krummholz Forest Cover

ol-forest-krummholz.svg

Extra: Non-GD Optimisation

We saw SGD and said that it is the most often used optimization technique. But what are the others? One technique is simulated annealing which works by slow cooling. The simulated annealing technique tries random neighbors at each iteration and keeps track of of the point with the lowest value of the error/cost function. The search space for a new neighbor (i.e. the maximum distance form the lowest point found until now) reduces at each iteration. This is similar to SGD with a decreasing learning rate.

But there are more techniques. Notably swarm intelligence provides us with several optimization algorithms:

  • particle swarm
  • bat swarm
  • cuckoo search

And genetic algorithms also work reasonably in an online learning scenario.