10.03 Forest Cover Type¶

In sklearn we have SDGClassifier which will perform SDG to achieve online learning on top of linear SVMs, logistic regression, or a perceptron. The SDGRegressor performs a linear regression as online learning. Note that this means that we can only find solutions to problems that can be approximated linearly. For non-linear online learning we need neural networks (which we will see soon).

We import the SDGCLassifier and the common preprocessors and validation techniques.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, f1_score

Forest Cover Type Dataset¶

For a change let's take on a dataset that is not present inside sklearn. The forest cover dataset are cartographic data about types of forests in the Roosevelt National Forest in Colorado. Since we are thinking of walking blindfolded around mountain ranges we may as well use a dataset of mountain forest coverage.

First let's define a couple of details about the features of the dataset. It has several continuous features and then several categorical ones. The categorical features are already one-hot-encoded for us. The wild= areas are quite hidden in the dataset description, for easier access we added comments to the code below describing the features we will load.

continuous = [
    'Elevation',
    'Aspect',
    'Slope',
    'HHydro',
    'VHydro',
    'Road',
    'Shade_9am',
    'Shade_Noon',
    'Shade_3pm',
    'Fire',
]
categorical = [
    'wild=1',  # Rawah Wilderness Area
    'wild=2',  # Neota Wilderness Area
    'wild=3',  # Comanche Peak Wilderness Area
    'wild=4',  # Cache la Poudre Wilderness Area
    'soil=1','soil=2','soil=3','soil=4','soil=5','soil=6','soil=7','soil=8','soil=9','soil=10',
    'soil=11','soil=12','soil=13','soil=14','soil=15','soil=16','soil=17','soil=18','soil=19','soil=20',
    'soil=21','soil=22','soil=23','soil=24','soil=25','soil=26','soil=27','soil=28','soil=29','soil=30',
    'soil=31','soil=32','soil=33','soil=34','soil=35','soil=36','soil=37','soil=38','soil=39','soil=40',
]
columns = continuous + categorical + ['label']
target_names = ['Spruce/Fir', 'Lodgepole Pine', 'Ponderosa Pine',
                'Cottonwood/Willow', 'Aspen', 'Douglas-fir', 'Krummholz']

Ponderosa Pine Forest Cover

^{ol-forest-ponderosa-pine.svg}

Based on the features we can then classify an area of forest cover into one of the seven labels. The set has more then half a million rows of data, it a reasonably sized dataset.

To keep with the spirit of what we have been doing until now we will write a function to actually retrieve the dataset. We will duplicate the sklearn convention for dataset loading and construct our load_cover_type function. The function not only allows for easy download of the dataset but also caches the downloaded data on the file system, so one does not need to download it the next time. A couple of things worth mentioning are that: the dataset is taken from the University of California Irvine Machine Learning Repository and is kept within their archives g-zipped. The decompression code is a tag obscure but it works on several corner cases. Also, the labels in the dataset start from $1$, we adjust the labels to start from $0$ during the dataset load.

This code is pretty much what sklearn does every time we call one of its load_* procedures to get a dataset. The description of the forest cover dataset is very extensive, we print only the descriptions of the main features to save space.

import os
import sys
import zlib
import requests
from sklearn import datasets
from sklearn.utils import Bunch


def load_cover_type():
    cov_dir = 'uci_cover_type'
    data_dir = datasets.get_data_home()
    data_path = os.path.join(data_dir, cov_dir, 'covtype.data')
    descr_path = os.path.join(data_dir, cov_dir, 'covtype.info')
    cov_data_gz = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz'
    cov_descr = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info'
    os.makedirs(os.path.join(data_dir, cov_dir), exist_ok=True)
    try:
        with open(descr_path, 'r') as f:
            descr = f.read()
    except IOError:
        print('Downloading file from', cov_descr, file=sys.stderr)
        r = requests.get(cov_descr)
        with open(descr_path, 'w') as f:
            f.write(r.text)
        descr = r.text
        r.close()
    try:
        data = pd.read_csv(data_path, delimiter=',', names=columns)
    except IOError:
        print('Downloading file from', cov_data_gz, file=sys.stderr)
        r = requests.get(cov_data_gz)
        cov_data = zlib.decompress(r.content, wbits=16+zlib.MAX_WBITS)  # obscure but works
        cov_data = cov_data.decode('utf8')
        with open(data_path, 'w') as f:
            f.write(cov_data)
        r.close()
        data = pd.read_csv(data_path, delimiter=',', names=columns)
    X = data[continuous + categorical].values
    y = data['label'].values - 1
    return Bunch(DESCR=descr,
                 data=X,
                 feature_names=columns[:-1],
                 feature_continuous=continuous,
                 feature_categorical=categorical,
                 target=y,
                 target_names=target_names)


covtype = load_cover_type()
print(covtype.DESCR[6923:8554])
print()
print(covtype.DESCR[12373:12713])

Name                                     Data Type    Measurement                       Description

Elevation                               quantitative    meters                       Elevation in meters
Aspect                                  quantitative    azimuth                      Aspect in degrees azimuth
Slope                                   quantitative    degrees                      Slope in degrees
Horizontal_Distance_To_Hydrology        quantitative    meters                       Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology          quantitative    meters                       Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways         quantitative    meters                       Horz Dist to nearest roadway
Hillshade_9am                           quantitative    0 to 255 index               Hillshade index at 9am, summer solstice
Hillshade_Noon                          quantitative    0 to 255 index               Hillshade index at noon, summer soltice
Hillshade_3pm                           quantitative    0 to 255 index               Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points      quantitative    meters                       Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns)      qualitative     0 (absence) or 1 (presence)  Wilderness area designation
Soil_Type (40 binary columns)           qualitative     0 (absence) or 1 (presence)  Soil Type designation
Cover_Type (7 types)                    integer         1 to 7                       Forest Cover Type designation

Forest Cover Type Classes:	1 -- Spruce/Fir
                                2 -- Lodgepole Pine
                                3 -- Ponderosa Pine
                                4 -- Cottonwood/Willow
                                5 -- Aspen
                                6 -- Douglas-fir
                                7 -- Krummholz

A quick look at the dataset is always a good idea. One can see all three parts of the set: the continuous features, the one-hot-encoded categorical features, and the forest cover type labels.

In the loading function we have also added a distinction between continuous and categorical features. This concept is often useful as one may need to scale continuous features but scaling one-hot-encoded features makes little sense.

df = pd.DataFrame(covtype.data, columns=covtype.feature_names)
df

Cottonwood Forest Cover

^{ol-forest-cottonwood.svg}

Half a million rows is a good dataset. It perhaps does not require online learning on most machines but on some it might.

That said, for presentation purposes it may take too long to run out code on the full dataset. Instead we will take two forest types: Aspen (label $4$) and Douglas-fir (label $5$), and use only the part of the dataset with these labels. Note how we change the labels to be $0$ and $1$ for a classification between only two forest types.

X = covtype.data
y = covtype.target
X = X[(y == 4) | (y == 5)].copy()
y = y[(y == 4) | (y == 5)].copy()
y[y == 4] = 0
y[y == 5] = 1
df = pd.DataFrame(X, columns=covtype.feature_names)
df

Willow Forest Cover

^{ol-forest-willow.svg}

Looking at the data we can easily see that the continuous features have very distinct value ranges. And therefore will require scaling.

We have the columns that are continuous in an attribute of the loaded dataset. If we now scale only those columns and place them back together with the categorical columns we have a dataset we can work with.

sc = StandardScaler()
X_cont = sc.fit_transform(df[covtype.feature_continuous].values)
X_cat = df[covtype.feature_categorical].values
X = np.c_[X_cont, X_cat]
X.shape

(26860, 54)

We have a real dataset, we should treat it as a real problem. We take out a test set which we will not touch.

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)

And train on the training set with cross-validation. The learning rate we will use will be constant with a value of eta0=0.001. Note also that SGD is an optimizer, not a model. The loss= argument chooses the model to optimize with SGD.

model = make_pipeline(
    PCA(n_components=10),
    SGDClassifier(loss='log', penalty='l2', max_iter=500, alpha=0.01, tol=0.01,
                  learning_rate='constant', eta0=0.001))
param_grid = {
    'sgdclassifier__alpha': [0.01, 0.1],
    'sgdclassifier__tol': [0.01, 0.1],
}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(xtrain, ytrain)
grid.best_score_, grid.best_estimator_

(0.9438756118118056,
 Pipeline(memory=None,
          steps=[('pca',
                  PCA(copy=True, iterated_power='auto', n_components=20,
                      random_state=None, svd_solver='auto', tol=0.0,
                      whiten=False)),
                 ('sgdclassifier',
                  SGDClassifier(alpha=0.01, average=False, class_weight=None,
                                early_stopping=False, epsilon=0.1, eta0=0.001,
                                fit_intercept=True, l1_ratio=0.15,
                                learning_rate='constant', loss='log',
                                max_iter=500, n_iter_no_change=5, n_jobs=None,
                                penalty='l2', power_t=0.5, random_state=None,
                                shuffle=True, tol=0.1, validation_fraction=0.1,
                                verbose=0, warm_start=False))],
          verbose=False))

Aspen Forest Cover

^{ol-forest-aspen.svg}

Finally evaluate on the test set. And we have quite good scores for both classes.

names = ['Aspen', 'Douglas-fir']
yfit = grid.best_estimator_.predict(xtest)
print(classification_report(ytest, yfit, target_names=names))

              precision    recall  f1-score   support

       Aspen       0.92      0.92      0.92      1930
 Douglas-fir       0.95      0.96      0.96      3442

    accuracy                           0.94      5372
   macro avg       0.94      0.94      0.94      5372
weighted avg       0.94      0.94      0.94      5372

Douglas-Fir Forest Cover

^{ol-forest-douglas-fir.svg}

Hey! That wasn't online learning.

Right, it was not. The full dataset may need online learning but out sample of two forest types only fit fine in memory. Yet we can make up the idea of a dataset that is too big by slitting into two sets of data and train SGD in an online fashion. We will misuse the train_test_split function for this.

cov1, cov2, ycov1, ycov2 = train_test_split(X, y, test_size=0.9)

We will also argue that we are facing a real world problem in which we only have a tiny bit of data ($10\%$) right now. Whilst we wait for our rangers to collect more data for us we need to make do with what we already have and build a model. Later, when our forest rangers return with a lot more data, we will be able to use online learning to update the model.

In sklearn there are two ways of using online learning. One is to use a method called partial_fit, instead of fit, which will update parameters instead of fitting completely new ones. Another way to enable online learning is to pass warm_start=True, this forces fit to always work like partial_fit. Both methods only work on models that support online learning.

xtrain1, xtest1, ytrain1, ytest1 = train_test_split(cov1, ycov1, test_size=0.2)
model = make_pipeline(
    PCA(n_components=10),
    SGDClassifier(loss='log', penalty='l2', max_iter=500, alpha=0.01, tol=0.01,
                  learning_rate='constant', eta0=0.001, warm_start=True))
model.fit(xtrain1, ytrain1)
yfit = model.predict(xtest1)
f1_score(ytest1, yfit)

0.957787481804949

We know about a tiny bit of the data and we can, more-or-less, classify that. Note that since we know that we have rather similar scores for both classes on this dataset we are fine using a single score here.

But if we try to classify the data we do not know about we may run into trouble.

xtrain2, xtest2, ytrain2, ytest2 = train_test_split(cov2, ycov2, test_size=0.2)
yfit = model.predict(xtest2)
f1_score(ytest2, yfit)

0.9489121676067687

Our model without the extra data provided later by the rangers works quite reasonably.

But we can train with the extra data and see if things improve.

model.fit(xtrain2, ytrain2)
yfit = model.predict(xtest2)
f1_score(ytest2, yfit)

0.9572925060435133

Krummholz Forest Cover

^{ol-forest-krummholz.svg}

Extra: Non-GD Optimisation¶

We saw SGD and said that it is the most often used optimization technique. But what are the others? One technique is simulated annealing which works by slow cooling. The simulated annealing technique tries random neighbors at each iteration and keeps track of of the point with the lowest value of the error/cost function. The search space for a new neighbor (i.e. the maximum distance form the lowest point found until now) reduces at each iteration. This is similar to SGD with a decreasing learning rate.

But there are more techniques. Notably swarm intelligence provides us with several optimization algorithms:

particle swarm
bat swarm
cuckoo search

And genetic algorithms also work reasonably in an online learning scenario.

References¶

UCI - Forest Cover Type Dataset

	Elevation	Aspect	Slope	HHydro	VHydro	Road	Shade_9am	Shade_Noon	Shade_3pm	Fire	...	soil=31	soil=32	soil=33	soil=34	soil=35	soil=36	soil=37	soil=38	soil=39	soil=40
0	2596	51	3	258	0	510	221	232	148	6279	...	0	0	0	0	0	0	0	0	0	0
1	2590	56	2	212	-6	390	220	235	151	6225	...	0	0	0	0	0	0	0	0	0	0
2	2804	139	9	268	65	3180	234	238	135	6121	...	0	0	0	0	0	0	0	0	0	0
3	2785	155	18	242	118	3090	238	238	122	6211	...	0	0	0	0	0	0	0	0	0	0
4	2595	45	2	153	-1	391	220	234	150	6172	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
581007	2396	153	20	85	17	108	240	237	118	837	...	0	0	0	0	0	0	0	0	0	0
581008	2391	152	19	67	12	95	240	237	119	845	...	0	0	0	0	0	0	0	0	0	0
581009	2386	159	17	60	7	90	236	241	130	854	...	0	0	0	0	0	0	0	0	0	0
581010	2384	170	15	60	5	90	230	245	143	864	...	0	0	0	0	0	0	0	0	0	0
581011	2383	165	13	60	4	67	231	244	141	875	...	0	0	0	0	0	0	0	0	0	0

	Elevation	Aspect	Slope	HHydro	VHydro	Road	Shade_9am	Shade_Noon	Shade_3pm	Fire	...	soil=31	soil=32	soil=33	soil=34	soil=35	soil=36	soil=37	soil=38	soil=39	soil=40
0	2596	51	3	258	0	510	221	232	148	6279	...	0	0	0	0	0	0	0	0	0	0
1	2590	56	2	212	-6	390	220	235	151	6225	...	0	0	0	0	0	0	0	0	0	0
2	2595	45	2	153	-1	391	220	234	150	6172	...	0	0	0	0	0	0	0	0	0	0
3	2606	45	7	270	5	633	222	225	138	6256	...	0	0	0	0	0	0	0	0	0	0
4	2605	49	4	234	7	573	222	230	144	6228	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
26855	2596	36	22	210	-3	514	212	186	100	1200	...	0	0	0	0	0	0	0	0	0	0
26856	2588	43	22	212	-2	511	218	185	91	1201	...	0	0	0	0	0	0	0	0	0	0
26857	2579	48	21	218	0	510	222	190	92	1203	...	0	0	0	0	0	0	0	0	0	0
26858	2571	55	21	228	-2	492	227	190	86	1206	...	0	0	0	0	0	0	0	0	0	0
26859	2638	147	13	210	38	524	237	238	130	1176	...	0	0	0	0	0	0	0	0	0	0