In sklearn
we have SDGClassifier
which will perform SDG
to achieve online learning on top of linear SVMs, logistic regression, or a perceptron.
The SDGRegressor
performs a linear regression as online learning.
Note that this means that we can only find solutions to problems
that can be approximated linearly.
For non-linear online learning we need neural networks (which we will see soon).
We import the SDGCLassifier
and the common preprocessors and validation techniques.
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, f1_score
For a change let's take on a dataset that is not present inside sklearn
.
The forest cover dataset are cartographic data about types of forests
in the Roosevelt National Forest in Colorado.
Since we are thinking of walking blindfolded around mountain ranges
we may as well use a dataset of mountain forest coverage.
First let's define a couple of details about the features of the dataset.
It has several continuous features and then several categorical ones.
The categorical features are already one-hot-encoded for us.
The wild=
areas are quite hidden in the dataset description,
for easier access we added comments to the code below describing the features we will load.
continuous = [
'Elevation',
'Aspect',
'Slope',
'HHydro',
'VHydro',
'Road',
'Shade_9am',
'Shade_Noon',
'Shade_3pm',
'Fire',
]
categorical = [
'wild=1', # Rawah Wilderness Area
'wild=2', # Neota Wilderness Area
'wild=3', # Comanche Peak Wilderness Area
'wild=4', # Cache la Poudre Wilderness Area
'soil=1','soil=2','soil=3','soil=4','soil=5','soil=6','soil=7','soil=8','soil=9','soil=10',
'soil=11','soil=12','soil=13','soil=14','soil=15','soil=16','soil=17','soil=18','soil=19','soil=20',
'soil=21','soil=22','soil=23','soil=24','soil=25','soil=26','soil=27','soil=28','soil=29','soil=30',
'soil=31','soil=32','soil=33','soil=34','soil=35','soil=36','soil=37','soil=38','soil=39','soil=40',
]
columns = continuous + categorical + ['label']
target_names = ['Spruce/Fir', 'Lodgepole Pine', 'Ponderosa Pine',
'Cottonwood/Willow', 'Aspen', 'Douglas-fir', 'Krummholz']
Based on the features we can then classify an area of forest cover into one of the seven labels. The set has more then half a million rows of data, it a reasonably sized dataset.
To keep with the spirit of what we have been doing until now we will
write a function to actually retrieve the dataset.
We will duplicate the sklearn
convention for dataset loading and construct
our load_cover_type
function.
The function not only allows for easy download of the dataset
but also caches the downloaded data on the file system,
so one does not need to download it the next time.
A couple of things worth mentioning are that:
the dataset is taken from the
University of California Irvine Machine Learning Repository
and is kept within their archives g-zipped.
The decompression code is a tag obscure but it works on several corner cases.
Also, the labels in the dataset start from $1$,
we adjust the labels to start from $0$ during the dataset load.
This code is pretty much what sklearn
does every time we call one
of its load_*
procedures to get a dataset.
The description of the forest cover dataset is very extensive,
we print only the descriptions of the main features to save space.
import os
import sys
import zlib
import requests
from sklearn import datasets
from sklearn.utils import Bunch
def load_cover_type():
cov_dir = 'uci_cover_type'
data_dir = datasets.get_data_home()
data_path = os.path.join(data_dir, cov_dir, 'covtype.data')
descr_path = os.path.join(data_dir, cov_dir, 'covtype.info')
cov_data_gz = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz'
cov_descr = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info'
os.makedirs(os.path.join(data_dir, cov_dir), exist_ok=True)
try:
with open(descr_path, 'r') as f:
descr = f.read()
except IOError:
print('Downloading file from', cov_descr, file=sys.stderr)
r = requests.get(cov_descr)
with open(descr_path, 'w') as f:
f.write(r.text)
descr = r.text
r.close()
try:
data = pd.read_csv(data_path, delimiter=',', names=columns)
except IOError:
print('Downloading file from', cov_data_gz, file=sys.stderr)
r = requests.get(cov_data_gz)
cov_data = zlib.decompress(r.content, wbits=16+zlib.MAX_WBITS) # obscure but works
cov_data = cov_data.decode('utf8')
with open(data_path, 'w') as f:
f.write(cov_data)
r.close()
data = pd.read_csv(data_path, delimiter=',', names=columns)
X = data[continuous + categorical].values
y = data['label'].values - 1
return Bunch(DESCR=descr,
data=X,
feature_names=columns[:-1],
feature_continuous=continuous,
feature_categorical=categorical,
target=y,
target_names=target_names)
covtype = load_cover_type()
print(covtype.DESCR[6923:8554])
print()
print(covtype.DESCR[12373:12713])
A quick look at the dataset is always a good idea. One can see all three parts of the set: the continuous features, the one-hot-encoded categorical features, and the forest cover type labels.
In the loading function we have also added a distinction between continuous and categorical features. This concept is often useful as one may need to scale continuous features but scaling one-hot-encoded features makes little sense.
df = pd.DataFrame(covtype.data, columns=covtype.feature_names)
df
Half a million rows is a good dataset. It perhaps does not require online learning on most machines but on some it might.
That said, for presentation purposes it may take too long to run out code on the full dataset. Instead we will take two forest types: Aspen (label $4$) and Douglas-fir (label $5$), and use only the part of the dataset with these labels. Note how we change the labels to be $0$ and $1$ for a classification between only two forest types.
X = covtype.data
y = covtype.target
X = X[(y == 4) | (y == 5)].copy()
y = y[(y == 4) | (y == 5)].copy()
y[y == 4] = 0
y[y == 5] = 1
df = pd.DataFrame(X, columns=covtype.feature_names)
df
Looking at the data we can easily see that the continuous features have very distinct value ranges. And therefore will require scaling.
We have the columns that are continuous in an attribute of the loaded dataset. If we now scale only those columns and place them back together with the categorical columns we have a dataset we can work with.
sc = StandardScaler()
X_cont = sc.fit_transform(df[covtype.feature_continuous].values)
X_cat = df[covtype.feature_categorical].values
X = np.c_[X_cont, X_cat]
X.shape
We have a real dataset, we should treat it as a real problem. We take out a test set which we will not touch.
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
And train on the training set with cross-validation.
The learning rate we will use will be constant with a value of eta0=0.001
.
Note also that SGD is an optimizer, not a model.
The loss=
argument chooses the model to optimize with SGD.
model = make_pipeline(
PCA(n_components=10),
SGDClassifier(loss='log', penalty='l2', max_iter=500, alpha=0.01, tol=0.01,
learning_rate='constant', eta0=0.001))
param_grid = {
'sgdclassifier__alpha': [0.01, 0.1],
'sgdclassifier__tol': [0.01, 0.1],
}
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(xtrain, ytrain)
grid.best_score_, grid.best_estimator_
Finally evaluate on the test set. And we have quite good scores for both classes.
names = ['Aspen', 'Douglas-fir']
yfit = grid.best_estimator_.predict(xtest)
print(classification_report(ytest, yfit, target_names=names))
Hey! That wasn't online learning.
Right, it was not.
The full dataset may need online learning
but out sample of two forest types only fit fine in memory.
Yet we can make up the idea of a dataset that is too big
by slitting into two sets of data and train SGD in an online fashion.
We will misuse the train_test_split
function for this.
cov1, cov2, ycov1, ycov2 = train_test_split(X, y, test_size=0.9)
We will also argue that we are facing a real world problem in which we only have a tiny bit of data ($10\%$) right now. Whilst we wait for our rangers to collect more data for us we need to make do with what we already have and build a model. Later, when our forest rangers return with a lot more data, we will be able to use online learning to update the model.
In sklearn
there are two ways of using online learning.
One is to use a method called partial_fit
, instead of fit
,
which will update parameters instead of fitting completely new ones.
Another way to enable online learning is to pass warm_start=True
,
this forces fit
to always work like partial_fit
.
Both methods only work on models that support online learning.
xtrain1, xtest1, ytrain1, ytest1 = train_test_split(cov1, ycov1, test_size=0.2)
model = make_pipeline(
PCA(n_components=10),
SGDClassifier(loss='log', penalty='l2', max_iter=500, alpha=0.01, tol=0.01,
learning_rate='constant', eta0=0.001, warm_start=True))
model.fit(xtrain1, ytrain1)
yfit = model.predict(xtest1)
f1_score(ytest1, yfit)
We know about a tiny bit of the data and we can, more-or-less, classify that. Note that since we know that we have rather similar scores for both classes on this dataset we are fine using a single score here.
But if we try to classify the data we do not know about we may run into trouble.
xtrain2, xtest2, ytrain2, ytest2 = train_test_split(cov2, ycov2, test_size=0.2)
yfit = model.predict(xtest2)
f1_score(ytest2, yfit)
Our model without the extra data provided later by the rangers works quite reasonably.
But we can train with the extra data and see if things improve.
model.fit(xtrain2, ytrain2)
yfit = model.predict(xtest2)
f1_score(ytest2, yfit)
We saw SGD and said that it is the most often used optimization technique. But what are the others? One technique is simulated annealing which works by slow cooling. The simulated annealing technique tries random neighbors at each iteration and keeps track of of the point with the lowest value of the error/cost function. The search space for a new neighbor (i.e. the maximum distance form the lowest point found until now) reduces at each iteration. This is similar to SGD with a decreasing learning rate.
But there are more techniques. Notably swarm intelligence provides us with several optimization algorithms:
And genetic algorithms also work reasonably in an online learning scenario.