To not make mistakes in analysis a deal of statistical knowledge is required.
We will review some statistics and learn a little about distributions in scipy
.
scipy
is the mathematical library for Python on top of NumPy.
It was geared to be the one and only mathematic library for the sciences in Python
but it turned out that it would become too big.
Libraries flensed from scipy
often include the sci
at the beginning of its name,
e.g. scikit-learn
or scikit-image
.
scipy
is comprised of:
scikit-image
Several of these routines are used in scikit-learn
and scikit-image
to produce decomposition and machine learning algorithms.
We will look at machine learning from a higher perspective but,
for the statistics we need, we can use the scipy.stats
module.
As a quick review let's see a handful of statistical measures (or simply statistics for short) we can perform on data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
Note: $1/(N-d)$ often becomes just $1/N$ or $1/(N-1)$ in bias-corrected calculations.
in other words the most common values for $d$ are $0$ for population
statistics and $1$ (or very rarely more) for sample statistics.
Bias correction is needed when operating over a sample instead of operating over
the entire population. All below NumPy
functions (except correlation functions
which are not multiplied by $1/N$) accept a ddof=
(degrees of freedom)
argument to perform a sample based calculation.
Let's have a look why this bias correction makes sense. We will take a dataset of several points in one horizontal axis. The vertical position of the points is merely illustrative of the fact that in a real world we can have only a fraction of all possible measurable conditions.
We will assume a normal distribution. The mean is at the center of the data points and a distance of $2$ standard deviations from the mean contains around $95\%$ of all points - within the dashed lines.
The blue points are our full population, there are no more points. Yet if we take a sample from this population we are more likely to sample points from regions there are more blue points in the first place. The green points below are a sample from within the population of all blue points.
The mean changes very little - and changes in the value of the mean depend on where on the axis the dataset is positioned, in order to avoid changes to the mean one can center the data at zero as we were doing above with $(x_i - \bar{x})$. Since changes in mean value can be avoided in most cases there is no need for a degrees of freedom adjustment to the mean.
The change in the value of standard deviation is more interesting. The spread of sampled data will always be smaller or equal to the spread of the population. This is exactly due to the reason that we are most likely to sample blue points from the middle rather than from the extremities. The premise is that we are assuming a distribution that is reasonably normal, or to be more exact a univariate distribution - a distribution with a single peak.
Now let's create some arrays to play with. One array is a simple range and the others are noise perturbed.
arr = np.arange(30)
acv = np.arange(30) + np.random.rand(30) - 0.5
acr = np.arange(30) + np.random.rand(30)*5 - 2.5
arr, acv, acr
NumPy
has the mean method but implementing it by hand is easy.
print(arr.mean())
print(arr.sum() / len(arr))
Standard deviation with zero degrees of freedom is the deviation of our data. With one degree of freedom it is an estimate of a population from which whe data at hand may be a reasonable sample.
print(arr.std())
print(np.std(arr, ddof=1))
Note how the second value is bigger.
In the first case we consider all the data as our population.
In the second case all our data in arr
is considered to be just a sample
from some population with much more data.
That population with more data which we do not know about must
have a bigger spread of values.
Same story with the variance (since it is just the squared standard deviation).
print(arr.var())
print(arr.var(ddof=1))
The covariance method (cov
) produces the variance of each array on the diagonal
and the actual covariance on the interception between the arrays. i.e.
np.cov |
arr | acv |
---|---|---|
arr | cov(arr, arr) | cov(arr, acv) |
acv | cov(acv, arr) | cov(acv, acv) |
And since $cov(a, a) = var(a)$ the diagonal of this matrix is just a sequence of variances.
print(np.cov([arr, acv], ddof=0))
print(np.cov([arr, acv], ddof=1))
Be careful to specify the degrees of freedom,
the default ddof
in variance in NumPy's standard deviation and variance
is ddof=0
but NumPy's covariance attempts to estimate ddof
by default.
If you do not specify a value for the ddof
argument,
NumPy's covariance will end with ddof=1
in the majority of cases.
The correlation measure we saw above in the equation
is actually just one way of measuring correlation.
It is called the Person's correlation coefficient or just Person's $r$.
NumPy's corrcoef
produces a Matrix similar to the covariance method.
One important thing to note is that Pearson's correlation coefficient
divides out the degrees of freedom in its equation.
Therefore, if we are working with a sample, we cannot bias-correct it.
scipy.stats
provides us with a pearsonr
method,
which reminds us of the fact that we cannot bias correct our correlation.
It returns an extra $p$ value which is an indication of how
likely an uncorrelated dataset is to produce such value of correlation.
The $p$ value is just a vague estimate though,
just a throw into a beta distribution (do not worry if you do not know what that is).
In general it can be said that very low $(-1)$ or very high $(1)$
value of correlation will give a very small $p$,
and correlation values around $0$ will give high $p$.
The $p$ value here is just a rough reminder for the experimenter.
print(np.corrcoef([arr, acv, acr]))
print(stats.pearsonr(arr, acv))
print(stats.pearsonr(acv, acr))
print(stats.pearsonr(arr, acr))
All the time we need to remember that all these statistics assume the normal distribution of data. A bell shaped distribution, or at least a distribution that just has a single peak - a univariate distribution of data values. A distribution that is multivariate - has several peaks in tis graph - will render the majority of the statistics we see here useless.
Next we will see a handful of ideas about distributions, and then a couple of pitfalls in the use of plain statistic values.