05.00 Stats

To not make mistakes in analysis a deal of statistical knowledge is required. We will review some statistics and learn a little about distributions in scipy. scipy is the mathematical library for Python on top of NumPy. It was geared to be the one and only mathematic library for the sciences in Python but it turned out that it would become too big. Libraries flensed from scipy often include the sci at the beginning of its name, e.g. scikit-learn or scikit-image.

scipy is comprised of:

  • Numerical Integration
  • Function Optimization - used in machine learning routines
  • Interpolation - e.g. splines
  • Fast Fourier Transforms
  • General Signal Processing
  • Linear Algebra - including Matrix decomposition
  • Image Processing - as NumPy arrays, used by scikit-image
  • Sparse Matrices - and graphs
  • Statistics - which is what interests us right now
  • And a handful of extra things

Several of these routines are used in scikit-learn and scikit-image to produce decomposition and machine learning algorithms. We will look at machine learning from a higher perspective but, for the statistics we need, we can use the scipy.stats module.

As a quick review let's see a handful of statistical measures (or simply statistics for short) we can perform on data.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

Mean

$$\bar{x} = \frac{1}{N} \sum_{i=1}^{N} x_i$$

Variance

$$\sigma^2 = \frac{1}{N - d} \sum_{i=1}^{N} (x_i - \bar{x})^2$$

Standard Deviation

$$\sigma = \sqrt{\frac{1}{N - d} \sum_{i=1}^{N} (x_i - \bar{x})^2}$$

Covariance

$$cov(X, Y) = \sigma_{xy} = \frac{1}{N - d} \sum_{i=1}^{N} (x_i - \bar{x})(y_i - \bar{y})$$

Correlation

$$corr(X, Y) = r = \frac{cov(X, Y)}{\sigma_x \sigma_y} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}$$

Note: $1/(N-d)$ often becomes just $1/N$ or $1/(N-1)$ in bias-corrected calculations. in other words the most common values for $d$ are $0$ for population statistics and $1$ (or very rarely more) for sample statistics. Bias correction is needed when operating over a sample instead of operating over the entire population. All below NumPy functions (except correlation functions which are not multiplied by $1/N$) accept a ddof= (degrees of freedom) argument to perform a sample based calculation.

Let's have a look why this bias correction makes sense. We will take a dataset of several points in one horizontal axis. The vertical position of the points is merely illustrative of the fact that in a real world we can have only a fraction of all possible measurable conditions.

We will assume a normal distribution. The mean is at the center of the data points and a distance of $2$ standard deviations from the mean contains around $95\%$ of all points - within the dashed lines.

Population Statistics

da-std-full.svg

The blue points are our full population, there are no more points. Yet if we take a sample from this population we are more likely to sample points from regions there are more blue points in the first place. The green points below are a sample from within the population of all blue points.

The mean changes very little - and changes in the value of the mean depend on where on the axis the dataset is positioned, in order to avoid changes to the mean one can center the data at zero as we were doing above with $(x_i - \bar{x})$. Since changes in mean value can be avoided in most cases there is no need for a degrees of freedom adjustment to the mean.

The change in the value of standard deviation is more interesting. The spread of sampled data will always be smaller or equal to the spread of the population. This is exactly due to the reason that we are most likely to sample blue points from the middle rather than from the extremities. The premise is that we are assuming a distribution that is reasonably normal, or to be more exact a univariate distribution - a distribution with a single peak.

Population Statistics

da-std-sample.svg

Now let's create some arrays to play with. One array is a simple range and the others are noise perturbed.

In [2]:
arr = np.arange(30)
acv = np.arange(30) + np.random.rand(30) - 0.5
acr = np.arange(30) + np.random.rand(30)*5 - 2.5
arr, acv, acr
Out[2]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]),
 array([ 0.42656744,  1.34760301,  1.76247459,  2.54394538,  4.25001083,
         5.37448213,  5.5951476 ,  6.71161456,  8.04594227,  8.836638  ,
        10.37401308, 10.89728014, 12.33849422, 12.76767618, 14.02655641,
        14.70839667, 15.52074703, 17.23753479, 18.44489014, 18.62833507,
        20.04839567, 21.47013894, 21.65815276, 22.97512403, 23.82185634,
        25.26785455, 26.0377071 , 27.36483584, 28.48860728, 28.93587003]),
 array([ 1.18661367,  1.16547567,  0.14580165,  4.99155547,  4.11994684,
         6.55103777,  6.12749731,  9.080279  ,  6.96948181, 10.05920244,
        11.70575631, 12.43509519,  9.62236534, 10.61161181, 14.64103517,
        15.07336027, 13.54752731, 15.90730184, 20.0211125 , 20.6451752 ,
        22.46041344, 19.71946508, 21.1359607 , 21.23087045, 23.24723358,
        25.72597371, 26.96254792, 28.7182825 , 27.11062461, 27.01276892]))

NumPy has the mean method but implementing it by hand is easy.

In [3]:
print(arr.mean())
print(arr.sum() / len(arr))
14.5
14.5

Standard deviation with zero degrees of freedom is the deviation of our data. With one degree of freedom it is an estimate of a population from which whe data at hand may be a reasonable sample.

In [4]:
print(arr.std())
print(np.std(arr, ddof=1))
8.65544144839919
8.803408430829505

Note how the second value is bigger. In the first case we consider all the data as our population. In the second case all our data in arr is considered to be just a sample from some population with much more data. That population with more data which we do not know about must have a bigger spread of values.

Same story with the variance (since it is just the squared standard deviation).

In [5]:
print(arr.var())
print(arr.var(ddof=1))
74.91666666666667
77.5

The covariance method (cov) produces the variance of each array on the diagonal and the actual covariance on the interception between the arrays. i.e.

np.cov arr acv
arr cov(arr, arr) cov(arr, acv)
acv cov(acv, arr) cov(acv, acv)

And since $cov(a, a) = var(a)$ the diagonal of this matrix is just a sequence of variances.

In [6]:
print(np.cov([arr, acv], ddof=0))
print(np.cov([arr, acv], ddof=1))
[[74.91666667 75.16545427]
 [75.16545427 75.50980833]]
[[77.5        77.75736649]
 [77.75736649 78.11359482]]

Be careful to specify the degrees of freedom, the default ddof in variance in NumPy's standard deviation and variance is ddof=0 but NumPy's covariance attempts to estimate ddof by default. If you do not specify a value for the ddof argument, NumPy's covariance will end with ddof=1 in the majority of cases.

The correlation measure we saw above in the equation is actually just one way of measuring correlation. It is called the Person's correlation coefficient or just Person's $r$. NumPy's corrcoef produces a Matrix similar to the covariance method.

One important thing to note is that Pearson's correlation coefficient divides out the degrees of freedom in its equation. Therefore, if we are working with a sample, we cannot bias-correct it. scipy.stats provides us with a pearsonr method, which reminds us of the fact that we cannot bias correct our correlation. It returns an extra $p$ value which is an indication of how likely an uncorrelated dataset is to produce such value of correlation. The $p$ value is just a vague estimate though, just a throw into a beta distribution (do not worry if you do not know what that is). In general it can be said that very low $(-1)$ or very high $(1)$ value of correlation will give a very small $p$, and correlation values around $0$ will give high $p$. The $p$ value here is just a rough reminder for the experimenter.

In [7]:
print(np.corrcoef([arr, acv, acr]))
print(stats.pearsonr(arr, acv))
print(stats.pearsonr(acv, acr))
print(stats.pearsonr(arr, acr))
[[1.         0.99937247 0.98441101]
 [0.99937247 1.         0.98467649]
 [0.98441101 0.98467649 1.        ]]
(0.9993724656859068, 3.582409846684914e-42)
(0.9846764927276871, 8.778258288510629e-23)
(0.9844110053919739, 1.1146616476882797e-22)

Bell Curve

da-bell.svg

All the time we need to remember that all these statistics assume the normal distribution of data. A bell shaped distribution, or at least a distribution that just has a single peak - a univariate distribution of data values. A distribution that is multivariate - has several peaks in tis graph - will render the majority of the statistics we see here useless.

Next we will see a handful of ideas about distributions, and then a couple of pitfalls in the use of plain statistic values.