02.01 NumPy Operations

NumPy arrays work like lists to some extent and like simple values to another. As lists you can retrieve parts of an array, but also can perform vectorized operations on all (or some) of the values of an array. We will now look at the operations on NumPy arrays that are most useful for working with data and which will lead us further towards data science.

Bookshelf

np-bookshelf.svg

To keep each section self contained we will perform the required imports from previous section at the top. If some import looks strange one should go back and check the previous sections. For now we only know about NumPy,

In [1]:
import numpy as np

Indexing and Slicing

Like lists indexing and slicing is done with square brackets. One dimensional indexing works pretty much the same as a list. Let's import NumPy, create an array and check.

In [2]:
x = np.arange(9)
x
Out[2]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8])

For almost all purposes a NumPy array can be treated as a Python list.

In [3]:
x[3]
Out[3]:
3

And slicing on such an array works like Python list. The arrays here are what NumPy calls one dimensional arrays because a single index (on dimension) is enough to retrieve a specific value form an array, and a slice returns an array that is also on dimensional.

In [4]:
x[1:5]
Out[4]:
array([1, 2, 3, 4])

For more dimensions we add an extra index. The index is understood as a tuple of integers or slice objects. Behind the scenes this is just a cleverly designed Python __getitem__ method. At this point NumPy arrays start to appear as a little more than simple lists.

In [5]:
x = np.arange(18).reshape((3, 6))
x, x.shape
Out[5]:
(array([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11],
        [12, 13, 14, 15, 16, 17]]),
 (3, 6))

We call the array above a two dimensional array: we need two indexes to retrieve a specific value from the array. And we can add as many dimensions to arrays as we want, NumPy supports an arbitrary number of dimensions. In most cases it is the human limitation of working with highly dimensional arrays that limits the number of dimensions.

The shape = (3, 6) tells us that we have $2$ dimensions, one dimension with $3$ possible indexes to select from and another dimension with $6$ indexes to select from. As we saw, this is because we walk the memory containing the values in steps of $1$ (first dimensions) or in steps of $6$ (second dimension). We often call two dimensional arrays matrices, although NumPy does have a specific matrix type. The difference between the NumPy matrix and the NumPy two dimensional array is how certain operations work on the objects, notably multiplication. That said, the need for the matrix data type is rare and is ften more problematic than it is worth it. Stick to two dimensional arrays, and we say matrix in NumPy we will be referring to two dimensional arrays from now on.

In [6]:
x[1, 1]
Out[6]:
7

Slicing can become complicated with several dimensions, let's try to memorize some operations.

Note: remember that slicing in Python uses the [start:stop:step] syntax. And that not providing one of the components they are take as:

  • no start: start=0
  • no stop: stop=-1
  • no step: step=1

Which also means that [:] means "take everything" since start = 1, stop = -1 and step = 1. Also remember that the start parameter is inclusive in Python, whilst the stop parameter is exclusive. With NumPy all is exactly the same but we can do several dimensions at once.

Let's take the arrays we have built and slice it in different ways on both dimensions. The power of working with a NumPy array will show. We can select alternate columns and/or rows, and much more. Once we combine this selection with further operations we will have a powerful tool on our hands.

Slice 1: Select

np-slice-1-select.svg
In [7]:
x[1,3:]
Out[7]:
array([ 9, 10, 11])

Slice 2: All Values

np-slice-2-all-values.svg
In [8]:
x[0:2,:]
Out[8]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])

Slice 3: Slice Both

np-slice-3-slice-both.svg
In [9]:
x[1:3,2:5]
Out[9]:
array([[ 8,  9, 10],
       [14, 15, 16]])

Slice 4: Step

np-slice-4-step.svg
In [10]:
x[:,::2]
Out[10]:
array([[ 0,  2,  4],
       [ 6,  8, 10],
       [12, 14, 16]])

Slice 5: Step Both

np-slice-5-step-both.svg
In [11]:
x[::2,1::3]
Out[11]:
array([[ 1,  4],
       [13, 16]])

Quirk, omitting :

One can use : (colon) to select an entire dimension, the same way as one uses it to select all elements in a Python list. Thanks to the tuple-of-slices syntax that NumPy uses one can omit the : from the last dimension. Yet NumPy is just a Python library and must respect the Python syntax. There are some quirks as to when : can be omitted.

The following works, note that our variable is still a two dimensional array:

In [12]:
x[1]
Out[12]:
array([ 6,  7,  8,  9, 10, 11])

and is equivalent to

In [13]:
x[1,]
Out[13]:
array([ 6,  7,  8,  9, 10, 11])

and equivalent to

In [14]:
x[1,:]
Out[14]:
array([ 6,  7,  8,  9, 10, 11])

But the following one will not work. One cannot omit the : of earlier dimensions because a lone comma is not allowed in Python syntax and the NumPy code never sees it.

In [15]:
x[,1]
  File "<ipython-input-15-cefd0d14ff16>", line 1
    x[,1]
      ^
SyntaxError: invalid syntax

The correct way is to use : in the first dimension

In [16]:
x[:,1]
Out[16]:
array([ 1,  7, 13])

It is good practice to always explicitly use : to mean that you are taking the full dimension. This works this way in NumPy because the array can be understood as a list of lists, and x[1] takes the first of those lists, i.e. a row. When we get to see pandas a single index will mean a column, so do not get used to the idea of the list of lists as a matrix.

Modifying slices

As Python lists NumPy arrays can be modified in place. Moreover, similar to Python lists one can assign several values at once into a NumPy array. And since NumPy arrays can have more than one dimension, one ca use one of the multidimensional slices we saw to assign to a specific sets of values. Let's rebuild our array and assign several values at once.

In [17]:
x = np.arange(18).reshape((3, 6))
x
Out[17]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])

We already saw the multidimensional slicing, we select all rows until row index $2$ (exclusive) and every other column.

In [18]:
x[:2,::2]
Out[18]:
array([[ 0,  2,  4],
       [ 6,  8, 10]])

All selection is an array with shape $(2, 3)$, If we produce an array of the same shape of zeros we can assign it directly to the slice.

In [19]:
x[:2,::2] = np.zeros((2, 3))
x
Out[19]:
array([[ 0,  1,  0,  3,  0,  5],
       [ 0,  7,  0,  9,  0, 11],
       [12, 13, 14, 15, 16, 17]])

This already seem quite powerful given what we saw about the slices earlier.

There is a very important detail about the fact that NumPy arrays are views on data: two separate views do not have their own copies of the data. This is useful for processing of big amounts of data without copying over and over. Yet, the view construct may result in very difficult to find bugs. To help prevent such issues with views the view that originally creates the data is considered to own the data. If data is copied the copy own its data, if data is not copied the view does not won it. The ownership does not prevent another view from changing the data but allows the programmer to be aware if he is working with a copy or a view into the data that may be modified byt other pats of the code. One array flag - named owndata - can tell you whether an array is a view (false) or not (true). To get a new array from a view one can use the copy method. Here we have an array that owns its data and one array that is a view into the data of the first:

In [20]:
z = np.arange(18)
x = z.reshape((3, 6))
y = x[:2,::2]
z.flags.owndata, x.flags.owndata, y.flags.owndata
Out[20]:
(True, False, False)

One can be very surprised when a change in a view affects the data in another. Very difficult to find bugs can result from it. That is the price we pay for faster and memory efficient data slicing. For example, we will assign to the full slice of y here but the data in x changes as well:

In [21]:
y[:] = np.zeros((2, 3))
x
Out[21]:
array([[ 0,  1,  0,  3,  0,  5],
       [ 0,  7,  0,  9,  0, 11],
       [12, 13, 14, 15, 16, 17]])

Concatenating and slicing

Concatenation can be performed in several ways, the main procedure is np.concatenate which accepts as axis= parameter. The axis can be very confusing since it means different things in the PyData group of libraries. For now remember that in NumPy axis means the dimension of the array. In other words, the axis is the index in the shape over which we want to perform an operation.

To concatenate arrays must match on all other dimensions apart from the axis used.

In [22]:
x = np.arange(18).reshape((3, 6))
y = np.arange(12).reshape((2, 6))
np.concatenate((x, y), axis=0)
Out[22]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])

Above the $6$ matches, and across axis $0$ we concatenate $3$ and $2$ rows. This may be a hard one to take in, try to read it several times to get the wording.

Below the $3$ matches and we concatenate $6$ and $4$ columns across axis $1$.

In [23]:
x = np.arange(18).reshape((3, 6))
y = np.arange(12).reshape((3, 4))
np.concatenate((x, y), axis=1)
Out[23]:
array([[ 0,  1,  2,  3,  4,  5,  0,  1,  2,  3],
       [ 6,  7,  8,  9, 10, 11,  4,  5,  6,  7],
       [12, 13, 14, 15, 16, 17,  8,  9, 10, 11]])

There are also np.vstack and np.hstack equivalent to axis=0 and axis=1 respectively.

In [24]:
x = np.arange(18).reshape((3, 6))
y = np.arange(12).reshape((2, 6))
np.vstack((x, y))
Out[24]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])

Remember that axis=0 or a vertical stack only makes sense with two dimensional arrays. Same goes for naming axis=1 a horizontal stack. With more than two dimensions one must be careful to match the axis to the shape of the array.

In [25]:
x = np.arange(18).reshape((3, 6))
y = np.arange(12).reshape((3, 4))
np.hstack((x, y))
Out[25]:
array([[ 0,  1,  2,  3,  4,  5,  0,  1,  2,  3],
       [ 6,  7,  8,  9, 10, 11,  4,  5,  6,  7],
       [12, 13, 14, 15, 16, 17,  8,  9, 10, 11]])

np.split separates the array in pieces. Can you tell how?

In [26]:
np.split(np.arange(9), 3)
Out[26]:
[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8])]

The splitting happens as separating the array into equal pieces. Above we split the array into $3$ equal sized pieces, whilst below into $2$ equal sized pieces.

In [27]:
np.split(np.arange(6), 2)
Out[27]:
[array([0, 1, 2]), array([3, 4, 5])]

Similar to concatenate it accepts an axis= argument, and there are np.vsplit and np.hsplit.

Note: There are also np.dstack and np.dslipt that are equivalent to axis=2. Yet, we will not be delving into three dimensional arrays too often.