02.03 Array Algebra

We have seen a god deal of the capabilities of NumPy yet a bit chunk still remains. We saw how we can cleverly build and select from the arrays, and we saw how the axis and shape are used to perform operations across specific dimensions.

Now we jump into operations on the contents of the arrays, this comes in two parts: we operate on the value in the array - a rather easy operation with NumPy arrays. Or we operate on the indexes into the array - an operation that can become mind boggling but also the operation that gives NumPy its full power. By indexing data in an array we form a new view, and we can keep creating more complex views. As long as we keep operating on views and indexes we do not need to copy or move that data in memory, this allows NumPy an optimal computing speed and memory management. For a start let's import our library.

In [1]:
import numpy as np

Ufuncs

After all the talk about executing several operations at once we finally dive into the procedures that execute these operations. Universal functions, ufuncs for short, are optimized operations that perform arithmetic on NumPy arrays. Under the hood the operations are assembly optimized loops but from Python the operations appear to happen all at once.

To perform operations on NumPy arrays one adds, subtracts, multiplies, divides, or even other operations as if dealing with plain numbers. Behind the scenes NumPy does its magic and performs the operation across the array. Let's take an array:

In [2]:
x = np.arange(6).reshape((2, 3))
x
Out[2]:
array([[0, 1, 2],
       [3, 4, 5]])

We can perform some algebra on the entire array and the operation happens over all its elements.

In [3]:
x + 7
Out[3]:
array([[ 7,  8,  9],
       [10, 11, 12]])

Number conversion is allowed but discouraged.

In [4]:
x / 3.14
Out[4]:
array([[0.        , 0.31847134, 0.63694268],
       [0.95541401, 1.27388535, 1.59235669]])

One must remember that we are working with fixed memory locations. For example, attempting number conversion with assignment will fail:

In [5]:
x /= 3.14
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-857ac9c9eb2b> in <module>
----> 1 x /= 3.14

TypeError: ufunc 'true_divide' output (typecode 'd') could not be coerced to provided output parameter (typecode 'l') according to the casting rule ''same_kind''

This is because one cannot change the type of the array by assigning to it, one may need a different amount of memory to hold the converted type.

As in Python the type conversion is upwards, towards floating point numbers, even if the result could be converted back to integers.

In [6]:
x // 3.14
Out[6]:
array([[0., 0., 0.],
       [0., 1., 1.]])

The full list of ufuncs is quite big but some are:

ufunc operator description
np.add + Add arguments element-wise.
np.subtract - Subtract arguments element-wise.
np.multiply * Multiply arguments element-wise.
np.divide / Returns a true division of the inputs, element-wise.
np.floor_divide // Return the largest integer smaller or equal to the division of the inputs.
np.negative - Numerical negative, element-wise.
np.power ** First array elements raised to powers from second array, element-wise.
np.mod % Return element-wise remainder of division.
np.sin Trigonometric sine, element-wise.
np.cos Cosine element-wise.
np.tan Compute tangent element-wise.
np.arcsin Inverse sine, element-wise.
np.arccos Inverse cosine, element-wise.
np.arctan Inverse tangent, element-wise.

Note that not all operations are linked to an operator. You can use them by directly invoking the ufunc on the array, e.g. np.sin(x).

And there are aggregation functions too, some of them:

aggregation function nan-safe description
np.sum np.nansum Compute sum of elements
np.prod np.nanprod Compute product of elements
np.mean np.nanmean Compute mean of elements
np.average Compute the weighted average
np.median np.nanmedian Compute median of elements
np.std np.nanstd Compute standard deviation
np.var np.nanvar Compute variance
np.min np.nanmin Find minimum value
np.max np.nanmax Find maximum value

The different thing about aggregation functions is that these again accept the axis= parameter. Again, the tricky part is that the argument means the axis through which the aggregation will take place. This is best seen through examples:

In [7]:
x = np.arange(18).reshape((3, 6))
x
Out[7]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])

Aggregation Axis None

np-agg-axis-none.svg
In [8]:
np.sum(x, axis=None)
Out[8]:
153

Aggregation Axis Zero

np-agg-axis-0.svg
In [9]:
np.sum(x, axis=0)
Out[9]:
array([18, 21, 24, 27, 30, 33])

Aggregation Axis One

np-agg-axis-1.svg
In [10]:
np.sum(x, axis=1)
Out[10]:
array([15, 51, 87])

The same as with concatenation, if more dimensions are used then the concepts of rows and columns becomes fuzzy. With more dimensions one needs to remember that the number given for the axis is the index of the value in the shape of the array.

Fancy Indexing

Apart from algebraic arithmetic, index arithmetic is important in NumPy. We can use an array of indexes to index another array. The is a concept that seems not overly useful at first sight but it is one of the most powerful concept in vectorial computing. One can work on indexes of data whilst the data is kept immobile in memory.

Let us look at an example with a simple array:

In [11]:
x = np.arange(18).reshape((3, 6))
x
Out[11]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])

We will index the array with two other arrays. The result are the elements of the original array in positions $(1, 1)$, $(2, 1)$ and $(2, 3)$.

In [12]:
x[np.array([1, 2, 2]), np.array([1, 1, 3])]
Out[12]:
array([ 7, 13, 15])

In the previous example we could have used Python lists but one can also use multidimensional arrays as indexes. For example:

In [13]:
x[np.array([[1, 2], [1, 2]]), np.array([[1, 1], [3, 1]])]
Out[13]:
array([[ 7, 13],
       [ 9, 13]])

Above we got the value at index $(2, 1)$ twice. we can extend that concept. We can use the : operator to ask for everything in one of the indexing positions. In this case we will ask for all columns.

In [14]:
x[np.array([1, 1]), :]
Out[14]:
array([[ 6,  7,  8,  9, 10, 11],
       [ 6,  7,  8,  9, 10, 11]])

This way we selected the entire second row twice. Operating on indexes is a powerful tool.

Boolean Logic and Masks

Since we can use arrays to index arrays we can create arrays that work as masks. With fancy indexing we used arrays of integers that work as indexes. A mask is a boolean only arrays of the same size as the array being indexed, and we retrieve (index) only the entries for which the mask is true. But first how do we generate a boolean array?

In [15]:
x = np.arange(18).reshape((3, 6))
x < 12
Out[15]:
array([[ True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True],
       [False, False, False, False, False, False]])

This array can then be used as a mask that selects all values where a true is present.

In [16]:
x[x < 12]
Out[16]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

The results shape is one dimensional because we cannot tell how many trues may be found.

On the other hand, we can use indexing on specific index position to select parts of the array. Here we will select specific rows:

In [17]:
x[[True, False, True], :]
Out[17]:
array([[ 0,  1,  2,  3,  4,  5],
       [12, 13, 14, 15, 16, 17]])

We can use the | (or) and the & (and) operators. Unfortunately in plain Python these are bitwise operators and have very high precedence, therefore we need to at parentheses around the expressions.

In [18]:
x[(x < 12) & (x > 3)]
Out[18]:
array([ 4,  5,  6,  7,  8,  9, 10, 11])

One can retain the shape by masking columns or rows only. To build row and column based masks one can use np.all and np.any.

In [19]:
x[np.all((x < 12) | (x < -2), axis=1), :]
Out[19]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])

The boolean operators are a from of NumPy ufuncs.

Finally, since Python is slow compared to NumPy operations, NumPy has a use for boolean arrays as a substitute for if-else loops. Given a boolean array np.where chooses from two arrays based on the boolean values. For example:

In [20]:
np.where([True, False, True, True, True], [7]*5, [2]*5)
Out[20]:
array([7, 2, 7, 7, 7])

Sorting and Indexing Again

Contrary to most operations in NumPy sorting an array is performed in place. A priori the in-place nature of sorting appear to contradict the NumPy attempt at not copying or modifying data already in memory. But we will soon see that there are ways of sorting without moving the data.

Also contrary to the aggregation types we saw axis=None cannot be used. The default axis for sorting is axis=-1. As in Python indexing, $-1$ in the NumPy axis means the highest possible value. For two dimensional arrays axis=-1 will be equivalent to axis=1, which is sorting across columns.

In [21]:
x = np.array([
    [7, 7, 3, 1],
    [2, 3, 1, 1],
    [2, 3, 1, 2],
])
x.sort(axis=-1)
x
Out[21]:
array([[1, 3, 7, 7],
       [1, 1, 2, 3],
       [1, 2, 2, 3]])

Even if we cannot use axis=None we can easily simulate such sorting by first reshaping the array.

In [22]:
y = x.reshape(-1)
y.sort()
y
Out[22]:
array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 7, 7])

And since sorting is in place we can see that the original array has been modified. Note as well that this modification is a good proof that in NumPy all values are stored sequentially in memory.

In [23]:
x
Out[23]:
array([[1, 1, 1, 1],
       [2, 2, 2, 3],
       [3, 3, 7, 7]])

The sorting is all good but if we have a big piece of data that we want to use for several purposes, and only one of the purposes is sorting, we may not want to copy all the data and sort it in place. Moreover, sorting on a view may mess the data in the original place in memory. To the rescue come the arg ufuncs of NumPy.

Another way to return a sorted array instead of sorting it in place is to use argsort. The argsort will return the indexes needed in order to sort the array but will not actually sort the array itself. Have a look at the image below. The sorting happens when the arrows cross each other on the left hand side. The right hand side just assigns the arrows into the sequential positions. What argsort does is to perform the sorting but not the assignment. It returns the indexes with which indexing the original array will return a sorted array.

In [24]:
x = np.array([7, 1, 3, 2])
x.argsort()
Out[24]:
array([1, 3, 2, 0])

Argsort

np-argsort.svg

In the same way as in place sorting, argsort uses axis=-1 as the default sorting axis for multidimensional arrays. For example we have the indexes needed across columns to return a sorted array in the same way.

In [25]:
x = np.array([
    [7, 7, 3, 1],
    [2, 3, 1, 1],
    [2, 3, 1, 2],
])
x.argsort()
Out[25]:
array([[3, 2, 0, 1],
       [2, 3, 0, 1],
       [2, 0, 3, 1]])

To actually sort the multidimensional array we will use fancy indexing. But we require one more concept from NumPy: how to force dimensionality of arrays. In order to change a one dimensional row array into a two dimensional column array we use np.newaxis to index the array on the dimension we want to add.

In [26]:
np.arange(3)[:, np.newaxis]
Out[26]:
array([[0],
       [1],
       [2]])

np.newaxis is just a fancy name for None and you will see None used often in place of np.newaxis.

Nevertheless, we now can use our multidimensional argsort to sort the array without moving the data.

In [27]:
x[np.arange(3)[:, np.newaxis], x.argsort(axis=-1)]
Out[27]:
array([[1, 3, 7, 7],
       [1, 1, 2, 3],
       [1, 2, 2, 3]])

And in order so sort across rows we invert the indexing order.

In [28]:
x[x.argsort(axis=0), np.arange(4)]
Out[28]:
array([[2, 3, 1, 1],
       [2, 3, 1, 1],
       [7, 7, 3, 2]])

Combining fancy indexing, arg ufuncs and axis arguments results in incredibly powerful operations. Sorting is not the only operation that has arg ufuncs, several aggregations allows for such ufuncs including argmin and argmax.

Sorting is an expensive operation and often we do not need to sort only a part of the data. Another common sorting procedure is partition/argpartition which sorts only part of the data. For example, to retrieve the two smallest elements from an array we partition it with $2$ as the argument.

In [29]:
x = np.array([6, 2, 3, 7, 1, 3, 2])
x[x.argpartition(2)]
Out[29]:
array([1, 2, 2, 7, 6, 3, 3])

Broadcasting

There is a little more about ufuncs and arithmetic with arrays than what we talked until now. The ufuncs, indexing, concatenation, axis and shape together abide by a concept called broadcasting in NumPy.

When you add (or subtract, or multiply) a number against an array, the number is summed to all elements, we know that this happens because of ufuncs. We say that the number is broadcast against all elements.

Broadcast Simple

np-broadcast-simple.svg
In [30]:
x = np.arange(18).reshape((3, 6))
x[0,:] + 42
Out[30]:
array([42, 43, 44, 45, 46, 47])

But we are not limited to broadcast single numbers only. As long as arrays share parts of their shape they can be broadcast together.

If, for any reason two arrays need to be operated together, be it fancy indexing or ufunc arithmetic, the arrays are broadcaster together. Broadcasting is a concept, the full broadcast structure is not stored in memory, but the internal process NumPy performs appear as if the arrays were broadcast before the arrays are operated on. For all practical purposes one can assume that when a broadcast happens the array is fully broadcast in memory. Some examples follow:

Broadcast Axis

np-broadcast-axis.svg
In [31]:
x[:,2:4] + np.arange(1, 3)
Out[31]:
array([[ 3,  5],
       [ 9, 11],
       [15, 17]])

In order to create a column vector we need to use np.newaxis.

In [32]:
x[:,2:4] + np.arange(1, 4)[:, np.newaxis]
Out[32]:
array([[ 3,  4],
       [10, 11],
       [17, 18]])

The row vector example is equivalent to the following. Part of the broadcasting procedure is to extend an axis at the front of the array. For finer control one can do it explicitly.

In [33]:
x[:,2:4] + np.arange(1, 3)[np.newaxis, :]
Out[33]:
array([[ 3,  5],
       [ 9, 11],
       [15, 17]])

Broadcasting

np-broadcast-tv.svg
The Alexandra Palace was the birth place of television. The British Broadcast Corporation (the BBC) built its mast in 1936 and started transmitting the first television channel in the world in the same year. The tower was not the first television transmitter. Transmitters for television existed as early as 1910s but the fierce competition with the radio marginalized television technology for decades. The tower in Alexandra Palace was the first public television broadcast. And, despite several interruptions including the World War II, the tower is used to broadcast television channels to this day.

c and r

These are NumPy constructs that seem very complicated at first but once you use them long enough they become second nature. It is important to notice that these two NumPy variables are function like objects that are not called as functions, instead we use indexing notation to get their functionality.

np.c_ concatenates as many arrays as given over the last axis. It forces all arrays to have at least $2$ dimensions, hence it makes it easy to add columns to arrays.

In [34]:
x = np.arange(12).reshape((3, 4))
np.c_[x, [1, 2, 3], [1, 1, 1], [0, 0, 0], x]
Out[34]:
array([[ 0,  1,  2,  3,  1,  1,  0,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  2,  1,  0,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  3,  1,  0,  8,  9, 10, 11]])

np_r_ is what np.c_ uses behind the scenes. np.r_ is very powerful but may be quite hard to grasp. It may receive a string of three numbers, in order these mean: the axis to concatenate over, the minimum number of dimensions to force all arrays to, and the axis to place the existing dimension of the arrays given.

That already sounds quite confusing but then come all the remaining indexes into np.r_. These indexes may be arrays but also may be slices. The slices generate new arrays. If the step of the slice is an integer then np.arange is used to generate the array. If the step is a complex number (j in Python) then np.linspace is used instead.

The most common use for np.r_ is to quickly generate data. We will have a look at one such use here, we generate each column of the following array as a $10$ point np.linspace.

In [35]:
n = 10j
np.r_['1,2,0', 1:7:n, 0:3:n, 10:12:n]
Out[35]:
array([[ 1.        ,  0.        , 10.        ],
       [ 1.66666667,  0.33333333, 10.22222222],
       [ 2.33333333,  0.66666667, 10.44444444],
       [ 3.        ,  1.        , 10.66666667],
       [ 3.66666667,  1.33333333, 10.88888889],
       [ 4.33333333,  1.66666667, 11.11111111],
       [ 5.        ,  2.        , 11.33333333],
       [ 5.66666667,  2.33333333, 11.55555556],
       [ 6.33333333,  2.66666667, 11.77777778],
       [ 7.        ,  3.        , 12.        ]])

These constructs are quite complex alright. Yet, one needs to be aware of them when reading NumPy code. Both np.c_ and np.r_ are very convenient once harnessed. And the only way to harness them is practice.