We have seen a god deal of the capabilities of NumPy yet a bit chunk still remains. We saw how we can cleverly build and select from the arrays, and we saw how the axis and shape are used to perform operations across specific dimensions.
Now we jump into operations on the contents of the arrays, this comes in two parts: we operate on the value in the array - a rather easy operation with NumPy arrays. Or we operate on the indexes into the array - an operation that can become mind boggling but also the operation that gives NumPy its full power. By indexing data in an array we form a new view, and we can keep creating more complex views. As long as we keep operating on views and indexes we do not need to copy or move that data in memory, this allows NumPy an optimal computing speed and memory management. For a start let's import our library.
import numpy as np
After all the talk about executing several operations at once we finally dive into the procedures that execute these operations. Universal functions, ufuncs for short, are optimized operations that perform arithmetic on NumPy arrays. Under the hood the operations are assembly optimized loops but from Python the operations appear to happen all at once.
To perform operations on NumPy arrays one adds, subtracts, multiplies, divides, or even other operations as if dealing with plain numbers. Behind the scenes NumPy does its magic and performs the operation across the array. Let's take an array:
x = np.arange(6).reshape((2, 3))
x
We can perform some algebra on the entire array and the operation happens over all its elements.
x + 7
Number conversion is allowed but discouraged.
x / 3.14
One must remember that we are working with fixed memory locations. For example, attempting number conversion with assignment will fail:
x /= 3.14
This is because one cannot change the type of the array by assigning to it, one may need a different amount of memory to hold the converted type.
As in Python the type conversion is upwards, towards floating point numbers, even if the result could be converted back to integers.
x // 3.14
The full list of ufuncs is quite big but some are:
ufunc | operator | description |
---|---|---|
np.add | + | Add arguments element-wise. |
np.subtract | - | Subtract arguments element-wise. |
np.multiply | * | Multiply arguments element-wise. |
np.divide | / | Returns a true division of the inputs, element-wise. |
np.floor_divide | // | Return the largest integer smaller or equal to the division of the inputs. |
np.negative | - | Numerical negative, element-wise. |
np.power | ** | First array elements raised to powers from second array, element-wise. |
np.mod | % | Return element-wise remainder of division. |
np.sin | Trigonometric sine, element-wise. | |
np.cos | Cosine element-wise. | |
np.tan | Compute tangent element-wise. | |
np.arcsin | Inverse sine, element-wise. | |
np.arccos | Inverse cosine, element-wise. | |
np.arctan | Inverse tangent, element-wise. |
Note that not all operations are linked to an operator.
You can use them by directly invoking the ufunc on the array, e.g. np.sin(x).
And there are aggregation functions too, some of them:
aggregation function | nan-safe | description |
---|---|---|
np.sum | np.nansum | Compute sum of elements |
np.prod | np.nanprod | Compute product of elements |
np.mean | np.nanmean | Compute mean of elements |
np.average | Compute the weighted average | |
np.median | np.nanmedian | Compute median of elements |
np.std | np.nanstd | Compute standard deviation |
np.var | np.nanvar | Compute variance |
np.min | np.nanmin | Find minimum value |
np.max | np.nanmax | Find maximum value |
The different thing about aggregation functions is that these again accept the axis=
parameter.
Again, the tricky part is that the argument means the axis through which the aggregation
will take place.
This is best seen through examples:
x = np.arange(18).reshape((3, 6))
x
np.sum(x, axis=None)
np.sum(x, axis=0)
np.sum(x, axis=1)
The same as with concatenation, if more dimensions are used then the concepts of rows and columns becomes fuzzy. With more dimensions one needs to remember that the number given for the axis is the index of the value in the shape of the array.
Apart from algebraic arithmetic, index arithmetic is important in NumPy. We can use an array of indexes to index another array. The is a concept that seems not overly useful at first sight but it is one of the most powerful concept in vectorial computing. One can work on indexes of data whilst the data is kept immobile in memory.
Let us look at an example with a simple array:
x = np.arange(18).reshape((3, 6))
x
We will index the array with two other arrays. The result are the elements of the original array in positions $(1, 1)$, $(2, 1)$ and $(2, 3)$.
x[np.array([1, 2, 2]), np.array([1, 1, 3])]
In the previous example we could have used Python lists but one can also use multidimensional arrays as indexes. For example:
x[np.array([[1, 2], [1, 2]]), np.array([[1, 1], [3, 1]])]
Above we got the value at index $(2, 1)$ twice.
we can extend that concept.
We can use the :
operator to ask for everything
in one of the indexing positions.
In this case we will ask for all columns.
x[np.array([1, 1]), :]
This way we selected the entire second row twice. Operating on indexes is a powerful tool.
Since we can use arrays to index arrays we can create arrays that work as masks. With fancy indexing we used arrays of integers that work as indexes. A mask is a boolean only arrays of the same size as the array being indexed, and we retrieve (index) only the entries for which the mask is true. But first how do we generate a boolean array?
x = np.arange(18).reshape((3, 6))
x < 12
This array can then be used as a mask that selects all values where a true is present.
x[x < 12]
The results shape is one dimensional because we cannot tell how many trues may be found.
On the other hand, we can use indexing on specific index position to select parts of the array. Here we will select specific rows:
x[[True, False, True], :]
We can use the |
(or) and the &
(and) operators.
Unfortunately in plain Python these are bitwise operators and have very high precedence,
therefore we need to at parentheses around the expressions.
x[(x < 12) & (x > 3)]
One can retain the shape by masking columns or rows only.
To build row and column based masks one can use np.all
and np.any
.
x[np.all((x < 12) | (x < -2), axis=1), :]
The boolean operators are a from of NumPy ufuncs.
Finally, since Python is slow compared to NumPy operations,
NumPy has a use for boolean arrays as a substitute for if-else loops.
Given a boolean array np.where
chooses from two arrays based on the
boolean values.
For example:
np.where([True, False, True, True, True], [7]*5, [2]*5)
Contrary to most operations in NumPy sorting an array is performed in place. A priori the in-place nature of sorting appear to contradict the NumPy attempt at not copying or modifying data already in memory. But we will soon see that there are ways of sorting without moving the data.
Also contrary to the aggregation types we saw axis=None
cannot be used.
The default axis for sorting is axis=-1
.
As in Python indexing, $-1$ in the NumPy axis means the highest possible value.
For two dimensional arrays axis=-1
will be equivalent to axis=1
,
which is sorting across columns.
x = np.array([
[7, 7, 3, 1],
[2, 3, 1, 1],
[2, 3, 1, 2],
])
x.sort(axis=-1)
x
Even if we cannot use axis=None
we can easily simulate such sorting
by first reshaping the array.
y = x.reshape(-1)
y.sort()
y
And since sorting is in place we can see that the original array has been modified. Note as well that this modification is a good proof that in NumPy all values are stored sequentially in memory.
x
The sorting is all good but if we have a big piece of data that we want to
use for several purposes, and only one of the purposes is sorting,
we may not want to copy all the data and sort it in place.
Moreover, sorting on a view may mess the data in the original place in memory.
To the rescue come the arg
ufuncs of NumPy.
Another way to return a sorted array instead of sorting it in place is to use argsort
.
The argsort
will return the indexes needed in order to sort the array
but will not actually sort the array itself.
Have a look at the image below.
The sorting happens when the arrows cross each other on the left hand side.
The right hand side just assigns the arrows into the sequential positions.
What argsort
does is to perform the sorting but not the assignment.
It returns the indexes with which indexing the original array will
return a sorted array.
x = np.array([7, 1, 3, 2])
x.argsort()
In the same way as in place sorting,
argsort
uses axis=-1
as the default sorting axis
for multidimensional arrays.
For example we have the indexes needed across columns
to return a sorted array in the same way.
x = np.array([
[7, 7, 3, 1],
[2, 3, 1, 1],
[2, 3, 1, 2],
])
x.argsort()
To actually sort the multidimensional array we will use fancy indexing.
But we require one more concept from NumPy: how to force dimensionality of arrays.
In order to change a one dimensional row array into a two dimensional column
array we use np.newaxis
to index the array on the dimension we want to add.
np.arange(3)[:, np.newaxis]
np.newaxis
is just a fancy name for None
and you will see
None
used often in place of np.newaxis
.
Nevertheless, we now can use our multidimensional argsort
to sort the array without moving the data.
x[np.arange(3)[:, np.newaxis], x.argsort(axis=-1)]
And in order so sort across rows we invert the indexing order.
x[x.argsort(axis=0), np.arange(4)]
Combining fancy indexing, arg
ufuncs and axis arguments results
in incredibly powerful operations.
Sorting is not the only operation that has arg
ufuncs,
several aggregations allows for such ufuncs including
argmin
and argmax
.
Sorting is an expensive operation and often we do not need to sort only a part of the data.
Another common sorting procedure is partition
/argpartition
which sorts only part of the data.
For example, to retrieve the two smallest elements from an array
we partition it with $2$ as the argument.
x = np.array([6, 2, 3, 7, 1, 3, 2])
x[x.argpartition(2)]
There is a little more about ufuncs and arithmetic with arrays than what we talked until now. The ufuncs, indexing, concatenation, axis and shape together abide by a concept called broadcasting in NumPy.
When you add (or subtract, or multiply) a number against an array, the number is summed to all elements, we know that this happens because of ufuncs. We say that the number is broadcast against all elements.
x = np.arange(18).reshape((3, 6))
x[0,:] + 42
But we are not limited to broadcast single numbers only. As long as arrays share parts of their shape they can be broadcast together.
If, for any reason two arrays need to be operated together, be it fancy indexing or ufunc arithmetic, the arrays are broadcaster together. Broadcasting is a concept, the full broadcast structure is not stored in memory, but the internal process NumPy performs appear as if the arrays were broadcast before the arrays are operated on. For all practical purposes one can assume that when a broadcast happens the array is fully broadcast in memory. Some examples follow:
x[:,2:4] + np.arange(1, 3)
In order to create a column vector we need to use np.newaxis
.
x[:,2:4] + np.arange(1, 4)[:, np.newaxis]
The row vector example is equivalent to the following. Part of the broadcasting procedure is to extend an axis at the front of the array. For finer control one can do it explicitly.
x[:,2:4] + np.arange(1, 3)[np.newaxis, :]
These are NumPy constructs that seem very complicated at first but once you use them long enough they become second nature. It is important to notice that these two NumPy variables are function like objects that are not called as functions, instead we use indexing notation to get their functionality.
np.c_
concatenates as many arrays as given over the last axis.
It forces all arrays to have at least $2$ dimensions,
hence it makes it easy to add columns to arrays.
x = np.arange(12).reshape((3, 4))
np.c_[x, [1, 2, 3], [1, 1, 1], [0, 0, 0], x]
np_r_
is what np.c_
uses behind the scenes.
np.r_
is very powerful but may be quite hard to grasp.
It may receive a string of three numbers, in order these mean:
the axis to concatenate over,
the minimum number of dimensions to force all arrays to,
and the axis to place the existing dimension of the arrays given.
That already sounds quite confusing but then come all the
remaining indexes into np.r_
.
These indexes may be arrays but also may be slices.
The slices generate new arrays.
If the step of the slice is an integer then np.arange
is used to generate the array.
If the step is a complex number (j
in Python) then
np.linspace
is used instead.
The most common use for np.r_
is to quickly generate data.
We will have a look at one such use here,
we generate each column of the following array
as a $10$ point np.linspace
.
n = 10j
np.r_['1,2,0', 1:7:n, 0:3:n, 10:12:n]
These constructs are quite complex alright.
Yet, one needs to be aware of them when reading NumPy code.
Both np.c_
and np.r_
are very convenient once harnessed.
And the only way to harness them is practice.