02.00 NumPy Basics

NumPy is an (or perhaps the) array oriented Python library meant for numerical computing.

But what actually is numerical computing? It is a set of techniques where numerical approximation is used, i.e. instead of attempting to find the exact result of an expression we build algorithms which will operate on approximations and find answers which are approximations themselves. Since numbers on digital computers have limited precision, calculating on approximations provides almost indistinguishable results. Moreover, on digital computers numbers are stored in a finite amount of memory, therefore to use a digital computer calculate anything that is not an approximation is only possible for trivial problems.

In numerical computing techniques we often need to perform the same operation over several numbers at once. This is done by computing on arrays of numbers, known as vectorial computing. A computer cannot perform an operation on several numbers at once (on a single CPU that is, but even a modern computer does not have many CPUs compared to the number of operations needed at once), instead vectorial computing is performed by creating a pointer to a position in the array and then we advance the pointer and execute the operation again. Vectorial computing is all about operations on several memory pointers.

Pointer

np-pointer.svg

Computing on arrays of numbers existed and has been optimized much before the conception of Numpy Yet, thanks to that, NumPy holds to a great deal of legacy of the numeric computing optimizations of the past. That said, if it ain't broke don't fix it. The legacy code that NumPy is built upon has been reliable for decades. If you run the same vectorial code (more on this in a moment) in NumPy, MATLAB or R, you will find that it takes pretty much the same time. This is because all these implementations actually use the same old and well tested code libraries.

NumPy really is a glue that binds the old vectorial computing optimizations with Python. And allows Python's quick prototyping dynamic syntax on top of that optimized vectorial computing.

Vectorial Computing

Is the idea that we can perform a computation on several values at once, conceptually that is. When a computer performs an operation it moves words (several bytes) into CPU registers and only then performs the operation. To figure out what operation needs to be performed (e.g. addition), the computer in turn perform other operations on memory addresses. For example, the CPU adds together memory offsets to figure out where in memory the operands for the addition are. But that does not mean that we can really perform things "at once"! Each operation is still performed in sequence. What makes vectorial computing fast is that the CPU does not need to figure out every time what operation to perform, all operations will be the same and the CPU can be told to just keep doing the same thing over and over on new numbers.

In other words, the general idea above is to minimize the overhead of adjunct processing during the computation of several operations "at once". We align the operands and then tell the CPU to always move just a single word (or double/triple word) before performing the next operation. The speedup results of performing the operations that way are astonishing.

Unfortunately, it turns out that Python is terrible at memory aligning values. This is mostly because Python's data types are sparse, i.e. python uses pointers and metadata to define the actual value of a type and its extra properties. Thanks to these properties Python is easy to use, easy to learn, and programmer friendly - think ducktyping. Enters NumPy array which attempts to bring the best of both worlds: easy of use and fast vectorial computing.

Memory Usage

np-memory-usage.svg

The heavy metadata of Python types is reduced in NumPy arrays. It has its trade-offs of course. For example, Python lists allows for lists of mixed types, NumPy arrays do not.

To be fair fixed array types do exist in plain Python but are not nearly as powerful, or as widely used, as NumPy arrays.

NumPy Arrays

Let's try this out, we import NumPy and create a bunch of arrays. The convention is that NumPy is imported into the current namespace as np, This is a strong convention and often documentation will omit the import.

In [1]:
import numpy as np

Creating an array from a list is easy. Yet, by doing so we are throwing away some of the power NumPy gives us.

In order to build the array from the list we need to build the entire lit first. A python list uses a lot more memory that a NumPy array with the same content. Since the list and the array would be on the same computer with the same memory, we limit the size of the array by the memory used by the list.

In [2]:
np.array([7, 6, 9, 11, 12])
Out[2]:
array([ 7,  6,  9, 11, 12])

Thanks to the memory limitations we have more efficient methods for building arrays. For example, np.zeros produces an array of a certain size filled with zeros.

There is also np.ones and np.empty, which, respectively, create an array filled with ones and an array filled with whatever is currently lurking in the assigned memory. The np.empty function requires a bit more explanation: it is the faster way to create an array because is does not actually fill the computer memory it uses, instead it just assign a piece of memory that happens to be free. That piece of memory may contain values from a previously executing program. In other words, we cannot know what an array created by np.empty may contain. If one wants to be sure of contents np.zero is slower but safer.

In [3]:
np.zeros(6)
Out[3]:
array([0., 0., 0., 0., 0., 0.])

Of course, one may need more complicated arrays than ones and zeros. In the same way that Python has a range function, NumPy has an array range: arange.

The call works in the same fashion as in plain Python:

  • an inclusive start value,
  • an exclusive end value,
  • and a value for the step.
In [4]:
np.arange(2, 17, 2)
Out[4]:
array([ 2,  4,  6,  8, 10, 12, 14, 16])

np.arange works well for integer numbers. Yet when one needs to work with floating point numbers, finding the correct value for the step may be very tricky. Floating point numbers are imprecise, and summing lots of imprecise numbers will increase the imprecision.

A classic example of floating point number imprecision is to attempt to find a range step for a list of $75$ numbers between $0$ and $100$. The step should be $1.3333...$ (a hundred divided by seventy five), a number that cannot be represented in floating point precision. This imprecision, when seen across all the $75$ numbers is quite significant.

In [5]:
sum([100/75]*75)
Out[5]:
99.99999999999991

A common need for arrays of floating point numbers is to execute a function (in the mathematical sense) over all values in an interval. Of course the this is impossible on a digital computer, an interval has an infinite number of numbers in it. But one can create enough values inside the interval and then execute the function on each of those values.

A perfect problem for the range family of functions. Yet if one of the sides of the interval lacks precision because of the range step one may end in trouble. NumPy comes to the rescue with the np.linspace function.

In [6]:
np.linspace(1, 3, 10)
Out[6]:
array([1.        , 1.22222222, 1.44444444, 1.66666667, 1.88888889,
       2.11111111, 2.33333333, 2.55555556, 2.77777778, 3.        ])

The linspace created $10$ values between $1$ and $3$, inclusive. The space was divided in a way that the $10$ numbers are evenly spaced, and we never needed to calculate the (impossible) step value between them.

Let's end with a slightly different example. Instead of giving a plain value to generate an array, we give a tuple of values.

In [7]:
np.ones((2, 3))
Out[7]:
array([[1., 1., 1.],
       [1., 1., 1.]])

That last one is more than a list, one can call it a matrix. NumPy arrays are allowed to have more than one dimension, where one dimension is what we often call a list, two dimensions a matrix, three dimensions a cube, and so on.

By generating an array with more dimensions we receive the total number of generated values equal to the product of all values in the given tuple. This leads us to the shape of and array, one of the array attributes.

Array Attributes

The single metadata for the array contains information about the array, including:

  • memory consumption estimates
  • data types (more on this later)
  • methods that can be executed over all elements of the array ("at once")
  • static metadata, where the shape of the array is the most important

The shape is important because without it an array would always have just on dimension. In memory, the NumPy array is simply a very long string of values one after another. The shape is a list of numbers that defines offsets at which one dimension starts and another ends. It will be easier if we see it, let's create some arrays of different shapes. We will also index the arrays, we will come back to indexing later with full detail.

In [8]:
x = np.arange(6)
x, x.shape, x[2]
Out[8]:
(array([0, 1, 2, 3, 4, 5]), (6,), 2)

1D Array

np-1d-array.svg

It is the type of array we looked at up to now, it is pretty similar to a list in plain Python. The shape says that after walking $6$ numbers we end up at the end of the array.

In this case the numbers are integers, and on most computers today it means that they occupy $8$ bytes each. This means that the entire array is $6$ times $8$ bytes of memory long. We will see data types soon, for now let's just say that every integer element is $8$ bytes in memory.

Enters the idea of reshaping an array, the idea that we can change the shape of an array without changing the array itself. We will look at an array of $18$ elements that can be seen either as a long array or a matrix $3$ by $6$.

In [9]:
x = np.arange(18).reshape((3, 6))
x, x.shape, x[1, 3]
Out[9]:
(array([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11],
        [12, 13, 14, 15, 16, 17]]),
 (3, 6),
 9)

2D Array

np-2d-array.svg

Whichever way we look at it the array is $18$ times $8$ ($144$) bytes long. In other words the array is just sequential $144$ bytes of memory plus metadata.

If the shape of the array is $18$ then it has only one dimensions. But since we reshaped the array to $(3, 6)$ we can understand it as a matrix of $3$ rows and $6$ columns. That said, in memory, the array is still $144$ sequential bytes. Reshaping an array does not change how the array is represented in memory, it only changes how we access the elements of the array.

If we want to access element $(1, 3)$ of the array, the value $9$ in this case (remember that in Python arrays start indexing from zero), we need to: Walk one full row ($6$ elements), and then walk another $3$ elements. During indexing the new shape serves as a multiplier to access the elements of the matrix. The index $(1, 3)$ becomes $1 \cdot 6 + 3$ because we know from the shape that each row has $6$ elements.

The above is quite simple for a matrix but NumPy can deal with as many dimensions as one wishes. Let us see an example of dealing with more dimensions.

In [10]:
x = np.arange(36).reshape((2, 3, 6))
x, x.shape, x[1, 2, 2]
Out[10]:
(array([[[ 0,  1,  2,  3,  4,  5],
         [ 6,  7,  8,  9, 10, 11],
         [12, 13, 14, 15, 16, 17]],
 
        [[18, 19, 20, 21, 22, 23],
         [24, 25, 26, 27, 28, 29],
         [30, 31, 32, 33, 34, 35]]]),
 (2, 3, 6),
 32)

3D Array

np-3d-array.svg

An extra dimension is added to the left of the shape metadata. This way the right-hand side number in the shape continues to be the row, the second number from right is the column, and the next number is the quantity of repetition of the matrix.

We will try to reach the value $32$ in the cube, which is at index $(1, 2, 2)$. Had we not reshaped the array, the element would be at the index in memory $32$. Now, on our reshaped array we need to use the index $(1, 2, 2)$: first jump over to the matrix below, then to the last row, and finally to the element in the row. We do it by multiplying each part of the index by all values of the shape to the right of it. Note that the index will always have the same number of values as the shape of the array. Therefore we go $1 \cdot 3 \cdot 6 + 2 \cdot 6 + 2$; which turns to be $32$, the index used on the single dimension array.

The take away is that the NumPy array is just a long sequence of values and the dimensions are just clever tricks to make that sequence behave like a matrix, or a cube, or even more dimensions. No matter how you create the array, in memory it always has just one dimension and its shape is just a trick to reach the indexes in a more human readable way. And no matter how or how many times you reshape the array, the memory layout does not change. Also, since the array size never changes one can only reshape it to values of shape that contain the full array. In the case above we could reshape into a cube because $2 \cdot 3 \cdot 6 = 36$.

The concept of keeping the memory layout fixed and only change the metadata (shape in the cases we saw) becomes very important if the array is big. We can change attributes of the array without moving large amounts of data in memory. Moving values in memory is computationally cheap but when one needs to move millions of values it becomes slow enough to be perceptible,

Other useful metadata on the array are:

  • size - the flattened length of the array
  • ndim - number of dimensions, equivalent to len(shape)
  • itemsize - bytes occupied by each item in the array
  • nbytes - memory use, estimated, equivalent to itemsize * size
  • dtype - the data type of items in the array

Contrary to shape most of these cannot be changed. The dtype (data type) can be changed but it requires to copy over every element in the array. In order to avoid extra computation (again important if the arrays are big) one needs to create the array with the data type one expects to use it as.

Data Types

Not only numbers can be placed in an array, although you will want to use it with numbers most of the time. One can place any Python object in a NumPy array, creating an array of Python objects (and a dtype of object). Although a NumPy array of Python objects works almost like a Python list, in both easy of use and memory size, defeating the optimizations of NumPy.

Even if we use numbers alone, there are several ways to encode a number. If you remember the discussion about numerical computing, deciding the precision of numbers is often needed. Better precision or smaller memory footprint is a trade-off between better answers and speed. Some data types in NumPy follow:

Data type Description
bool_ Boolean (True or False) stored as a byte
int_ Default integer type (same as C long; normally either int64 or int32)
intc Identical to C int (normally int32 or int64)
intp Integer used for indexing (same as C ssize_t; normally either int32 or int64)
int8 Byte (-128 to 127)
int16 Integer (-32768 to 32767)
int32 Integer (-2147483648 to 2147483647)
int64 Integer (-9223372036854775808 to 9223372036854775807)
uint8 Unsigned integer (0 to 255)
uint16 Unsigned integer (0 to 65535)
uint32 Unsigned integer (0 to 4294967295)
uint64 Unsigned integer (0 to 18446744073709551615)
float_ Shorthand for float64.
float16 Half precision float: sign bit, 5 bits exponent, 10 bits mantissa
float32 Single precision float: sign bit, 8 bits exponent, 23 bits mantissa
float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
complex_ Shorthand for complex128.
complex64 Complex number, represented by two 32-bit floats (real and imaginary components)
complex128 Complex number, represented by two 64-bit floats (real and imaginary components)

We take the list from the NumPy documentation. When one creates an array one can specify the data type with dtype=. If one does not specify a data type NumPy will attempt to decide based on the computer architecture and function executed.

In [11]:
x = np.arange(18).reshape((3, 6))
x.dtype
Out[11]:
dtype('int64')

On the majority of personal computers today the default integer will be int64, and integer the size of the computer word on 64bit CPUs. This is $8$ bytes, each byte is formed of $8$ bits. Using the word size for numbers is another optimization: a CPU is best at processing values encoded in the size of its word.

That said one may need to save space and is willing to sacrifice precision. If one uses only small (up to $2^8$) positive integers as data, then one may opt for encoding as uint8 (unsigned - which means positive only - integer encoded over $1$ byte). Now one can place much bigger arrays in the same amount of memory.

In [12]:
x = np.arange(18, dtype=np.uint8).reshape((3, 6))
x, x.dtype
Out[12]:
(array([[ 0,  1,  2,  3,  4,  5],
        [ 6,  7,  8,  9, 10, 11],
        [12, 13, 14, 15, 16, 17]], dtype=uint8),
 dtype('uint8'))

A smaller encoding size uses less memory and copying the values becomes faster too. Yet it does not necessarily mean faster computing. A CPU still uses full word operations, and needs to copy its entire integer size o operate on it.

And we also have sizes for floating point numbers, be it simple (real) or complex. A complex number is composed of two floating point numbers on a computer. Other functions, here np.linspace, accept dtype= in the same way as np.arange.

In [13]:
x = np.linspace(1, 5, 9, dtype=np.float16).reshape((3, 3))
x, x.dtype
Out[13]:
(array([[1. , 1.5, 2. ],
        [2.5, 3. , 3.5],
        [4. , 4.5, 5. ]], dtype=float16),
 dtype('float16'))

The CPU word limitations do not necessarily work for floating point numbers. Floating point numbers are computed on their own subsystem on modern CPUs and NumPy is compiled in a way to decide the best default floating point number size for the CPU.

In the majority of cases CPUs today use int64 and float64 as the defaults. But at the time of writing there are CPUs out there in the wild where both defaults do not match (one of them is $64$ and another is $32$), therefore do not be surprised if that happens.