01.03 PyData Primer

A summary of the features of Python that we will use follows. By no means this is an extensive tutorial of the Python language, instead this is just a cuckoo's flew over the basics of the features that we will need in order to deal with some machine learning. Think of it as a retrospective of what you already know about Python.

In general, the following is structured so that one with understanding of a programming language can understand the Python features we will need. We will make analogies to other programming languages you may know. If you struggle with this section I'll need to ask you to brush up your programming.


Python was originally built as an object oriented language, yet it wanted to compete with Perl which was a language heavily used for quick scripting. Python succeeded by making its function a first class citizen and not depend on object oriented patterns for everything (note though that below the hood a Python function is an object).

A function starts after the def statement and ends when it executes an explicit return statement, ends execution without reaching a return or an exception is raised through the function. (Contrary to compiled programming languages) the return statement does not require a single value to be returned, one can return several values at once or no value at all. The following are all valid function definitions:

In [1]:
def do_nothing():

def do_nothing_as_well():
    return None

def with_args(cat, pig):
    return 'Cat %s, pig %s' % (cat, pig)

def return_tuple(cat, pig):
    return 'Cat %s' % cat, 'Pig %s' % pig

print(with_args('is hungry', 'escaped'))
print(return_tuple('is hungry', 'escaped'))
Cat is hungry, pig escaped
('Cat is hungry', 'Pig escaped')

Optional Arguments

You can provide optional/default keyword arguments to functions. That is Python's way of giving different signatures/constructors to the same function/method. Optional arguments are characterized by an assignment (equal sign) inside the def statement right after to the defaulted argument. Then, after the equal sign a value to which the optional argument will default to must be given. All non-defaulted arguments must come before the defaulted/optional arguments. Examples:

In [2]:
def status(cat='is hungry'):
    return 'Cat %s' % cat

def neighbours_cat(neighbour, status='is hungry'):
    return '%s cat is %s' % (neighbour, status)

print(status('well fed'))
print(neighbours_cat("'round the corner's", 'well fed'))
Cat is hungry
Cat well fed
Upstair's cat is is hungry
'round the corner's cat is well fed

Function Arguments

Since Python is a dynamic language, it is possible to call the same function in several ways. A function call is performed by evaluating all arguments in the call and then comparing the resulting lists of arguments with the signature of the function. A function call is parsed as:

  1. From left to right all non-keyword arguments (positional arguments) are appended to a list
  2. All keyword arguments are placed inside a (keyword) dictionary
  3. The positional arguments fill the list of arguments of the function signature
  4. All non-filled keyword arguments in the signature are searched for in the keyword dictionary
  5. If the function has a *<arg> argument the remaining list of positional arguments is passed there
  6. If the function has a **<arg> argument the remaining keyword dictionary is passed there
  7. If the positional list and keyword dictionary are empty the function is called, otherwise an error is raised

By convention the argument for extra positional arguments is often called *args, and the argument for extra keyword arguments is called **kwargs or **kw. Yet that is not a very strong convention, and if better readability can be achieved by giving these variables better names that is accepted. For example, here we use non-conventional names to check where we can buy which brand of cat food:

In [3]:
def can_eat(cat, brand='felix'):
    print(cat, 'eats', brand, 'food')

def cat_food_brands(market, *brands):
    print('In', market, 'we found the following brands of cat food:')
    for brand in brands:

def deliver_cat_food(address, **quantity):
    print('Delivery to', address)
    for b, q in quantity.items():
        print(q, 'cans of', b)

can_eat('my cat', 'whiskas')
print('-' * 30)
can_eat('my cat', brand='wheats')
print('-' * 30)
cat_food_brands('Tesco', 'felix', 'whiskas', 'wheats')
print('-' * 30)
cat_food_brands("Sainsbury's", 'whiskas', 'sainsbury')
print('-' * 30)
deliver_cat_food('Northampton Square', whiskas=7, felix=3)
my cat eats whiskas food
my cat eats wheats food
In Tesco we found the following brands of cat food:
In Sainsbury's we found the following brands of cat food:
Delivery to Northampton Square
7 cans of whiskas
3 cans of felix

List Comprehensions

Despite its object oriented origin, Python did fall in love with functional patterns. The idea of a functional execution of programs originated in LISP (LISt Processing), and is based on operations such as map, and filter. Python does support the map and filter functions as built-ins but it also does come with a syntax called list comprehension.

List comprehensions are often easier to read and shorter to write than their equivalents with map and filter. Also, Python has a good optimizer of list comprehensions which makes these perform faster than hand-coded sequences of map and filter, most of the time. Following we can see a couple of list comprehensions and their lisp-like counterparts in the code comments:

In [4]:
numbers = list(range(10))
print('numbers:', numbers)

odd = [x for x in numbers if x % 2 == 1]
# filter(lambda x: x % 2 == 1, numbers)
print('odd:', odd)

even_squared = [x*x for x in numbers if x % 2 == 0]
# map(lambda x: x*x, filter(lambda x: x % 2 == 1, numbers))
print('even squared:', even_squared)
numbers: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
odd: [1, 3, 5, 7, 9]
even squared: [0, 4, 16, 36, 64]

Combining Comprehensions

A single list comprehension is powerful but a combination of them makes for the full power of the functional paradigm. An example is in order.

Let's try to distribute cat food across several households in a way that most cats are happy. Note that we will ignore the special preferences of each cat, e.g. a cat that likes "whiskas special" will need to do with plain whiskas food since we do not want to spend too much money on the whims of cats. The below uses the functional paradigm to distribute equally the amount of cat food across the neighborhood cats. Note that iterating over a dictionary is the same as iterating over its .keys() method, this is a feature of python dictionaries that often confuses people coming from other languages.

The code here is complicated, that is intended. A much easier way of solving this problem would be with several for loops. Yet, there is a good reason why we use list comprehensions here. Soon we will see vectorial computing libraries (e.g. NumPy) and the way they operate is very similar to the code below. Therefore try understanding the code here, even if it takes a while. Some hints are:

  • The exact cat preferences are messy, first we filter them to have clean data;
  • We then figure out how many cats eat each type of food;
  • Finally we combine both constructs to divide the food across the cats.
In [5]:
from pprint import pprint

cat_preferences = {
    'my cat': ['whiskas', 'felix pork', 'wheat'],
    "neighbour's cat": ['whiskas special', 'wheat'],
    "'round the corner cat": ['felix', 'sainsbury']
food_in_drawer = {'felix': 6, 'whiskas': 10, 'wheat': 12, 'sainsbury': 5}

preferences = dict(
    [(cat, [food for food in food_in_drawer if [x for x in cat_preferences[cat] if x.startswith(food)]])
        for cat in cat_preferences])
print('-' * 30)
food_div =  dict(
    [(food, len([cat for cat in cat_preferences if food in preferences[cat]]))
        for food in food_in_drawer])
print('food division')
print('-' * 30)
rations = dict(
    [(cat, dict([(food, food_in_drawer[food] // food_div[food])
                    for food in food_in_drawer if food in preferences[cat]]))
        for cat in cat_preferences])
{"'round the corner cat": ['felix', 'sainsbury'],
 'my cat': ['felix', 'whiskas', 'wheat'],
 "neighbour's cat": ['whiskas', 'wheat']}
food division
{'felix': 2, 'sainsbury': 1, 'wheat': 2, 'whiskas': 2}
{'my cat': {'felix': 3, 'whiskas': 5, 'wheat': 6},
 "neighbour's cat": {'whiskas': 5, 'wheat': 6},
 "'round the corner cat": {'felix': 3, 'sainsbury': 5}}

This was an exercise in relational algebra, which is often used in NumPy and Pandas. If you have worked with SQL databases this was (hopefully) familiar to you to some extent. The idea of moving data around was similar to joining tables in a SQL database. Another thing one may notice based on this example is my own fondness to cats.



String Operations

In the code above we saw startswith, this is a string operation, i.e. an operation performed on string objects. Being able to handle strings is an important skill independent of whether you are analyzing data, writing a web crawler or scripting your cat food delivery network. Let's have a look at some of these operations, specifically the operations that may be useful in data munging.

In [6]:
cat = 'Aubrey'
dog = 'Rose'
address = ' Northampton Square, Clerkenwell '  # note the spaces

print(', '.join([cat, dog]))
print('[' + address + ']')
print('[' + address.lstrip() + ']')
print('[' + address.rstrip() + ']')
print('[' + address.strip() + ']')
print([x.strip(',') for x in address.split()])
Aubrey, Rose
[ Northampton Square, Clerkenwell ]
[Northampton Square, Clerkenwell ]
[ Northampton Square, Clerkenwell]
[Northampton Square, Clerkenwell]
['Northampton', 'Square,', 'Clerkenwell']
['Northampton', 'Square', 'Clerkenwell']

For anything more complex regular expressions are the way to go. Yet, we will cover very little on regular expressions as it is a huge topic on itself. Whenever it will be needed, we will mention a sample of regular expression syntax in that place.

Data Types

Python is dynamically typed, i.e. the type of a variable is only retrieved when needed. More specifically Python is duck-typed, which means that as long as and object (data type, data structure or even function) abides by a certain protocol it will work as the type intended for that protocol. In other words, as long as a data type behaves well enough as the intended data type for an operation, it will just work.

This also means that a function may receive completely different types of objects and act differently based on what it got. One example of such behavior can be outlined with:

In [7]:

def divide_food(food):
    """Divides the food among cats, can receive a dictionary or list of 2-tuples"""
    if not hasattr(food, 'keys'):
        food = dict(food)
    for f in food:
        food[f] //= 3
    return food

print(divide_food({'felix': 7, 'whiskas': 6}))
print(divide_food([('felix', 7), ('whiskas', 6)]))
{'felix': 2, 'whiskas': 2}
{'felix': 2, 'whiskas': 2}

Duck-typing, and protocol checking as in the function above, is heavily used throughout the Python data stack. Do not be surprised when we look at a function that works in a completely different manner when passed arguments of different types.


Since functions are first class citizens in Python, nothing holds us from having variables with references to functions. And since we have references to functions, nothing holds us from referencing a function which we did not give a name - an anonymous function.

Anonymous functions - or lambda functions - are functions without a given name (in Python, without a meaningful __name__ attribute). These are often used to pass simple functions around. A lambda function can only contain a single expression and has an implicit return. Whatever is the result of the single expression in the lambda function it will be returned to the caller, despite no return statement is visible.

In [8]:
def named_function(food):
    return 'Cat ate %s' % food

anon_function = lambda food: 'Cat ate %s' % food

Cat ate felix
Cat ate felix


We will deal very little with the object oriented nature of Python but we will need to know some bits about objects. An object is an encapsulation of state together with methods (functions) that operate on this state. In Python object state and object methods live in different places in memory, the first argument to all normal methods of an object points to the actual state encapsulated by the current instance of the object. By convention we use self as the name of the first argument of the object methods, and this is a very strong convention.

After constructing an object the __init__ method is invoked, it takes the self argument and then anything that we wish to be stored or used for constructing an instance of our object. Optional arguments are accepted and encouraged within the definition of __init__, these optional arguments make for what in other languages is accomplished with multiple constructors.

A Python function is actually an object. The def simply defines and object which has a __call__ method, this method is invoked when the object is called (by placing brackets after it). The dictionaries and lists are just Python objects too, these define the __getitem__ method. In Python these dunder (double underscore) methods define the protocols of the basic objects.

What follows is an example of a multi-protocol object, with a similar __getitem__ as the multidimensional array object which we will see when we learn about NumPy. At first sight the NumPy like objects seem very strange but the example here hopefully clears some confusion in that those objects are just python. After we see NumPy I encourage you to come back here and look at this object again.

Note: do not worry if you do not understand what is happening below, we will not explicitly cover it. On the other hand, if you know Python well and are interested in what goes behind the scenes in the data manipulation libraries this object outlines it.

In [9]:
class Cat(object):

    def __init__(self, greeting='Meaow!', legs=4):
        self.greeting = greeting
        self.legs = legs
        self.fed = True

    def is_hungry(self):
        return not self.fed

    def feed(self):
        self.fed = True

    def __call__(self):
        if self.fed:
        self.fed = False

    def __getitem__(self, key):
        This one is pretty complicated - this is how NumPy and Pandas works below the hood.
        If you really want to go deep try figuring out what it does and how it does it.
        if slice == type(key):
            return 'Do not slice me!'
        elif int == type(key):
            return min(abs(key), self.legs)
            return key

cat = Cat('Mieau!')
print('Hungry:', cat.is_hungry())
print('Hungry:', cat.is_hungry())
cat()  # is hungry, will not meaow
print('List slice:', cat[1:3:2])
print('List access:', cat[1])
print('Too many legs:', cat[7])
print('Dictionary access:', cat['are you may cat?'])
print('Arbitrary access:', cat[1:7:2, 'fur', 3])
Hungry: False
Hungry: True
List slice: Do not slice me!
List access: 1
Too many legs: 4
Dictionary access: are you may cat?
Arbitrary access: (slice(1, 7, 2), 'fur', 3)

Finally, if anything in the sections above - perhaps without the last section about objects just above - was too much for you, do have a look at one of the several extensive resources for learning more about Python. Knowing Python well will be only of benefit to anyone wishing to do things with data and/or machine learning.

The list of python resource below is, by far, not comprehensive. That said, I find the resources below to the best available ones at the time of writing.

Extra Resources