A summary of the features of Python that we will use follows. By no means this is an extensive tutorial of the Python language, instead this is just a cuckoo's flew over the basics of the features that we will need in order to deal with some machine learning. Think of it as a retrospective of what you already know about Python.
In general, the following is structured so that one with understanding of a programming language can understand the Python features we will need. We will make analogies to other programming languages you may know. If you struggle with this section I'll need to ask you to brush up your programming.
Python was originally built as an object oriented language, yet it wanted to compete with Perl which was a language heavily used for quick scripting. Python succeeded by making its function a first class citizen and not depend on object oriented patterns for everything (note though that below the hood a Python function is an object).
A function starts after the def
statement and ends when it executes an
explicit return
statement, ends execution without reaching a return
or an
exception is raised through the function.
(Contrary to compiled programming languages)
the return statement does not require a single value to be returned,
one can return several values at once or no value at all.
The following are all valid function definitions:
def do_nothing():
pass
def do_nothing_as_well():
return None
def with_args(cat, pig):
return 'Cat %s, pig %s' % (cat, pig)
def return_tuple(cat, pig):
return 'Cat %s' % cat, 'Pig %s' % pig
print(do_nothing())
print(do_nothing_as_well())
print(with_args('is hungry', 'escaped'))
print(return_tuple('is hungry', 'escaped'))
You can provide optional/default keyword arguments to functions.
That is Python's way of giving different signatures/constructors to the same function/method.
Optional arguments are characterized by an assignment (equal sign) inside the def
statement right after to the defaulted argument.
Then, after the equal sign a value to which the optional argument will default to must be given.
All non-defaulted arguments must come before the defaulted/optional arguments.
Examples:
def status(cat='is hungry'):
return 'Cat %s' % cat
def neighbours_cat(neighbour, status='is hungry'):
return '%s cat is %s' % (neighbour, status)
print(status())
print(status('well fed'))
print(neighbours_cat("Upstair's"))
print(neighbours_cat("'round the corner's", 'well fed'))
Since Python is a dynamic language, it is possible to call the same function in several ways. A function call is performed by evaluating all arguments in the call and then comparing the resulting lists of arguments with the signature of the function. A function call is parsed as:
*<arg>
argument the remaining list of positional arguments is passed there**<arg>
argument the remaining keyword dictionary is passed thereBy convention the argument for extra positional arguments is often called *args
,
and the argument for extra keyword arguments is called **kwargs
or **kw
.
Yet that is not a very strong convention, and if better readability can be achieved
by giving these variables better names that is accepted.
For example, here we use non-conventional names to check where
we can buy which brand of cat food:
def can_eat(cat, brand='felix'):
print(cat, 'eats', brand, 'food')
def cat_food_brands(market, *brands):
print('In', market, 'we found the following brands of cat food:')
for brand in brands:
print(brand)
def deliver_cat_food(address, **quantity):
print('Delivery to', address)
for b, q in quantity.items():
print(q, 'cans of', b)
can_eat('my cat', 'whiskas')
print('-' * 30)
can_eat('my cat', brand='wheats')
print('-' * 30)
cat_food_brands('Tesco', 'felix', 'whiskas', 'wheats')
print('-' * 30)
cat_food_brands("Sainsbury's", 'whiskas', 'sainsbury')
print('-' * 30)
deliver_cat_food('Northampton Square', whiskas=7, felix=3)
Despite its object oriented origin, Python did fall in love with functional patterns.
The idea of a functional execution of programs originated in LISP (LISt Processing),
and is based on operations such as map
, and filter
.
Python does support the map
and filter
functions
as built-ins but it also does come with a syntax called list comprehension.
List comprehensions are often easier to read and shorter to write than their equivalents with
map
and filter
. Also, Python has a good optimizer of list comprehensions which makes
these perform faster than hand-coded sequences of map
and filter
, most of the time.
Following we can see a couple of list comprehensions and their lisp-like counterparts
in the code comments:
numbers = list(range(10))
print('numbers:', numbers)
odd = [x for x in numbers if x % 2 == 1]
# filter(lambda x: x % 2 == 1, numbers)
print('odd:', odd)
even_squared = [x*x for x in numbers if x % 2 == 0]
# map(lambda x: x*x, filter(lambda x: x % 2 == 1, numbers))
print('even squared:', even_squared)
A single list comprehension is powerful but a combination of them makes for the full power of the functional paradigm. An example is in order.
Let's try to distribute cat food across several households in a way that most cats are happy.
Note that we will ignore the special preferences of each cat,
e.g. a cat that likes "whiskas special" will need to do with
plain whiskas food since we do not want to spend too much money on the whims of cats.
The below uses the functional paradigm to distribute equally the amount of cat food
across the neighborhood cats.
Note that iterating over a dictionary is the same as iterating over its .keys()
method,
this is a feature of python dictionaries that often confuses people coming from other languages.
The code here is complicated, that is intended. A much easier way of solving this problem would be with several for loops. Yet, there is a good reason why we use list comprehensions here. Soon we will see vectorial computing libraries (e.g. NumPy) and the way they operate is very similar to the code below. Therefore try understanding the code here, even if it takes a while. Some hints are:
from pprint import pprint
cat_preferences = {
'my cat': ['whiskas', 'felix pork', 'wheat'],
"neighbour's cat": ['whiskas special', 'wheat'],
"'round the corner cat": ['felix', 'sainsbury']
}
food_in_drawer = {'felix': 6, 'whiskas': 10, 'wheat': 12, 'sainsbury': 5}
preferences = dict(
[(cat, [food for food in food_in_drawer if [x for x in cat_preferences[cat] if x.startswith(food)]])
for cat in cat_preferences])
print('preferences')
pprint(preferences)
print('-' * 30)
food_div = dict(
[(food, len([cat for cat in cat_preferences if food in preferences[cat]]))
for food in food_in_drawer])
print('food division')
pprint(food_div)
print('-' * 30)
rations = dict(
[(cat, dict([(food, food_in_drawer[food] // food_div[food])
for food in food_in_drawer if food in preferences[cat]]))
for cat in cat_preferences])
rations
This was an exercise in relational algebra, which is often used in NumPy and Pandas. If you have worked with SQL databases this was (hopefully) familiar to you to some extent. The idea of moving data around was similar to joining tables in a SQL database. Another thing one may notice based on this example is my own fondness to cats.
In the code above we saw startswith
, this is a string operation,
i.e. an operation performed on string objects.
Being able to handle strings is an important skill independent of whether you are analyzing data,
writing a web crawler or scripting your cat food delivery network.
Let's have a look at some of these operations,
specifically the operations that may be useful in data munging.
cat = 'Aubrey'
dog = 'Rose'
address = ' Northampton Square, Clerkenwell ' # note the spaces
print(cat.startswith('A'))
print(cat.endswith('y'))
print(cat.lower())
print(cat.upper())
print(', '.join([cat, dog]))
print('[' + address + ']')
print('[' + address.lstrip() + ']')
print('[' + address.rstrip() + ']')
print('[' + address.strip() + ']')
print(address.split())
print([x.strip(',') for x in address.split()])
For anything more complex regular expressions are the way to go. Yet, we will cover very little on regular expressions as it is a huge topic on itself. Whenever it will be needed, we will mention a sample of regular expression syntax in that place.
Python is dynamically typed, i.e. the type of a variable is only retrieved when needed. More specifically Python is duck-typed, which means that as long as and object (data type, data structure or even function) abides by a certain protocol it will work as the type intended for that protocol. In other words, as long as a data type behaves well enough as the intended data type for an operation, it will just work.
This also means that a function may receive completely different types of objects and act differently based on what it got. One example of such behavior can be outlined with:
CAT_NUM = 3
def divide_food(food):
"""Divides the food among cats, can receive a dictionary or list of 2-tuples"""
if not hasattr(food, 'keys'):
food = dict(food)
for f in food:
food[f] //= 3
return food
print(divide_food({'felix': 7, 'whiskas': 6}))
print(divide_food([('felix', 7), ('whiskas', 6)]))
Duck-typing, and protocol checking as in the function above, is heavily used throughout the Python data stack. Do not be surprised when we look at a function that works in a completely different manner when passed arguments of different types.
Since functions are first class citizens in Python, nothing holds us from having variables with references to functions. And since we have references to functions, nothing holds us from referencing a function which we did not give a name - an anonymous function.
Anonymous functions - or lambda functions - are functions without a given name
(in Python, without a meaningful __name__
attribute).
These are often used to pass simple functions around.
A lambda function can only contain a single expression and has an implicit return.
Whatever is the result of the single expression in the lambda function
it will be returned to the caller, despite no return
statement is visible.
def named_function(food):
return 'Cat ate %s' % food
anon_function = lambda food: 'Cat ate %s' % food
print(named_function('felix'))
print(anon_function('felix'))
print(named_function.__name__)
print(anon_function.__name__)
We will deal very little with the object oriented nature of Python
but we will need to know some bits about objects. An object is an encapsulation of state
together with methods (functions) that operate on this state. In Python object state
and object methods live in different places in memory, the first argument to all
normal methods of an object points to the actual state encapsulated by the current
instance of the object. By convention we use self
as the name of the first argument
of the object methods, and this is a very strong convention.
After constructing an object the __init__
method is invoked, it takes the self
argument
and then anything that we wish to be stored or used for constructing an instance of our object.
Optional arguments are accepted and encouraged within the definition of __init__
, these optional
arguments make for what in other languages is accomplished with multiple constructors.
A Python function is actually an object. The def
simply defines and object which has a
__call__
method, this method is invoked when the object is called (by placing brackets after it).
The dictionaries and lists are just Python objects too, these define the __getitem__
method.
In Python these dunder (double underscore) methods define the protocols of the basic objects.
What follows is an example of a multi-protocol object,
with a similar __getitem__
as the multidimensional array object which we will see
when we learn about NumPy.
At first sight the NumPy like objects seem very strange
but the example here hopefully clears some confusion
in that those objects are just python.
After we see NumPy I encourage you to come back here and look at this object again.
Note: do not worry if you do not understand what is happening below, we will not explicitly cover it. On the other hand, if you know Python well and are interested in what goes behind the scenes in the data manipulation libraries this object outlines it.
class Cat(object):
def __init__(self, greeting='Meaow!', legs=4):
self.greeting = greeting
self.legs = legs
self.fed = True
def is_hungry(self):
return not self.fed
def feed(self):
self.fed = True
def __call__(self):
if self.fed:
print(self.greeting)
self.fed = False
def __getitem__(self, key):
"""
This one is pretty complicated - this is how NumPy and Pandas works below the hood.
If you really want to go deep try figuring out what it does and how it does it.
"""
if slice == type(key):
return 'Do not slice me!'
elif int == type(key):
return min(abs(key), self.legs)
else:
return key
cat = Cat('Mieau!')
print('Hungry:', cat.is_hungry())
cat()
print('Hungry:', cat.is_hungry())
cat() # is hungry, will not meaow
cat.feed()
cat()
print('List slice:', cat[1:3:2])
print('List access:', cat[1])
print('Too many legs:', cat[7])
print('Dictionary access:', cat['are you may cat?'])
print('Arbitrary access:', cat[1:7:2, 'fur', 3])
Finally, if anything in the sections above - perhaps without the last section about objects just above - was too much for you, do have a look at one of the several extensive resources for learning more about Python. Knowing Python well will be only of benefit to anyone wishing to do things with data and/or machine learning.
The list of python resource below is, by far, not comprehensive. That said, I find the resources below to the best available ones at the time of writing.