We will look at both: data analytics and machine learning, therefore let's start by defining the difference between them. In data analytics we are interested in understanding the process that generated the data, whilst in machine learning we want to replicate the process that created the data without necessarily understanding it. The reality is, as always, somewhere in between. When doing data analysis one may perform machine learning and vice-versa.
In the middle ground we have the fancy title of the professional performing both things, data analytics and machine learning - the data scientist. Our final objective is to describe to a good extent what a data scientist works with, what tools he uses and get some experience with these tools. Often a data scientist is described as follows:
For a start let's extend our understanding of the concepts we just outlined. Given a dataset that describes some phenomena, e.g. number of people (or chickens) crossing the street at a given point.
A typical session of data analysis is normally composed of:
Statistics - to check for consistency in the data.
Data manipulation - in order to remove inconsistencies and prepare for the next step.
Visualization - to attempt to understand what the data describes.
Rinse and Repeat
Extract knowledge - once a good visualization gives us a good understanding we argue that we can extract knowledge from this understanding, knowledge which may produce a model of the original phenomena.
Machine learning development on the other hand, would be composed of:
Statistics and Linear Algebra - to check for inconsistency and also to format the data in order to feed it to certain algorithms.
Data manipulation - manually and also in an automatic fashion.
Program reproducible scenarios - before building the model we think about how to evaluate it, and withdraw some of the data for the evaluation.
Build model - in an automated fashion. We do not necessarily understand the phenomena that caused the data but we build a model that reproduces it.
Validate model - to ensure that our model is not reproducing the behaviors of the data by pure chance.
Rinse and Repeat - which can be automated to some extent.
Reuse model for the same problem on a bigger scale - often we build the model in order to use it on a bigger problem than the sample that was collected in the dataset.
Data Science is a kind off a combination of both above. The person performing data analytics and machine learning is often called a Data Scientist. Typically, one will first try to tackle a difficult problem by data analysis alone and then, if the problem cannot be solved by data analysis, you attempt machine learning. Examples of problems that are too hard for pure data analysis:
Often, but not always, the difference between the use of plain analysis and machine learning is the scale of the problem. Yet, are there problems that can be solved by data analysis but cannot be solved by machine learning?
Bus watcher problem: Imagine that one needs to classify bus routes traveling through the M25 (ring road of London) according to the company running the route. But all data he can have is from the observation of a stretch of the road for 8 hours a day. He may be able to read bus destinations and transport company logos. But logos may be misleading, since bus companies often exchange routes between themselves.
The bus watcher would need several weeks of data to have a good chance of guessing routes whilst accounting for vehicle and route exchanges between the bus companies. Yet, bus companies do not sit idly during this period. By the time the bus watcher collected his data, the data is out of date. In summary, the real world problem - the bus schedule - changes its pattern too fast in order to be captured by the data collected.
Problems where the phenomena (distribution of bus routes) change faster than our ability to collect the data are not problems that one should attempt. Attempting such problems is often called a bad experimental setup. One could try to capture the pattern of how the bus companies operate. But that is a very, very difficult problem, one can argue that we are attempting to model human behavior of company contract changes. This new problem is not only overly difficult but also has nothing to do with the original problem of attempting to find the bus schedules.
That said, these kinds of problems are often not performed by a data scientist. Or at least, a clever data scientist recognizes when he is asked to work on a problem that changes too fast because of meddling people (or other highly complex system). In theory, such a complex problem (human behavior in setting contracts) can be solved if one has enough data; yet, the amount of data and computation needed would be overwhelming. Knowing a solvable problem from an unsolvable one, is an art.
What is a Data Scientist then? That is difficult to encompass, let's have a look at some iterations of a data scientist with other, more commonly known job titles. Some of these happened to yours truly, although all names have been modified and any link to real people is mere coincidence.
SA: Why do you need 45GBs of server memory?
DS: The model needs 5M iterations to train, and I need to do it in parallel.
SA: But you're booting 7k python VMs for that, and forking the same process thousands of times without doing much work in each fork. This is incredibly inefficient.
DS: Sorry, but that is how the model library works. It is in alpha phase, was just pinched together by a bunch of guys at Berkeley.
SA: Wait, and you are dumping an alpha phase library into production?!
DS: That is the only one that has a Convolutional NN model that works on our GPUs.
SD: We cannot use that code, it has globals, no encapsulation, not even an API.
DS: All we need is that something calls this every minute.
SD: No! That's an extra webserver on top of the one we have. It ain't even integrated with our single sign on.
DS: It does not need to be, it is just the solution for the ML part.
SD: But this will stay forever in the codebase, and people will forget what it does.
PM: So, do we have the solution for that problem?
DS: Yes! I finally got it validated for 87% accuracy.
PM: Good! So it is almost done! By when do you think you will get the other 13% done?
DS: No, no, no. It is done, it has an accuracy of 87%.
PM: But we need a solution, not 87% of a solution.
DS: That is not how machine learning works.
Hence a better image for a data scientist, rather than the one at the beginning, is:
Some history is in order, the evolution of data science is not split from the evolution of the hardware and the tools that made it possible. Without computers and without optimized mathematical libraries the analysis and use of data that is widespread today would not be possible. Soon we will start learning with NumPy and then build on other tools starting from there, yet there is a lot of history before NumPy came into existence.
NumPy (and friends) was developed in this century but before that it was called Numeric. And even before the conception of NumPy, parts of it were developed as their own, stand alone, libraries. For example, during compilation NumPy still uses FORTRAN code. The continuity and development of NumPy had its ups and downs, in the same fashion as machine learning itself (or intelligent systems, or simply artificial intelligence, as they were called back in the day) had its ups and downs of enthusiasm throughout the last decades.
The software used for data science over the years is listed below in chronological order. Today, the data science portfolio of software that we will be looking at still uses the same concepts - and often the same code - as in these libraries.
We will also use data from several sources whilst we learn how to operate on data. Most data used by us can be found in one of the following data repositories, where we use data from other sources we cite it in place.
The list below contains more repositories than the ones we use data from. Have a look at the repositories here in order to familiarize yourself how data can be found out in the wild.
Or, a longer description of what we are going to see. This may prove useful if you are looking back for a specific topic.
Jupyter (and Python Review): We will explore Jupyter, a data science platform using the Python programming language. Along the way we will go through a quick review of Python, yet we will not learn how to program - it is expected that one already knows how to program in order to benefit.
NumPy: The foundation of modern data science. We will see vectorial computing and the computer memory layout that allows for fast computations on several numbers (almost) at once.
Matplotlib: For plotting we will see the classical, if not a bit old, graph drawing library. Matplotlib is clunky at times but its well proven design allows one to edit every single detail of a plot.
Pandas: Data manipulation is more than working with numbers, and pandas is the de facto standard for data manipulation in Python. We will explore the features of the library most commonly used in data science. Pandas is huge, and a full coverage alone would take more than a single book to complete.
Statistics and Analytics: Next we will need to review a bit of mathematics and statistics. With that covered we can explore some data science in the real world.
SciKit Learn and Classification (KNN): SciKit Learn is a library that automates the boring parts of machine learning. Yet, one needs to understand the boring parts before safely automating them away. We look at how SciKit uses conventions and helper functions to perform its job. In this case we use a small K Nearest Neighbors (KNN) classification algorithm to demonstrate SciKit Learn's capabilities.
Regression and Feature Engineering: Regression was the first machine learning set of algorithms, long before it was called machine learning. We explore what additions have been recently made to regression algorithms. Also, we go back to the idea that not all data are numbers and transform text data into numbers for use in machine learning.
Clustering and PCA: Unsupervised learning greets us here. Clustering is a form of finding knowledge in a dataset without the help of external description for the data points, whilst Principal Component Analysis (PCA) is an example of a technique to better visualize the complexity of a dataset. We also quickly look at manifold techniques.
Decision Trees (and Random Forests) and SVMs: More powerful algorithms for classification and regression are explored. Both, Forests and Support Vector Machines (SVMs), are very powerful but have some limitations. We will explore the good and bad sides of the techniques.
Neural Networks: We define Online Learning and skim over the concepts of neural networks. Neural Networks (NNs) are a very big topic with several books on. Our interest here is their comparison with the other techniques we saw, and a reasonable understanding of their training and limitations.
Extras and the future: We give resources for future exploration. Also, we attempt to discuss the state of art of machine learning as of the time of writing. And perhaps speculate what else may or may not become state of art machine learning in the close future.
Or, what I need to install to run the examples. We will have a look at two ways of downloading the software needed: the easy way, and the hard way for some extra challenge. Note that one is expected to know how to program in order to follow the examples. A short introduction to the Python syntax is given shortly, therefore one can pickup the syntax itself, yet that is not enough for someone without any programming background.
The easy way: In order to make the learning curve less steep we only use software that can be found in an easily installable data science distribution: anaconda. The anaconda website has plentiful documentation on how to install it. In summary, all you need is to: download the installer, execute it, follow the installation steps (click "next"), and use the new menu items (or the jupyter command line) to open the jupyter lab.
Go on, make sure you can install the distribution, run jupyter lab, and open the initial page in a browser.
If you struggle with the installation, anaconda provides guides for every common operating system on their website. At the time of writing the guides were at the following links:
This is enough to follow the examples, an explanation of jupyter follows. And you can now ignore the remainder of this section.
Yet, if you do not want to use a prepared distribution and are not afraid of the challenge of building a distribution from scratch yourself, keep reading. The hard way takes a good deal of time and may require a lot of debugging, you have been warned.
The hard way requires creating an environment by yourself and populate it with all software that we will need. For a start you will need the following pieces of software installed on your machine:
Assuming common (x86) hardware that is all you need. Python can download pre-compiled binary packages for common architectures. If you are building on more complex hardware, you will also need a C compiler, Python compilation headers, a FORTRAN compiler, make, and LaTeX (for Matplotlib). Whichever hardware you may have now use virtualenv to create an new environment, there use pip to install:
numpy
pandas
scipy
matplotlib
seaborn
jupyter
scikit-learn
scikit-image
This should take a while.
But after all this, executing jupyter lab
inside the virtual environment
you created should work in pretty much the same way as installing a
pre-generated binary environment such as anaconda. Good luck!