Teaching Your Computer
As I have written in my last two articles (Machine Learning Everywhere and Preparing Data for Machine Learning), machine learning is influencing our lives in numerous ways. As a consumer, you've undoubtedly experienced machine learning, whether you know it or not—from recommendations for what products you should buy from various online stores, to the selection of postings that appear (and don't) on Facebook, to the maddening voice-recognition systems that airlines use, to the growing number of companies that offer to select clothing, food and wine for you based on your personal preferences.
Machine learning is everywhere, and although the theory and practice both can take some time to learn and internalize, the basics are fairly straightforward for people to learn.
The basic idea behind machine learning is that you build a model—a description of the ways the inputs and outputs are related. This model then allows you to ask the computer to analyze new data and to predict the outputs for new sets of inputs. This is essentially what machine learning is all about. In "supervised learning", the computer is trained to categorize data based on inputs that humans had previously categorized. In "unsupervised learning", you ask the computer to categorize data on your behalf.
In my last article, I started exploring a data set created by Scott Cole, a data scientist (and neuroscience PhD student) who measured burritos in a variety of California restaurants. I looked at the different categories of data that Cole and his fellow eater-researchers gathered and considered a few ways one could pare down the data set to something more manageable, as well as reasonable.
Here I describe how to take this smaller data set, consisting solely of the features that were deemed necessary, and use it to train the computer by creating a machine-learning model.
Machine-Learning Models
Let's say that the quality of a burrito is determined solely by its size. Thus, the larger the burrito, the better it is; the smaller the burrito, the worse it is. If you describe the size as a matrix X, and the resulting quality score as y, you can describe this mathematically as:
y = qX
where q is a factor describing the relationship between X and y.
Of course, you know that burrito quality has to do with more than just the size. Indeed, in Cole's research, size was removed from the list of features, in part because not every data point contained size information.
Moreover, this example model will need to take several factors—not just one—into consideration, and may have to combine them in a sophisticated way in order to predict the output value accurately. Indeed, there are numerous algorithms that can be used to create models; determining which one is appropriate, and then tuning it in the right way, is part of the game.
The goal here, then, will be to combine the burrito data and an algorithm to create a model for burrito tastiness. The next step will be to see if the model can predict the tastiness of a burrito based on its inputs.
But, how do you create such a model?
In theory, you could create it from scratch, reading the appropriate statistical literature and implementing it all in code. But because I'm using Python, and because Python's scikit-learn has been tuned and improved over several years, there are a variety of model types to choose from that others already have created.
Before starting with the model building, however, let's get the data into the necessary format. As I mentioned in my last article and alluded to above, Python's machine-learning package (scikit-learn) expects that when training a supervised-learning model, you'll need a set of sample inputs, traditionally placed in a two-dimensional matrix called X (yes, uppercase X), and a set of sample outputs, traditionally placed in a vector called y (lowercase). You can get there as follows, inside the Jupyter notebook:
%pylab inline
import pandas as pd # load pandas with an alias
from pandas import Series, DataFrame # load useful Pandas classes
df = pd.read_csv('burrito.csv') # read into a data frame
Once you have loaded the CSV file containing burrito data, you'll keep only those columns that contain the features of interest, as well as the output score:
burrito_data = df[range(11,24)]
You'll then remove the columns that are highly correlated to one another and/or for which a great deal of data is missing. In this case, it means removing all of the features having to do with burrito size:
burrito_data.drop(['Circum', 'Volume', 'Length'], axis=1,
↪inplace=True)
Let's also drop any of the samples (that is, rows) in which one or more values is NaN ("not a number"), which will throw off the values:
burrito_data.dropna(inplace=True, axis=0)
Once you've done this, the data frame is ready to be used in a model. Separate out the X and y values:
y = burrito_data['overall']
X = burrito_data.drop(['overall'], axis=1)
The goal is now to create a model that describes, as best as possible,
the way the values in X lead to a value in y. In other
words, if you look at X.iloc[0]
(that is, the input values for the first
burrito sample) and at y.iloc[0]
(that is, the output value for the
first burrito sample), it should be possible to understand how
those inputs map to those outputs. Moreover, after training the
computer with the data, the computer should be able to predict the
overall score of a burrito, given those same inputs.
Creating a Model
Now that the data is in order, you can build a model. But which algorithm (sometimes known as a "classifier") should you use for the model? This is, in many ways, the big question in machine learning, and is often answerable only via a combination of experience and trial and error. The more machine-learning problems you work to solve, the more of a feel you'll get for the types of models you can try. However, there's always the chance that you'll be wrong, which is why it's often worth creating several different types of models, comparing them against one another for validity. I plan to talk more about validity testing in my next article; for now, it's important to understand how to build a model.
Different algorithms are meant for different kinds of machine-learning problems. In this case, the input data already has been ranked, meaning that you can use a supervised learning model. The output from the model is a numeric score that ranges from 0 to 5, which means that you'll have to use a numeric model, rather than a categorical one.
The difference is that a categorical model's outputs will (as the name implies) indicate into which of several categories, identified by integers, the input should be placed. For example, modern political parties hire data scientists who try to determine which way someone will vote based on input data. The result, namely a political party, is categorical.
In this case, however, you have numeric data. In this kind of model, you expect the output to vary along a numeric range. A pricing model, determining how much someone might be willing to pay for a particular item or how much to charge for an advertisement, will use this sort of model.
I should note that if you want, you can turn the numeric data into categorical data simply by rounding or truncating the floating-point y values, such that you get integer values. It is this sort of transformation that you'll likely need to consider—and try, and test—in a machine-learning project. And, it's this myriad of choices and options that can lead to a data-science project being involved, and to incorporate your experience and insights, as well as brute-force tests of a variety of possible models.
Let's assume you're going to keep the data as it is. You cannot use a purely categorical model, but rather will need to use one that incorporates the statistical concept of "regression", in which you attempt to determine which of your input factors cause the output to correlate linearly with the outputs—that is, assume that the ideal is something like the "y = qX" that you saw above; given that this isn't the case, how much influence did meat quality have vs. uniformity vs. temperature? Each of those factors affected the overall quality in some way, but some of them had more influence than others.
One of the easiest to understand, and most popular, types of models uses the K Network Neighbors (KNN) algorithm. KNN basically says that you'll take a new piece of data and compare its features with those of existing, known, categorized data. The new data is then classified into the same category as its K closest neighbors, where K is a number that you must determine, often via trial and error.
However, KNN works only for categories; this example is dealing with a
regression problem, which can't use KNN. Except, Python's
scikit-learn happens to come with a version of KNN that is designed to
work with regression problems—the KNeighborsRegressor
classifier.
So, how do you use it? Here's the basic way in which all supervised learning happens in scikit-learn:
-
Import the Python class that implements the classifier.
-
Create a model—that is, an instance of the classifier.
-
Train the model using the "fit" method.
-
Feed data to the model and get a prediction.
Let's try this with the data. You already have an X and a y, which you
can plug in to the standard sklearn
pattern:
from sklearn.neighbors import KNeighborsRegressor # import classifier
KNR = KNeighborsRegressor() # create a model
KNR.fit(X, y) # train the model
Without the dropna
above (in which I removed any rows containing one or more
NaN values), you still would have "dirty" data, and
sklearn would be unable to proceed. Some classifiers can handle NaN
data, but as a general rule, you'll need to get rid of
NaN values—either to satisfy the classifier's rules, or to ensure that
your results are of high quality, or even (in some cases) valid.
With the trained model in place, you now can ask it: "If you have a burrito with really great ingredients, how highly will it rank?"
All you have to do is create a new, fake sample burrito with all high-quality ingredients:
great_ingredients = np.ones(X.iloc[0].count()) * 5
In the above line of code, I took the first sample from X (that is,
X.iloc[0]
), and then counted how many items it contained. I then
multiplied the resulting NumPy array by 5, so that it contained all
5s. I now can ask the model to predict the overall quality of such a
burrito:
KNR.predict([great_ingredients])
I get back a result of:
array([ 4.86])
meaning that the burrito would indeed score high—not a 5, but high nonetheless. What if you create a burrito with absolutely awful ingredients? Let's find the predicted quality:
terrible_ingredients = np.zeros(X.iloc[0].count())
In the above line of code, I created a NumPy array containing zeros, the same length as the X's list of features. If you now ask the model to predict the score of this burrito, you get:
array([ 1.96])
The good news is that you have now trained the computer to predict the quality of a burrito from a set of rated ingredients. The other good news is that you can determine which ingredients are more influential and which are less influential.
At the same time, there is a problem: how do you know that KNN regression is the best model you could use? And when I say "best", I ask whether it's the most accurate at predicting burrito quality. For example, maybe a different classifier will have a higher spread or will describe the burritos more accurately.
It's also possible that the classifier is a good one, but that one of its parameters—parameters that you can use to "tune" the model—wasn't set correctly. And I suspect that you indeed could do better, since the best burrito actually sampled got a score of 5, and the worst burrito had a score of 1.5. This means that the model is not a bad start, but that it doesn't quite handle the entire range that one would have expected.
One possible solution to this problem is to adjust the parameters that
you hand the classifier when creating the model. In the case of any
KNN-related model, one of the first parameters you can try to tune is
n_neighbors
. By default, it's set to 5, but what if you set it to
higher or to lower?
A bit of Python code can establish this for you:
for k in range(1,10):
print(k)
KNR = KNeighborsRegressor(n_neighbors=k)
KNR.fit(X, y)
print("\tTerrible: {0}".format(KNR.predict([terrible_ingredients])))
print("\tBest: {0}".format(KNR.predict([great_ingredients])))
After running the above code, it seems like the model that has the
highest high and the lowest low is the one in which n_neighbors
is
equal to 1. It's not quite what I would have expected, but that's why it's
important to try different models.
And yet, this way of checking to see which value of n_neighbors
is
the best is rather primitive and has lots of issues. In my next article, I plan to
look into checking the models, using more sophisticated
techniques than I used here.
Using Another Classifier
So far, I've described how you can create multiple models from a single classifier, but scikit-learn comes with numerous classifiers, and it's usually a good idea to try several.
So in this case, let's also try a simple regression model. Whereas KNN uses existing, known data points in order to decide what outputs to predict based on new inputs, regression uses good old statistical techniques. Thus, you can use it as follows:
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(X, y)
print("\tTerrible: {0}".format(KNR.predict([terrible_ingredients])))
print("\tBest: {0}".format(KNR.predict([great_ingredients])))
Once again, I want to stress that just because you don't cover the entire spread of output values, from best to worst, you can't discount this model. And, a model that works with some data sets often will not work with other data sets.
But as you can see, scikit-learn makes it easy—almost trivially easy, in fact—to create and experiment with different models. You can, thus, try different classifiers, and types of classifiers, in order to create a model that describes your data.
Now that you've created several models, the big question is which one is the best? Which one not only describes the data, but also does so well? Which one will give the most predictive power moving forward, as you encounter an ever-growing number of burritos? What ingredients should a burrito-maker stress in order to maximize eater satisfaction, while minimizing costs?
In order to answer these questions, you'll need to have a way of testing your models. In my next article, I'll look at how to test your models, using a variety of techniques to check the validity of a model and even compare numerous classifier types against one another.
Resources
I used Python and the many parts of the SciPy stack (NumPy, SciPy, Pandas, matplotlib and scikit-learn) in this article. All are available from PyPI or from SciPy.org.
I recommend a number of resources for people interested in data science and machine learning. One long-standing weekly e-mail list is "KDNuggets" You also should consider the "Data Science Weekly" newsletter and "This Week in Data", describing the latest data sets available to the public.
I am a big fan of podcasts and particularly love "Partially Derivative". Other good ones are "Data Stories" and "Linear Digressions". I listen to all three on a regular basis and learn from them all.
If you're looking to get into data science and machine learning, I recommend Kevin Markham's "Data School" and Jason Brownlie's "Machine Learning Mastery", where he sells a number of short, dense, but high-quality ebooks on these subjects.