Unsupervised Learning
In my last few articles, I've looked into machine learning and how you can build a model that describes the world in some way. All of the examples I looked at were of "supervised learning", meaning that you loaded data that already had been categorized or classified in some way, and then created a model that "learned" the ways the inputs mapped to the outputs. With a good model, you then were able to predict the output for a new set of inputs.
Supervised learning is a very useful technique and is quite widespread. But, there is another set of techniques in machine learning known as unsupervised learning. These techniques, broadly speaking, ask the computer to find the hidden structure in the data—in other words, to "learn" what the meaning of the data is, what relationships it contains, which features are of importance, and which data records should be considered to be outliers or anomalies.
Unsupervised learning also can be used for what's known as "dimensionality reduction", in which the model functions as a preprocessing step, reducing the number of features in order to simplify the inputs that you'll hand to another model.
In other words, in supervised learning, you teach the computer about your data and hope that it understands the relationships and categorization well enough to categorize data it hasn't seen before successfully.
In unsupervised learning, by contrast, you're asking the computer to tell you something interesting about the data.
This month, I take an initial look at the world of unsupervised learning. Can a computer categorize data as well as a human? How can you use Python's scikit-learn to create such models?
Unsupervised Learning
There's a children's card game called Set that is a useful way to think about machine learning. Each card in the game contains a picture. The picture contains one, two or three shapes. There are several different shapes, and each shape has a color and a fill pattern. In the game, players are supposed to identify three-card groups of cards using any one of those properties. Thus, you could create a group based on the color green, in which all cards are green in color (but contain different numbers of shapes, shapes and fill patterns). You could create a group based on the number of shapes, in which every card has two shapes, but those shapes can be of any color, any shape and any fill pattern.
The idea behind the game is that players can create a variety of different groups and should take advantage of this in order to win the game.
I often think of unsupervised learning as asking the computer to play a game of Set. You give the computer a data set and ask it to divide that large bunch of data into separate categories. The model may choose any feature, or set of features, and that might (or might not) be a feature that humans would consider to be important. But, it will find those connections, or at least try to do so.
One of the most common machine-learning models for beginners is the "iris" dataset, containing 150 flowers, 50 from each of three types of irises. Several months ago, I showed how you could create a supervised model to identify irises. In other words, you could create and train a model that would categorize irises accurately based on their petal and sepal sizes.
Can unsupervised learning achieve the same goal? That is, can you create a model that will divide the flowers into three different groups, doing the same job (or close to it) that humans have done?
Another way of asking this question is whether the way in which biologists distinguish between varieties of flowers is supported by the underlying measurement data.
Let's load the iris data and then start to create an unsupervised model. Assuming that I'm working within the Jupyter notebook, I can execute the following:
%pylab inline
import pandas as pd
from pandas import DataFrame, Series
from sklearn.datasets import load_iris
iris = load_iris()
df = DataFrame(iris.data, columns=iris.feature_names)
df['response'] = iris.target
In other words, I've created a Pandas data frame containing five columns—the four features and also the response (that is, the classification). You won't be passing the classification to the model (although that might improve the model's ability to classify the flowers), but it's convenient to keep everything together in this way.
Creating a Model
Once you've loaded the data, it's time to create a model. You're looking to do what's known as "clustering", which means that the computer will divide the data set into categories or clusters.
So, now what? In supervised learning, you would create a new model from a classifier and then train it using scikit-learn's "fit" method. You then could give the trained model one or more data points and ask it to categorize those based on the model.
In unsupervised learning, it's a bit trickier—after all, you're asking the computer to do the categorization. If you don't have any pre-labeled categories, it's going to be hard to know whether your model is useful, accurate or both.
But before getting into the evaluation, let's build a model. Sklearn comes with a number of classifiers that handle clustering. One popular classifier is known as "K-means". In K-means clustering, the idea is that the model puts each data point inside the cluster whose mean is the closest. Thus, if there are three clusters, each cluster will contain points that are calculated to be closest. The "inertia" is a measurement of how coherent the groups are—that is, how closely associated with one another the elements that have been grouped together fit.
I should note that because K-means uses distances to calculate how to compose a group, you likely will want all of your features to be on the same scale. In the case of the flowers, all are within the same order of magnitude. But, you can imagine that if three measurements are on a scale of 1–10 and a fourth is on a scale of 1–1 million, the calculations might not work out as well. For this reason, it can be a good idea to use a scaler—several of which come with sklearn—to put all of your data onto the same scale. Such scaling is often important when creating models; it helps the calculations to identify two or more items as being close by.
So, using Python's scikit-learn, you can say:
from sklearn.cluster import KMeans
k = KMeans(n_clusters=3)
The above code indicates that you're going to use the K-means algorithm. You create a new model, indicating when you do so that you want three groups.
Now, right away you might be asking yourself how to know that there
will be three categories—and the cop-out answer is that you guess.
You can try different values for n_clusters
and evaluate the model to
see how well it does. But in many cases, you'll have to experiment a
bit.
Let's now run K-means on the data. The X (that is, input matrix) is going to be the data frame, minus the "response" column. You can create that as follows:
X = df.drop('response', axis=1)
With supervised learning, the "fit" method is the process in which you teach the model to make associations between the input matrix X and the output vector y. In unsupervised learning, you're asking the model itself to make such divisions and to create an output vector. You do this with "fit":
k.fit(X)
Evaluating the Model
The first question you'll ask the model is: "How did it divide up the flowers?" You know that the irises should be divided into three different groups, each with 50 flowers. How did K-means do?
You can ask the model itself using a variety of attributes. These attributes often end with an underscore (_), indicating that they may continue to change over time, as the model is trained more.
And indeed, this is an important point to make. When you invoke the "fit" method, you are teaching the model from scratch. However, there are times when you have so much data, you cannot reasonably teach the model all at once. For such cases, you might want to try an algorithm that supports the "partial_fit" method, which allows you to grab inputs a little bit at a time, teaching the model iteratively. However, not all algorithms support partial_fit; a large number of data points might force your hand and reduce the number of algorithms from which you can choose.
For this example, and in the case of K-means, you cannot teach the model incrementally. Let's ask the model for its measure of inertia:
k.inertia_
(Again, notice the trailing underscore.) The value that I get is 0.78.9. The inertia value isn't on a scale; the general sense is that the lower the inertia score, the better, with zero being the best.
What if I were to divide the flowers into only two groups, or four groups? Using scikit-learn, I can do that pretty quickly and determine whether the computer thinks the manual classification (into three groups) was a good choice:
output = [ ]
for i in range(2,20):
model = KMeans(n_clusters=i)
model.fit(X)
output.append((i, model.inertia_))
kmeans = DataFrame(output, columns=['i', 'inertia'])
Now, it might seem ridiculous to group 150 flowers into up to 19
different groups! And indeed, the lowest inertia value that I get is
when I set n_clusters=19
, with the inertia rising as the number of
groups goes down.
Perhaps this means that every flower is unique and cannot be
categorized? Perhaps. But it seems more likely that our data isn't
appropriate for K-means. Maybe it's the wrong shape. Maybe its values
aren't varied enough. And indeed, when you look at the way in which
the flowers were clustered for n_clusters=3
, you see that the
clustering was quite different from what people came up with. I can
turn the automatically labeled flowers into a Pandas Series, and then
count how many of each flower was found:
Series(k.labels_).value_counts()
I get:
2 62
1 50
0 38
Well, it could be worse—but it also could be much better. Perhaps you can and should try another algorithm and see if it's better able to group the flowers together.
I should note that this now falls under the category of "semi-supervised learning"—that is, trying to see whether an unsupervised technique can achieve the same results, or at least similar results, to a previously used supervised technique.
In such a case, you can evaluate your model using not just statistical tests, but also one of the techniques I described in my previous articles on supervised learning, namely train-test-split. You use unsupervised learning on a portion of the input data and then predict on the remaining part. Comparing the model's outputs with the expected outputs for that subset can help you evaluate and tune your model.
A Different Algorithm
But in this case, let's try using a different model to achieve a different result, simply to see how easily sklearn allows you to try different models. One common choice in unsupervised learning is Gaussian Mixture, known in previous versions of scikit-learn as GMM. Let's use it:
from sklearn.mixture import GaussianMixture
model = GaussianMixture(n_components=3)
model.fit(X)
Now, let's have the model predict with the data used to train it, which will return a NumPy array with the categories:
model.predict(X)
How did that do? Let's pop this data into a Pandas Series object and then count the values:
Series(model.predict(X)).value_counts()
And sure enough, the results:
2 55
1 50
0 45
This is still imperfect—assuming that the human classification counts as "perfect", but it's clearly better than the attempts with K-means. And because this is semi-supervised learning here, in which you have some of the original scores, you can use some of sklearn's metrics to find how good (or bad) the model is:
from sklearn import metrics
labels_true = iris.target
labels_pred = model.predict(X)
Let's find out how well it did:
metrics.homogeneity_score(labels_true, labels_pred)
0.89832636726027748
metrics.completeness_score(labels_true, labels_pred)
0.90106489086402064
Hey, pretty good! Not perfect (that is, 1.0), but not bad at all. And if you compare this against the K-means model:
labels_pred = k.labels_
metrics.homogeneity_score(labels_true, labels_pred)
0.75148540219883375
metrics.completeness_score(labels_true, labels_pred)
0.76498615144898152
In other words, my intuition was right. The GaussianMixture model was better at clustering the flowers than the K-means model.
Conclusion
In many ways, unsupervised learning is the true magic and potential in the machine-learning world. By using computers to identify patterns and groups in your data, more quickly and accurately than you could do yourself, you can start to identify and predict all sorts of things. As with supervised learning though, unsupervised learning requires that you try a variety of models, compare them against one another and understand that each model has its own advantages, disadvantages and biases.
The world of data science in general, and machine learning in particular, continues to grow at an extremely rapid rate, with new ideas, techniques and tutorials available all of the time. The Resources section here describes several places where you can learn more and start your journey in this set of concepts and technologies.
Resources
I used Python and the many parts of the SciPy stack (NumPy, SciPy, Pandas, Matplotlib and scikit-learn) in this article. All are available from PyPI or from SciPy.org
I recommend a number of resources for people interested in data science and machine learning.
One long-standing weekly email list is "KDNuggets". You also should consider the "Data Science Weekly" newsletter and "This Week in Data", describing the latest data sets available to the public.
I am a big fan of podcasts, and I particularly love "Partially Derivative". Other good ones are "Data Stories" and "Linear Digressions". I listen to all three on a regular basis and learn from them all.
If you're looking to get into data science and machine learning, I recommend Kevin Markham's Data School and Jason Brownlie's Machine Learning Mastery" where he sells a number of short and dense, but high-quality ebooks on these subjects.