Machine Learning

Weekend Reading: Python

Carlie Fairchild — Sat, 10 Nov 2018 13:17:05 +0000

Python is easy to use, powerful, versatile and a Linux Journal reader favorite. We've round up some of the most popular recent Python-related articles for your weekend reading.

Introducing PyInstaller by Reuven M. Lerner: Want to distribute Python programs to your Python-less clients? PyInstaller is the answer.
Bytes, Characters and Python 2 by Reuven M. Lerner: Moving from Python 2 to 3? Here's what you need to know about strings and their role in in your upgrade.
Introducing Python 3.7's Dataclasses by Reuven M. Lerner: Python 3.7's dataclasses reduce repetition in your class definitions.
Examining Data Using Pandas by Reuven M. Lerner: You don't need to be a data scientist to use Pandas for some basic analysis.
Multiprocessing in Python by Reuven M. Lerner: Python's "multiprocessing" module feels like threads, but actually launches processes.
Launching External Processes in Python by Reuven M. Lerner: Think it's complex to connect your Python program to the UNIX shell? Think again!
Thinking Concurrently: How Modern Network Applications Handle Multiple Connections by Reuven M. Lerner: exploring different types of multiprocessing and looks at the advantages and disadvantages of each.
Threading in Python by Reuven M. Lerner: threads can provide concurrency, even if they're not truly parallel.
Using Python for Science by Joey Bernard: introducing Anaconda, a Python distribution for scientific research.
Visualizing Molecules with Python by Joey Bernard: introducing PyMOL, a Python package for studying chemical structures.
Novelty and Outlier Detection by Reuven M. Lerner: we look at a number of ways you can try to identify outliers using the tools and libraries that Python provides for working with data: NumPy, Pandas and scikit-learn.

Go to Full Article

Empowering Linux Developers for the New Wave of Innovation

Evan Dandrea — Fri, 13 Jul 2018 12:15:15 +0000

by Evan Dandrea

New businesses with software at their core are being created every day. Developers are the lifeblood of so much of what is being built and of technological innovation, and they are ever more vital to operations across the entire business. So why wouldn't we empower them?

Machine learning and IoT in particular offer huge opportunities for developers, especially those facing the crowded markets of other platforms, to engage with a sizeable untapped audience.

That Linux is open source makes it an amazing breeding ground for innovation. Developers aren’t constrained by closed ecosystems, meaning that Linux has long been the operating system of choice for developers. So by engaging with Linux, businesses can attract the best available developer skills.

The Linux ecosystem has always strived for a high degree of quality. Historically it was the Linux community taking sole responsibility for packaging software, gating each application update with careful review to ensure it worked as advertised on each distribution of Linux. This proved difficult for all sides.

Broad access to the code was needed, and open-source software could be offered through the app store. User support requests and bugs were channelled through the Linux distributions, and there was such a volume of reporting, it became difficult to feed information back to the appropriate software authors.

As the number of applications and Linux distributions grew, it became increasingly clear this model would not scale much further. Software authors took matters into their own hands, often picking a single Linux distribution to support and skipping the app store entirely. Because of this, they lost app discoverability and gained the complexity of running duplicative infrastructure.

This placed increased responsibility on developers at a time when the expectations of their role was already expanding. They are no longer just makers, they now bear the risk of breaking robotic arms with their code or bringing down MRI machines with a patch.

As an industry we acknowledge this problem—you can potentially have a bad update and software isn’t an exact science—but we then ask these developers to roll the dice. Do you risk compromise or self-inflicted harm?

Meanwhile the surface area increases. The industry continues a steady march of automation, creating ever more software components to plug together and layer solutions on. Not only do developers face the update question for their own code, they also must trust all developers facing that same decision in all the code beneath their own.

Go to Full Article

ONNX: the Open Neural Network Exchange Format

Braddock Gaskill — Wed, 25 Apr 2018 14:19:00 +0000

by Braddock Gaskill

An open-source battle is being waged for the soul of artificial intelligence. It is being fought by industry titans, universities and communities of machine-learning researchers world-wide. This article chronicles one small skirmish in that fight: a standardized file format for neural networks. At stake is the open exchange of data among a multitude of tools instead of competing monolithic frameworks.

The good news is that the battleground is Free and Open. None of the big players are pushing closed-source solutions. Whether it is Keras and Tensorflow backed by Google, MXNet by Apache endorsed by Amazon, or Caffe2 or PyTorch supported by Facebook, all solutions are open-source software.

Unfortunately, while these projects are open, they are not interoperable. Each framework constitutes a complete stack that until recently could not interface in any way with any other framework. A new industry-backed standard, the Open Neural Network Exchange format, could change that.

Now, imagine a world where you can train a neural network in Keras, run the trained model through the NNVM optimizing compiler and deploy it to production on MXNet. And imagine that is just one of countless combinations of interoperable deep learning tools, including visualizations, performance profilers and optimizers. Researchers and DevOps no longer need to compromise on a single toolchain that provides a mediocre modeling environment and so-so deployment performance.

What is required is a standardized format that can express any machine-learning model and store trained parameters and weights, readable and writable by a suite of independently developed software.

Enter the Open Neural Network Exchange Format (ONNX).

The Vision

To understand the drastic need for interoperability with a standard like ONNX, we first must understand the ridiculous requirements we have for existing monolithic frameworks.

A casual user of a deep learning framework may think of it as a language for specifying a neural network. For example, I want 100 input neurons, three fully connected layers each with 50 ReLU outputs, and a softmax on the output. My framework of choice has a domain language to specify this (like Caffe) or bindings to a language like Python with a clear API.

However, the specification of the network architecture is only the tip of the iceberg. Once a network structure is defined, the framework still has a great deal of complex work to do to make it run on your CPU or GPU cluster.

Python, obviously, doesn't run on a GPU. To make your network definition run on a GPU, it needs to be compiled into code for the CUDA (NVIDIA) or OpenCL (AMD and Intel) APIs or processed in an efficient way if running on a CPU. This compilation is complex and why most frameworks don't support both NVIDIA and AMD GPU back ends.

Go to Full Article

Novelty and Outlier Detection

Reuven M. Lerner — Thu, 28 Sep 2017 12:31:03 +0000

by Reuven M. Lerner

In my last few articles, I've looked at a number of ways machine learning can help make predictions. The basic idea is that you create a model using existing data and then ask that model to predict an outcome based on new data.

So, it's not surprising that one of the most amazing ways machine learning is being applied is in predicting the future. Just a few days before writing this piece, it was announced that machine learning models actually might be able to predict earthquakes—a goal that has eluded scientists for many years and that has the potential to save thousands, and maybe even millions, of lives.

But as you've also seen, machine learning can be used to "cluster" data—that is, to find patterns that humans either can't or won't see, and to try to put the data into various "clusters", or machine-driven categories. By asking the computer to divide data into distinct groups, you gain the opportunity to find and make use of previously undetected patterns.

Just as clustering can be used to divide data into a number of coherent groups, it also can be used to decide which data points belong inside a group and which don't. In "novelty detection", you have a data set that contains only good data, and you're trying to determine whether new observations fit within the existing data set. In "outlier detection", the data may contain outliers, which you want to identify.

Where could such detection be useful? Consider just a few questions you could answer with such a system:

Are there an unusual amount of login attempts from a particular IP address?
Are any customers buying more than the typical number of products at a given hour?
Which homes are consuming above-average amounts of water during a drought?
Which judges convict an unusual number of defendants?
Should a patient's blood tests be considered normal, or are there outliers that require further checks and examinations?

In all of those cases, you could set thresholds for minimum and maximum values and then tell the computer to use those thresholds in determining what's suspicious. But machine learning changes that around, letting the computer figure out what is considered "normal" and then identify the anomalies, which humans then can investigate. This allows people to concentrate their energies on understanding whether the outliers are indeed problematic, rather than on identifying them in the first place.

So in this article, I look at a number of ways you can try to identify outliers using the tools and libraries that Python provides for working with data: NumPy, Pandas and scikit-learn. Just which technique and tools will be appropriate for your data depend on what you're doing, but the basic theory and practice presented here should at least provide you with some food for thought.

Go to Full Article

Classifying Text

Reuven M. Lerner — Tue, 05 Sep 2017 14:35:37 +0000

by Reuven M. Lerner

In my last few articles, I've looked at several ways one can apply machine learning, both supervised and unsupervised. This time, I want to bring your attention to a surprisingly simple—but powerful and widespread—use of machine learning, namely document classification.

You almost certainly have seen this technique used in day-to-day life. Actually, you might not have seen it in action, but you certainly have benefited from it, in the form of an email spam filter. You might remember that back in the earliest days of spam filters, you needed to "train" your email program, so that it would know what your real email looked like. Well, that was a machine-learning model in action, being told what "good" documents looked like, as opposed to "bad" documents. Of course, spam filters are far more sophisticated than that nowadays, but as you'll see over the course of this article, there are logical reasons why spammers include innocent-seeming (and irrelevant to their business) words in the text of their spam.

Text classification is a problem many businesses and organizations have to deal with. Whether it's classifying legal documents, medical records or tweets, machine learning can help you look through lots of text, separating it into different groups.

Now, text classification requires a bit more sophistication than working with purely numeric data. In particular, it requires that you spend some time collecting and organizing data into a format that a model can handle. Fortunately, Python's scikit-learn comes with a number of tools that can get you there fairly easily.

Organizing the Data

Many cases of text classification are supervised learning problems—that is, you'll train the model, give it inputs (for example, text documents) and the "right" output for each input (for example, categories). In scikit-learn, the general template for supervised learning is:


model = CLASS()
model.fit(X, y)
model.predict(new_data_X)

CLASS is one of the 30 or so Python classes that come with scikit-learn, each of which implements a different type of "estimator"—a machine-learning algorithm. Some estimators work best with supervised classification problems, some work with supervised regression problems, and still others work with clustering (that is, unsupervised classification) problems. You often will be able to choose from among several different estimators, but the general format remains the same.

Go to Full Article

Unsupervised Learning

Reuven M. Lerner — Thu, 10 Aug 2017 12:13:28 +0000

by Reuven M. Lerner

In my last few articles, I've looked into machine learning and how you can build a model that describes the world in some way. All of the examples I looked at were of "supervised learning", meaning that you loaded data that already had been categorized or classified in some way, and then created a model that "learned" the ways the inputs mapped to the outputs. With a good model, you then were able to predict the output for a new set of inputs.

Supervised learning is a very useful technique and is quite widespread. But, there is another set of techniques in machine learning known as unsupervised learning. These techniques, broadly speaking, ask the computer to find the hidden structure in the data—in other words, to "learn" what the meaning of the data is, what relationships it contains, which features are of importance, and which data records should be considered to be outliers or anomalies.

Unsupervised learning also can be used for what's known as "dimensionality reduction", in which the model functions as a preprocessing step, reducing the number of features in order to simplify the inputs that you'll hand to another model.

In other words, in supervised learning, you teach the computer about your data and hope that it understands the relationships and categorization well enough to categorize data it hasn't seen before successfully.

In unsupervised learning, by contrast, you're asking the computer to tell you something interesting about the data.

This month, I take an initial look at the world of unsupervised learning. Can a computer categorize data as well as a human? How can you use Python's scikit-learn to create such models?

Unsupervised Learning

There's a children's card game called Set that is a useful way to think about machine learning. Each card in the game contains a picture. The picture contains one, two or three shapes. There are several different shapes, and each shape has a color and a fill pattern. In the game, players are supposed to identify three-card groups of cards using any one of those properties. Thus, you could create a group based on the color green, in which all cards are green in color (but contain different numbers of shapes, shapes and fill patterns). You could create a group based on the number of shapes, in which every card has two shapes, but those shapes can be of any color, any shape and any fill pattern.

The idea behind the game is that players can create a variety of different groups and should take advantage of this in order to win the game.

Go to Full Article

Testing Models

Reuven M. Lerner — Thu, 29 Jun 2017 11:20:14 +0000

by Reuven M. Lerner

In my last few articles, I've been dipping into the waters of "machine learning"—a powerful idea that has been moving steadily into the mainstream of computing, and that has the potential to change lives in numerous ways. The goal of machine learning is to produce a "model"—a piece of software that can make predictions with new data based on what it has learned from old data.

One common type of problem that machine learning can help solve is classification. Given some new data, how can you categorize it? For example, if you're a credit-card company, and you have data about a new purchase, does the purchase appear to be legitimate or fraudulent? The degree to which you can categorize a purchase accurately depends on the quality of your model. And, the quality of your model will generally depend on not only the algorithm you choose, but also the quantity and quality of data you use to "train" that model.

Implied in the above statement is that given the same input data, different algorithms can produce different results. For this reason, it's not enough to choose a machine-learning algorithm. You also must test the resulting model and compare its quality against other models as well.

So in this article, I explore the notion of testing models. I show how Python's scikit-learn package, which you can use to build and train models, also provides the ability to test them. I also describe how scikit-learn provides tools to compare model effectiveness.

Testing Models

What does it even mean to "test" a model? After all, if you have built a model based on available data, doesn't it make sense that the model will work with future data?

Perhaps, but you need to check, just to be sure. Perhaps the algorithm isn't quite appropriate for the type of data you're examining, or perhaps there wasn't enough data to train the model well. Or, perhaps the data was flawed and, thus, didn't train the model effectively.

But, one of the biggest problems with modeling is that of "overfitting". Overfitting means that the model does a great job of describing the training data, but that it is tied to the training data so closely and specifically, it cannot be generalized further.

For example, let's assume that a credit-card company wants to model fraud. You know that in a large number of cases, people use credit cards to buy expensive electronics. An overfit model wouldn't just give extra weight to someone buying expensive electronics in its determination of fraud; it might look at the exact price, location and type of electronics being bought. In other words, the model will precisely describe what has happened in the past, limiting its ability to generalize and predict the future.

Imagine if you could read letters that were only from a font you had previously learned, and you can further understand the limitations of overfitting.

Go to Full Article

Teaching Your Computer

Reuven M. Lerner — Tue, 16 May 2017 09:01:26 +0000

by Reuven M. Lerner

As I have written in my last two articles (Machine Learning Everywhere and Preparing Data for Machine Learning), machine learning is influencing our lives in numerous ways. As a consumer, you've undoubtedly experienced machine learning, whether you know it or not—from recommendations for what products you should buy from various online stores, to the selection of postings that appear (and don't) on Facebook, to the maddening voice-recognition systems that airlines use, to the growing number of companies that offer to select clothing, food and wine for you based on your personal preferences.

Machine learning is everywhere, and although the theory and practice both can take some time to learn and internalize, the basics are fairly straightforward for people to learn.

The basic idea behind machine learning is that you build a model—a description of the ways the inputs and outputs are related. This model then allows you to ask the computer to analyze new data and to predict the outputs for new sets of inputs. This is essentially what machine learning is all about. In "supervised learning", the computer is trained to categorize data based on inputs that humans had previously categorized. In "unsupervised learning", you ask the computer to categorize data on your behalf.

In my last article, I started exploring a data set created by Scott Cole, a data scientist (and neuroscience PhD student) who measured burritos in a variety of California restaurants. I looked at the different categories of data that Cole and his fellow eater-researchers gathered and considered a few ways one could pare down the data set to something more manageable, as well as reasonable.

Here I describe how to take this smaller data set, consisting solely of the features that were deemed necessary, and use it to train the computer by creating a machine-learning model.

Machine-Learning Models

Let's say that the quality of a burrito is determined solely by its size. Thus, the larger the burrito, the better it is; the smaller the burrito, the worse it is. If you describe the size as a matrix X, and the resulting quality score as y, you can describe this mathematically as:


y = qX

where q is a factor describing the relationship between X and y.

Of course, you know that burrito quality has to do with more than just the size. Indeed, in Cole's research, size was removed from the list of features, in part because not every data point contained size information.

Go to Full Article

Preparing Data for Machine Learning

Reuven M. Lerner — Tue, 25 Apr 2017 13:11:40 +0000

by Reuven M. Lerner

When I go to Amazon.com, the online store often recommends products I should buy. I know I'm not alone in thinking that these recommendations can be rather spooky—often they're for products I've already bought elsewhere or that I was thinking of buying. How does Amazon do it? For that matter, how do Facebook and LinkedIn know to suggest that I connect with people whom I already know, but with whom I haven't yet connected online?

The answer, in short, is "data science", a relatively new field that marries programming and statistics in order to make sense of the huge quantity of data we're creating in the modern world. Within the world of data science, machine learning uses software to create statistical models to find correlations in our data. Such correlations can help recommend products, predict highway traffic, personalize pricing, display appropriate advertising or identify images.

So in this article, I take a look at machine learning and some of the amazing things it can do. I increasingly feel that machine learning is sort of like the universe—already vast and expanding all of the time. By this, I mean that even if you think you've missed the boat on machine learning, it's never too late to start. Moreover, everyone else is struggling to keep up with all of the technologies, algorithms and applications of machine learning as well.

For this article, I'm looking at a simple application of categorization and "supervised learning", solving a problem that has vexed scientists and researchers for many years: just what makes the perfect burrito? Along the way, you'll hopefully start to understand some of the techniques and ideas in the world of machine learning.

The Problem

The problem, as stated above, is a relatively simple one to understand: burritos are a popular food, particularly in southern California. You can get burritos in many locations, typically with a combination of meat, cheese and vegetables. Burritos' prices vary widely, as do their sizes and quality. Scott Cole, a PhD student in neuroscience, argued with his friends not only over where they could get the best burritos, but which factors led to a burrito being better or worse. Clearly, the best way to solve this problem was by gathering data.

Now, you can imagine a simple burrito-quality rating system, as used by such services as Amazon: ask people to rate the burrito on a scale of 1–5. Given enough ratings, that would indicate which burritos were best and which were worst.

But Cole, being a good researcher, understood that a simple, one-dimensional rating was probably not sufficient. A multi-dimensional rating system would keep ratings closer together (since they would be more focused), but it also would allow him to understand which aspects of a burrito were most essential to its high quality.

Go to Full Article

Machine Learning Everywhere

Reuven M. Lerner — Thu, 23 Feb 2017 11:43:27 +0000

by Reuven M. Lerner

The field of statistics typically has had a bad reputation. It's seen as difficult, boring and even a bit useless. Many of my friends had to take statistics courses in graduate school, so that they could analyze and report on their research. To many of them, the classes were a form of nerdy, boring torture.

Maybe it's just me, but after I took those courses, I felt like I was seeing the world through new eyes. Suddenly, I could better understand the world around me. Newspaper articles about the government and scientific and corporate reports made more sense. I also could identify the flaws in such reports more easily and criticize them from a position of understanding.

Much of the power of statistics lies in the creation of a "model", or a mathematical description of reality. A model is a caricature of sorts, in that it doesn't represent all of reality, but rather just those factors that you think will affect the thing you're trying to understand. A model lets you say that given inputs A, B, C and D, you can know, more or less, what the output will be.

Sometimes, the goal of a statistical model is to predict a value—for example, given a certain size and neighborhood, you can predict the price of a house. Or, given someone's age, weight and where they live, you can predict his or her likelihood of getting a certain disease.

Often, the goal is to predict a category—for example, in an upcoming election, for whom are people likely to vote? Taking into account where they live, what level of education they've received, their ethnic background and a few other factors, you can often predict for whom people will vote before they know it themselves.

During the past few years, there has been a huge amount of buzz around the terms "big data", "data science" and "machine learning". As these buzzwords continue to gain acceptance, many statisticians are wondering what the big deal is. And to be honest, their complaint makes some sense, given that "machine learning" is, more or less, a computerized version of the predictive models that statisticians have been creating for decades.

Now, why am I telling you this? Because I actually do believe that machine learning is a game-changer for huge parts of our lives. Just as my perspective was changed when I learned statistics, giving me tools to understand the world better, many businesses are having their perspectives changed, as they use machine learning to understand themselves better. Everything from online shopping, to the items you see in your social-network feeds, to the voice-recognition algorithms in your phone, to the fraud detection used by your credit-card company is being affected, boosted and (hopefully) improved via machine learning.

Go to Full Article