Data Science

Shall We Study Amazon's Pricing Together?

Doc Searls — Tue, 02 Oct 2018 11:30:00 +0000

Is it possible to figure out how we're being profiled online?

This past July, I spent a quality week getting rained out in a series of brainstorms by alpha data geeks at the Pacific Northwest BI & Analytics Summit in Rogue River, Oregon. Among the many things I failed to understand fully there was how much, or how well, we could know about how the commercial sites and services of the online world deal with us, based on what they gather about us, on the fly or over time, as we interact with them.

The short answer was "not much". But none of the experts I talked to said "Don't bother trying." On the contrary, the consensus was that the sums of data gathered by most companies are (in the words of one expert) "spaghetti balls" that are hard, if not possible, to unravel completely. More to my mission in life and work, they said it wouldn't hurt to have humans take some interest in the subject.

In fact, that was pretty much why I was invited there, as a Special Guest. My topic was "When customers are in full command of what companies do with their data—and data about them". As it says at that link, "The end of this story...is a new beginning for business, in a world where customers are fully in charge of their lives in the marketplace—both online and off: a world that was implicit in both the peer-to-peer design of the Internet and the nature of public markets in the pre-industrial world."

Obviously, this hasn't happened yet.

This became even more obvious during a break when I drove to our AirBnB nearby. By chance, my rental car radio was tuned to a program called From Scurvy to Surgery: The History Of Randomized Trials. It was an Innovation Hub interview with Andrew Leigh, Ph.D. (@ALeighMP), economist and member of the Australian Parliament, discussing his new book, Randomistas: How Radical Researchers Are Changing Our World (Yale University Press, 2018). At one point, Leigh reported that "One expert says, 'Every pixel on Amazon's home page has had to justify its existence through a randomized trial.'"

I thought, Wow. How much of my own experience of Amazon has been as a randomized test subject? And can I possibly be in anything even remotely close to full charge of my own life inside Amazon's vast silo?

Go to Full Article

Learning Data Science

Reuven M. Lerner — Tue, 24 Oct 2017 12:19:27 +0000

by Reuven M. Lerner

In my last few articles, I've written about data science and machine learning. In case my enthusiasm wasn't obvious from my writing, let me say it plainly: it has been a long time since I last encountered a technology that was so poised to revolutionize the world in which we live.

Think about it: you can download, install and use open-source data science libraries, for free. You can download rich data sets on nearly every possible topic you can imagine, for free. You can analyze that data, publish it on a blog, and get reactions from governments and companies.

I remember learning in high school that the difference between freedom of speech and freedom of the press is that not everyone has a printing press. Not only has the internet provided everyone with the equivalent of a printing press, but it has given us the power to perform the sort of analysis that until recently was exclusively available to governments and wealthy corporations.

During the past year, I have increasingly heard that data science is the sexiest profession of the 21st century and the one that will be in greatest demand. Needless to say, those two things make for a very appealing combination! It's no surprise that I've seen a major uptick in the number of companies inviting me to teach on this subject.

The upshot is that you—yes, you, dear reader—should spend time in the coming months, weeks and years learning whatever you can about data science. This isn't because you will change jobs and become a data scientist. Rather, it's because everyone is going to become a data scientist. No matter what work you do, you'll be better at it, because you will be able to use the tools of data science to analyze past performance and make predictions based on it.

Back when I started to develop web applications, it was the norm to have a database team that created the tables and queries. Nowadays, although there certainly are places that have a full-time database staff, the assumption is that every developer has at least a passing familiarity with relationship (or even NoSQL) databases and how to work with them. In the same way that developers who understand databases are more powerful than those who don't, people in the computer field who understand data science are more powerful than those who don't.

There is a bit of bad news on this front, though. If you thought that the pace of technological change in programming and the web moved at a breakneck pace, you haven't seen anything yet! The world of data science—the tools, the algorithms, the applications—are moving at an overwhelming speed. The good news is that everyone is struggling to keep up, which means if you find yourself overwhelmed, you're probably in very good company. Just be sure to keep moving ahead, aiming to increase your understanding of the theory, algorithms, techniques and software that data scientists use.

Go to Full Article

Unsupervised Learning

Reuven M. Lerner — Thu, 10 Aug 2017 12:13:28 +0000

by Reuven M. Lerner

In my last few articles, I've looked into machine learning and how you can build a model that describes the world in some way. All of the examples I looked at were of "supervised learning", meaning that you loaded data that already had been categorized or classified in some way, and then created a model that "learned" the ways the inputs mapped to the outputs. With a good model, you then were able to predict the output for a new set of inputs.

Supervised learning is a very useful technique and is quite widespread. But, there is another set of techniques in machine learning known as unsupervised learning. These techniques, broadly speaking, ask the computer to find the hidden structure in the data—in other words, to "learn" what the meaning of the data is, what relationships it contains, which features are of importance, and which data records should be considered to be outliers or anomalies.

Unsupervised learning also can be used for what's known as "dimensionality reduction", in which the model functions as a preprocessing step, reducing the number of features in order to simplify the inputs that you'll hand to another model.

In other words, in supervised learning, you teach the computer about your data and hope that it understands the relationships and categorization well enough to categorize data it hasn't seen before successfully.

In unsupervised learning, by contrast, you're asking the computer to tell you something interesting about the data.

This month, I take an initial look at the world of unsupervised learning. Can a computer categorize data as well as a human? How can you use Python's scikit-learn to create such models?

Unsupervised Learning

There's a children's card game called Set that is a useful way to think about machine learning. Each card in the game contains a picture. The picture contains one, two or three shapes. There are several different shapes, and each shape has a color and a fill pattern. In the game, players are supposed to identify three-card groups of cards using any one of those properties. Thus, you could create a group based on the color green, in which all cards are green in color (but contain different numbers of shapes, shapes and fill patterns). You could create a group based on the number of shapes, in which every card has two shapes, but those shapes can be of any color, any shape and any fill pattern.

The idea behind the game is that players can create a variety of different groups and should take advantage of this in order to win the game.

Go to Full Article

iguaz.io

James Gray — Fri, 02 Sep 2016 15:30:00 +0000

by James Gray

An IT megatrend in progress involves the shift from legacy monolithic apps running on enterprise storage to systems of engagement that interact with users, collect real-time data from many sources and store it in elastic and shared data services. A self-described "disruptive" enterprise seeking to push this vision forward is iguaz.io, which recently announced a virtualized data services architecture for revolutionizing both private and public clouds.

In contrast to the current siloed approach, iguaz.io consolidates data into a high-volume, real-time data repository that virtualizes and presents it as streams, messages, files, objects or data records. All data types are stored consistently on different memory or storage tiers, and popular application frameworks (Hadoop, ELK, Spark or Docker containers) are accelerated.

In addition to a 10x to 100x improvement in time to insights at lower costs, iguaz.io says that its architecture provides best-in-class data security based on a real-time classification engine, a critical need for data sharing among users and business units.

Go to Full Article