statistics

Galit Shmueli et al.'s Data Mining for Business Analytics (Wiley)

James Gray — Fri, 03 Nov 2017 16:11:00 +0000

The updated 5th edition of the book Data Mining for Business Analytics from Galit Shmueli and collaborators and published by Wiley is a standard guide to data mining and analytics that adds two new co-authors and a trove of new material vis-á-vis its predecessor. R is a free, open-source and popularity-gaining software environment for statistical computing and graphics. Trailing with the subtitle Concepts, Techniques, and Applications in R, the new 5th edition of Data Mining for Business Analytics continues to provide an applied approach to data-mining concepts and methods, using the R software as a canvas on which to illustrate.

With the book, readers learn how to implement a variety of popular data-mining algorithms in R to tackle business problems and opportunities. Material covered in-depth includes both statistical and machine-learning algorithms for prediction, classification, visualization, dimension reduction, recommender systems, clustering, text mining and network analysis.

The new 5th edition includes material from business, government, a dozen case studies demonstrating applications for the data-mining techniques described, and exercises in each chapter that help readers gauge and expand their comprehension and competency of the material. Data Mining for Business Analytics can serve as either a text book or a reference for analysts, researchers and practitioners working with quantitative methods in myriad fields.

Go to Full Article

Learning Data Science

Reuven M. Lerner — Tue, 24 Oct 2017 12:19:27 +0000

by Reuven M. Lerner

In my last few articles, I've written about data science and machine learning. In case my enthusiasm wasn't obvious from my writing, let me say it plainly: it has been a long time since I last encountered a technology that was so poised to revolutionize the world in which we live.

Think about it: you can download, install and use open-source data science libraries, for free. You can download rich data sets on nearly every possible topic you can imagine, for free. You can analyze that data, publish it on a blog, and get reactions from governments and companies.

I remember learning in high school that the difference between freedom of speech and freedom of the press is that not everyone has a printing press. Not only has the internet provided everyone with the equivalent of a printing press, but it has given us the power to perform the sort of analysis that until recently was exclusively available to governments and wealthy corporations.

During the past year, I have increasingly heard that data science is the sexiest profession of the 21st century and the one that will be in greatest demand. Needless to say, those two things make for a very appealing combination! It's no surprise that I've seen a major uptick in the number of companies inviting me to teach on this subject.

The upshot is that you—yes, you, dear reader—should spend time in the coming months, weeks and years learning whatever you can about data science. This isn't because you will change jobs and become a data scientist. Rather, it's because everyone is going to become a data scientist. No matter what work you do, you'll be better at it, because you will be able to use the tools of data science to analyze past performance and make predictions based on it.

Back when I started to develop web applications, it was the norm to have a database team that created the tables and queries. Nowadays, although there certainly are places that have a full-time database staff, the assumption is that every developer has at least a passing familiarity with relationship (or even NoSQL) databases and how to work with them. In the same way that developers who understand databases are more powerful than those who don't, people in the computer field who understand data science are more powerful than those who don't.

There is a bit of bad news on this front, though. If you thought that the pace of technological change in programming and the web moved at a breakneck pace, you haven't seen anything yet! The world of data science—the tools, the algorithms, the applications—are moving at an overwhelming speed. The good news is that everyone is struggling to keep up, which means if you find yourself overwhelmed, you're probably in very good company. Just be sure to keep moving ahead, aiming to increase your understanding of the theory, algorithms, techniques and software that data scientists use.

Go to Full Article

Image Processing on Linux

Joey Bernard — Tue, 17 Oct 2017 13:30:00 +0000

by Joey Bernard

I've covered several scientific packages in this space that generate nice graphical representations of your data and work, but I've not gone in the other direction much. So in this article, I cover a popular image processing package called ImageJ. Specifically, I am looking at Fiji, an instance of ImageJ bundled with a set of plugins that are useful for scientific image processing.

The name Fiji is a recursive acronym, much like GNU. It stands for "Fiji Is Just ImageJ". ImageJ is a useful tool for analyzing images in scientific research—for example, you may use it for classifying tree types in a landscape from aerial photography. ImageJ can do that type categorization. It's built with a plugin architecture, and a very extensive collection of plugins is available to increase the available functionality.

The first step is to install ImageJ (or Fiji). Most distributions will have a package available for ImageJ. If you wish, you can install it that way and then install the individual plugins you need for your research. The other option is to install Fiji and get the most commonly used plugins at the same time. Unfortunately, most Linux distributions will not have a package available within their package repositories for Fiji. Luckily, however, an easy installation file is available from the main website. It's a simple zip file, containing a directory with all of the files required to run Fiji. When you first start it, you get only a small toolbar with a list of menu items (Figure 1).

Figure 1. You get a very minimal interface when you first start Fiji.

If you don't already have some images to use as you are learning to work with ImageJ, the Fiji installation includes several sample images. Click the File→Open Samples menu item for a dropdown list of sample images (Figure 2). These samples cover many of the potential tasks you might be interested in working on.

Figure 2. Several sample images are available that you can use as you learn how to work with ImageJ.

If you installed Fiji, rather than ImageJ alone, a large set of plugins already will be installed. The first one of note is the autoupdater plugin. This plugin checks the internet for updates to ImageJ, as well as the installed plugins, each time ImageJ is started.

Go to Full Article

Popcon - Are You In Or Out?

Michael Reed — Mon, 31 Jan 2011 14:00:00 +0000

by Michael Reed

Those of you who regularly install Debian may have noticed a prompt that asks you if you would like to install Popcon, the Debian Popularity Contest. Popcon gathers statistics about package usage and periodically submits it to Debian. The anonymous statistics gathered by the script are freely available on the Debian website, and the script can be invoked manually to give a clearer idea of package usage on your own system.

I must admit that I had always declined to take part in the survey. Some people will object on privacy grounds, but personally, I trust that Debian aren't going to do anything devious with the info. I had opted out because it sounded like another possible point of failure and didn't actually know what the project did.

If you didn't select it when installing Debian, you can install Popcon at any time via the package manager, and this doesn't hamper the quality of the data. If you're installing it manually, bear in mind that it installation script prompts for user input, so make sure that you can view the text output of your package management system. The information that it is actually gathering is the installation date and most recent access date of every package on your system. By default, Popcon gathers the information and submits it once a week using a cron job.

Once installed, you can invoke it automatically by typing (as root)

popularity-contest

You'll receive a long list of all of the packages on your system arranged in order of most recently accessed. Here is a sample of the output when I ran it on my Debian Sid box.

1290877204 1290877209 iptables /usr/sbin/ip6tables-apply OLD
1290877204 1290877339 ed /usr/bin/red OLD
1290877204 1290877401 laptop-detect /usr/sbin/laptop-detect OLD
1290877204 1290877230 libnfsidmap2 /usr/lib/libnfsidmap/static.so OLD
1290877204 1290877414 libruby1.8 /usr/lib/ruby/1.8/net/ftp.rb OLD
1290877204 1290877455 google-gadgets-gst /usr/lib/google-gadgets/modules/gst-audio-framework.so OLD
1290877204 1290877246 tcpd /usr/sbin/tcpd OLD

The first two numbers are the access and the creation time of the most recently accessed file within the library. The time is presented in Unix time format, that is, number of seconds elapsed since midnight January 1970. This is followed by the name of the library and the most recently accessed file in that library. The last piece of information is a tag which indicates if that library is considered old (not accessed for more than a month). There are tags to indicate if the library is recently installed or contains no runnable programs.

Obviously, the output for a typical system is going to be vast. For this reason, if you're invoking it from the command line, either piping to a file or grep is the best approach. For example, piping it to a file with

popularity-contest >popcon.txt

Go to Full Article