Examining Data Using Pandas

""

You don't need to be a data scientist to use Pandas for some basic analysis.

Traditionally, people who program in Python use the data types that come with the language, such as integers, strings, lists, tuples and dictionaries. Sure, you can create objects in Python, but those objects typically are built out of those fundamental data structures.

If you're a data scientist working with Pandas though, most of your time is spent with NumPy. NumPy might feel like a Python data structure, but it acts differently in many ways. That's not just because all of its operations work via vectors, but also because the underlying data is actually a C-style array. This makes NumPy extremely fast and efficient, consuming far less memory for a given array of numbers than traditional Python objects would do.

The thing is, NumPy is designed to be fast, but it's also a bit low level for some people. To get more functionality and a more flexible interface, many people use Pandas, a Python package that provides two basic wrappers around NumPy arrays: one-dimensional Series objects and two-dimensional Data Frame objects.

I often describe Pandas as "Excel within Python", in that you can perform all sorts of calculations as well as sort data, search through it and plot it.

For all of these reasons, it's no surprise that Pandas is a darling of the data science community. But here's the thing: you don't need to be a data scientist to enjoy Pandas. It has a lot of excellent functionality that's good for Python developers who otherwise would spend their time wrestling with lists, tuples and dictionaries.

So in this article, I describe some basic analysis that everyone can do with Pandas, regardless of whether you're a data scientist. If you ever work with CSV files (and you probably do), I definitely recommend thinking about using Pandas to open, read, analyze and even write to them. And although I don't cover it in this article, Pandas handles JSON and Excel very well too.

Creating Data Frames

Although it's possible to create a data frame from scratch using Python data structures or NumPy arrays, it's more common in my experience to do so from a file. Fortunately, Pandas can load data from a variety of file formats.

Before you can do anything with Pandas, you have to load it. In a Jupyter notebook, do:


%pylab inline
import pandas as pd

For example, Python comes with a csv module that knows how to handle files in CSV (comma-separated value) format. But, then you need to iterate over the file and do something with each of those lines/rows. I often find it easier to use Pandas to work with such files. For example, here's a CSV file:


a,b,c,d
e,f,g,h
"i,j",k,l,m
n,o.p,q

You can turn this into a data frame with:


df = pd.read_csv('mycsv.csv')

Now you can view the contents of the data frame in Jupyter with:


df

The thing is, there are all sorts of CSV files you probably don't even think of as CSV files. For example, consider the Linux /etc/passwd file. It's not really a CSV file, but if you think about it for a moment, you'll realize that the format is the same. Each record is on a single line, and the fields are separated with ":" characters. So in this case, you'll need to use the sep parameter to indicate the separator:


filename = '/etc/passwd'
df = pd.read_csv(filename, sep=':')
df.head()

The command df.head(), much like the UNIX head utility, shows the first few rows of the data frame. I often use this method to inspect the data and make sure it came through okay.

In this particular case, the data came through just fine, but the first line (the root user's record!) was interpreted as header rows. Fortunately, you can fix that with another parameter:


df = pd.read_csv(filename, sep=':', header=None)
df.head()

If you pass the header parameter, you're telling Pandas which row of the file should be considered the headers. But if you pass None as a value, you're telling Pandas that none of the rows is the header. That's fine, but then you're going to get integers (starting with 0) as your column names. I'd rather use real names, so to specify them, do:


df = pd.read_csv(filename, sep=':', header=None,
                names=['username', 'password', 'uid',
                       'gid', 'name', 'homedir', 'shell'])
df.head()

The thing is, do you really need the password column? Given that on modern computers, it'll always contain "x", I think that you can leave it out. So, you can say:


df = pd.read_csv(filename, sep=':', header=None,
                 usecols=[0,2,3,4,5,6],
                names=['username', 'uid', 'gid',
                       'name', 'homedir', 'shell'])
df.head()

The usecols parameter indicates which columns you want to read from the file.

Cars Towed in Chicago

Now, why would you want to use Pandas on your /etc/passwd file? You probably don't, but as you can imagine, if it works on that file, it'll work on other files too.

For example, many cities now are making data available to the general public. Chicago publishes much of its data online, so let's take a look and see, for example, what you can find out about what cars were towed in the last 90 days in Chicago.

You can go to the Chicago data portal at https://data.cityofchicago.org/browse. I clicked "towed vehicles"→"export"→"CSV" to get a CSV file. Then I took the downloaded file and imported it into Pandas with:


df = pd.read_csv('towed.csv')

Nah, I'm just kidding. You think that I have the time and energy to download the file and then load it? One of my favorite features of Pandas is the fact that most methods that work with files also can work with URLs. So, I can say:


url = 'https://data.cityofchicago.org/api/views/ygr5-vcbg/
↪rows.csv?accessType=DOWNLOAD'
df = pd.read_csv(url)

(Note that the URL might well have changed by the time you read this, or it might be keyed to my session information. If not, all the better, from my perspective!)

I then take a look at the data frame and discover that it has headers, that the separator is a comma and that it worked just fine. The only adjustment I'm going to make is to parse the "tow date" column as a date, so I can play with that a bit. I also indicate that because the date is in US format, the day comes first:


df = pd.read_csv(url, parse_dates=['Tow Date'], dayfirst=False)
df.head()

You'll notice that the first column now appears a bit differently, in year-month-day format. That shows there is a datestamp. You also can see that if you run the df.info method, which tells you about the data frame itself:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5560 entries, 0 to 5559
Data columns (total 10 columns):
Tow Date              5560 non-null datetime64[ns]
Make                  5537 non-null object
Style                 5538 non-null object
Model                 509 non-null object
Color                 5536 non-null object
Plate                 4811 non-null object
State                 5392 non-null object
Towed to Address      5560 non-null object
Tow Facility Phone    5559 non-null object
Inventory Number      5560 non-null int64
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 434.5+ KB

As you can see, the "Tow Date" column is now being stored as a datetime64, a 64-bit field containing a date and time.

So, now that you have this data, what can you do with it? One of my favorite methods is value_counts, which tells how many times a particular value appears in a column. So, you can say the following to find out how many vehicles were towed from each state:


df['State'].value_counts()

Now, that's going to be a long list, so it's probably better to limit it by using head on the series you get back from value_counts:


df['State'].value_counts().head(10)

Now you'll see which ten states have the most towed vehicles in Chicago in the last 90 days:


IL    4948
IN     148
TX      48
WI      48
MN      28
IA      27
MI      19
GA      14
FL      14
TN      13

Not surprisingly, most of the vehicles towed in Chicago were from Illinois, with the second-highest number from Indiana, which is nearby. Less expected (to me, at least) was the large number of towed vehicles from Texas, which is not exactly nearby.

To make use of the fact that the timestamp is now a true datetime object, you can find out the most common dates on which vehicles were towed:


2018-03-03    215
2018-03-04    195
2018-02-24    165
2018-03-25    148
2018-03-26    140
2018-03-24    135
2018-03-15    126
2018-03-21    122
2018-02-03    120
2018-03-22    117

As you can see, there's a great deal of variation. Maybe that variation is due to the day of the week? In order to find out, you can use the dt proxy object Pandas provides for datetime columns. You can use that to extract information from the datetime column and then analyze it. For example, you can say:


df['Tow Date'].dt.dayofweek

This retrieves the day of the week for each of the tow dates. Then you can use value_counts to summarize how many vehicles were towed on each day of the week:


df['Tow Date'].dt.dayofweek.value_counts()

The results are as follows:


5    1143
4     831
3     820
6     798
2     794
0     640
1     534

The above indicates that Chicago towed far more cars on Fridays than on other days. This might always be true, or it might just be true for this 90-day sample.

You also can check to see what brand of cars are towed most often or whether there's a correlation between the color and the chances of being towed.

What if you want to know, week by week, how many cars were towed? For that, you can take advantage of a great Pandas feature, in which you can set the data frame's index to be a timestamp column:


df.set_index('Tow Date', inplace=True)

Now, instead of accessing rows by an integer index, you'll use a timestamp. But the point is not to access things that way. Rather, it's a resample, similar to a GROUP BY query in SQL. Resampling can be done in a variety of ways; here's asking for it to be on a one-week basis:


df.resample('1W')

By itself, this does nothing. But if you then count the number of rows for each week, you get much more interesting output. For every column, you get the number of entries per week. The numbers differ, because many columns have missing (NaN) information, which isn't included in the count. That's fine; you can just use the Make column, which seems to be filled in for every towed vehicle:


df.resample('1W').count()['Make']

Now you should get a very nice report, showing how many cars were towed each week:


Tow Date
2017-12-31    103
2018-01-07    321
2018-01-14    209
2018-01-21    241
2018-01-28    250
2018-02-04    399
2018-02-11    248
2018-02-18    328
2018-02-25    587
2018-03-04    862
2018-03-11    495
2018-03-18    601
2018-03-25    754
2018-04-01    139
Freq: W-SUN, Name: Make, dtype: int64

Humans see patterns more easily with graphics than with tables of numbers, so you might want to plot this, also using Pandas:


df.resample('1W').count()['Make'].plot()

That makes a line plot showing that the number of towed vehicles shot up at the start of March. Why? You could look at temperatures and a calendar to start to guess, but the fact that you can download, inspect and graph this information so easily is a good starting point.

Conclusion

Pandas is more than merely a tool for data scientists to do numeric analysis. The fact that it can read (and write) formats such as CSV, Excel and JSON make it my primary tool for working with those formats. Add to that the fact that it can summarize data, including based on timestamps, and you probably can see why Pandas is so popular.

Resources

The home page for Pandas is here, and the Jupyter notebook, which I mentioned in this article, is here.

Finally, the list of "datetime components", which you can access via the dt object on timestamp columns in Pandas, is here.

Reuven Lerner teaches Python, data science and Git to companies around the world. You can subscribe to his free, weekly "better developers" e-mail list, and learn from his books and courses at https://lerner.co.il. Reuven lives with his wife and children in Modi'in, Israel.

Load Disqus comments