Cleaning Data in Python

30 minute read

Loading and viewing your data

We’re going to look at a subset of the Department of Buildings Job Application Filings dataset from the NYC Open Data portal. This dataset consists of job applications filed on January 22, 2017. Our first task is to load this dataset into a DataFrame and then inspect it using the .head() and .tail() methods. However, we’ll find out very quickly that the printed results don’t allow you to see everything we need, since there are too many columns. Therefore, we need to look at the data in another way. The .shape and .columns attributes let us see the shape of the DataFrame and obtain a list of its columns. From here, we can see which columns are relevant to the questions we’d like to ask of the data. To this end, a new DataFrame, df_subset, consisting only of these relevant columns, has been pre-loaded. This is the DataFrame we’ll work with in the rest of the chapter. Get acquainted with the dataset now by exploring it with pandas! This initial exploratory analysis is a crucial first step of data cleaning.

# Import pandas
import pandas as pd
# Read the file into a DataFrame: df
df = pd.read_csv('dob_job_application_filings_subset.csv')
# Print the head of df
print(df.head())
# Print the tail of df
print(df.tail())
# Print the shape of df
print(df.shape)
# Print the columns of df
print(df.columns)
# Print the head and tail of df_subset
print(df_subset.head())
print(df_subset.tail())

We identified some potentially unclean or missing data. Now, we’ll continue to diagnose our data with the very useful .info() method. The .info() method provides important information about a DataFrame, such as the number of rows, number of columns, number of non-missing values in each column, and the data type stored in each column. This is the kind of information that will allow us to confirm whether the ‘Initial Cost’ and ‘Total Est. Fee’ columns are numeric or strings. From the results, we’ll also be able to see whether or not all columns have complete data in them. Our task is to use the .info() method on these and analyze the results.

# Print the info of df
print(df.info())
# Print the info of df_subset
print(df_subset.info())

Frequency counts for categorical data

.describe() can only be used on numeric columns. So how can we diagnose data issues when you have categorical data? One way is by using the .value_counts() method, which returns the frequency counts for each unique value in a column! This method also has an optional parameter called dropna which is True by default. What this means is if we have missing data in a column, it will not give a frequency count of them. We want to set the dropna column to False so if there are missing values in a column, it will give us the frequency counts. Here we’re going to look at the ‘Borough’, ‘State’, and ‘Site Fill’ columns to make sure all the values in there are valid. When looking at the output, do a sanity check: Are all values in the ‘State’ column from NY, for example? Since the dataset consists of applications filed in NY, we would expect this to be the case.

# Print the value counts for 'Borough'
print(df['Borough'].value_counts(dropna=False))
# Print the value_counts for 'State'
print(df['State'].value_counts(dropna=False))
# Print the value counts for 'Site Fill'
print(df['Site Fill'].value_counts(dropna=False))

Visualizing single variables with histograms

Up until now, we’ve been looking at descriptive statistics of our data. One of the best ways to confirm what the numbers are telling us is to plot and visualize the data. We’ll start by visualizing single variables using a histogram for numeric values. The column we will work on in this exercise is ‘Existing Zoning Sqft’. The .plot() method allows us to create a plot of each column of a DataFrame. The kind parameter allows you to specify the type of plot to use - kind=’hist’, for example, plots a histogram.

Begin by computing summary statistics for the ‘Existing Zoning Sqft’ column using the .describe() method. You’ll notice that there are extremely large differences between the min and max values, and the plot will need to be adjusted accordingly. In such cases, it’s good to look at the plot on a log scale. The keyword arguments logx=True or logy=True can be passed in to .plot() depending on which axis you want to rescale.

Finally, note that Python will render a plot such that the axis will hold all the information. That is, if you end up with large amounts of whitespace in your plot, it indicates counts or values too small to render.

# Import matplotlib.pyplot
import matplotlib.pyplot as plt
# Plot the histogram
df['Existing Zoning Sqft'].plot(kind='hist', rot=70, logx=True, logy=True)
# Display the histogram
plt.show()

Visualizing multiple variables with boxplots

Histograms are great ways of visualizing single variables. To visualize multiple variables, boxplots are useful, especially when one of the variables is categorical. Our job is to use a boxplot to compare the ‘initial_cost’ across the different values of the ‘Borough’ column. The pandas .boxplot() method is a quick way to do this, in which we have to specify the column and by parameters. Here, we want to visualize how ‘initial_cost’ varies by ‘Borough’. pandas and matplotlib.pyplot have been imported as pd and plt, respectively, and the DataFrame has been loaded as df.

# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt
# Create the boxplot
df.boxplot(column='initial_cost', by='Borough', rot=90)
# Display the plot
plt.show()

Visualizing multiple variables with scatter plots

Boxplots are great when we have a numeric column that we want to compare across different categories. When we want to visualize two numeric columns, scatter plots are ideal. Here, our job is to make a scatter plot with ‘initial_cost’ on the x-axis and the ‘total_est_fee’ on the y-axis. We can do this by using the DataFrame .plot() method with kind=’scatter’. You’ll notice right away that there are 2 major outliers shown in the plots. Since these outliers dominate the plot, an additional DataFrame, df_subset, has been provided, in which some of the extreme values have been removed. After making a scatter plot using this, we’ll find some interesting patterns here that would not have been seen by looking at summary statistics or 1 variable plots.

# Import necessary modules
import pandas as pd
import matplotlib.pyplot as plt
# Create and display the first scatter plot
df.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()
# Create and display the second scatter plot
df_subset.plot(kind='scatter', x='initial_cost', y='total_est_fee', rot=70)
plt.show()

Reshaping your data using melt

Melting data is the process of turning columns of our data into rows of data. Consider the DataFrames from the previous exercise. In the tidy DataFrame, the variables Ozone, Solar.R, Wind, and Temp each had their own column. If, however, we wanted these variables to be in rows instead, we could melt the DataFrame. In doing so, however, we would make the data untidy! This is important to keep in mind: Depending on how our data is represented, we will have to reshape it differently. Here we will practice melting a DataFrame using pd.melt(). There are two parameters we should be aware of: id_vars and value_vars. The id_vars represent the columns of the data we do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns we do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted. Our job is to melt its Ozone, Solar.R, Wind, and Temp columns into rows.

# Print the head of airquality
print(airquality.head())
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'])
# Print the head of airquality_melt
print(airquality_melt.head())

When melting DataFrames, it would be better to have column names more meaningful than variable and value. The default names may work in certain situations, but it’s best to always have data that is self explanatory. We can rename the variable column by specifying an argument to the var_name parameter, and the value column by specifying an argument to the value_name parameter.

# Print the head of airquality
print(airquality.head())
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month','Day'], var_name='measurement', value_name='reading')
# Print the head of airquality_melt
print(airquality_melt.head())

Pivot data

Pivoting data is the opposite of melting it. Remember the tidy form that the airquality DataFrame was in before we melted it? We’ll now begin pivoting it back into that form using the .pivot_table() method! While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column. .pivot_table() has an index parameter which we can use to specify the columns that we don’t want pivoted: It is similar to the id_vars parameter of pd.melt(). Two other parameters that we have to specify are columns (the name of the column you want to pivot), and values (the values to be used when the column is pivoted).

# Print the head of airquality_melt
print(airquality_melt.head())
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month','Day'], columns='measurement', values='reading')
# Print the head of airquality_pivot
print(airquality_pivot.head())

After pivoting airquality_melt, we didn’t quite get back the original DataFrame. What we got back instead was a pandas DataFrame with a hierarchical index (also known as a MultiIndex). Hierarchical indexes are covered in depth in Manipulating DataFrames with pandas. In essence, they allow us to group columns or rows by another variable - in this case, by ‘Month’ as well as ‘Day’. There’s a very simple method we can use to get back the original DataFrame from the pivoted DataFrame: .reset_index().

# Print the index of airquality_pivot
print(airquality_pivot.index)
# Reset the index of airquality_pivot: airquality_pivot
airquality_pivot_reset = airquality_pivot.reset_index()
# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)
# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())

So far, we’ve used the .pivot_table() method when there are multiple index values we want to hold constant during a pivot. Let’s say our data collection method accidentally duplicated your dataset. Such a dataset, in which each row is duplicated, has been pre-loaded as airquality_dup. In addition, the airquality_melt DataFrame has been pre-loaded. Explore their shapes by accessing their .shape attributes to confirm the duplicate rows present in airquality_dup. We’ll see that by using .pivot_table() and the aggfunc parameter, we can not only reshape your data, but also remove duplicates. Finally, we can then flatten the columns of the pivoted DataFrame using .reset_index(). NumPy and pandas have been imported as np and pd respectively.

# Pivot airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month','Day'], columns='measurement', values='reading', aggfunc=np.mean)
# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()
# Print the head of airquality_pivot
print(airquality_pivot.head())
# Print the head of airquality
print(airquality.head())

Splitting a column with .str

The dataset consist of case counts of tuberculosis by country, year, gender, and age group, has been pre-loaded into a DataFrame as tb. We’re going to tidy the ‘m014’ column, which represents males aged 0-14 years of age. In order to parse this value, we need to extract the first letter into a new column for gender, and the rest into a column for age_group. Here, since we can parse values by position, we can take advantage of pandas’ vectorized string slicing by using the str attribute of columns of type object. Begin by printing the columns of tb in the IPython Shell using its .columns attribute, and take note of the problematic column.

# Melt tb: tb_melt
tb_melt = pd.melt(frame=tb, id_vars=['country','year'])
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]
# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]
# Print the head of tb_melt
print(tb_melt.head())

Splitting a column with .split() and .get()

Another common way multiple variables are stored in columns is with a delimiter. We’ll learn how to deal with such cases, using a dataset consisting of Ebola cases and death counts by state and country. Print the columns of ebola using ebola.columns. Notice that the data has column names such as Cases_Guinea and Deaths_Guinea. Here, the underscore _ serves as a delimiter between the first part (cases or deaths), and the second part (country). This time, we cannot directly slice the variable by position. We now need to use Python’s built-in string method called .split(). By default, this method will split a string into parts separated by a space. However, in this case we want it to split by an underscore. We can do this on Cases_Guinea, for example, using Cases_Guinea.split(‘ _ ‘), which returns the list [‘Cases’, ‘Guinea’]. The next challenge is to extract the first element of this list and assign it to a type variable, and the second element of the list to a country variable. We can accomplish this by accessing the str attribute of the column and using the .get() method to retrieve the 0 or 1 index, depending on the part we want.

# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date','Day'], var_name='type_country', value_name='counts')
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt['type_country'].str.split('_')
# Create the 'type' column
ebola_melt['type'] = ebola_melt['str_split'].str.get(0)
# Create the 'country' column
ebola_melt['country'] = ebola_melt['str_split'].str.get(1)
# Print the head of ebola_melt
print(ebola_melt.head())

Combining rows of data

The dataset we’ll be working with here relates to NYC Uber data. The original dataset has all the originating Uber pickup locations by time and latitude and longitude. For didactic purposes, we’ll be working with a very small portion of the actual data. Three DataFrames have been loaded: uber1, which contains data for April 2014, uber2, which contains data for May 2014, and uber3, which contains data for June 2014. Our job is to concatenate these DataFrames together such that the resulting DataFrame has the data for all three months. Begin by exploring the structure of these three DataFrames using methods such as .head().

# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1,uber2,uber3])
# Print the shape of row_concat
print(row_concat.shape)
# Print the head of row_concat
print(row_concat.head())

Combining columns of data

Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, we use the same pd.concat() function, but this time with the keyword argument axis=1. The default, axis=0, is for a row-wise concatenation. We’ll return to the Ebola dataset we worked with briefly in the last chapter. It has been pre-loaded into a DataFrame called ebola_melt. In this DataFrame, the status and country of a patient is contained in a single column. This column has been parsed into a new DataFrame, status_country, where there are separate columns for status and country. Explore the ebola_melt and status_country DataFrames. Our job is to concatenate them column-wise in order to obtain a final, clean DataFrame.

# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt,status_country],axis=1)
# Print the shape of ebola_tidy
print(ebola_tidy.shape)
# Print the head of ebola_tidy
print(ebola_tidy.head())

Finding files that match a pattern

We’re now going to practice using the glob module to find all csv files in the workspace. The glob module has a function called glob that takes a pattern and returns a list of the files in the working directory that match that pattern. For example, if you we the pattern is part_ single digit number .csv, you can write the pattern as ‘part_?.csv’ (which would match part_1.csv, part_2.csv, part_3.csv, etc.) The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.

# Import necessary modules
import pandas as pd
import glob
# Write the pattern: pattern
pattern = '*.csv'
# Save all file matches: csv_files
csv_files = glob.glob(pattern)
# Print the file names
print(csv_files)
# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])
# Print the head of csv2
print(csv2.head())

Iterating and concatenating all matches

Now that we have a list of filenames to load, we can load all the files into a list of DataFrames that can then be concatenated. We’ll start with an empty list called frames. Our job is to use a for loop to iterate through each of the filenames, read each filename into a DataFrame, and then append it to the frames list. We can then concatenate this list of DataFrames using pd.concat(). Go for it!

# Create an empty list: frames
frames = []
#  Iterate over csv_files
for csv in csv_files:
    #  Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    # Append df to frames
    frames.append(df)
# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)
# Print the shape of uber
print(uber.shape)
# Print the head of uber
print(uber.head())

1-to-1 data merge

Merging data allows us to combine disparate datasets into a single dataset to do more complex analysis. Here, we’ll be using survey data that contains readings that William Dyer, Frank Pabodie, and Valentina Roerich took in the late 1920 and 1930 while they were on an expedition towards Antarctica. The dataset was taken from a sqlite database from the Software Carpentry SQL lesson. Two DataFrames are available: site and visited. Explore them and take note of their structure and column names. Our task is to perform a 1-to-1 merge of these two DataFrames using the ‘name’ column of site and the ‘site’ column of visited.

# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')
# Print o2o
print(o2o)

Many-to-1 data merge

In a many-to-one (or one-to-many) merge, one of the values will be duplicated and recycled in the output. That is, one of the keys in the merge is not unique. Here, the two DataFrames site and visited have been pre-loaded once again. Note that this time, visited has multiple entries for the site column. Confirm this by exploring it in the IPython Shell. The .merge() method call is the same as the 1-to-1 merge from the previous exercise, but the data and output will be different.

# Merge the DataFrames: m2o
m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')
# Print m2o
print(m2o)

Many-to-many data merge

The final merging scenario occurs when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created. Two example DataFrames that share common key values have been pre-loaded: df1 and df2. Another DataFrame df3, which is the result of df1 merged with df2, has been pre-loaded. All three DataFrames have been printed - look at the output and notice how pairwise combinations have been created. This example is to help us develop our intuition for many-to-many merges. Here, we’ll work with the site and visited DataFrames from before, and a new survey DataFrame. Our task is to merge site and visited as we did in the earlier exercises. We will then merge this merged DataFrame with survey.

# Merge site and visited: m2m
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')
# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey, left_on='ident', right_on='taken')
# Print the first 20 lines of m2m
print(m2m.head(20))

Converting data types

We’ll see how ensuring all categorical variables in a DataFrame are of type category reduces memory usage. The tips dataset has been loaded into a DataFrame called tips. This data contains information about how much a customer tipped, whether the customer was male or female, a smoker or not, etc. Look at the output of tips.info() in the IPython Shell. We’ll note that two columns that should be categorical - sex and smoker - are instead of type object, which is pandas’ way of storing arbitrary strings. Our job is to convert these two columns to type category and note the reduced memory usage.

# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')
# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')
# Print the info of tips
print(tips.info())

Working with numeric data

If we expect the data type of a column to be numeric (int or float), but instead it is of type object, this typically means that there is a non numeric value in the column, which also signifies bad data. We can use the pd.to_numeric() function to convert a column into a numeric data type. If the function raises an error, we can be sure that there is a bad value within the column. We can either use the techniques we learned previously to do some exploratory data analysis and find the bad value, or we can choose to ignore or coerce the value into a missing value, NaN. A modified version of the tips dataset has been pre-loaded into a DataFrame called tips. For instructional purposes, it has been pre-processed to introduce some ‘bad’ data for you to clean. Use the .info() method to explore this. We’ll note that the total_bill and tip columns, which should be numeric, are instead of type object. Our job is to fix this.

# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')
# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')
# Print the info of tips
print(tips.info())

String parsing with regular expressions

Regular expressions are powerful ways of defining patterns to match strings. This exercise will get us started with writing them. When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. Our job in this exercise is to define a regular expression to match US phone numbers that fit the pattern of xxx-xxx-xxxx. The regular expression module in python is re. When performing pattern matching on data, since the pattern will be used for a match across multiple rows, it’s better to compile the pattern first using re.compile(), and then use the compiled pattern to match values.

# Import the regular expression module
import re
# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')
# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))
# See if the pattern matches
result = prog.match('1123-456-7890')
print(bool(result))

Extracting numerical values from strings

Extracting numbers from strings is a common task, particularly when working with unstructured data or log files. Say we have the following string: ‘the recipe calls for 6 strawberries and 2 bananas’. It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios. When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), we can use the re.findall() function. You pass in a pattern and a string to re.findall(), and it will return a list of the matches.

# Import the regular expression module
import re
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')
# Print the matches
print(matches)

We will continue practicing our regular expression skills. For each provided string, our job is to write the appropriate pattern to match it.

# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)
# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)
# Write the third pattern
pattern3 = bool(re.match(pattern='\w*[A-Z]', string='Australia'))
print(pattern3)

Custom functions to clean data

We’ll now practice writing functions to clean data. The tips dataset has been pre-loaded into a DataFrame called tips. It has a ‘sex’ column that contains the values ‘Male’ or ‘Female’. Our job is to write a function that will recode ‘Male’ to 1, ‘Female’ to 0, and return np.nan for all entries of ‘sex’ that are neither ‘Male’ nor ‘Female’. Recoding variables like this is a common data cleaning task. Functions provide a mechanism for us to abstract away complex bits of code as well as reuse code. This makes our code more readable and less error prone. We can use the .apply() method to apply a function across entire rows or columns of DataFrames. However, note that each column of a DataFrame is a pandas Series. Functions can also be applied across Series. Here, we will apply our function over the ‘sex’ column.

# Define recode_sex()
def recode_sex(sex_value):
    # Return 1 if sex_value is 'Male'
    if sex_value == 'Male':
        return 1
    # Return 0 if sex_value is 'Female'    
    elif sex_value == 'Female':
        return 0
    # Return np.nan    
    else:
        return np.nan
# Apply the function to the sex column
tips['sex_recode'] = tips.sex.apply(recode_sex)
# Print the first five rows of tips
print(tips.head())

Lambda functions

You’ll now be introduced to a powerful Python feature that will help us clean our data more effectively: lambda functions. Instead of using the def syntax that you used previously, lambda functions let us make simple, one-line functions. For example, here’s a function that squares a variable used in an .apply() method:

def my_square(x):
    return x ** 2

df.apply(my_square)
The equivalent code using a lambda function is:
df.apply(lambda x: x ** 2)

The lambda function takes one parameter - the variable x. The function itself just squares x and returns the result, which is whatever the one line of code evaluates to. In this way, lambda functions can make our code concise and Pythonic. Our job is to clean its ‘total_dollar’ column by removing the dollar sign. We’ll do this using two different methods: With the .replace() method, and with regular expressions. The regular expression module re has been pre-imported.

# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))
# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])
# Print the head of tips
print(tips.head())

Dropping duplicate data

Duplicate data causes a variety of problems. From the point of view of performance, they use up unnecessary amounts of memory and cause unneeded calculations to be performed when processing data. In addition, they can also bias any analysis results. A dataset consisting of the performance of songs on the Billboard charts has been pre-loaded into a DataFrame called billboard. Check out its columns in the IPython Shell. Or job in this exercise is to subset this DataFrame and then drop all duplicate rows.

# Create the new DataFrame: tracks
tracks = billboard[['year','artist','track','time']]
# Print info of tracks
print(tracks.info())
# Drop the duplicates: tracks_no_duplicates
tracks_no_duplicates = tracks.drop_duplicates()
# Print info of tracks
print(tracks_no_duplicates.info())

Filling missing data

Here, we’ll return to the airquality dataset. It has been pre-loaded into the DataFrame airquality, and it has missing values for us to practice filling in. Explore airquality in the IPython Shell to checkout which columns have missing values. It’s rare to have a (real-world) dataset without any missing values, and it’s important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values. Also, understanding how much missing data we have, and thinking about where it comes from is crucial to making unbiased interpretations of data.

# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality['Ozone'].mean()
# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality['Ozone'].fillna(oz_mean)
# Print the info of airquality
print(airquality.info())

Testing your data with asserts

Here, we’ll practice writing assert statements using the Ebola dataset from previous chapters to programmatically check for missing values and to confirm that all values are positive. The dataset has been pre-loaded into a DataFrame called ebola. We use the .all() method together with the .notnull() DataFrame method to check for missing values in a column. The .all() method returns True if all values are True. When used on a DataFrame, it returns a Series of Booleans - one for each column in the DataFrame. So if we are using it on a DataFrame, like in this exercise, we need to chain another .all() method so that you return only one True or False value. When using these within an assert statement, nothing will be returned if the assert statement is true: This is how we can confirm that the data you are checking are valid. Note: You can use pd.notnull(df) as an alternative to df.notnull().

# Assert that there are no missing values
assert pd.notnull(ebola).all().all()
# Assert that all values are >= 0
assert (ebola >=0).all().all()

Exploratory analysis

Whenever we obtain a new dataset, our first task should always be to do some exploratory analysis to get a better understanding of the data and diagnose it for any potential issues. The Gapminder data for the 19th century has been loaded into a DataFrame called g1800s. Use pandas methods such as .head(), .info(), and .describe(), and DataFrame attributes like .columns and .shape to explore it.

Since 1800, life expectancy around the globe has been steadily going up. We would expect the Gapminder data to confirm this. Our job in this exercise is to create a scatter plot with life expectancy in ‘1800’ on the x-axis and life expectancy in ‘1899’ on the y-axis. Here, the goal is to visually check the data for insights as well as errors. When looking at the plot, pay attention to whether the scatter plot takes the form of a diagonal line, and which points fall below or above the diagonal line. This will inform how life expectancy in 1899 changed (or did not change) compared to 1800 for different countries. If points fall on a diagonal line, it means that life expectancy remained the same!

# Import matplotlib.pyplot
import matplotlib.pyplot as plt
# Create the scatter plot
g1800s.plot(kind='scatter', x='1800', y='1899')
# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')
# Specify axis limits
plt.xlim(20, 55)
plt.ylim(20, 55)
# Display the plot
plt.show()

Thinking about the question at hand

Since we are given life expectancy level data by country and year, we could ask questions about how much the average life expectancy changes over each year. Before continuing, however, it’s important to make sure that the following assumptions about the data are true:

‘Life expectancy’ is the first column (index 0) of the DataFrame.
The other columns contain either null or numeric values.
The numeric values are all greater than or equal to 0.
There is only one instance of each country.

We can write a function that we can apply over the entire DataFrame to verify some of these assumptions. Note that spending the time to write such a script will help us when working with other datasets as well.

def check_null_or_valid(row_data):
    """Function that takes a row of data,
    drops all missing values,
    and checks if all remaining values are greater than or equal to 0
    """
    no_na = row_data.dropna()[1:-1]
    numeric = pd.to_numeric(no_na)
    ge0 = numeric >= 0
    return ge0
# Check whether the first column is 'Life expectancy'
assert g1800s.columns[0] == 'Life expectancy'
# Check whether the values in the row are valid
assert g1800s.iloc[:, 1:].apply(check_null_or_valid, axis=1).all().all()
# Check that there is only one instance of each country
assert g1800s['Life expectancy'].value_counts()[0] == 1

Assembling your data

Here, three DataFrames have been pre-loaded: g1800s, g1900s, and g2000s. These contain the Gapminder life expectancy data for, respectively, the 19th century, the 20th century, and the 21st century. Our task in this exercise is to concatenate them into a single DataFrame called gapminder. This is a row-wise concatenation, similar to how we concatenated the monthly Uber datasets.

# Concatenate the DataFrames row-wise
gapminder = pd.concat([g1800s,g1900s,g2000s])
# Print the shape of gapminder
print(gapminder.shape)
# Print the head of gapminder
print(gapminder.head())

Reshaping your data

Now that we have all the data combined into a single DataFrame, the next step is to reshape it into a tidy data format. Currently, the gapminder DataFrame has a separate column for each year. What we want instead is a single column that contains the year, and a single column that represents the average life expectancy for each year and country. By having year in its own column, we can use it as a predictor variable in a later analysis. We can convert the DataFrame into the desired tidy format by melting it.

# Melt gapminder: gapminder_melt
gapminder_melt = pd.melt(gapminder,id_vars='Life expectancy')
# Rename the columns
gapminder_melt.columns = ['country','year','life_expectancy']
# Print the head of gapminder_melt
print(gapminder_melt.head())

Checking the data types

Now that our data is in the proper shape, we need to ensure that the columns are of the proper data type. That is, we need to ensure that country is of type object, year is of type int64, and life_expectancy is of type float64. The tidy DataFrame has been pre-loaded as gapminder. Explore it in the IPython Shell using the .info() method. Notice that the column ‘year’ is of type object. This is incorrect, so we’ll need to use the pd.to_numeric() function to convert it to a numeric data type. NumPy and pandas have been pre-imported as np and pd.

# Convert the year column to numeric
gapminder.year = pd.to_numeric(gapminder.year)
# Test if country is of type object
assert gapminder.country.dtypes == np.object
# Test if year is of type int64
assert gapminder.year.dtypes == np.int64
# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64

Looking at country spellings

Having tidied our DataFrame and checked the data types, our next task in the data cleaning process is to look at the ‘country’ column to see if there are any special or invalid characters you may need to deal with. It is reasonable to assume that country names will contain:

The set of lower and upper case letters.
Whitespace between words.
Periods for any abbreviations.

To confirm that this is the case, we can leverage the power of regular expressions again. For common operations like this, Python has a built-in string method - str.contains() - which takes a regular expression pattern, and applies it to the Series, returning True if there is a match, and False otherwise. Since here we want to find the values that do not match, we have to invert the boolean, which can be done using ~. This Boolean series can then be used to get the Series of countries that have invalid names.

# Create the series of countries: countries
countries = gapminder.country
# Drop all the duplicates from countries
countries = countries.drop_duplicates()
# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'
# Create the Boolean vector: mask
mask = countries.str.contains(pattern)
# Invert the mask: mask_inverse
mask_inverse = ~mask
# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries[mask_inverse]
# Print invalid_countries
print(invalid_countries)

More data cleaning and processing

It’s now time to deal with the missing data. There are several strategies for this: We can drop them, fill them in using the mean of the column or row that the missing value is in (also known as imputation), or, if we are dealing with time series data, use a forward fill or backward fill, in which you replace missing values in a column with the most recent known value in the column. See pandas Foundations for more on forward fill and backward fill.

In general, it is not the best idea to drop missing values, because in doing so you may end up throwing away useful information. In this data, the missing values refer to years where no estimate for life expectancy is available for a given country. We could fill in, or guess what these life expectancies could be by looking at the average life expectancies for other countries in that year, for example. Whichever strategy you go with, it is important to carefully consider all options and understand how they will affect our data.

In this exercise, we’ll practice dropping missing values. Our job is to drop all the rows that have NaN in the life_expectancy column. Before doing so, it would be valuable to use assert statements to confirm that year and country do not have any missing values.

# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()
# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()
# Drop the missing values
gapminder = gapminder.dropna()
# Print the shape of gapminder
print(gapminder.shape)

Wrapping up

Now that we have a clean and tidy dataset, we can do a bit of visualization and aggregation. In this exercise, we’ll begin by creating a histogram of the life_expectancy column. We should not get any values under 0 and you should see something reasonable on the higher end of the life_expectancy age range. Our next task is to investigate how average life expectancy changed over the years. To do this, we need to subset the data by each year, get the life_expectancy column from each subset, and take an average of the values. We can achieve this using the .groupby() method. This .groupby() method is covered in greater depth in Manipulating DataFrames with pandas. Finally, we can save your tidy and summarized DataFrame to a file using the .to_csv() method.

# Add first subplot
plt.subplot(2, 1, 1)
# Create a histogram of life_expectancy
gapminder.life_expectancy.plot('hist')
# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()
# Print the head of gapminder_agg
print(gapminder_agg.head())
# Print the tail of gapminder_agg
print(gapminder_agg.tail())
# Add second subplot
plt.subplot(2, 1, 2)
# Create a line plot of life expectancy per year
gapminder_agg.plot()
# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')
# Display the plots
plt.tight_layout()
plt.show()
# Save both DataFrames to csv files
gapminder.to_csv('gapminder.csv' )
gapminder_agg.to_csv('gapminder_agg.csv')

Datasets

Share on

Twitter Facebook Google+ LinkedIn

Vishal Kumar