Manipulating DataFrames with pandas
Positional and labeled indexing
Given a pair of label-based indices, sometimes it’s necessary to find the corresponding positions. We will use the Pennsylvania election results again. The DataFrame is provided for us as election. Find x and y such that election.iloc[x, y] == election.loc[‘Bedford’, ‘winner’]. That is, what is the row position of ‘Bedford’, and the column position of ‘winner’? Remember that the first position in Python is 0, not 1! To answer this question, first explore the DataFrame using election.head() in the Python Shell and inspect it with our eyes.
# Assign the row position of election.loc['Bedford']: x
x = 4
# Assign the column position of election['winner']: y
y = election.columns.get_loc('winner')
# Print the boolean equivalence
print(election.iloc[x, y] == election.loc['Bedford', 'winner'])
Indexing and column rearrangement
There are circumstances in which it’s useful to modify the order of your DataFrame columns. We do that now by extracting just two columns from the Pennsylvania election results DataFrame. Our job is to read the CSV file and set the index to ‘county’. We’ll then assign a new DataFrame by selecting the list of columns [‘winner’, ‘total’, ‘voters’]. The CSV file is provided to us in the variable filename.
# Import pandas
import pandas as pd
# Read in filename and set the index: election
election = pd.read_csv(filename, index_col='county')
# Create a separate dataframe with the columns ['winner', 'total', 'voters']: results
results = election[['winner', 'total', 'voters']]
# Print the output of results.head()
print(results.head())
Slicing rows
The Pennsylvania US election results data set that we have been using so far is ordered by county name. This means that county names can be sliced alphabetically. In this exercise, we’re going to perform slicing on the county names of the election DataFrame.
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']
# Print the p_counties DataFrame
print(p_counties)
# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc['Potter':'Perry':-1]
# Print the p_counties_rev DataFrame
print(p_counties_rev)
Slicing columns
Similar to row slicing, columns can be sliced by value. In this exercise, our job is to slice column names from the Pennsylvania election results DataFrame using .loc[].
# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:,:'Obama']
# Print the output of left_columns.head()
print(left_columns.head())
# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:,'Obama':'winner']
# Print the output of middle_columns.head()
print(middle_columns.head())
# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:,'Romney':]
# Print the output of right_columns.head()
print(right_columns.head())
Subselecting DataFrames with lists
We can use lists to select specific row and column labels with the .loc[] accessor. In this exercise, our job is to select the counties [‘Philadelphia’, ‘Centre’, ‘Fulton’] and the columns [‘winner’,’Obama’,’Romney’] from the election DataFrame, which has been pre-loaded with the index set to ‘county’.
# Create the list of row labels: rows
rows = ['Philadelphia', 'Centre', 'Fulton']
# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']
# Create the new DataFrame: three_counties
three_counties = election.loc[rows,cols]
# Print the three_counties DataFrame
print(three_counties)
Thresholding data
In this exercise, we have provided the Pennsylvania election results and included a column called ‘turnout’ that contains the percentage of voter turnout per county. Our job is to prepare a boolean array to select all of the rows and columns where voter turnout exceeded 70%.
# Create the boolean array: high_turnout
high_turnout = election.turnout>70
# Filter the election DataFrame with the high_turnout array: high_turnout_df
high_turnout_df = election[high_turnout]
# Print the high_turnout_results DataFrame
print(high_turnout_df)
Filtering columns using other columns
The election results DataFrame has a column labeled ‘margin’ which expresses the number of extra votes the winner received over the losing candidate. This number is given as a percentage of the total votes cast. It is reasonable to assume that in counties where this margin was less than 1%, the results would be too-close-to-call. Our job is to use boolean selection to filter the rows where the margin was less than 1. We’ll then convert these rows of the ‘winner’ column to np.nan to indicate that these results are too close to declare a winner.
# Import numpy
import numpy as np
# Create the boolean array: too_close
too_close = election.margin <1
# Assign np.nan to the 'winner' column where the results were too close to call
election.winner[too_close]= np.nan
# Print the output of election.info()
print(election.info())
Filtering using NaNs
In certain scenarios, it may be necessary to remove rows and columns with missing data from a DataFrame. The .dropna() method is used to perform this action. We’ll now practice using this method on a dataset obtained from Vanderbilt University, which consists of data from passengers on the Titanic. Explore it in the IPython Shell and we will note that there are many NaNs. We will focus specifically on the ‘age’ and ‘cabin’ columns in this exercise. Our job is to use .dropna() to remove rows where any of these two columns contains missing data and rows where all of these two columns contain missing data. We’ll also use the .shape attribute, which returns the number of rows and columns in a tuple from a DataFrame, or the number of rows from a Series, to see the effect of dropping missing values from a DataFrame.
# Select the 'age' and 'cabin' columns: df
df = titanic[['age','cabin']]
# Print the shape of df
print(df.shape)
# Drop rows in df with how='any' and print the shape
print(df.dropna(how='any').shape)
# Drop rows in df with how='all' and print the shape
print(df.dropna(how='all').shape)
# Call .dropna() with thresh=1000 and axis='columns' and print the output of .info() from titanic
print(titanic.dropna(thresh=1000, axis='columns').info())
Using apply() to transform a column
The .apply() method can be used on a pandas DataFrame to apply an arbitrary Python function to every element. In this exercise we’ll take daily weather data in Pittsburgh in 2013 obtained from Weather Underground. Our job is to use the .apply() method to perform this conversion on the ‘Mean TemperatureF’ and ‘Mean Dew PointF’ columns of the weather DataFrame.
# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
return 5/9*(F - 32)
# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF','Mean Dew PointF']].apply(to_celsius)
# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']
# Print the output of df_celsius.head()
print(df_celsius.head())
Using .map() with a dictionary
The .map() method is used to transform values according to a Python dictionary look-up. In this exercise we’ll practice this method while returning to working with the election DataFrame. Our job is to use a dictionary to map the values ‘Obama’ and ‘Romney’ in the ‘winner’ column to the values ‘blue’ and ‘red’, and assign the output to the new column ‘color’.
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'blue','Romney':'red'}
# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election.winner.map(red_vs_blue)
# Print the output of election.head()
print(election.head())
Using vectorized functions
When performance is paramount, we should avoid using .apply() and .map() because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using vectorized functions instead, we can loop over the data at the same speed as compiled code (C, Fortran, etc.)! NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy). We can even write your own vectorized functions, but for now we will focus on the ones distributed by NumPy and pandas. In this exercise we’re going to import the zscore function from scipy.stats and use it to compute the deviation in voter turnout in Pennsylvania from the mean in fractions of the standard deviation. In statistics, the z-score is the number of standard deviations by which an observation is above the mean - so if it is negative, it means the observation is below the mean.
# Import zscore from scipy.stats
from scipy.stats import zscore
# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election['turnout'])
# Print the type of turnout_zscore
print(type(turnout_zscore))
# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore']=turnout_zscore
# Print the output of election.head()
print(election.head())
Changing index of a DataFrame
As we saw in the previous exercise, indexes are immutable objects. This means that if you want to change or modify the index in a DataFrame, then you need to change the whole index. We will do this now, using a list comprehension to create the new index. A list comprehension is a succinct way to generate a list in one line. For example, the following list comprehension generates a list that contains the cubes of all numbers from 0 to 9: cubes = [i**3 for i in range(10)]. This is equivalent to the following code:
cubes = []
for i in range(10):
cubes.append(i**3)
Before getting started, print the sales DataFrame in the IPython Shell and verify that the index is given by month abbreviations containing lowercase characters.
# Create the list of new indexes: new_idx
new_idx = [str.upper(i) for i in sales.index]
# Assign new_idx to sales.index
sales.index = new_idx
# Print the sales DataFrame
print(sales)
Changing index name labels
Notice that in the previous exercise, the index was not labeled with a name. In this exercise, we will set its name to ‘MONTHS’. Similarly, if all the columns are related in some way, we can provide a label for the set of columns.
# Assign the string 'MONTHS' to sales.index.name
sales.index.name = 'MONTHS'
# Print the sales DataFrame
print(sales)
# Assign the string 'PRODUCTS' to sales.columns.name
sales.columns.name= 'PRODUCTS'
# Print the sales dataframe again
print(sales)
Building an index, then a DataFrame
We can also build the DataFrame and index independently, and then put them together. If we take this route, be careful, as any mistakes in generating the DataFrame or the index can cause the data and the index to be aligned incorrectly. In this exercise, the sales DataFrame has been provided for us without the month index. Our job is to build this index separately and then assign it to the sales DataFrame. Before getting started, print the sales DataFrame in the IPython Shell and note that it’s missing the month information.
# Generate the list of months: months
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
# Assign months to sales.index
sales.index=months
# Print the modified sales DataFrame
print(sales)
Extracting data with a MultiIndex
The sales DataFrame we have been working with has been extended to now include State information as well. In the IPython Shell, print the new sales DataFrame to inspect the data. Take note of the MultiIndex! Extracting elements from the outermost level of a MultiIndex is just like in the case of a single-level Index. WE can use the .loc[] accessor.
# Print sales.loc[['CA', 'TX']]
print(sales.loc[['CA', 'TX']])
# Print sales['CA':'TX']
print(sales['CA':'TX'])
Setting & sorting a MultiIndex
In the previous exercise, the MultiIndex was created and sorted. Now, we’re going to do this yourself! With a MultiIndex, we should always ensure the index is sorted. We can skip this only if we know the data is already sorted on the index fields.
# Set the index to be the columns ['state', 'month']: sales
sales = sales.set_index(['state', 'month'])
# Sort the MultiIndex: sales
sales = sales.sort_index()
# Print the sales DataFrame
print(sales)
Using .loc[] with nonunique indexes
It is always preferable to have a meaningful index that uniquely identifies each row. Even though pandas does not require unique index values in DataFrames, it works better if the index values are indeed unique. To see an example of this, we will index our sales data by ‘state’ in this exercise.
# Set the index to the column 'state': sales
sales = sales.set_index('state')
# Print the sales DataFrame
print(sales)
# Access the data from 'NY'
print(sales.loc['NY',:])
Indexing multiple levels of a MultiIndex
Looking up indexed data is fast and efficient. And we have already seen that lookups based on the outermost level of a MultiIndex work just like lookups on DataFrames that have a single-level Index. Looking up data based on inner levels of a MultiIndex can be a bit trickier. In this exercise, we will use your sales DataFrame to do some increasingly complex lookups. The trickiest of all these lookups are when we want to access some inner levels of the index. In this case, we need to use slice(None) in the slicing parameter for the outermost dimension(s) instead of the usual :, or use pd.IndexSlice. We can refer to the pandas documentation for more details. For example, use the following code to extract rows from all Symbols for the dates Oct. 3rd through 4th inclusive:
stocks.loc[(slice(None), slice('2016-10-03', '2016-10-04')), :]
Pay particular attention to the tuple (slice(None), slice(‘2016-10-03’, ‘2016-10-04’)).
# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[('NY',1)]
# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[(['CA','TX'],2),:]
# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[(slice(None), 2), :]
Pivoting a single variable
Suppose we started a blog for a band, and we would like to log how many visitors you have had, and how many signed-up for our newsletter. To help design the tours later, we track where the visitors are. Inspect users in the IPython Shell and make a note of which variable you want to use to index the rows (‘weekday’), which variable you want to use to index the columns (‘city’), and which variable will populate the values in the cells (‘visitors’). Try to visualize what the result should be. In this exercise, our job is to pivot users so that the focus is on ‘visitors’, with the columns indexed by ‘city’ and the rows indexed by ‘weekday’.
# Pivot the users DataFrame: visitors_pivot
visitors_pivot = users.pivot(index='weekday',columns='city',values='visitors')
# Print the pivoted DataFrame
print(visitors_pivot)
Pivoting all variables
If we do not select any particular variables, all of them will be pivoted. In this case - with the users DataFrame - both ‘visitors’ and ‘signups’ will be pivoted, creating hierarchical column labels.
# Pivot users with signups indexed by weekday and city: signups_pivot
signups_pivot = users.pivot(index='weekday',columns='city',values='signups')
# Print signups_pivot
print(signups_pivot)
# Pivot users pivoted by both signups and visitors: pivot
pivot = users.pivot(index='weekday',columns='city')
# Print the pivoted DataFrame
print(pivot)
Stacking & unstacking I
We are now going to practice stacking and unstacking DataFrames. Pay attention to the index, and notice that the index levels are [‘city’, ‘weekday’]. So ‘weekday’ - the second entry - has position 1. This position is what corresponds to the level parameter in .stack() and .unstack() calls. Alternatively, we can specify ‘weekday’ as the level instead of its position. Our job in this exercise is to unstack users by ‘weekday’. We will then use .stack() on the unstacked DataFrame to see if you get back the original layout of users.
# Unstack users by 'weekday': byweekday
byweekday = users.unstack(level='weekday')
# Print the byweekday DataFrame
print(byweekday)
# Stack byweekday by 'weekday' and print it
print(byweekday.stack(level='weekday'))
Stacking & unstacking II
We are now going to continue working with the users DataFrame. As always, first explore it in the IPython Shell to see the layout and note the index. Our job in this exercise is to unstack and then stack the ‘city’ level, as we did previously for ‘weekday’. Note that you won’t get the same DataFrame.
# Unstack users by 'city': bycity
bycity = users.unstack(level='city')
# Print the bycity DataFrame
print(bycity)
# Stack bycity by 'city' and print it
print(bycity.stack(level='city'))
Restoring the index order
Continuing from the previous exercise, we will now use .swaplevel(0, 1) to flip the index levels. Note they won’t be sorted. To sort them, we will have to follow up with a .sort_index(). We will then obtain the original DataFrame. Note that an unsorted index leads to slicing failures. To begin, print both users and bycity in the IPython Shell. The goal here is to convert bycity back to something that looks like users.
# Stack 'city' back into the index of bycity: newusers
newusers = bycity.stack(level='city')
# Swap the levels of the index of newusers: newusers
newusers = newusers.swaplevel(0,1)
# Print newusers and verify that the index is not sorted
print(newusers)
# Sort the index of newusers: newusers
newusers = newusers.sort_index()
# Print newusers and verify that the index is now sorted
print(newusers)
# Verify that the new DataFrame is equal to the original
print(newusers.equals(users))
Adding names for readability
We are now going to practice melting DataFrames. Explore users DataFrame from previous exercises with the rows indexed by ‘weekday’, columns indexed by ‘city’, and values populated with ‘visitors’. The goal of melting is to restore a pivoted DataFrame to its original form, or to change it from a wide shape to a long shape. We can explicitly specify the columns that should remain in the reshaped DataFrame with id_vars, and list which columns to convert into values with value_vars. If we don’t pass a name to the values in pd.melt(), we will lose the name of your variable. We can fix this by using the value_name keyword argument. Our job in this exercise is to melt visitors_by_city_weekday to move the city names from the column labels to values in a single column called ‘city’. If we were to use just pd.melt(visitors_by_city_weekday), we would obtain the following result:
city value
0 weekday Mon
1 weekday Sun
2 Austin 326
3 Austin 139
4 Dallas 456
5 Dallas 237
Therefore, we have to specify the id_vars keyword argument to ensure that ‘weekday’ is retained in the reshaped DataFrame, and the value_name keyword argument to change the name of value to visitors.
# Reset the index: visitors_by_city_weekday
visitors_by_city_weekday = visitors_by_city_weekday.reset_index()
# Print visitors_by_city_weekday
print(visitors_by_city_weekday)
# Melt visitors_by_city_weekday: visitors
visitors = pd.melt(visitors_by_city_weekday, id_vars=['weekday'], value_name='visitors')
# Print visitors
print(visitors)
Going from wide to long
We can move multiple columns into a single column (making the data long and skinny) by “melting” multiple columns. In this exercise, we will practice doing this.
# Melt users: skinny
skinny = pd.melt(users, id_vars=['weekday','city'], value_vars=['visitors','signups'], value_name='value')
# Print skinny
print(skinny)
Obtaining key-value pairs with melt()
Sometimes, all we need is some key-value pairs, and the context does not matter. If said context is in the index, we can easily obtain what we want. For example, in the users DataFrame, the visitors and signups columns lend themselves well to being represented as key-value pairs. So if we created a hierarchical index with ‘city’ and ‘weekday’ columns as the index, we can easily extract key-value pairs for the ‘visitors’ and ‘signups’ columns by melting users and specifying col_level=0.
# Set the new index: users_idx
users_idx = users.set_index(['city', 'weekday'])
# Print the users_idx DataFrame
print(users_idx)
# Obtain the key-value pairs: kv_pairs
kv_pairs = pd.melt(users_idx,col_level=0)
# Print the key-value pairs
print(kv_pairs)
Setting up a pivot table
Recall that a pivot table allows you to see all of your variables as a function of two other variables. In this exercise, we will use the .pivot_table() method to see how the users DataFrame entries appear when presented as functions of the ‘weekday’ and ‘city’ columns. That is, with the rows indexed by ‘weekday’ and the columns indexed by ‘city’.
# Create the DataFrame with the appropriate pivot table: by_city_day
by_city_day = users.pivot_table(index='weekday',columns='city')
# Print by_city_day
print(by_city_day)
We can also use aggregation functions within a pivot table by specifying the aggfunc parameter. In this exercise, we will practice using the ‘count’ and len aggregation functions - which produce the same result - on the users DataFrame.
# Use a pivot table to display the count of each column: count_by_weekday1
count_by_weekday1 = users.pivot_table(index='weekday',aggfunc='count')
# Print count_by_weekday
print(count_by_weekday1)
# Replace 'aggfunc='count'' with 'aggfunc=len': count_by_weekday2
count_by_weekday2 = users.pivot_table(index='weekday',aggfunc=len)
# Verify that the same result is obtained
print('==========================================')
print(count_by_weekday1.equals(count_by_weekday2))
Sometimes it’s useful to add totals in the margins of a pivot table. We can do this with the argument margins=True. In this exercise, we will practice using margins in a pivot table along with a new aggregation function: sum.
# Create the DataFrame with the appropriate pivot table: signups_and_visitors
signups_and_visitors = users.pivot_table(index='weekday',aggfunc=sum)
# Print signups_and_visitors
print(signups_and_visitors)
# Add in the margins: signups_and_visitors_total
signups_and_visitors_total = users.pivot_table(index='weekday',aggfunc=sum,margins=True)
# Print signups_and_visitors_total
print(signups_and_visitors_total)
Grouping by multiple columns
In this exercise, we will return to working with the Titanic dataset and use .groupby() to analyze the distribution of passengers who boarded the Titanic. The ‘pclass’ column identifies which class of ticket was purchased by the passenger and the ‘embarked’ column indicates at which of the three ports the passenger boarded the Titanic. ‘S’ stands for Southampton, England, ‘C’ for Cherbourg, France and ‘Q’ for Queenstown, Ireland. Our job is to first group by the ‘pclass’ column and count the number of rows in each class using the ‘survived’ column. We will then group by the ‘embarked’ and ‘pclass’ columns and count the number of passengers.
# Group titanic by 'pclass'
by_class = titanic.groupby('pclass')
# Aggregate 'survived' column of by_class by count
count_by_class = by_class['survived'].count()
# Print count_by_class
print(count_by_class)
# Group titanic by 'embarked' and 'pclass'
by_mult = titanic.groupby(['embarked','pclass'])
# Aggregate 'survived' column of by_mult by count
count_mult = by_mult['survived'].count()
# Print count_mult
print(count_mult)
Grouping by another series
In this exercise, we’ll use two data sets from Gapminder.org to investigate the average life expectancy (in years) at birth in 2010 for the 6 continental regions. To do this we’ll read the life expectancy data per country into one pandas DataFrame and the association between country and region into another. By setting the index of both DataFrames to the country name, we’ll then use the region information to group the countries in the life expectancy DataFrame and compute the mean value for 2010.
# Read life_fname into a DataFrame: life
life = pd.read_csv(life_fname, index_col='Country')
# Read regions_fname into a DataFrame: regions
regions = pd.read_csv(regions_fname, index_col='Country')
# Group life by regions['region']: life_by_region
life_by_region = life.groupby(regions['region'])
# Print the mean over the '2010' column of life_by_region
print(life_by_region['2010'].mean())
Computing multiple aggregates of multiple columns
The .agg() method can be used with a tuple or list of aggregations as input. When applying multiple aggregations on multiple columns, the aggregated DataFrame has a multi-level column index. In this exercise, we’re going to group passengers on the Titanic by ‘pclass’ and aggregate the ‘age’ and ‘fare’ columns by the functions ‘max’ and ‘median’. We’ll then use multi-level selection to find the oldest passenger per class and the median fare price per class.
# Group titanic by 'pclass': by_class
by_class = titanic.groupby('pclass')
# Select 'age' and 'fare'
by_class_sub = by_class[['age','fare']]
# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(['max','median'])
# Print the maximum age in each class
print(aggregated.loc[:, ('age','max')])
# Print the median fare in each class
print(aggregated.loc[:,('fare','median')])
Aggregating on index levels/fields
If we have a DataFrame with a multi-level row index, the individual levels can be used to perform the groupby. This allows advanced aggregation techniques to be applied along one or more levels in the index and across one or more columns. In this exercise we’ll use the full Gapminder dataset which contains yearly values of life expectancy, population, child mortality (per 1,000) and per capita gross domestic product (GDP) for every country in the world from 1964 to 2013. Our job is to create a multi-level DataFrame of the columns ‘Year’, ‘Region’ and ‘Country’. Next we’ll group the DataFrame by the ‘Year’ and ‘Region’ levels. Finally, we’ll apply a dictionary aggregation to compute the total population, spread of per capita GDP values and average child mortality rate.
# Read the CSV file into a DataFrame and sort the index: gapminder
gapminder = pd.read_csv('gapminder.csv',index_col=['Year','region','Country']).sort_index()
# Group gapminder by 'Year' and 'region': by_year_region
by_year_region = gapminder.groupby(level=(0,1))
# Define the function to compute spread: spread
def spread(series):
return series.max() - series.min()
# Create the dictionary: aggregator
aggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread}
# Aggregate by_year_region using the dictionary: aggregated
aggregated = by_year_region.agg(aggregator)
# Print the last 6 entries of aggregated
print(aggregated.tail(6))
Grouping on a function of the index
Groupby operations can also be performed on transformations of the index values. In the case of a DateTimeIndex, we can extract portions of the datetime over which to group. In this exercise we’ll read in a set of sample sales data from February 2015 and assign the ‘Date’ column as the index. Our job is to group the sales data by the day of the week and aggregate the sum of the ‘Units’ column. Is there a day of the week that is more popular for customers? To find out, we’re going to use .strftime(‘%a’) to transform the index datetime values to abbreviated days of the week.
# Read file: sales
sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)
# Create a groupby object: by_day
by_day = sales.groupby(sales.index.strftime('%a'))
# Create sum: units_sum
units_sum = by_day['Units'].sum()
# Print units_sum
print(units_sum)
Detecting outliers with Z-Scores
Using the zscore function, we can apply a .transform() method after grouping to apply a function to groups of data independently. The z-score is also useful to find outliers: a z-score value of +/- 3 is generally considered to be an outlier. In this example, we’re going to normalize the Gapminder data in 2010 for life expectancy and fertility by the z-score per region. Using boolean indexing, you will filter out countries that have high fertility rates and low life expectancy for their region.
# Import zscore
from scipy.stats import zscore
# Group gapminder_2010: standardized
standardized = gapminder_2010.groupby('region')['life','fertility'].transform(zscore)
# Construct a Boolean Series to identify outliers: outliers
outliers = (standardized['life'] < -3) | (standardized['fertility'] > 3)
# Filter gapminder_2010 by the outliers: gm_outliers
gm_outliers = gapminder_2010.loc[outliers]
# Print gm_outliers
print(gm_outliers)
Filling missing data (imputation) by group
Many statistical and machine learning packages cannot determine the best action to take when missing data entries are encountered. Dealing with missing data is natural in pandas (both in using the default behavior and in defining a custom behavior). Previously we practiced using the .dropna() method to drop missing values. Now, we will practice imputing missing values. We can use .groupby() and .transform() to fill missing data appropriately for each group. Our job is to fill in missing ‘age’ values for passengers on the Titanic with the median age from their ‘gender’ and ‘pclass’. To do this, we’ll group by the ‘sex’ and ‘pclass’ columns and transform each group with a custom function to call .fillna() and impute the median value. Notice in particular the NaNs in the ‘age’ column.
# Create a groupby object: by_sex_class
by_sex_class = titanic.groupby(['sex','pclass'])
# Write a function that imputes median
def impute_median(series):
return series.fillna(series.median())
# Impute age and assign to titanic['age']
titanic.age = by_sex_class['age'].transform(impute_median)
# Print the output of titanic.tail(10)
print(titanic.tail(10))
Other transformations with .apply
The .apply() method when used on a groupby object performs an arbitrary function on each of the groups. These functions can be aggregations, transformations or more complex workflows. The .apply() method will then combine the results in an intelligent way. In this exercise, we’re going to analyze economic disparity within regions of the world using the Gapminder data set for 2010. To do this we’ll define a function to compute the aggregate spread of per capita GDP in each region and the individual country’s z-score of the regional per capita GDP. We’ll then select three countries - United States, Great Britain and China - to see a summary of the regional GDP and that country’s z-score against the regional mean. The 2010 Gapminder DataFrame is provided for you as gapminder_2010. The following function has been defined for your use:
def disparity(gr):
# Compute the spread of gr['gdp']: s
s = gr['gdp'].max() - gr['gdp'].min()
# Compute the z-score of gr['gdp'] as (gr['gdp']-gr['gdp'].mean())/gr['gdp'].std(): z
z = (gr['gdp'] - gr['gdp'].mean())/gr['gdp'].std()
# Return a DataFrame with the inputs {'z(gdp)':z, 'regional spread(gdp)':s}
return pd.DataFrame({'z(gdp)':z , 'regional spread(gdp)':s})
# Group gapminder_2010 by 'region': regional
regional = gapminder_2010.groupby('region')
# Apply the disparity function on regional: reg_disp
reg_disp = regional.apply(disparity)
# Print the disparity of 'United States', 'United Kingdom', and 'China'
print(reg_disp.loc[['United States','United Kingdom','China']])
Grouping and filtering with .apply()
By using .apply(), we can write functions that filter rows within groups. The .apply() method will handle the iteration over individual groups and then re-combine them back into a Series or DataFrame. In this exercise we’ll take the Titanic data set and analyze survival rates from the ‘C’ deck, which contained the most passengers. To do this we’ll group the dataset by ‘sex’ and then use the .apply() method on a provided user defined function which calculates the mean survival rates on the ‘C’ deck:
def c_deck_survival(gr):
c_passengers = gr['cabin'].str.startswith('C').fillna(False)
return gr.loc[c_passengers, 'survived'].mean()
# Create a groupby object using titanic over the 'sex' column: by_sex
by_sex = titanic.groupby('sex')
# Call by_sex.apply with the function c_deck_survival and print the result
c_surv_by_sex = by_sex.apply(c_deck_survival)
# Print the survival rates
print(c_surv_by_sex)
Grouping and filtering with .filter()
We can use groupby with the .filter() method to remove whole groups of rows from a DataFrame based on a boolean condition. In this exercise, we’ll take the February sales data and remove entries from companies that purchased less than or equal to 35 Units in the whole month. First, we’ll identify how many units each company bought for verification. Next we’ll use the .filter() method after grouping by ‘Company’ to remove all rows belonging to companies whose sum over the ‘Units’ column was less than or equal to 35. Finally, verify that the three companies whose total Units purchased were less than or equal to 35 have been filtered out from the DataFrame.
# Read the CSV file into a DataFrame: sales
sales = pd.read_csv('sales.csv', index_col='Date', parse_dates=True)
# Group sales by 'Company': by_company
by_company = sales.groupby('Company')
# Compute the sum of the 'Units' of by_company: by_com_sum
by_com_sum = by_company['Units'].sum()
print(by_com_sum)
# Filter 'Units' where the sum is > 35: by_com_filt
by_com_filt = by_company.filter(lambda g:g['Units'].sum() > 35)
print(by_com_filt)
Filtering and grouping with .map()
We have seen how to group by a column, or by multiple columns. Sometimes, we may instead want to group by a function/transformation of a column. The key here is that the Series is indexed the same way as the DataFrame. We can also mix and match column grouping with Series grouping. In this exercise our job is to investigate survival rates of passengers on the Titanic by ‘age’ and ‘pclass’. In particular, the goal is to find out what fraction of children under 10 survived in each ‘pclass’. We’ll do this by first creating a boolean array where True is passengers under 10 years old and False is passengers over 10. We’ll use .map() to change these values to strings. Finally, we’ll group by the under 10 series and the ‘pclass’ column and aggregate the ‘survived’ column. The ‘survived’ column has the value 1 if the passenger survived and 0 otherwise. The mean of the ‘survived’ column is the fraction of passengers who lived.
# Create the Boolean Series: under10
under10 = (titanic['age'] < 10).map({True:'under 10', False:'over 10'})
# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['survived'].mean()
print(survived_mean_1)
# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10,'pclass'])['survived'].mean()
print(survived_mean_2)
Using .value_counts() for ranking
For this exercise, we will use the pandas Series method .value_counts() to determine the top 15 countries ranked by total number of medals. Notice that .value_counts() sorts by values by default. The result is returned as a Series of counts indexed by unique entries from the original Series with values (counts) ranked in descending order.
# Select the 'NOC' column of medals: country_names
country_names = medals['NOC']
# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()
# Print top 15 countries ranked by medals
print(medal_counts.head(15))
Using .pivot_table() to count medals by type
Rather than ranking countries by total medals won and showing that list, we may want to see a bit more detail. We can use a pivot table to compute how many separate bronze, silver and gold medals each country won. That pivot table can then be used to repeat the previous computation to rank by total medals won. In this exercise, we will use .pivot_table() first to aggregate the total medals by type. Then, we can use .sum() along the columns of the pivot table to produce a new column. When the modified pivot table is sorted by the total medals column, we can display the results from the last exercise with a bit more detail.
# Construct the pivot table: counted
counted = medals.pivot_table(index='NOC', values='Athlete', columns='Medal', aggfunc='count')
# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis='columns')
# Sort counted by the 'totals' column
counted = counted.sort_values(by='totals', ascending=False)
# Print the top 15 rows of counted
print(counted.head(15))
Applying .drop_duplicates()
What could be the difference between the ‘Event_gender’ and ‘Gender’ columns? We should be able to evaluate your guess by looking at the unique values of the pairs (Event_gender, Gender) in the data. In particular, we should not see something like (Event_gender=’M’, Gender=’Women’). However, we will see that, strangely enough, there is an observation with (Event_gender=’W’, Gender=’Men’). The duplicates can be dropped using the .drop_duplicates() method, leaving behind the unique observations.
# Select columns: ev_gen
ev_gen = medals[['Event_gender','Gender']]
# Drop duplicate pairs: ev_gen_uniques
ev_gen_uniques = ev_gen.drop_duplicates()
# Print ev_gen_uniques
print(ev_gen_uniques)
Finding possible errors with .groupby()
We will now use .groupby() to continue your exploration. Our job is to group by ‘Event_gender’ and ‘Gender’ and count the rows. We will see that there is only one suspicious row: This is likely a data error.
# Group medals by the two columns: medals_by_gender
medals_by_gender = medals.groupby(['Event_gender','Gender'])
# Create a DataFrame with a group count: medal_count_by_gender
medal_count_by_gender = medals_by_gender.count()
# Print medal_count_by_gender
print(medal_count_by_gender)
WE will now inspect the suspect record by locating the offending row. We will see that, according to the data, Joyce Chepchumba was a man that won a medal in a women’s event. That is a data error as we can confirm with a web search.
# Create the Boolean Series: sus
sus = (medals.Event_gender == 'W') & (medals.Gender == 'Men')
# Create a DataFrame with the suspicious row: suspect
suspect = medals[sus]
# Print suspect
print(suspect)
Using .nunique() to rank by distinct sports
We may want to know which countries won medals in the most distinct sports. The .nunique() method is the principal aggregation here. Given a categorical Series S, S.nunique() returns the number of distinct categories.
# Group medals by 'NOC': country_grouped
country_grouped = medals.groupby('NOC')
# Compute the number of distinct sports in which each country won medals: Nsports
Nsports = country_grouped['Sport'].nunique()
# Sort the values of Nsports in descending order
Nsports = Nsports.sort_values(ascending=False)
# Print the top 15 rows of Nsports
print(Nsports.head(15))
Counting USA vs. USSR Cold War Olympic Sports
The Olympic competitions between 1952 and 1988 took place during the height of the Cold War between the United States of America (USA) & the Union of Soviet Socialist Republics (USSR). Our goal in this exercise is to aggregate the number of distinct sports in which the USA and the USSR won medals during the Cold War years. The construction is mostly the same as in the preceding exercise. There is an additional filtering stage beforehand in which you reduce the original DataFrame medals by extracting data from the Cold War period that applies only to the US or to the USSR. The relevant country codes in the DataFrame, which has been pre-loaded as medals, are ‘USA’ & ‘URS’.
# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_war
during_cold_war = (medals['Edition']>= 1952)& (medals['Edition']<=1988)
# Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_urs
is_usa_urs = medals.NOC.isin(['USA', 'URS'])
# Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medals
cold_war_medals = medals.loc[during_cold_war & is_usa_urs]
# Group cold_war_medals by 'NOC'
country_grouped = cold_war_medals.groupby('NOC')
# Create Nsports
Nsports = country_grouped['Sport'].nunique()
# Print Nsports
print(Nsports)
Counting USA vs. USSR Cold War Olympic Medals
For this exercise, we want to see which country, the USA or the USSR, won the most medals consistently over the Cold War period. There are several steps involved in carrying out this computation.
- You’ll need a pivot table with years (‘Edition’) on the index and countries (‘NOC’) on the columns. The entries will be the total number of medals each country won that year. If the country won no medals in a given edition, expect a NaN in that entry of the pivot table.
- You’ll need to slice the Cold War period and subset the ‘USA’ and ‘URS’ columns.
- You’ll need to make a Series from this slice of the pivot table that tells which country won the most medals in that edition using .idxmax(axis=’columns’). If .max() returns the maximum value of Series or 1D array, .idxmax() returns the index of the maximizing element. The argument axis=columns or axis=1 is required because, by default, this aggregation would be done along columns for a DataFrame.
- The final Series contains either ‘USA’ or ‘URS’ according to which country won the most medals in each Olympic edition. You can use .value_counts() to count the number of occurrences of each.
# Create the pivot table: medals_won_by_country
medals_won_by_country = medals.pivot_table(index='Edition', columns='NOC', values='Athlete', aggfunc='count')
# Slice medals_won_by_country: cold_war_usa_usr_medals
cold_war_usa_urs_medals = medals_won_by_country.loc[1952:1988, ['USA','URS']]
# Create most_medals
most_medals = cold_war_usa_urs_medals.idxmax(axis='columns')
# Print most_medals.value_counts()
print(most_medals.value_counts())
Visualizing USA Medal Counts by Edition: Line Plot
Our job in this exercise is to visualize the medal counts by ‘Edition’ for the USA.
# Create the DataFrame: usa
usa = medals[medals['NOC']=='USA']
# Group usa by ['Edition', 'Medal'] and aggregate over 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].agg('count')
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')
# Plot the DataFrame usa_medals_by_year
usa_medals_by_year.plot()
plt.show()
As in the previous exercise, our job in this exercise is to visualize the medal counts by ‘Edition’ for the USA. This time, we will use an area plot to see the breakdown better. The usa DataFrame has been created and all reshaping from the previous exercise has been done.
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']
# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')
# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
plt.show()
You may have noticed that the medals are ordered according to a lexicographic (dictionary) ordering: Bronze < Gold < Silver. However, we would prefer an ordering consistent with the Olympic rules: Bronze < Silver < Gold. We can achieve this using Categorical types. In this final exercise, after redefining the ‘Medal’ column of the DataFrame medals, we will repeat the area plot from the previous exercise to see the new ordering.
# Redefine 'Medal' as an ordered categorical
medals.Medal = pd.Categorical(values=medals.Medal,categories=['Bronze', 'Silver', 'Gold'],ordered=True)
# Create the DataFrame: usa
usa = medals[medals.NOC == 'USA']
# Group usa by 'Edition', 'Medal', and 'Athlete'
usa_medals_by_year = usa.groupby(['Edition', 'Medal'])['Athlete'].count()
# Reshape usa_medals_by_year by unstacking
usa_medals_by_year = usa_medals_by_year.unstack(level='Medal')
# Create an area plot of usa_medals_by_year
usa_medals_by_year.plot.area()
plt.show()