Intermediate Python

27 minute read

Line Plot

With matplotlib, we can create a bunch of different plots in Python. The most basic plot is the line plot. The years are loaded in our workspace as a list called year, and the corresponding populations as a list called pop.

# Print the last item from year and pop
print(year[-1])
print(pop[-1])
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Make a line plot: year on the x-axis, pop on the y-axis
plt.plot(year,pop)
plt.show()

Now that we’ve built your first line plot, let’s start working on the data that professor Hans Rosling used to build his beautiful bubble chart. It was collected in 2007. Two lists are available for you:

  • life_exp which contains the life expectancy for each country and
  • gdp_cap, which contains the GDP per capita (i.e. per person) for each country expressed in US Dollars.

GDP stands for Gross Domestic Product. It basically represents the size of the economy of a country. Divide this by the population and we get the GDP per capita.

# Print the last item of gdp_cap and life_exp
print(gdp_cap[-1])
print(life_exp[-1])
# Make a line plot, gdp_cap on the x-axis, life_exp on the y-axis
plt.plot(gdp_cap,life_exp)
# Display the plot
plt.show()

Scatter Plot

When we have a time scale along the horizontal axis, the line plot is your friend. But in many other cases, when we’re trying to assess if there’s a correlation between two variables, for example, the scatter plot is the better choice. Let’s continue with the gdp_cap versus life_exp plot, the GDP and life expectancy data for different countries in 2007. Maybe a scatter plot will be a better alternative?

# Change the line plot below to a scatter plot
plt.scatter(gdp_cap, life_exp)
# Put the x-axis on a logarithmic scale
plt.xscale('log')
# Show plot
plt.show()

We saw that that the higher GDP usually corresponds to a higher life expectancy. In other words, there is a positive correlation. Is there’s a relationship between population and life expectancy of a country? The list life_exp and pop is available, listing the corresponding populations for the countries in 2007. The populations are in millions of people.

# Import package
import matplotlib.pyplot as plt
# Build Scatter plot
plt.scatter(pop,life_exp)
# Show plot
plt.show()

Histogram

life_exp, the list containing data on the life expectancy for different countries in 2007. To see how life expectancy in different countries is distributed, let’s create a histogram of life_exp.

# Create histogram of life_exp data
plt.hist(life_exp)
# Display histogram
plt.show()

We didn’t specify the number of bins. By default, Python sets the number of bins to 10 in that case. The number of bins is pretty important. Too few bins will oversimplify reality and won’t show you the details. Too many bins will overcomplicate reality and won’t show the bigger picture.

# Build histogram with 5 bins
plt.hist(life_exp,bins=5)
# Show and clean up plot
plt.show()
plt.clf()
# Build histogram with 20 bins
plt.hist(life_exp,bins=20)
# Show and clean up again
plt.show()
plt.clf()

We saw population pyramids for the present day and for the future. Because we were using a histogram, it was very easy to make a comparison. Let’s do a similar comparison. life_exp contains life expectancy data for different countries in 2007. We also have access to a second list now, life_exp1950, containing similar data for 1950.

# Histogram of life_exp, 15 bins
plt.hist(life_exp,bins=15)
# Show and clear plot
plt.show()
plt.clf()
# Histogram of life_exp1950, 15 bins
plt.hist(life_exp1950,bins=15)
# Show and clear plot again
plt.show()
plt.clf()

Labels

It’s time to customize our own plot. This is the fun part, we will see your plot come to life! We’re going to work on the scatter plot with world development data: GDP per capita on the x-axis (logarithmic scale), life expectancy on the y-axis. As a first step, let’s add axis labels and a title to the plot. We can do this with the xlabel(), ylabel() and title() functions, available in matplotlib.pyplot.

# Basic scatter plot, log scale
plt.scatter(gdp_cap, life_exp)
plt.xscale('log')
# Strings
xlab = 'GDP per Capita [in USD]'
ylab = 'Life Expectancy [in years]'
title = 'World Development in 2007'
# Add axis labels
plt.xlabel(xlab)
plt.ylabel(ylab)
# Add title
plt.title(title)
# After customizing, display the plot
plt.show()

Ticks

We can control the y-ticks by specifying two arguments:

plt.yticks([0,1,2], [“one”,”two”,”three”]). The ticks corresponding to the numbers 0, 1 and 2 will be replaced by one, two and three, respectively. Let’s do a similar thing for the x-axis of our world development chart, with the xticks() function. The tick values 1000, 10000 and 100000 should be replaced by 1k, 10k and 100k.

# Scatter plot
plt.scatter(gdp_cap, life_exp)
# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
# Definition of tick_val and tick_lab
tick_val = [1000,10000,100000]
tick_lab = ['1k','10k','100k']
# Adapt the ticks on the x-axis
plt.xticks(tick_val,tick_lab)
# After customizing, display the plot
plt.show()

Sizes

Right now, the scatter plot is just a cloud of blue dots, indistinguishable from each other. Let’s change this. Wouldn’t it be nice if the size of the dots corresponds to the population?

# Import numpy as np
import numpy as np
# Store pop as a numpy array: np_pop
np_pop = np.array(pop)
# Double np_pop
np_pop = np_pop * 2
# Update: set s argument to np_pop
plt.scatter(gdp_cap, life_exp, s = np_pop)
# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000, 10000, 100000],['1k', '10k', '100k'])
# Display the plot
plt.show()

Colors

The next step is making the plot more colorful! To do this, a list col has been created for you. It’s a list with a color for each corresponding country, depending on the continent the country is part of. If we have another look at the script, under # Additional Customizations, we’ll see that there are two plt.text() functions now. They add the words “India” and “China” in the plot.

# Scatter plot
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha = 0.8)
# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])
# Additional customizations
plt.text(1550, 71, 'India')
plt.text(5700, 80, 'China')
# Add grid() call
plt.grid(1)
# Show the plot
plt.show()

Motivation for Dictionaries

To see why dictionaries are useful, have a look at the two lists defined on the right. countries contains the names of some European countries. capitals lists the corresponding names of their capital.

# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']
# Get index of 'germany': ind_ger
ind_ger = countries.index('germany')
# Use ind_ger to print out capital of Germany
print(capitals[ind_ger])

The countries and capitals lists are again available in the script. It’s our job to convert this data to a dictionary where the country names are the keys and the capitals are the corresponding values. As a refresher, here is a recipe for creating a dictionary:

my_dict = {
   "key1":"value1",
   "key2":"value2",
}

In this recipe, both the keys and the values are strings.

# Definition of countries and capital
countries = ['spain', 'france', 'germany', 'norway']
capitals = ['madrid', 'paris', 'berlin', 'oslo']
# From string in countries and capitals, create dictionary europe
europe= {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo'}
# Print europe
print(europe)

If the keys of a dictionary are chosen wisely, accessing the values in a dictionary is easy and intuitive. For example, to get the capital for France from europe you we use: europe[‘france’]. Here, ‘france’ is the key and ‘paris’ the value is returned.

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
# Print out the keys in europe
print(europe.keys())
# Print out value that belongs to key 'norway'
print(europe['norway'])

Dictionary Manipulation

If we know how to access a dictionary, we can also assign a new value to it. To add a new key-value pair to europe you can use something like this: europe[‘iceland’] = ‘reykjavik’

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }
# Add italy to europe
europe['italy']='rome'
# Print out italy in europe
print('italy' in europe)
# Add poland to europe
europe['poland']='warsaw'
# Print europe
print(europe)

Somebody thought it would be funny to mess with our accurately generated dictionary. An adapted version of the europe dictionary is available in the script. Can we clean up? Do not do this by adapting the definition of europe, but by adding Python commands to the script to update and remove key:value pairs.

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw',
          'australia':'vienna' }
# Update capital of germany
europe['germany']='berlin'
# Remove australia
del(europe["australia"])
# Print europe
print(europe)

Dictionariception

Remember lists? They could contain anything, even other lists. Well, for dictionaries the same holds. Dictionaries can contain key:value pairs where the values are again dictionaries. Have a look at the script where another version of europe - the dictionary we’ve been working with all along - is coded. The keys are still the country names, but the values are dictionaries that contain more information than just the capital. It’s perfectly possible to chain square brackets to select elements. To fetch the population for Spain from europe.

# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }
# Print out the capital of France
print(europe['france']['capital'])
# Create sub-dictionary data
data={'capital':'rome','population':59.83}
# Add data to europe under key 'italy'
europe['italy']=data
# Print europe
print(europe)

Dictionary to DataFrame

Pandas is an open source library, providing high-performance, easy-to-use data structures and data analysis tools for Python. Sounds promising! The DataFrame is one of Pandas’ most important data structures. It’s basically a way to store tabular data where we can label the rows and the columns. One way to build a DataFrame is from a dictionary. We will be working with vehicle data from different countries. Each observation corresponds to a country and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on. Three lists are defined in the script:

  • names, containing the country names for which data is available.
  • dr, a list with booleans that tells whether people drive left or right in the corresponding country.
  • cpc, the number of motor vehicles per 1000 people in the corresponding country.

Each dictionary key is a column label and each value is a list which contains the column elements.

# Pre-defined lists
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
# Import pandas as pd
import pandas as pd
# Create dictionary my_dict with three key:value pairs: my_dict
my_dict = {'country':names,'drives_right':dr,'cars_per_cap':cpc}
# Build a DataFrame cars from my_dict: cars
cars = pd.DataFrame(my_dict)
# Print cars
print(cars)

We noticed that the row labels (i.e. the labels for the different observations) were automatically set to integers from 0 up to 6? To solve this a list row_labels has been created. We can use it to specify the row labels of the cars DataFrame. We do this by setting the index attribute of cars, that we can access as cars.index.

# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(dict)
print(cars)
# Definition of row_labels
row_labels = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
# Specify row labels of cars
cars.index = row_labels
# Print cars again
print(cars)

CSV to DataFrame

Putting data in a dictionary and then building a DataFrame works, but it’s not very efficient. What if we’re dealing with millions of observations? In those cases, the data is typically available as files with a regular structure. One of those file types is the CSV file, which is short for “comma-separated values”. To import CSV data into Python as a Pandas DataFrame you can use read_csv(). Let’s explore this function with the same cars data. This time, however, the data is available in a CSV file, named cars.csv.

# Import pandas as pd
import pandas as pd
# Import the cars.csv data: cars
cars= pd.read_csv("cars.csv")
# Print out cars
print(cars)

Our read_csv() call to import the CSV data didn’t generate an error, but the output is not entirely what we wanted. The row labels were imported as another column without a name. Remember index_col, an argument of read_csv(), that we can use to specify which column in the CSV file should be used as a row label? Well, that’s exactly what we need here!

# Import pandas as pd
import pandas as pd
# Fix import by including index_col
cars = pd.read_csv('cars.csv',index_col=0)
# Print out cars
print(cars)

Square Brackets

We can index and select Pandas DataFrames in many different ways. The simplest, but not the most powerful way, is to use square brackets. The single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out country column as Pandas Series
print(cars["country"])
# Print out country column as Pandas DataFrame
print(cars[["country"]])
# Print out DataFrame with country and drives_right columns
print(cars[["country","drives_right"]])

Square brackets can do more than just selecting columns. We can also use them to get rows, or observations, from a DataFrame. Pay attention: We can only select rows using square brackets if you specify a slice, like 0:4. Also, we’re using the integer indexes of the rows here, not the row labels!

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out first 3 observations
print(cars[0:3])
# Print out fourth, fifth and sixth observation
print(cars[3:6])

loc and iloc

With loc and iloc we can do practically any data selection operation on DataFrames we can think of. loc is label-based, which means that we have to specify rows and columns based on their row and column labels. iloc is integer index based, so we have to specify rows and columns by their integer index like we did previously.

cars.loc['RU']
cars.iloc[4]

cars.loc[['RU']]
cars.iloc[[4]]

cars.loc[['RU', 'AUS']]
cars.iloc[[4, 1]]

As before, code is included that imports the cars data as a Pandas DataFrame.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out observation for Japan
print(cars.loc["JAP",:])
# Print out observations for Australia and Egypt
print(cars.loc[["AUS","EG"],:])

loc and iloc also allow us to select both rows and columns from a DataFrame.

cars.loc['IN', 'cars_per_cap']
cars.iloc[3, 0]

cars.loc[['IN', 'RU'], 'cars_per_cap']
cars.iloc[[3, 4], 0]

cars.loc[['IN', 'RU'], ['cars_per_cap', 'country']]
cars.iloc[[3, 4], [0, 1]]
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out drives_right value of Morocco
print(cars.loc[["MOR"],["drives_right"]])
# Print sub-DataFrame
print(cars.loc[["RU","MOR"],["country","drives_right"]])

It’s also possible to select only columns with loc and iloc. In both cases, we simply put a slice going from beginning to end in front of the comma:

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out drives_right column as Series
print(cars.loc[:,"drives_right"])
# Print out drives_right column as DataFrame
print(cars.loc[:,["drives_right"]])
# Print out cars_per_cap and drives_right as DataFrame
print(cars.loc[:,["cars_per_cap","drives_right"]])

Equality

To check if two Python values, or variables, are equal we can use ==. To check for inequality, we need !=. As a refresher, have a look at the following examples that all result in True.

# Comparison of booleans
print(True == False)
# Comparison of integers
print((-5 * 15)!=75)
# Comparison of strings
print("pyscript" == "PyScript")
# Compare a boolean with an integer
print(True==1)

Greater and less than

We know about the less than and greater than signs, < and > in Python. We can combine them with an equals sign: <= and >=. Pay attention: <= is valid syntax, but =< is not. Remember that for string comparison, Python determines the relationship based on alphabetical order.

# Comparison of integers
x = -3 * 6
print(x>=-10 )
# Comparison of strings
y = "test"
print("test" <=y)
# Comparison of booleans
print(True > False)

Compare arrays

Out of the box, we can also use comparison operators with Numpy arrays. Remember areas, the list of area measurements for different rooms in our house from the previous course? This time there’s two Numpy arrays: my_house and your_house. They both contain the areas for the kitchen, living room, bedroom and bathroom in the same order, so we can compare them.

# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])
# my_house greater than or equal to 18
print(my_house[my_house>=18])
# my_house less than your_house
print(my_house[my_house<your_house])

and, or, not

A boolean is either 1 or 0, True or False. With boolean operators such as and, or and not, you we combine these booleans to perform more advanced queries on our data. In the sample code, two variables are defined: my_kitchen and your_kitchen, representing areas.

# Define variables
my_kitchen = 18.0
your_kitchen = 14.0
# my_kitchen bigger than 10 and smaller than 18?
print(my_kitchen>10 and my_kitchen<18)
# my_kitchen smaller than 14 or bigger than 17?
print(my_kitchen<14 or my_kitchen>17)
# Double my_kitchen smaller than triple your_kitchen?
print(2*my_kitchen<3*your_kitchen)

Boolean operators with Numpy

Before, the operational operators like < and >= worked with Numpy arrays out of the box. Unfortunately, this is not true for the boolean operators and, or, and not. To use these operators with Numpy, we will need np.logical_and(), np.logical_or() and np.logical_not(). Here’s an example on the my_house and your_house arrays from before to give us an idea:

# Create arrays
import numpy as np
my_house = np.array([18.0, 20.0, 10.75, 9.50])
your_house = np.array([14.0, 24.0, 14.25, 9.0])
# my_house greater than 18.5 or smaller than 10
print(np.logical_or(my_house>18.5,my_house<10))
# Both my_house and your_house smaller than 11
print(np.logical_and(my_house<11,your_house<11))

if else

It’s time to take a closer look around in our house. Two variables are defined in the sample code: room, a string that tells you which room of the house we’re looking at, and area, the area of that room.

# Define variables
room = "kit"
area = 14.0
# if-else construct for room
if room == "kit" :
    print("looking around in the kitchen.")
    else :
        print("looking around elsewhere.")

# if-else construct for area
    if area > 15 :
        print("big place!")
    else:
        print("pretty small.")

elif

It’s also possible to have a look around in the bedroom. The sample code contains an elif part that checks if room equals “bed”. In that case, “looking around in the bedroom.” is printed out. It’s up to us now! Make a similar addition to the second control structure to further customize the messages for different values of area.

# Define variables
room = "bed"
area = 14.0
# if-elif-else construct for room
if room == "kit" :
    print("looking around in the kitchen.")
elif room == "bed":
    print("looking around in the bedroom.")
else :
    print("looking around elsewhere.")
# if-elif-else construct for area
if area > 15 :
    print("big place!")
elif area >10:
    print("medium size,nice!")
else :
    print("pretty small.")

Driving right

Remember that cars dataset, containing the cars per 1000 people (cars_per_cap) and whether people drive right (drives_right) for different countries (country)? Let’s start simple and try to find all observations in cars where drives_right is True. drives_right is a boolean column, so we’ll have to extract it as a Series and then use this boolean Series to select observations from cars.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Extract drives_right column as Series: sel
sel = cars[cars["drives_right"]]
# Print sel
print(sel)

Cars per capita

Let’s stick to the cars data some more. This time we want to find out which countries have a high cars per capita figure. In other words, in which countries do many people have a car, or maybe multiple cars. We’ll build up a boolean Series, that we can then use to subset the cars DataFrame to select certain observations. If we want to do this in a one-liner, that’s perfectly fine!

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars['cars_per_cap']
many_cars= cpc>500
car_maniac = cars[many_cars]
# Print car_maniac
print(car_maniac)

Remember about np.logical_and(), np.logical_or() and np.logical_not(), the Numpy variants of the and, or and not operators? We can also use them on Pandas Series to do more advanced filtering operations. Take this example that selects the observations that have a cars_per_cap between 10 and 80. Try out these lines of code step by step to see what’s happening.

cpc = cars['cars_per_cap']
between = np.logical_and(cpc > 10, cpc < 80)
medium = cars[between]
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Import numpy, you'll need this
import numpy as np
# Create medium: observations with cars_per_cap between 100 and 500
between = np.logical_and(cars["cars_per_cap"]>100,cars["cars_per_cap"]<500)
medium = cars[between]
# Print medium
print(medium)

while loop

We’re going to code a while loop that implements a very basic control system for an inverted pendulum. If there’s an offset from standing perfectly straight, the while loop will incrementally fix this offset.

# Initialize offset
offset = 8
# Code the while loop
while offset!=0:
    print("correcting...")
    offset = offset-1
    print(offset)

The while loop that corrects the offset is a good start, but what if offset is negative? The while loop will never stop running, because offset will be further decreased on every run. offset != 0 will never become False and the while loop continues forever. We will fix things by putting an if-else statement inside the while loop.

# Initialize offset
offset = -6
# Code the while loop
while offset != 0 :
    print("correcting...")
    if(offset>0):
                 offset = offset - 1
    else:
         offset=offset+1    
    print(offset)

As usual, we simply have to indent the code with 4 spaces to tell Python which code should be executed in the for loop.

# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Code the for loop
for height in areas:
    print(height)

Indexes and Values

Using a for loop to iterate over a list only gives us access to every list element in each run, one after the other. If we also want to access the index information, we can use enumerate().

# areas list
areas = [11.25, 18.0, 20.0, 10.75, 9.50]
# Change for loop to use enumerate()
for index, area in enumerate(areas) :
    print("room "+str(index+1)+": "+str(area))

Loop over list of lists

Remember the house variable? It’s basically a list of lists, where each sublist contains the name and area of a room in our house. It’s up to us to build a for loop from scratch this time!

# house list of lists
house = [["hallway", 11.25],
         ["kitchen", 18.0],
         ["living room", 20.0],
         ["bedroom", 10.75],
         ["bathroom", 9.50]]
# Build a for loop from scratch
for x in house:
     print("the "+str(x[0])+" is "+str(x[1])+" sqm")

Loop over Dictionary

In Python 3, we need the items() method to loop over a dictionary. Remember the europe dictionary that contained the names of some European countries as key and their capitals as corresponding value?

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'bonn',
          'norway':'oslo', 'italy':'rome', 'poland':'warsaw', 'australia':'vienna' }
# Iterate over europe
for key,value in europe.items():
    print("the capital of "+key+" is "+ value)

Loop over Numpy array

If we’re dealing with a 1D Numpy array, looping over all elements can be simple. If we’re dealing with a 2D Numpy array, it’s more complicated. A 2D array is built up of multiple 1D arrays. Two Numpy arrays are available: np_height, a Numpy array containing the heights of Major League Baseball players, and np_baseball, a 2D Numpy array that contains both the heights (first column) and weights (second column) of those players.

# Import numpy as np
import numpy as np
# For loop over np_height
for x in np_height:
    print(str(x)+" inches")
# For loop over np_baseball
for x in np.nditer(np_baseball):
    print(str(x))

Loop over DataFrame

Iterating over a Pandas DataFrame is typically done with the iterrows() method. Used in a for loop, every observation is iterated over and on every iteration the row label and actual row contents. In this and the following exercises we will be working on the cars DataFrame. It contains information on the cars per capita and whether people drive right or left for seven countries in the world.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Iterate over rows of cars
for key,value in cars.iterrows():
    print(key)
    print(value)

The row data that’s generated by iterrows() on every run is a Pandas Series. This format is not very convenient to print out. Luckily, we can easily select variables from the Pandas Series using square brackets.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Adapt for loop
for lab,row in cars.iterrows():
    print(lab +": "+ str(row['cars_per_cap']))

Add Column

We can add new column in the cars data frame.

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Code for loop that adds COUNTRY column
for lab,row in cars.iterrows():
    cars.loc[lab, "COUNTRY"] = row["country"].upper()
# Print cars
print(cars)

Using iterrows() to iterate over every observation of a Pandas DataFrame is easy to understand, but not very efficient. On every iteration, we’re creating a new Pandas Series. If we want to add a column to a DataFrame by calling a function on another column, the iterrows() method in combination with a for loop is not the preferred way to go. Instead, we’ll want to use apply(). Compare the iterrows() version with the apply() version to get the same result in the brics DataFrame:

for lab, row in brics.iterrows() :
    brics.loc[lab, "name_length"] = len(row["country"])

brics["name_length"] = brics["country"].apply(len)

We can do a similar thing to call the upper() method on every name in the country column. However, upper() is a method, so we’ll need a slightly different approach:

# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)
# Use .apply(str.upper)
cars["COUNTRY"] = cars["country"].apply(str.upper)
print(cars)

Hacker statistics

Randomness has many uses in science, art, statistics, cryptography, gaming, gambling, and other fields. We’re going to use randomness to simulate a game. All the functionality is contained in the random package, a sub-package of numpy. We will be using two functions from this package:

  • seed(): sets the random seed, so that your results are the reproducible between simulations. As an argument, it takes an integer of your choosing. If you call the function, no output will be generated.
  • rand(): if you don’t specify any arguments, it generates a random float between zero and one.
# Import numpy as np
import numpy as np
# Set the seed
np.random.seed(123)
# Generate and print random float
print(np.random.rand())

Roll the dice

In the previous exercise, we used rand(), that generates a random float between 0 and 1. We can just as well use randint(), also a function of the random package, to generate integers randomly.

# Import numpy and set seed
import numpy as np
np.random.seed(123)
# Use randint() to simulate a dice
print(np.random.randint(1,7))
# Use randint() again
print(np.random.randint(1,7))

In the Empire State Building bet, our next move depends on the number of eyes we throw with the dice. We can perfectly code this with an if-elif-else construct! The sample code assumes that we’re currently at step 50. Can we fill in the missing pieces to finish the script?

# Import numpy and set seed
import numpy as np
np.random.seed(123)
# Starting step
step = 50
# Roll the dice
dice= np.random.randint(1,7)
# Finish the control construct
if dice <= 2 :
    step = step - 1
elif dice <=5:
    step =step+1
else:
    step = step + np.random.randint(1,7)
# Print out dice and step
print(dice)
print(step)

Before, we have already written Python code that determines the next step based on the previous step. Now it’s time to put this code inside a for loop so that we can simulate a random walk.

# Import numpy and set seed
import numpy as np
np.random.seed(123)
# Initialize random_walk
random_walk = [0]
# Complete the ___
for x in range(100) :
    # Set step: last element in random_walk
    step = random_walk[-1]
    # Roll the dice
    dice = np.random.randint(1,7)
    # Determine next step
    if dice <= 2:
        step = step - 1
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)
    # append next_step to random_walk
    random_walk.append(step)
# Print random_walk
print(random_walk)

How low can you go?

Things are shaping up nicely! We already have code that calculates our location in the Empire State Building after 100 dice throws. However, there’s something we haven’t thought about - we can’t go below 0! A typical way to solve problems like this is by using max(). If you pass max() two arguments, the biggest one gets returned. For example, to make sure that a variable x never goes below 10 when we decrease it, we can use: x = max(10, x - 1)

# Import numpy and set seed
import numpy as np
np.random.seed(123)
# Initialize random_walk
random_walk = [0]
for x in range(100) :
    step = random_walk[-1]
    dice = np.random.randint(1,7)
    if dice <= 2:
        # Replace below: use max to make sure step can't go below 0
        step = max(step - 1,0)
    elif dice <= 5:
        step = step + 1
    else:
        step = step + np.random.randint(1,7)
    random_walk.append(step)
print(random_walk)

Visualize the walk

Let’s visualize this random walk! Remember how we could use matplotlib to build a line plot? The first list we pass is mapped onto the x axis and the second list is mapped onto the y axis. If we pass only one argument, Python will know what to do and will use the index of the list to map onto the x axis, and the values in the list onto the y axis.

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Plot random_walk
plt.plot(random_walk)
# Show the plot
plt.show()

Simulate multiple walks

A single random walk is one thing, but that doesn’t tell you if we have a good chance at winning the bet. To get an idea about how big our chances are of reaching 60 steps, we can repeatedly simulate the random walk and collect the results.

# Initialization
import numpy as np
np.random.seed(123)
# Initialize all_walks
all_walks= []
# Simulate random walk 10 times
for i in range(10) :
    # Code from before
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        random_walk.append(step)
    # Append random_walk to all_walks
    all_walks.append(random_walk)
# Print all_walks
print(all_walks)

Visualize all walks

all_walks is a list of lists: every sub-list represents a single random walk. If we convert this list of lists to a Numpy array, we can start making interesting plots!

import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []
for i in range(10) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        random_walk.append(step)
    all_walks.append(random_walk)
# Convert all_walks to Numpy array: np_aw
np_aw = np.array(all_walks)
# Plot np_aw and show
plt.plot(np_aw)
plt.show()
# Clear the figure
plt.clf()
# Transpose np_aw: np_aw_t
np_aw_t = np.transpose(np_aw)
# Plot np_aw_t and show
plt.plot(np_aw_t)
plt.show()

Implement clumsiness

With this neatly written code of ours, changing the number of times the random walk should be simulated is super-easy. We simply update the range() function in the top-level for loop. There’s still something we forgot! We’re a bit clumsy and we have a 0.1% chance of falling down. That calls for another random number generation. Basically, we can generate a random float between 0 and 1. If this value is less than or equal to 0.001, we should reset step to 0.

import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []
# Simulate random walk 250 times
for i in range(250) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        # Implement clumsiness
        if  np.random.rand()<=0.001:
            step = 0
        random_walk.append(step)
    all_walks.append(random_walk)
# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))
plt.plot(np_aw_t)
plt.show()

Plot the distribution

All these fancy visualizations have put us on a sidetrack. We still have to solve the million-dollar problem: What are the odds that we’ll reach 60 steps high on the Empire State Building? Basically, we want to know about the end points of all the random walks you’ve simulated. These end points have a certain distribution that we can visualize with a histogram.

import matplotlib.pyplot as plt
import numpy as np
np.random.seed(123)
all_walks = []
# Simulate random walk 1000 times
for i in range(500) :
    random_walk = [0]
    for x in range(100) :
        step = random_walk[-1]
        dice = np.random.randint(1,7)
        if dice <= 2:
            step = max(0, step - 1)
        elif dice <= 5:
            step = step + 1
        else:
            step = step + np.random.randint(1,7)
        if np.random.rand() <= 0.001 :
            step = 0
        random_walk.append(step)
    all_walks.append(random_walk)
# Create and plot np_aw_t
np_aw_t = np.transpose(np.array(all_walks))
# Select last row from np_aw_t: ends
ends = np_aw_t[-1]
# Plot histogram of ends, display plot
plt.hist(ends)
plt.show()

Datasets:

Tags:

Updated: