Data Science Toolbox (2)

19 minute read

Iterating over iterables

An iterable is an object that can return an iterator, while an iterator is an object that keeps state and produces the next value when you call next() on it. You will reinforce your knowledge about these by iterating over and printing from iterables and iterators.

# Create a list of strings: flash
flash = ['jay garrick', 'barry allen', 'wally west', 'bart allen']
# Print each list item in flash using a for loop
for person in flash:
    print(person)
# Create an iterator for flash: superspeed
superspeed=iter(flash)
# Print each item from the iterator
print(next(superspeed))
print(next(superspeed))
print(next(superspeed))
print(next(superspeed))

Note that not all iterables are actual lists. They can be strings. Now we will focus on the range() function. You can use range() in a for loop as if it’s a list to be iterated over. Recall that range() doesn’t actually create the list; instead, it creates a range object with an iterator that produces the values until it reaches the limit. If range() created the actual list, calling it with a value of 10100 may not work, especially since a number as big as that may go over a regular computer’s memory. The value 10100 is actually what’s called a Googol which is a 1 followed by a hundred 0s. That’s a huge number!

# Create an iterator for range(3): small_value
small_value = iter(range(3))
# Print the values in small_value
print(next(small_value))
print(next(small_value))
print(next(small_value))
# Loop over range(3) and print the values
for num in range(3):
  print(num)
# Create an iterator for range(10 ** 100): googol
googol = iter(range(10 ** 100))
# Print the first 5 values from googol
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))
print(next(googol))

Iterators as function arguments

We’ve been using the iter() function to get an iterator object, as well as the next() function to retrieve the values one by one from the iterator object. There are also functions that take iterators as arguments. For example, the list() and sum() functions return a list and the sum of elements, respectively.

# Create a range object: values
values = range(10,21)
# Print the range object
print(values)
# Create a list of integers: values_list
values_list = list(values)
# Print values_list
print(values_list)
# Get the sum of values: values_sum
values_sum = sum(values)
# Print values_sum
print(values_sum)

Using enumerate

We’ve just gained several new ideas on iterators and one of them is the enumerate() function. Recall that enumerate() returns an enumerate object that produces a sequence of tuples, and each of the tuples is an index-value pair.

# Create a list of strings: mutants
mutants = ['charles xavier',
            'bobby drake',
            'kurt wagner',
            'max eisenhardt',
            'kitty pride']
# Create a list of tuples: mutant_list
mutant_list = list(enumerate(mutants))
# Print the list of tuples
print(mutant_list)
# Unpack and print the tuple pairs
for index1,value1 in enumerate(mutants):
    print(index1, value1)
# Change the start index
for index2,value2 in enumerate(mutants,start=1):
    print(index2, value2)

Using zip

Another interesting function is zip(), which takes any number of iterables and returns a zip object that is an iterator of tuples. If you wanted to print the values of a zip object, you can convert it into a list and then print it. Printing just a zip object will not return the values unless you unpack it first. In this exercise, you will explore this for yourself.

# Create a list of tuples: mutant_data
mutant_data = list(zip(mutants,aliases,powers))
# Print the list of tuples
print(mutant_data)
# Create a zip object using the three lists: mutant_zip
mutant_zip = zip(mutants,aliases,powers)
# Print the zip object
print(mutant_zip)
# Unpack the zip object and print the tuple values
for value1,value2,value3 in mutant_zip:
    print(value1, value2, value3)

Using * and zip to ‘unzip’

We know how to use zip() as well as how to print out values from a zip object. Excellent! Let’s play around with zip() a little more. There is no unzip function for doing the reverse of what zip() does. We can, however, reverse what has been zipped together by using zip() with a little help from * ! * unpacks an iterable such as a list or a tuple into positional arguments in a function call.

# Create a zip object from mutants and powers: z1
z1 = zip(mutants,powers)
# Print the tuples in z1 by unpacking with *
print(*z1)
# Re-create a zip object from mutants and powers: z1
z1 = zip(mutants,powers)
# 'Unzip' the tuples in z1 by unpacking with * and zip(): result1, result2
result1, result2 = zip(*z1)
# Check if unpacked tuples are equivalent to original tuples
print(result1 == mutants)
print(result2 == powers)

Processing large amounts of Twitter data

Sometimes, the data we have to process reaches a size that is too much for a computer’s memory to handle. This is a common problem faced by data scientists. A solution to this is to process an entire data source chunk by chunk, instead of a single go all at once. We will process a large csv file of Twitter data in the same way that you processed ‘tweets.csv’ in Bringing it all together exercises of the prequel course, but this time, working on it in chunks of 10 entries at a time.

# Initialize an empty dictionary: counts_dict
counts_dict={}
# Iterate over the file chunk by chunk
for chunk in pd.read_csv('tweets.csv',chunksize=10):
    # Iterate over the column in DataFrame
    for entry in chunk['lang']:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1
# Print the populated dictionary
print(counts_dict)

We now know how to deal with situations where you need to process a very large file and that’s a very useful skill to have! It’s good to know how to process a file in smaller, more manageable chunks, but it can become very tedious having to write and rewrite the same code for the same task each time.

# Define count_entries()
def count_entries(csv_file,c_size,colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}
    # Iterate over the file chunk by chunk
    for chunk in pd.read_csv(csv_file,chunksize=c_size):
        # Iterate over the column in DataFrame
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1
    # Return counts_dict
    return counts_dict
# Call count_entries(): result_counts
result_counts = count_entries('tweets.csv',10,'lang')
# Print result_counts
print(result_counts)

Writing list comprehensions

List comprehensions can be built over iterables. They can replace a for loop in a single line of code. Our job in this exercise is to write a list comprehension that produces a list of the squares of the numbers ranging from 0 to 9.

# Create list comprehension: squares
squares = [ i*i for i in range(0,10)]

Nested list comprehensions

Great! We have a good grasp of the basic syntax of list comprehensions. Let’s push our code-writing skills a little further. In this exercise, we will be writing a list comprehension within another list comprehension, or nested list comprehensions. It sounds a little tricky. Let’s step aside for a while from strings. One of the ways in which lists can be used are in representing multi-dimension objects such as matrices. Matrices can be represented as a list of lists in Python. For example a 5 x 5 matrix with values 0 to 4 in each row can be written as:

matrix = [[0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4], [0, 1, 2, 3, 4]]

Our task is to recreate this matrix by using nested listed comprehensions. Recall that we can create one of the rows of the matrix with a single list comprehension. To create the list of lists, we simply have to supply the list comprehension as the output expression of the overall list comprehension:

[[output expression] for iterator variable in iterable]

Note that here, the output expression is itself a list comprehension.

# Create a 5 x 5 matrix using a list of lists: matrix
matrix = [[col for col in range(0,5)] for row in range(0,5)]
# Print the matrix
for row in matrix:
    print(row)

Using conditionals in comprehensions (1)

We’ve been using list comprehensions to build lists of values, sometimes using operations to create these values. An interesting mechanism in list comprehensions is that we can also create lists with values that meet only a certain condition. One way of doing this is by using conditionals on iterator variables.

# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create list comprehension: new_fellowship
new_fellowship = [member for member in fellowship if len(member)>=7]
# Print the new list
print(new_fellowship)

We used an if conditional statement in the predicate expression part of a list comprehension to evaluate an iterator variable. In this exercise, you will use an if-else statement on the output expression of the list.

# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create list comprehension: new_fellowship
new_fellowship = [member if len(member)>=7 else '' for member in fellowship]
# Print the new list
print(new_fellowship)

Dict comprehensions

Comprehensions aren’t relegated merely to the world of lists. There are many other objects you can build using comprehensions, such as dictionaries, pervasive objects in Data Science. We will create a dictionary using the comprehension syntax for this exercise. In this case, the comprehension is called a dict comprehension. Recall that the main difference between a list comprehension and a dict comprehension is the use of curly braces {} instead of []. Additionally, members of the dictionary are created using a colon :, as in : .

# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Create dict comprehension: new_fellowship
new_fellowship = {member:len(member) for member in fellowship}
# Print the new list
print(new_fellowship)

Write your own generator expressions

Recall that generator expressions basically have the same syntax as list comprehensions, except that it uses parentheses () instead of brackets []; this should make things feel familiar! Furthermore, if we have ever iterated over a dictionary with .items(), or used the range() function, for example, we have already encountered and used generators before, without knowing it! When we use these functions, Python creates generators for you behind the scenes. Now, we will start simple by creating a generator object that produces numeric values.

# Create generator object: result
result = (num for num in range(31))

# Print the first 5 values
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))

# Print the rest of the values
for value in result:
    print(value)

Great! At this point, we already know how to write a basic generator expression. Now we will push this idea a little further by adding to the output expression of a generator expression. Because generator expressions and list comprehensions are so alike in syntax, this should be a familiar task for you!

# Create a list of strings: lannister
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Create a generator object: lengths
lengths = (len(person) for person in lannister )

# Iterate over and print the values in lengths
for value in lengths:
    print(value)

We’ve dealt mainly with writing generator expressions, which uses comprehension syntax. Being able to use comprehension syntax for generator expressions made your work so much easier! Now, recall there are generator functions as well. Generator functions are functions that, like generator expressions, yield a series of values, instead of returning a single value. A generator function is defined as we do a regular function, but whenever it generates a value, it uses the keyword yield instead of return.

# Create a list of strings
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']
# Define generator function get_lengths
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""
    # Yield the length of a string
    for person in input_list:
        yield len(person)
# Print the values generated by get_lengths()
for value in get_lengths(lannister):
    print(value)

Dictionaries for data science

Following lists are actually extracted from a bigger dataset file of world development indicators from the World Bank. For pedagogical purposes, we have pre-processed this dataset into the lists that we’ll be working with. The first list feature_names contains header names of the dataset and the second list row_vals contains actual values of a row from the dataset, corresponding to each of the header names.

# Zip lists: zipped_lists
zipped_lists = zip(feature_names,row_vals)
# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)
# Print the dictionary
print(rs_dict)

Writing a function to help you

Suppose we need to repeat the same process done in the previous exercise to many, many rows of data. Rewriting your code again and again could become very tedious, repetitive, and unmaintainable. We will create a function to house the code you wrote earlier to make things easier and much more concise. Why? This way, we only need to call the function and supply the appropriate lists to create your dictionaries!

# Define lists2dict()
def lists2dict(list1,list2):
    """Return a dictionary where list1 provides
    the keys and list2 provides the values."""
    # Zip lists: zipped_lists
    zipped_lists = zip(list1, list2)
    # Create a dictionary: rs_dict
    rs_dict = dict(zipped_lists)
    # Return the dictionary
    return rs_dict
# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names,row_vals)
# Print rs_fxn
print(rs_fxn)

Using a list comprehension

This time, we’re going to use the lists2dict() function we defined to turn a bunch of lists into a list of dictionaries with the help of a list comprehension.

# Print the first two lists in row_lists
print(row_lists[0])
print(row_lists[1])
# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names,sublist) for sublist in row_lists]
# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])

Turning this all into a DataFrame

We’ve zipped lists together, created a function to house your code, and even used the function in a list comprehension to generate a list of dictionaries. That was a lot of work and we did a great job! We will now use of all these to convert the list of dictionaries into a pandas DataFrame. We will see how convenient it is to generate a DataFrame from dictionaries with the DataFrame() function from the pandas package.

# Import the pandas package
import pandas as pd
# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]
# Turn list of dicts into a DataFrame: df
df = pd.DataFrame(list_of_dicts)
# Print the head of the DataFrame
print(df.head())

Processing data in chunks

Sometimes, data sources can be so large in size that storing the entire dataset in memory becomes too resource-intensive. Here we will process the first 1000 rows of a file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset.

# Open a connection to the file
with open('world_dev_ind.csv') as file:
    # Skip the column names
    file.readline()
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}
    # Process only the first 1000 rows
    for j in range(1000):
        # Split the current line into a list: line
        line = file.readline().split(',')
        # Get the value for the first column: first_col
        first_col = line[0]
        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1
# Print the resulting dictionary
print(counts_dict)

We processed a file line by line for a given number of lines. What if, however, we want to do this for the entire file? In this case, it would be useful to use generators. Generators allow us to lazily evaluate data. This concept of lazy evaluation is useful when we have to deal with very large datasets because it lets you generate values in an efficient manner by yielding only chunks of data at a time instead of the whole thing at once. We will define a generator function read_large_file() that produces a generator object which yields a single line from a file each time next() is called on it. Note that when we open a connection to a file, the resulting file object is already a generator! So out in the wild, we won’t have to explicitly create generator objects in cases such as this. However, for pedagogical reasons, we are having practice how to do this here with the read_large_file() function.

# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""
    # Loop indefinitely until the end of the file
    while True:
        # Read a line from the file: data
        data = file_object.readline()
        # Break if this is the end of the file
        if not data:
            break
        # Yield the line of data
        yield data
# Open a connection to the file
with open('world_dev_ind.csv') as file:
    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)
    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

Great! We’ve just created a generator function that you can use to help you process large files. Now let’s use our generator function to process the World Bank dataset like you did previously. We will process the file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset. However, we won’t process just 1000 rows of data, we’ll process the entire dataset!

# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Open a connection to the file
with open('world_dev_ind.csv') as file:
    # Iterate over the generator from read_large_file()
    for line in read_large_file(file):
        row = line.split(',')
        first_col = row[0]
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1
# Print            
print(counts_dict)

Writing an iterator to load data in chunks

Another way to read data too large to store in memory in chunks is to read the file in as DataFrames of a certain length, say, 100. For example, with the pandas package (imported as pd), we can do pd.read_csv(filename, chunksize=100). This creates an iterable reader object, which means that we can use next() on it. We will read a file in small DataFrame chunks with read_csv(). We’re going to use the World Bank Indicators data ‘ind_pop.csv’, to look at the urban population indicator for numerous countries and years.

# Import the pandas package
import pandas as pd
# Initialize reader object: df_reader
df_reader = pd.read_csv('ind_pop.csv',chunksize=10)
# Print two chunks
print(next(df_reader))
print(next(df_reader))

We used read_csv() to read in DataFrame chunks from a large dataset. Now we will read in a file using a bigger DataFrame chunk size and then process the data from the first chunk. To process the data, we will create another DataFrame composed of only the rows from a specific country. We will then zip together two of the columns from the new DataFrame, ‘Total Population’ and ‘Urban population (% of total)’. Finally, we will create a list of tuples from the zip object, where each tuple is composed of a value from each of the two columns mentioned.

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv' , chunksize=1000)
# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)
# Check out the head of the DataFrame
print(df_urb_pop.head())
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode']=='CEB']
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'], df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Print pops_list
print(pops_list)

We’re getting used to reading and processing data in chunks by now. Let’s push our skills a little further by adding a column to a DataFrame. Starting from the code of the previous exercise, we will be using a list comprehension to create the values for a new column ‘Total Urban Population’ from the list of tuples that weu generated earlier. Recall that the first and second elements of each tuple consist of, respectively, values from the columns ‘Total Population’ and ‘Urban population (% of total)’. The values in this new column ‘Total Urban Population’, therefore, are the product of the first and second element in each tuple. Furthermore, because the 2nd element is a percentage, we need to divide the entire result by 100, or alternatively, multiply it by 0.01. We will also plot the data from this new column to create a visualization of the urban population data.

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)
# Get the first DataFrame chunk: df_urb_pop
df_urb_pop = next(urb_pop_reader)
# Check out specific country: df_pop_ceb
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
# Zip DataFrame columns of interest: pops
pops = zip(df_pop_ceb['Total Population'],
            df_pop_ceb['Urban population (% of total)'])
# Turn zip object into list: pops_list
pops_list = list(pops)
# Use list comprehension to create new DataFrame column 'Total Urban Population'
df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
# Plot urban population data
df_pop_ceb.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

We’ve only processed the data from the first DataFrame chunk. This time, we will aggregate the results over all the DataFrame chunks in the dataset. This basically means we will be processing the entire dataset now. This is neat because we’re going to be able to process the entire large dataset by just working on smaller pieces of it! We’re going to use the data from ‘ind_pop_data.csv’, available in our current directory.

# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)
# Initialize empty DataFrame: data
data = pd.DataFrame()
# Iterate over each DataFrame chunk
for df_urb_pop in urb_pop_reader:
    # Check out specific country: df_pop_ceb
    df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
    # Zip DataFrame columns of interest: pops
    pops = zip(df_pop_ceb['Total Population'],
                df_pop_ceb['Urban population (% of total)'])
    # Turn zip object into list: pops_list
    pops_list = list(pops)
    # Use list comprehension to create new DataFrame column 'Total Urban Population'
    df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]

    # Append DataFrame chunk to data: data
    data = data.append(df_pop_ceb)
# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

This is the last leg. We’ve learned a lot about processing a large dataset in chunks. Now we will put all the code for processing the data into a single function so that we can reuse the code without having to rewrite the same things all over again. We’re going to define the function plot_pop() which takes two arguments: the filename of the file to be processed, and the country code of the rows you want to process in the dataset. Because all of the previous code you’ve written in the previous exercises will be housed in plot_pop(), calling the function already does the following:

  • Loading of the file chunk by chunk,
  • Creating the new column of urban population values, and
  • Plotting the urban population data. That’s a lot of work, but the function now makes it convenient to repeat the same process for whatever file and country code we want to process and visualize! After we are done, take a moment to look at the plots and reflect on the new skills we have acquired. The journey doesn’t end here! We can continue exploring it using the pre-processed version available on Kaggle.
# Define plot_pop()
def plot_pop(filename, country_code):
    # Initialize reader object: urb_pop_reader
    urb_pop_reader = pd.read_csv(filename, chunksize=1000)
    # Initialize empty DataFrame: data
    data = pd.DataFrame()
    # Iterate over each DataFrame chunk
    for df_urb_pop in urb_pop_reader:
        # Check out specific country: df_pop_ceb
        df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]
        # Zip DataFrame columns of interest: pops
        pops = zip(df_pop_ceb['Total Population'],
                    df_pop_ceb['Urban population (% of total)'])
        # Turn zip object into list: pops_list
        pops_list = list(pops)
        # Use list comprehension to create new DataFrame column 'Total Urban Population'
        df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1]) for tup in pops_list]
        # Append DataFrame chunk to data: data
        data = data.append(df_pop_ceb)
    # Plot urban population data
    data.plot(kind='scatter', x='Year', y='Total Urban Population')
    plt.show()
# Set the filename: fn
fn = 'ind_pop_data.csv'
# Call plot_pop for country code 'CEB'
plot_pop(fn,'CEB')
# Call plot_pop for country code 'ARB'
plot_pop(fn,'ARB')

Datasets:

Tags:

Updated: