Data Science Toolbox (1)

19 minute read

There are lots and lots of fantastic functions in Python and its library ecosystem. However, as a Data Scientist, we will constantly need to write own functions to solve problems that are dictated by your data. The art of function writing is what we will learn in this first Python Data Science toolbox course. We will write our very own custom functions, complete with multiple parameters and multiple return values, along with default arguments and variable-length arguments. We will gain insight into scoping in Python and be able to write lambda functions and handle errors in our very own function writing practice. On top of this, we will wrap up each topic by diving into using your acquired skills to write functions that analyze twitter DataFrames and are generalizable to broader Data Science contexts.

Write a Simple Function

Function bodies need to be indented by a consistent number of spaces and the choice of 4 is common.

# Define the function shout
def shout():
    """Print a string with three exclamation marks"""
    # Concatenate the strings: shout_word
    shout_word= 'congratulations'+'!!!'

    # Print shout_word
    print(shout_word)

# Call shout
shout()

Single-parameter functions

You have successfully defined and called your own function! That’s pretty cool. We defined and called the function shout(), which printed out a string concatenated with ‘!!!’. We will now update shout() by adding a parameter so that it can accept and process any string argument passed to it. Also note that shout(word), the part of the header that specifies the function name and parameter(s), is known as the signature of the function. You may encounter this term in the wild!

# Define shout with the parameter, word
def shout(word):
    """Print a string with three exclamation marks"""
    # Concatenate the strings: shout_word
    shout_word = word + '!!!'

    # Print shout_word
    print(shout_word)
# Call shout with the string 'congratulations'
shout('congratulations')

Functions that return single values

Now we try our hand at another modification to the shout() function so that it now returns a single value instead of printing within the function. Recall that the return keyword lets us return values from functions. Returning values is generally more desirable than printing them out because, as we saw earlier, a print() call assigned to a variable has type NoneType.

# Define shout with the parameter, word
def shout(word):
    """Return a string with three exclamation marks"""
    # Concatenate the strings: shout_word
    shout_word= word+'!!!'

    # Replace print with return
    return(shout_word)

# Pass 'congratulations' to shout: yell
yell=shout('congratulations')

# Print yell
print(yell)

Functions with Multiple Parameters

We are now going to use what we have learned to modify the shout() function further. Here, we will modify shout() to accept two arguments.

# Define shout with parameters word1 and word2
def shout(word1, word2):
    """Concatenate strings with three exclamation marks"""
    # Concatenate word1 with '!!!': shout1
    shout1 = word1 + "!!!"

    # Concatenate word2 with '!!!': shout2
    shout2 = word2 + '!!!'

    # Concatenate shout1 with shout2: new_shout
    new_shout = shout1 + shout2

    # Return new_shout
    return new_shout

# Pass 'congratulations' and 'you' to shout(): yell
yell = shout('congratulations', 'you')

# Print yell
print(yell)

Tuples

A Tuple is a collection of Python objects separated by commas. In some ways a tuple is similar to a list in terms of indexing, nested objects and repetition but a tuple is immutable unlike lists which are mutable. Here, we will practice how to construct, unpack, and access tuple elements.

nums = (2, 4, 6)
# Unpack nums into num1, num2, and num3
num1 = nums[0]
num2 = nums[1]
num3 = nums[2]

# Construct even_nums
even_nums = (2, num2, num3)

Functions that Return Multiple Values

In the previous exercise, we constructed tuples, assigned tuples to variables, and unpacked tuples. Here we will return multiple values from a function using tuples. Let’s now update our shout() function to return multiple values. Instead of returning just one string, we will return two strings with the string !!! concatenated to each. Note that the return statement return x, y has the same result as return (x, y): the former actually packs x and y into a tuple under the hood!

# Define shout_all with parameters word1 and word2
def shout_all(word1,word2):
        # Concatenate word1 with '!!!': shout1
    shout1= word1+'!!!'
        # Concatenate word2 with '!!!': shout2
    shout2= word2+'!!!'
        # Construct a tuple with shout1 and shout2: shout_words
    shout_words=(shout1,shout2)
    # Return shout_words
    return shout_words
# Pass 'congratulations' and 'you' to shout_all(): yell1, yell2
yell1, yell2 = shout_all('congratulations','you')
# Print yell1 and yell2
print(yell1)
print(yell2)

We’ve got our first taste of writing your own functions in the previous exercises. We’ve learned how to add parameters to our own function definitions, return a value or multiple values with tuples, and how to call the functions we’ve defined. Now we will bring together all these concepts and apply them to a simple data science problem. We will load a dataset and develop functionalities to extract simple insights from the data. Goal is to recall how to load a dataset into a DataFrame. The dataset contains Twitter data and we will iterate over entries in a column to build a dictionary in which the keys are the names of languages and the values are the number of tweets in the given language.

# Import pandas
import pandas as pd
# Import Twitter data as dataframe: df
df = pd.read_csv('tweets.csv')
# Initialize an empty dictionary: langs_count
langs_count = {}
# Extract column from DataFrame: col
col = df['lang']

# Iterate over lang column in df
for entry in col:
    # If the language is in langs_count, add 1
    if entry in langs_count.keys():
        langs_count[entry]+=1
    # Else add the language to langs_count, set the value to 1
    else:
        langs_count[entry]= 1

# Print the populated dictionary
print(langs_count)

Great job! We’ve now defined the functionality for iterating over entries in a column and building a dictionary with keys the names of languages and values the number of tweets in the given language. Now we will define a function with the functionality you developed in the previous exercise, return the resulting dictionary from within the function, and call the function with the appropriate arguments.

# Define count_entries()
def count_entries(df,col_name):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    # Initialize an empty dictionary: langs_count
    langs_count = {}  
    # Extract column from DataFrame: col
    col = df[col_name]
        # Iterate over lang column in dataframe
    for entry in col:
        # If the language is in langs_count, add 1
        if entry in langs_count.keys():
            langs_count[entry]+=1
        # Else add the language to langs_count, set the value to 1
        else:
            langs_count[entry]=1
    # Return the langs_count dictionary
    return langs_count
# Call count_entries(): result
result=count_entries(tweets_df,'lang')
# Print the result
print(result)

The keyword global

We will use the keyword global within a function to alter the value of a variable defined in the global scope.

# Create a string: team
team = "teen titans"
# Define change_team()
def change_team():
    """Change the value of the global variable team."""
    # Use team in global scope
    global team
    # Change the value of team in global: team
    team = "justice league"
# Print team
print(team)
# Call change_team()
change_team()
# Print team
print(team)

Nested Functions

One reason why we would like to use nested functions is to avoid writing out the same computations within functions repeatedly. There’s nothing new about defining nested functions: you simply define it as you would a regular function with def and embed it inside another function!

In this exercise, inside a function three_shouts(), we will define a nested function inner() that concatenates a string object with !!!. three_shouts() then returns a tuple of three elements, each a string concatenated with !!! using inner().

# Define three_shouts
def three_shouts(word1, word2, word3):
    """Returns a tuple of strings
    concatenated with '!!!'."""

    # Define inner
    def inner(word):
        """Returns a string concatenated with '!!!'."""
        return word + '!!!'

    # Return a tuple of strings
    return (inner(word1),inner(word2), inner(word3))

# Call three_shouts() and print
print(three_shouts('a', 'b', 'c'))

Great job, we have just nested a function within another function. One other pretty cool reason for nesting functions is the idea of a closure. This means that the nested or inner function remembers the state of its enclosing scope when called. Thus, anything defined locally in the enclosing scope is available to the inner function even when the outer function has finished execution.

Let’s move forward then! Here you will complete the definition of the inner function inner_echo() and then call echo() a couple of times, each with a different argument. Complete the exercise and see what the output will be!

# Define echo
def echo(n):
    """Return the inner_echo function."""

    # Define inner_echo
    def inner_echo(word1):
        """Concatenate n copies of word1."""
        echo_word = word1 * n
        return echo_word

    # Return inner_echo
    return inner_echo

# Call echo: twice
twice = echo(2)

# Call echo: thrice
thrice = echo(3)

# Call twice() and thrice() then print
print(twice('hello'), thrice('hello'))

The keyword nonlocal and nested functions

Let’s once again work further on your mastery of scope! In this exercise, you will use the keyword nonlocal within a nested function to alter the value of a variable defined in the enclosing scope.

# Define echo_shout()
def echo_shout(word):
    """Change the value of a nonlocal variable"""
        # Concatenate word with itself: echo_word
    echo_word=word+word
        #Print echo_word
    print(echo_word)
        # Define inner function shout()
    def shout():
        """Alter a variable in the enclosing scope"""    
        #Use echo_word in nonlocal scope
        nonlocal echo_word
        #Change echo_word to echo_word concatenated with '!!!'
        echo_word = echo_word+'!!!'
    # Call function shout()
    shout()
    #Print echo_word
    print(echo_word)
#Call function echo_shout() with argument 'hello'    
echo_shout('hello')

Functions with one default argument

We’ve learned to define functions with more than one parameter and then calling those functions by passing the required number of arguments. We saw how to define functions with default arguments. We will practice that skill in this exercise by writing a function that uses a default argument and then calling the function a couple of times.

# Define shout_echo
def shout_echo(word1, echo=1):
    """Concatenate echo copies of word1 and three
     exclamation marks at the end of the string."""

    # Concatenate echo copies of word1 using *: echo_word
    echo_word = word1 * echo

    # Concatenate '!!!' to echo_word: shout_word
    shout_word = echo_word + '!!!'

    # Return shout_word
    return shout_word

# Call shout_echo() with "Hey": no_echo
no_echo = shout_echo("Hey")

# Call shout_echo() with "Hey" and echo=5: with_echo
with_echo = shout_echo("Hey",echo=5)

# Print no_echo and with_echo
print(no_echo)
print(with_echo)

Functions with multiple default arguments

We’ve now defined a function that uses a default argument - don’t stop there just yet! We will now try your hand at defining a function with more than one default argument and then calling this function in various ways. After defining the function, we will call it by supplying values to all the default arguments of the function. Additionally, we will call the function by not passing a value to one of the default arguments - see how that changes the output of your function!

# Define shout_echo
def shout_echo(word1,echo=1, intense=False):
    """Concatenate echo copies of word1 and three
    exclamation marks at the end of the string."""

    # Concatenate echo copies of word1 using *: echo_word
    echo_word = word1 * echo

    # Capitalize echo_word if intense is True
    if intense is True:
        # Capitalize and concatenate '!!!': echo_word_new
        echo_word_new = echo_word.upper() + '!!!'
    else:
        # Concatenate '!!!' to echo_word: echo_word_new
        echo_word_new = echo_word + '!!!'

    # Return echo_word
    return echo_word_new

# Call shout_echo() with "Hey", echo=5 and intense=True: with_big_echo
with_big_echo = shout_echo("Hey",5,True)

# Call shout_echo() with "Hey" and intense=True: big_no_echo
big_no_echo =  shout_echo("Hey",intense=True)

# Print values
print(with_big_echo)
print(big_no_echo)

Functions with variable-length arguments (* args)

Flexible arguments enable you to pass a variable number of arguments to a function. Now we will practice defining a function that accepts a variable number of string arguments. The function we will define is gibberish() which can accept a variable number of string values. Its return value is a single string composed of all the string arguments concatenated together in the order they were passed to the function call. We will call the function with a single string argument and see how the output changes with another call using more than one string argument. Remember within the function definition, args is a tuple.

# Define gibberish
def gibberish(*args):
    """Concatenate strings in *args together."""
    # Initialize an empty string: hodgepodge
    hodgepodge=""
    # Concatenate the strings in args
    for word in args:
        hodgepodge += word
    # Return hodgepodge
    return hodgepodge
# Call gibberish() with one string: one_word
one_word = gibberish("luke")
# Call gibberish() with five strings: many_words
many_words = gibberish("luke", "leia", "han", "obi", "darth")
# Print one_word and many_words
print(one_word)
print(many_words)

Functions with variable-length keyword arguments (** kwargs)

Let’s push further on what we’ve learned about flexible arguments - we’ve used (* args), we’re now going to use (** kwargs)! What makes them different is that it allows you to pass a variable number of keyword arguments to functions. Remember within the function definition, kwargs is a dictionary.

# Define report_status
def report_status(**kwargs):
    """Print out the status of a movie character."""
    print("\nBEGIN: REPORT\n")
    # Iterate over the key-value pairs of kwargs
    for key,value in kwargs.items():
        # Print out the keys and values, separated by a colon ':'
        print(key + ": " + value)
    print("\nEND REPORT")
# First call to report_status()
report_status(name="luke", affiliation="jedi",status="missing")
# Second call to report_status()
report_status(name="anakin", affiliation= "sith lord", status="deceased")

Bringing it all together (1)

In this exercise, we will generalize the Twitter language analysis that WE did in the previous chapter. We will do that by including a default argument that takes a column name. For convenience, pandas has been imported as pd and the ‘tweets.csv’ file has been imported into the DataFrame tweets_df.

# Define count_entries()
def count_entries(df, col_name='lang'):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    # Initialize an empty dictionary: cols_count
    cols_count = {}
    # Extract column from DataFrame: col
    col = df[col_name]
    # Iterate over the column in dataframe
    for entry in col:
        # If entry is in cols_count, add 1
        if entry in cols_count.keys():
            cols_count[entry] += 1
        # Else add the entry to cols_count, set the value to 1
        else:
            cols_count[entry] = 1
    # Return the cols_count dictionary
    return cols_count
# Call count_entries(): result1
result1 = count_entries(tweets_df,col_name='lang')
# Call count_entries(): result2
result2 = count_entries(tweets_df,col_name='source')
# Print result1 and result2
print(result1)
print(result2)

We’ve just generalized your Twitter language analysis that we did to include a default argument for the column name. We’re now going to generalize this function one step further by allowing the user to pass it a flexible argument, that is, in this case, as many column names as the user would like!

# Define count_entries()
def count_entries(df,*args):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    #Initialize an empty dictionary: cols_count
    cols_count = {}
    # Iterate over column names in args
    for col_name in args:
        # Extract column from DataFrame: col
        col = df[col_name]
        # Iterate over the column in dataframe
        for entry in col:
            # If entry is in cols_count, add 1
            if entry in cols_count.keys():
                cols_count[entry] += 1
            # Else add the entry to cols_count, set the value to 1
            else:
                cols_count[entry] = 1
    # Return the cols_count dictionary
    return cols_count
# Call count_entries(): result1
result1 = count_entries(tweets_df,"lang")
# Call count_entries(): result2
result2 = count_entries(tweets_df,"lang","source")
# Print result1 and result2
print(result1)
print(result2)

Writing a lambda function

Some function definitions are simple enough that they can be converted to a lambda function. By doing this, we write less lines of code, which is pretty awesome and will come in handy, especially when we’re writing and maintaining big programs. We will use what you know about lambda functions to convert a function that does a simple task into a lambda function. Take a look at this function definition:

def echo_word(word1, echo):
    """Concatenate echo copies of word1."""
    words = word1 * echo
    return words

The function echo_word takes 2 parameters: a string value, word1 and an integer value, echo. It returns a string that is a concatenation of echo copies of word1. Our task is to convert this simple function into a lambda function.

# Define echo_word as a lambda function: echo_word
echo_word = (lambda word1,echo: word1*echo)
# Call echo_word: result
result = echo_word('hey',5)
# Print result
print(result)

Map() and lambda functions

So far, we’ve used lambda functions to write short, simple functions as well as to redefine functions with simple functionality. The best use case for lambda functions, however, are for when we want these simple functionalities to be anonymously embedded within larger expressions. What that means is that the functionality is not stored in the environment, unlike a function defined with def. To understand this idea better, we will use a lambda function in the context of the map() function. map() applies a function over an object, such as a list. Here, we can use lambda functions to define the function that map() will use to process the object. For example:

nums = [2, 4, 6, 8, 10]
result = map(lambda a: a ** 2, nums)

We can see here that a lambda function, which raises a value a to the power of 2, is passed to map() alongside a list of numbers, nums. The map object that results from the call to map() is stored in result.

# Create a list of strings: spells
spells = ["protego", "accio", "expecto patronum", "legilimens"]
# Use map() to apply a lambda function over spells: shout_spells
shout_spells = map(lambda item:item+'!!!', spells)
# Convert shout_spells to a list: shout_spells_list
shout_spells_list= list(shout_spells)
# Convert shout_spells into a list and print it
print(shout_spells_list)

Filter() and lambda functions

We used lambda functions to anonymously embed an operation within map(). We will practice this again by using a lambda function with filter(), which may be new to you! The function filter() offers a way to filter out elements from a list that don’t satisfy certain criteria.

# Create a list of strings: fellowship
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']
# Use filter() to apply a lambda function over fellowship: result
result = filter(lambda member: len(member)>6, fellowship)
# Convert result to a list: result_list
result_list=list(result)
# Convert result into a list and print it
print(result_list)

Reduce() and lambda functions

We’re getting very good at using lambda functions! Here’s one more function to add to our repertoire of skills. The reduce() function is useful for performing some computation on a list and, unlike map() and filter(), returns a single value as a result. To use reduce(), we must import it from the functools module. Remember gibberish() from a few exercises back?

# Define gibberish
def gibberish(* args):
    """Concatenate strings in * args together."""
    hodgepodge = ''
    for word in args:
        hodgepodge += word
    return hodgepodge

gibberish() simply takes a list of strings as an argument and returns, as a single-value result, the concatenation of all of these strings. In this exercise, we will replicate this functionality by using reduce() and a lambda function that concatenates strings together.

# Import reduce from functools
from functools import reduce
# Create a list of strings: stark
stark = ['robb', 'sansa', 'arya', 'eddard', 'jon']
# Use result() to apply a lambda function over stark: result
result = reduce(lambda item1,item2: item1+item2, stark)
# Print the result
print(result)

Error handling with try-except

A good practice in writing your own functions is also anticipating the ways in which other people (or yourself, if you accidentally misuse your own function) might use the function you defined. We saw that the len() function is able to handle input arguments such as strings, lists, and tuples, but not int type ones and raises an appropriate error and error message when it encounters invalid input arguments. One way of doing this is through exception handling with the try-except block.

# Define shout_echo
def shout_echo(word1, echo=1):
    """Concatenate echo copies of word1 and three
    exclamation marks at the end of the string."""
    # Initialize empty strings: echo_word, shout_words
    echo_word=""
    shout_words=""
    # Add exception handling with try-except
    try:
        # Concatenate echo copies of word1 using *: echo_word
        echo_word = word1*echo
        # Concatenate '!!!' to echo_word: shout_words
        shout_words = echo_word+'!!!'
    except:
        # Print error message
        print("word1 must be a string and echo must be an integer.")
    # Return shout_words
    return shout_words
# Call shout_echo
shout_echo("particle", echo="accelerator")

Error handling by raising an error

Another way to raise an error is by using raise. The call to shout_echo() uses valid argument values. To test and see how the raise statement works, simply change the value for the echo argument to a negative value. Don’t forget to change it back to valid values to move on to the next exercise!

# Define shout_echo
def shout_echo(word1, echo=1):
    """Concatenate echo copies of word1 and three
    exclamation marks at the end of the string."""
    # Raise an error with raise
    if echo<0:
        raise ValueError('echo must be greater than 0')
    # Concatenate echo copies of word1 using *: echo_word
    echo_word = word1 * echo
    # Concatenate '!!!' to echo_word: shout_word
    shout_word = echo_word + '!!!'
    # Return shout_word
    return shout_word

# Call shout_echo
shout_echo("particle", echo=5)

Bringing it all together (1)

This is awesome! We have now learned how to write anonymous functions using lambda, how to pass lambda functions as arguments to other functions such as map(), filter(), and reduce(), as well as how to write errors and output custom error messages within your functions. You will now put together these learnings to good use by working with a Twitter dataset. Before practicing your new error handling skills, you will write a lambda function and use filter() to select retweets, that is, tweets that begin with the string ‘RT’.

# Select retweets from the Twitter dataframe: result
result = filter(lambda x: x[0:2]=='RT',tweets_df['text'])
# Create list from filter object result: res_list
res_list=list(result)
# Print all retweets in res_list
for tweet in res_list:
    print(tweet)

Sometimes, we make mistakes when calling functions - even ones you made yourself. But don’t fret! You will improve on your previous work with the count_entries() function by adding a try-except block to it. This will allow your function to provide a helpful message when the user calls your count_entries() function but provides a column name that isn’t in the DataFrame.

# Define count_entries()
def count_entries(df, col_name='lang'):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    # Initialize an empty dictionary: cols_count
    cols_count = {}
    # Add try block
    try:
        # Extract column from DataFrame: col
        col = df[col_name]
        # Iterate over the column in dataframe
        for entry in col:
            # If entry is in cols_count, add 1
            if entry in cols_count.keys():
                cols_count[entry] += 1
            # Else add the entry to cols_count, set the value to 1
            else:
                cols_count[entry] = 1
        # Return the cols_count dictionary
        return cols_count
    # Add except block
    except:
        print('The dataframe does not have a ' + col_name + ' column.')
# Call count_entries(): result1
result1 = count_entries(tweets_df, 'lang')
# Print result1
print(result1)
# Call count_entries(): result2
result2 = count_entries(tweets_df, 'lang1')

Weu built on our function count_entries() to add a try-except block. This was so that users would get helpful messages when calling our count_entries() function and providing a column name that isn’t in the DataFrame. Now we’ll instead raise a ValueError in the case that the user provides a column name that isn’t in the DataFrame.

# Define count_entries()
def count_entries(df, col_name='lang'):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    # Raise a ValueError if col_name is NOT in DataFrame
    if col_name not in df.columns:
        raise ValueError('The DataFrame does not have a ' + col_name + ' column.')
    # Initialize an empty dictionary: cols_count
    cols_count = {}
    # Extract column from DataFrame: col
    col = df[col_name]
    # Iterate over the column in DataFrame
    for entry in col:
        # If entry is in cols_count, add 1
        if entry in cols_count.keys():
            cols_count[entry] += 1
            # Else add the entry to cols_count, set the value to 1
        else:
            cols_count[entry] = 1
        # Return the cols_count dictionary
    return cols_count
# Call count_entries(): result1
result1 = count_entries(tweets_df, "lang")
# Print result1
print(result1)

Datasets:

Tags:

Updated: