Importing Data in Python (1)

15 minute read

Importing entire text files

We’ll be working with the file moby_dick.txt. It is a text file that contains the opening sentences of Moby Dick, one of the great American novels! Here we’ll get experience opening a text file, printing its contents to the shell and, finally, closing it.

# Open a file: file
file = open('moby_dick.txt', mode='r')
# Print it
print(file.read())
# Check whether file is closed
print(file.closed)
# Close file
file.close()
# Check whether file is closed
print(file.closed)

Importing text files line by line

For large files, we may not want to print all of their content to the shell: we may wish to print only the first few lines. Enter the readline() method, which allows you to do this. When a file called file is open, we can print out the first line by executing file.readline(). If we execute the same command again, the second line will print, and so on. We can bind a variable file by using a context manager construct. While still within this construct, the variable file will be bound to open(‘huck_finn.txt’); thus, to print the file to the shell, all the code we need to execute is:

# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

Using NumPy to import flat files

We are now going to load the MNIST digit recognition dataset using the numpy function loadtxt(). The first argument will be the filename. The second will be the delimiter which, in this case, is a comma. You can find more information about the MNIST dataset here on the webpage of Yann LeCun, who is currently Director of AI Research at Facebook and Founding Director of the NYU Center for Data Science, among many other things.

# Import package
import numpy as np
# Assign filename to variable: file
file = 'digits.csv'
# Load file as array: digits
digits = np.loadtxt(file, delimiter=",")
# Print datatype of digits
print(type(digits))
# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))
# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

What if there are rows, such as a header, that we don’t want to import? What if our file has a delimiter other than a comma? What if we only wish to import particular columns? There are a number of arguments that np.loadtxt() takes that we’ll find useful: delimiter changes the delimiter that loadtxt() is expecting, for example, we can use ‘,’ and ‘\t’ for comma-delimited and tab-delimited respectively; skiprows allows us to specify how many rows (not indices) we wish to skip; usecols takes a list of the indices of the columns we wish to keep. The file that we’ll be importing, digits_header.txt, has a header and is tab-delimited.

# Import numpy
import numpy as np
# Assign the filename: file
file = 'digits_header.txt'
# Load the data: data
data = np.loadtxt(file, delimiter="\t", skiprows=1, usecols=[0,2])
# Print data
print(data)

The file seaslug.txt has a text header, consisting of strings and is tab-delimited. These data consists of percentage of sea slug larvae that had metamorphosed in a given time period. Read more here. Due to the header, if we try to import it as-is using np.loadtxt(), Python would throw us a ValueError and tell us that it could not convert string to float. There are two ways to deal with this: firstly, we can set the data type argument dtype equal to str (for string). Alternatively, we can skip the first row as we have seen before, using the skiprows argument.

# Assign filename: file
file = 'seaslug.txt'
# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)
# Print the first element of data
print(data[0])
# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter="\t", dtype=float, skiprows=1)
# Print the 10th element of data_float
print(data_float[9])
# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()

Much of the time we will need to import datasets which have different datatypes in different columns; one column may contain strings and another floats, for example. The function np.loadtxt() will freak at this. There is another function, np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it will figure out what types each column should be. There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(), except that its default dtype is None.

# Assign the filename: file
file = 'titanic.csv'
# Import file using np.recfromcsv: d
d=np.recfromcsv(file,delimiter=',',names=True,dtype=None)
# Print out first three entries of d
print(d[:3])

Using pandas to import flat files as DataFrames

We were able to import flat files containing columns with different datatypes as numpy arrays. However, the DataFrame object in pandas is a more appropriate structure in which to store such data and, thankfully, we can easily import files of mixed data types as DataFrames using the pandas functions read_csv() and read_table().

# Import pandas as pd
import pandas as pd
# Assign the filename: file
file = 'titanic.csv'
# Read the file into a DataFrame: df
df = pd.read_csv('titanic.csv')
# View the head of the DataFrame
print(df.head())

We were able to import flat files into a pandas DataFrame. As a bonus, it is then straightforward to retrieve the corresponding numpy array using the attribute values.

# Assign the filename: file
file = 'digits.csv'
# Read the first 5 rows of the file into a DataFrame: data
data=pd.read_csv(file,nrows=5,header=None)
# Build a numpy array from the DataFrame: data_array
data_array= np.array(data.values)
# Print the datatype of data_array to the shell
print(type(data_array))

The pandas package is also great at dealing with many of the issues we will encounter when importing data as a data scientist, such as comments occurring in flat files, empty lines and missing values. Note that missing values are also commonly referred to as NA or NaN. We’re now going to import a slightly corrupted copy of the Titanic dataset titanic_corrupt.txt, which contains comments after the character ‘#’ is tab-delimited.

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep="\t", comment= '#', na_values= ['Nothing'])
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

Loading a pickled file

There are a number of datatypes that cannot be saved easily to flat files, such as lists and dictionaries. If we want our files to be human readable, we may want to save them as text files in a clever manner. JSONs, which we will see later, are appropriate for Python dictionaries. However, if we merely want to be able to import them into Python, we can serialize them. All this means is converting the object into a sequence of bytes, or a bytestream. We’ll import the pickle package, open a previously pickled data structure from a file and load it.

# Import pickle package
import pickle
# Open pickle file and load data: d
with open('data.pkl', 'rb') as file:
    d = pickle.load(file)
# Print d
print(d)
# Print datatype of d
print(type(d))

Listing sheets in Excel files

Whether we like it or not, any working data scientist will need to deal with Excel spreadsheets at some point in time. We won’t always want to do so in Excel, however! Here, we’ll learn how to use pandas to import Excel spreadsheets and how to list the names of the sheets in any loaded .xlsx file. We can retrieve a list of the sheet names using the attribute spreadsheet.sheet_names. Specifically, we’ll be loading and checking out the spreadsheet ‘battledeath.xlsx’, modified from the Peace Research Institute Oslo’s (PRIO) dataset. This data contains age-adjusted mortality rates due to war in various countries over several years.

# Import pandas
import pandas as pd
# Assign spreadsheet filename: file
file = 'battledeath.xlsx'
# Load spreadsheet: xl
xl = pd.ExcelFile(file)
# Print sheet names
print(xl.sheet_names)

We saw that the Excel file contains two sheets, ‘2002’ and ‘2004’. The next step is to import these. We’ll learn how to import any given sheet of your loaded .xlsx file as a DataFrame. We’ll be able to do so by specifying either the sheet’s name or its index.

# Load a sheet into a DataFrame by name: df1
df1 = xl.parse(1)
# Print the head of the DataFrame df1
print(df1.head())
# Load a sheet into a DataFrame by index: df2
df2=xl.parse(0)
# Print the head of the DataFrame df2
print(df2.head())

We will parse our spreadsheets and use additional arguments to skip rows, rename columns and select only particular columns. As before, we’ll use the method parse(). This time, however, we’ll add the additional arguments skiprows, names and parse_cols. These skip rows, name the columns and designate which columns to parse, respectively. All these arguments can be assigned to lists containing the specific row numbers, strings and column numbers, as appropriate.

# Parse the first sheet and rename the columns: df1
df1 = xl.parse(0, skiprows=[0], names=['Country','AAM due to War (2002)'])
# Print the head of the DataFrame df1
print(df1.head())
# Parse the first column of the second sheet and rename the column: df2
df2 = xl.parse(1, parse_cols= [0], skiprows=[0], names=['Country'])
# Print the head of the DataFrame df2
print(df2.head())

Importing SAS files

We’ll figure out how to import a SAS file as a DataFrame using SAS7BDAT and pandas. The data are adapted from the website of the undergraduate text book Principles of Econometrics by Hill, Griffiths and Lim.

# Import sas7bdat package
from sas7bdat import SAS7BDAT
# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
    df_sas=file.to_data_frame()
# Print head of DataFrame
print(df_sas.head())
# Plot histogram of DataFrame features (pandas and pyplot already imported)
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()

Importing Stata files

Here, we’ll gain expertise in importing Stata files as DataFrames using the pd.read_stata() function from pandas.

# Import pandas
import pandas as pd
# Load Stata file into a pandas DataFrame: df
df= pd.read_stata('disarea.dta')
# Print the head of the DataFrame df
print(df.head())
# Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of coutries')
plt.show()

Using h5py to import HDF5 files

In this exercise, we’ll import it using the h5py library. We’ll also print out its datatype to confirm you have imported it correctly. We’ll then study the structure of the file in order to see precisely what HDF groups it contains. You can find the LIGO data plus loads of documentation and tutorials here. There is also a great tutorial on Signal Processing with the data here.

# Import packages
import numpy as np
import h5py
# Assign filename: file
file = 'LIGO_data.hdf5'
# Load file: data
data = h5py.File(file, 'r')
# Print the datatype of the loaded file
print(type(data))
# Print the keys of the file
for key in data.keys():
    print(key)

We’ll extract some of the LIGO experiment’s actual data from the HDF5 file and you’ll visualize it. To do so, we’ll need to first explore the HDF5 group ‘strain’.

# Get the HDF5 group: group
group = data['strain']
# Check out keys of group
for key in group.keys():
    print(key)
# Set variable equal to time series data: strain
strain= data['strain']['Strain'].value
# Set number of time points to sample: num_samples
num_samples=10000
# Set time vector
time = np.arange(0, 1, 1/num_samples)
# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()

Loading .mat files

We’ll figure out how to load a MATLAB file using scipy.io.loadmat() and we’ll discover what Python datatype it yields. The file ‘albeck_gene_expression.mat’ is in our working directory. This file contains gene expression data from the Albeck Lab at UC Davis. We can find the data and some great documentation here.

# Import package
import scipy.io
# Load MATLAB file: mat
mat = scipy.io.loadmat('albeck_gene_expression.mat')
# Print the datatype type of mat
print(type(mat))

Here, we’ll discover what is in the MATLAB dictionary that we loaded.

# Print the keys of the MATLAB dictionary
print(mat.keys())
# Print the type of the value corresponding to the key 'CYratioCyt'
print(type(mat['CYratioCyt']))
# Print the shape of the value corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt']))
# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()

Creating a database engine

Here, we’re going to fire up our very first SQL engine. We’ll create an engine to connect to the SQLite database ‘Chinook.sqlite’, which is in our working directory. We need to create an engine to connect to ‘Northwind.sqlite’.

engine = create_engine('sqlite:///Northwind.sqlite')

Here, ‘sqlite:///Northwind.sqlite’ is called the connection string to the SQLite database Northwind.sqlite. A little bit of background on the Chinook database: the Chinook database contains information about a semi-fictional digital media store in which media data is real and customer, employee and sales data has been manually created. Why the name Chinook, you ask? According to their website,

The name of this sample database was based on the Northwind database. Chinooks are winds in the interior West of North America, where the Canadian Prairies and Great Plains meet various mountain ranges. Chinooks are most prevalent over southern Alberta in Canada. Chinook is a good name choice for a database that intends to be an alternative to Northwind.
# Import necessary module
from sqlalchemy import create_engine
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

In this exercise, we’ll once again create an engine to connect to ‘Chinook.sqlite’. Before we can get any data out of the database, however, we’ll need to know what tables it contains! To this end, we’ll save the table names to a list using the method table_names() on the engine and then we will print the list.

# Import necessary module
from sqlalchemy import create_engine
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Save the table names to a list: table_names
table_names = engine.table_names()
# Print the table names to the shell
print(table_names)

The Hello World of SQL Queries!

Now, it’s time for liftoff! We’ll perform the Hello World of SQL queries, SELECT, in order to retrieve all columns of the table Album in the Chinook database.

# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Open engine connection: con
con = engine.connect()
# Perform query: rs
rs = con.execute("SELECT * FROM Album")
# Save results of the query to DataFrame: df
df = pd.DataFrame(rs.fetchall())
# Close connection
con.close()
# Print head of DataFrame df
print(df.head())

Congratulations on executing your first SQL query! Now we’re going to figure out how to customize our query in order to Select specified columns from a table; Select a specified number of rows; Import column names from the database table.

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT LastName,Title FROM Employee")
    df = pd.DataFrame(rs.fetchmany(size=3))
    df.columns = rs.keys()
# Print the length of the DataFrame df
print(len(df))
# Print the head of the DataFrame df
print(df.head())

We can now execute a basic SQL query to select records from any table in our database and can also perform simple query customizations to select particular columns and numbers of rows. There are a couple more standard SQL query chops that will aid us in our journey to becoming an SQL ninja. Let’s say, for example that we want to get all records from the Customer table of the Chinook database for which the Country is ‘Canada’. We can do this very easily in SQL using a SELECT statement followed by a WHERE clause as follows:

SELECT * FROM Customer WHERE Country = 'Canada'

In fact, we can filter any SELECT statement by any condition using a WHERE clause. This is called filtering your records. Query away!

# Create engine: engine
engine = create_engine("sqlite:///Chinook.sqlite")
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee WHERE EmployeeId >= '6' ")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()
# Print the head of the DataFrame df
print(df.head())

We can also order our SQL query results. For example, if we want to get all records from the Customer table of the Chinook database and order them in increasing order by the column SupportRepId, we could do so with the following query:

"SELECT * FROM Customer ORDER BY SupportRepId"

Get querying!

# Create engine: engine
engine = create_engine("sqlite:///Chinook.sqlite")
# Open engine in context manager
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate ")
    df = pd.DataFrame(rs.fetchall())
    # Set the DataFrame's column names
    df.columns=rs.keys()
# Print head of DataFrame
print(df.head())

Pandas and The Hello World of SQL Queries!

Here, we’ll take advantage of the power of pandas to write the results of our SQL query to a DataFrame in one swift line of Python code! We’ll first import pandas and create the SQLite ‘Chinook.sqlite’ engine. Then we’ll query the database to select all records from the Album table.

# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine("sqlite:///Chinook.sqlite")
# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Album", engine)
# Print head of DataFrame
print(df.head())
# Open engine in context manager
# Perform query and save results to DataFrame: df1
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Album")
    df1 = pd.DataFrame(rs.fetchall())
    df1.columns = rs.keys()
# Confirm that both methods yield the same result: does df = df1 ?
print(df.equals(df1))

Here, we’ll become more familiar with the pandas function read_sql_query() by using it to execute a more complex query: a SELECT statement followed by both a WHERE clause AND an ORDER BY clause. We’ll build a DataFrame that contains the rows of the Employee table for which the EmployeeId is greater than or equal to 6 and we’ll order these entries by BirthDate.

# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine("sqlite:///Chinook.sqlite")
# Execute query and store records in DataFrame: df
df= pd.read_sql_query("SELECT * FROM Employee WHERE EmployeeId >= '6' ORDER BY BirthDate",engine)
# Print head of DataFrame
print(df.head())

Here, we’ll perform our first INNER JOIN! We’ll be working with your favourite SQLite database, Chinook.sqlite. For each record in the Album table, we’ll extract the Title along with the Name of the Artist. The latter will come from the Artist table and so we will need to INNER JOIN these two tables on the ArtistID column of both.

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT Title,Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID")
    df= pd.DataFrame(rs.fetchall())
    df.columns=rs.keys()
# Print head of DataFrame df
print(df.head())

Congrats on performing your first INNER JOIN! I am now going to finish this part with one final exercise in which we perform an INNER JOIN and filter the result using a WHERE clause.

# Execute query and store records in DataFrame: df
df= pd.read_sql_query("SELECT * FROM PlaylistTrack  INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000",engine)
# Print head of DataFrame
print(df.head())

Datasets:

Tags:

Updated: