Importing Data in Python (2)

10 minute read

Importing flat files from the web

We are about to import our first file from the web! The flat file you will import will be ‘winequality-red.csv’ from the University of California, Irvine’s Machine Learning repository. The flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.

# Import package
from urllib.request import urlretrieve
# Import pandas
import pandas as pd
# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
# Save file locally
urlretrieve(url,'winequality-red.csv')
# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())

We have just imported a file from the web, saved it locally and loaded it into a DataFrame. If we just wanted to load a file from the web into a DataFrame without first saving it locally, we can do that easily using pandas. In particular, we can use the function pd.read_csv() with the URL as the first argument and the separator sep as the second argument.

# Import packages
import matplotlib.pyplot as plt
import pandas as pd
# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
# Read file into a DataFrame: df
df = pd.read_csv(url,sep= ';')
# Print the head of the DataFrame
print(df.head())
# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1])
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()

Importing non-flat files from the web

Congrats! We’ve just loaded a flat file from the web into a DataFrame without first saving it locally using the pandas function pd.read_csv(). This function is super cool because it has close relatives that allow us to load all types of files, not only flat ones. Now we’ll use pd.read_excel() to import an Excel spreadsheet. Our job is to use pd.read_excel() to read in all of its sheets, print the sheet names and then print the head of the first sheet using its name, not its index. Note that the output of pd.read_excel() is a Python dictionary with sheet names as keys and corresponding DataFrames as corresponding values.

# Import package
import pandas as pd
# Assign url of file: url
url ='http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'
# Read in all sheets of Excel file: xl
xl = pd.read_excel(url,sheetname=None)
# Print the sheetnames to the shell
print(xl.keys())
# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())

Performing HTTP requests in Python using urllib

Now it’s time to perform some HTTP GET requests of our own. Here we will ping our very own DataCamp servers to perform a GET request to extract information from our teach page, “http://www.datacamp.com/teach/documentation”. We’ll extract the HTML itself. Right now, however, we are going to package and send the request and then catch the response.

# Import packages
from urllib.request import urlopen, Request
# Specify the url
url = "http://www.datacamp.com/teach/documentation"
# This packages the request: request
request=Request(url)
# Sends the request and catches the response: response
response = urlopen(request)
# Print the datatype of response
print(type(response))
# Be polite and close the response!
response.close()

We have just packaged and sent a GET request to “http://www.datacamp.com/teach/documentation” and then caught the response. We saw that such a response is a http.client.HTTPResponse object. The question remains: what can we do with this response? Well, as it came from an HTML page, we could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method. Here we’ll build on your previous great work to extract the response and print the HTML.

# Import packages
from urllib.request import urlopen, Request
# Specify the url
url = "http://www.datacamp.com/teach/documentation"
# This packages the request
request = Request(url)
# Sends the request and catches the response: response
response = urlopen(request)
# Extract the response: html
html = response.read()
# Print the html
print(html)
# Be polite and close the response!
response.close()

Now that we’ve got your head and hands around making HTTP requests using the urllib package, we’re going to figure out how to do the same using the higher-level requests library. We’ll once again be pinging DataCamp servers for their “http://www.datacamp.com/teach/documentation” page. Note that using urllib, we don’t have to close the connection when using requests!

# Import package
import requests
# Specify the url: url
url = "http://www.datacamp.com/teach/documentation"
# Packages the request, send the request and catch the response: r
r= requests.get(url)
# Extract the response: text
text = r.text
# Print the html
print(text)

Parsing HTML with BeautifulSoup

We’ll learn how to use the BeautifulSoup package to parse, prettify and extract information from HTML. We’ll scrape the data from the webpage of Guido van Rossum, Python’s very own Benevolent Dictator for Life. Here we’ll prettify the HTML and then extract the text and the hyperlinks. The URL of interest is url = ‘https://www.python.org/~guido/’.

# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extracts the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Prettify the BeautifulSoup object: pretty_soup
pretty_soup =  soup.prettify()
# Print the response
print(pretty_soup)

As promised, in the following exercises, we’ll learn the basics of extracting information from HTML soup. Here we’ll figure out how to extract the text from the BDFL’s webpage, along with printing the webpage’s title.

# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url: url
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extract the response as html: html_doc
html_doc = r.text
# Create a BeautifulSoup object from the HTML: soup
soup= BeautifulSoup(html_doc)
# Get the title of Guido's webpage: guido_title
guido_title =  soup.title
# Print the title of Guido's webpage to the shell
print(guido_title)
# Get Guido's text: guido_text
guido_text = soup.get_text()
# Print Guido's text to the shell
print(guido_text)

Here we’ll figure out how to extract the URLs of the hyperlinks from the BDFL’s webpage. In the process, we’ll become close friends with the soup method find_all().

# Import packages
import requests
from bs4 import BeautifulSoup
# Specify url
url = 'https://www.python.org/~guido/'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Extracts the response as html: html_doc
html_doc = r.text
# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
# Print the title of Guido's webpage
print(soup.title)
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')
# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

Loading and exploring a JSON

We know what a JSON is, we’ll load one into your Python environment and explore it ourself. Here, we’ll load the JSON ‘a_movie.json’ into the variable json_data, which will be a dictionary. We’ll then explore the JSON contents by printing the key-value pairs of json_data to the shell.

# Load JSON: json_data
with open("a_movie.json") as json_file:
    json_data = json.load(json_file)
# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

API requests

Now it’s our turn to pull some movie data down from the Open Movie Database (OMDB) using their API. The movie we’ll query the API about is The Social Network. To query the API about the movie Hackers, query string will be ‘http://www.omdbapi.com/?t=hackers’ and a single argument t=hackers. Note: recently, OMDB has changed their API: we now also have to specify an API key. This means we’ll have to add another argument to the URL: apikey=ff21610b.

# Import requests package
import requests
# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=ff21610b&t=the+social+network'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Print the text of the response
print(r.text)

Wow, congrats! We’ve just queried our first API programmatically in Python and printed the text of the response to the shell. However, as we know, our response is actually a JSON, so we can do one step better and decode the JSON. We can then print the key-value pairs of the resulting dictionary. That’s what we’re going to do now!

# Import package
import requests
# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=ff21610b&t=social+network'
# Package the request, send the request and catch the response: r
r= requests.get(url)
# Decode the JSON data into a dictionary: json_data
json_data = r.json()
# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

We’re doing so well and having so much fun that we’re going to try one more API: the Wikipedia API (documented here). We’ll figure out how to find and extract information from the Wikipedia page for Pizza. What gets a bit wild here is that our query will return nested JSONs, that is, JSONs with JSONs, but Python can handle that because it will translate them into dictionaries within dictionaries.

# Import package
import requests
# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'
# Package the request, send the request and catch the response: r
r = requests.get(url)
# Decode the JSON data into a dictionary: json_data
json_data = r.json()
# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)

API Authentication

The package tweepy is great at handling all the Twitter API OAuth Authentication details for us. All we need to do is pass it your authentication credentials. In this, we have created some mock authentication credentials (if you wanted to replicate this at home, you would need to create a Twitter App). Your task is to pass these credentials to tweepy’s OAuth handler.

# Import package
import tweepy
# Store OAuth authentication credentials in relevant variables
access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"

# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)

Now that we have set up our authentication credentials, it is time to stream some tweets! We have already defined the tweet stream listener class, MyStreamListener. Our task is to create the Streamobject and to filter tweets according to particular keywords.

# Initialize Stream listener
l = MyStreamListener()
# Create you Stream object with authentication
stream = tweepy.Stream(auth, l)
# Filter Twitter Streams to capture data by the keywords:
stream.filter(track= ['clinton', 'trump', 'sanders','cruz'])

Now that we’ve got your Twitter data sitting locally in a text file, it’s time to explore it!

# Import package
import json
# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'
# Initialize empty list to store tweets: tweets_data
tweets_data = list()
# Open connection to file
tweets_file = open(tweets_data_path, "r")
# Read in tweets and store in list: tweets_data
for line in tweets_file:
    tweet = json.loads(line)
    tweets_data.append(tweet)
# Close connection to file
tweets_file.close()
# Print the keys of the first tweet dict
print(tweets_data[0].keys())

Now we have the Twitter data in a list of dictionaries, tweets_data, where each dictionary corresponds to a single tweet. Next, we’re going to extract the text and language of each tweet. The text in a tweet, t1, is stored as the value t1[‘text’]; similarly, the language is stored in t1[‘lang’]. Oour task is to build a DataFrame in which each row is a tweet and the columns are ‘text’ and ‘lang’.

# Import package
import pandas as pd
# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text','lang'])
# Print head of DataFrame
print(df.head())

Now that we have your DataFrame of tweets set up, we’re going to do a bit of text analysis to count how many tweets contain the words ‘clinton’, ‘trump’, ‘sanders’ and ‘cruz’. I have defined the following function word_in_text(), which will tell us whether the first argument (a word) occurs within the 2nd argument (a tweet).

import re
def word_in_text(word, text):
    word = word.lower()
    text = tweet.lower()
    match = re.search(word, text)

    if match:
        return True
    return False

We’re going to iterate over the rows of the DataFrame and calculate how many tweets contain each of our keywords! The list of objects for each candidate has been initialized to 0.

# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]
# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
    clinton += word_in_text('clinton', row['text'])
    trump += word_in_text('trump', row['text'])
    sanders += word_in_text('sanders', row['text'])
    cruz += word_in_text('cruz',row['text'])

Now that we have the number of tweets that each candidate was mentioned in, we can plot a bar chart of this data. We’ll use the statistical data visualization library seaborn, which you may not have seen before, but i’ll guide you through. You’ll first import seaborn as sns. You’ll then construct a barplot of the data using sns.barplot, passing it two arguments: a list of labels and a list containing the variables you wish to plot (clinton, trump and so on.) Hopefully, we’ll see that Trump was unreasonably represented!

# Import packages
import seaborn as sns
import matplotlib.pyplot as plt
# Set seaborn style
sns.set(color_codes=True)
# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']
# Plot histogram
ax = sns.barplot(cd,[clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()

Datasets

Share on

Twitter Facebook Google+ LinkedIn

Vishal Kumar

Importing Data in Python (2)

Importing flat files from the web

Importing non-flat files from the web

Performing HTTP requests in Python using urllib

Parsing HTML with BeautifulSoup

Loading and exploring a JSON

API requests

API Authentication

Datasets

Share on

You may also enjoy

Extreme Gradient Boosting with XGBoost

Statistical Thinking with Python (2)

Statistical Thinking with Python (1)

Interactive Data Visualization with Bokeh