Anaconda Web Scraping

Getting started. I have written a few articles about web scraping before where I explain how Beautiful Soup works. This python package is very easy to use and you can check this article I wrote to scrape prices from house listings. Aug 31, 2020 Installing or updating Python on your computer is the first step to becoming a Python programmer. There are a multitude of installation methods: you can download official Python distributions from Python.org, install from a package manager, and even install specialized distributions for scientific computing, Internet of Things, and embedded systems. Python Web Scraping - Quick Guide - Web scraping is an automatic process of extracting information from web. This chapter will give you an in-depth idea of web scraping, its comparison with web cr. Conda install -c anaconda nltk Downloading NLTK’s Data. Conda install -c anaconda scrapy Description Scrapy is an open source and collaborative framework for extracting the data you need from websites in a fast, simple, yet extensible way.

Intro to Python and Web Scraping

I. Anaconda Crash Course: Using Anaconda (“Conda”) to Supplement Python

Python is a valuable scripting language for data analysis and management; however managing a Python project environment can be nuanced and tricky. Anaconda is a platform built to complement Python by creating customizable and easily accessible environments in which you can run Python scripts.

For reference, the Anaconda homepage is found at the following address.

The following tutorial runs through the installation of Anaconda, and then introduces you to the concepts behind Anaconda that make it a nice and useful Python development environment.

Install Anaconda (aka Conda)

The Anaconda homepage contains the materials that you need to install Anaconda on your machine. You will primarily be using Anaconda through the command line, so you will have to get comfortable working on the command line.

II. Check Anaconda Version and Install

The first step is to open Terminal and check to see if you have Anaconda installed. If not, we will install it. To check the version, follow the following commands.

A. Open Terminal

B. Check Version

The syntax to access Anaconda on the command line is simply ‘conda’. To check the version you have installed, use the following:

If you have it installed, you will see platform information, version details, and environment paths after you hit enter, if not, the terminal will not recognize the command.

C. Install Anaconda

To install ‘Conda’, navigate to the Anaconda downloads page at:

Here, pick your system (Mac/Windows) and the Python version. In our case, we are going to pick Mac and select version 3.6. Use the graphical installer, it will provide us a wizard that will step us through the installation process. Download the installer, double click the package file and follow the instructions. Just a heads up, the installation process takes 5-10 minutes, its a big program, but is straight forward.

If you run into problems installing, or you get an error that states that Anaconda cannot be install in the default location, visit this page for short instructions on how to troubleshoot installation.

Anaconda is contained in one directory, so if you ever need to uninstall Anaconda, use Terminal to remove the entire Anaconda directory using rm -rf ~/anaconda.

We used Python 3, not Python 2. The guidelines on the site describe that we should use whichever version we intend to use most often, but ultimately it will not matter that much. Anaconda supports both, and if you ever want to use Python 2, you can create an environment that uses Python 2 and activate it. The main reason you would want to use Python 2 is that Linux distributions and Macs, Python 2.7 is still the default, and because the Python ecosystem has amassed a significant amount of quality software over the years in which some of it does not yet work on 3. Python 3, however, is designed to be more robust and fixes a lot of bugs in Python 2, so in the future, expect to see a continued migration to Python 3. We are set up with Python 3 as our default, but since we are using Anaconda, if we want to set up a Python 2 instance at some point, it will be easy to do!

III. Confirm the installation worked properly

Once we are finished with the installation, check to make sure it installed correctly by performing a version check.

If you see a 4.X.X version number popup, and with platform and environment information, the installation worked. Now we can begin working with Conda.

Additional Reading and Resources

Anaconda 30-minute Test Drive

Conda Command Line Cheatsheet -

Mac Command Line Cheatsheet –

https://github.com/0nn0/terminal-mac-cheatsheet/wiki/Terminal-Cheatsheet-for-Mac-(-basics-)

Python Documentation -

Python is an increasingly popular high-level programming language. It emphasizes legibility over highly complex structure. Python innately provides simple data structures allowing for easy data manipulation. Python provides a simple approach to object-oriented programming, which in turn allows for intuitive programming, and has resulted in a large user community that has created numerous libraries that extend the basic capacities of the language.

Python is an 'interpretted' language. This means that every Python command that is executed is actually translated to lower-level programming languages. Lower-level programming languages are very fast and powerful, but writing programs in these languages can be difficult.

There are two main versions of Python, these are Python 2 and Python 3. Python 3 is newer, and remove bugs and idiosyncracies of Python 2, however Python 2 is still heavily used. While the fundamentals of the two versions are the same, there are some differences (mostly syntactical) between the two and unfortunately Python 3 is NOT backwards compatible, so always double check which version of Python is being used when you work with it. In this class, we are using Python 3.

There are a number of great Python tutorials available on the web, some can be found here: </p>

There are also some excellent Python textbooks and cookbooks. A simple Amazon search will reveal many. In particular, we recommend and are using the following from the MIT Press: </p>

Web Scraping With Anaconda

A Jupyter Notebook (formerly called iPython) is a tool for interactively writing and executing Python code. It allows the programmer to easily write and test code by allowing snippets of code and their results to be displayed side-by-side. Each snippet of code is called a 'cell'. It promotes the ease of documentation by allowing some cells to contain text, html code, images, or even Latex. Jupyter is particularly useful for Big Data computation, as it allows for parallel processing.

Let's switch over to a Jupyter Notebook for the rest of this tutorial. First, create an environment that is running Python 3 in terminal. We can name this anything, but name it python3 so we can remember easily what it is.

Now we want to work in this environment. Activate python3 using the following.

Then, navigate to the folder containing the tutorial files you've downloaded.

Finally launch a Jupyter notebook, and open the 'Intro to Python and Web Scraping.ipynb' file

3. Python Libraries

Python is a dynamically typed language. A language has dynamic typing when variable types are not predefined like in a compiled language; the type of a value is evaluated when the code is run, based on how you are attempting to use it.

Dynamic typing and a number of other language specific characteristics, like readability and reusability, make it a very popular programming language with a large user community. However, Python on itself only provides a basic number of modules and functionality. In order to extend Python’s functionality, the active community has created a very large number of libraries. A library is a built-in or external module that can be imported into our current code to add functionality. Libraries usually take advantage of Object-Oriented-Programming, defining Python objects in additional scripts that can then be instantiated in our current code.

Loading libraries into our current context can be expensive; for that reason, Python requires us to explicitly load the libraries that we want to use:

We can be more specific and import only specific classes or functions of the library:

We can also change the name of the library when it gets imported:

4. Web Scraping using Beautiful Soup

Let's scrape some data using a fun library called Beautiful Soup. We'll create a CSV dataset of the a table on 311 reported Rodent Incidents around Boston.

The website we are going to scrape is here.

Let's get started!

Importing Modules

First import modules. import requests imports the requests module, and import bs4 imports the Beautiful Soup library.

Testing out Requests

Requests will allow us to load a webpage into python so that we can parse it and manipulate it. Test this by running the following. Enter the following commands in terminal, and hit enter after entering each to run each of them.

This allowed us to access all of the content from the source code of the webpage with Python, which we can now parse and extract data. It even printed to our console. Pretty cool!

Testing out Beautiful Soup

Our next big step is to test out Beautiful Soup. Let's talk about what this is...

What is Beautiful Soup?

Beautiful Soup is a Python library for parsing data out of HTML and XML files (aka webpages). It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. The major concept with Beautiful Soup is that it allows you to access elements of your page by following the CSS structures, such as grabbing all links, all headers, specific classes, or more. It is a powerful library. Once we grab elements, Python makes it easy to write the elements or relevant components of the elements into other files, such as a CSV, that can be stored in a database or opened in other software.

The sample webpage we are using contains data on 'rodent incidents' in the greater Boston area. Let's use this file to explore the tree, and extract some data.

iv. Make the Soup

First, we have to turn the website code into a Python object. We have already imported the Beautiful Soup library, so we can start calling some of the methods in the libary. Replace print response.text with the following. This turns the text into an Python object named soup.

An important note: You need to specify the specific parser that Beautiful Soup uses to parse your text. This is done in the second argument of the BeautifulSoup function. The default is the built in Python parser, which we can call using html.parser

You an also use lxml or html5lib. This is nicely described in the documentation. For our purposes, using the default is fine.

Using the Beautiful Soup prettify() function, we can print the page to see the code printed in a readable and legible manner.

At any point, if you need a reference, visit the Beautiful Soup documentation for the official descriptions of functions. Prettify is a handy one to see our document in a clean fashion.

Navigating the Data Structure

With our data from the webpage nicely laid out, Beautiful Soup allows us to now navigate the data structure. We called our Beautiful Soup object soup, so we can run the Beautiful Soup functions on this object. Let's explore some ways to do this, try entering some of the following into your terminal.

Working with Arrays

The easiest way to access elements and then either write them to file or manipulate them is to save them as objects themselves. Note that our data is organzed into counties and several numbers. Let's save these to arrays, which are the easiest way to work with the data.

The following gives us an array, we can work with the elements.

You never want to repeat code like this, so turn this into a loop:

This array only gives us counties though, let's get all of the data elements from all classes.

We have all of our data that was nested in these tags saved to a Python array. Access the elements of the array by using data[x], where x is location in the array. In Python, arrays start at 0, so place 1 in a Python array is actually called by using a 0, and place 8 would be called by a 7.

Right now, we get the whole element with those commands. To get just the content, use the following.

In the next part, we'll create a script file and run it!

Create and execute Script File (.py)

Open up starter_script.py in Sublime Text and we are going to copy the code below into that file. .py signifies that the file is a python program. Using this method, you can make independently standing python files that can be run. It is located in the scripts folder for the week. Starter script contains all the code below, and is a sample you can use for scraping pages.

For the first lines in the file, lets import modules. import requests imports the requests module, and import bs4 imports the Beautiful Soup library, then, based on what we did above, load the page, turn it into a parseable 'soup', then find the proper elements. Let's put a print at the end so we can see our work as it runs.

Run this by typing python starter_script.py in terminal. This will execute our program.

You should see an array with our data elements nested within tags. This is what we want!

Write data to a file using a simple loop

Python makes opening a file and writing to it very easy. Let's take this simple dataset and write it to a file that saves in our current working directory. An important note, whatever the working directory is when you start Python will be the root for where your files are read from and written to.

Python also has nice iteration features that allow us to iterate through arrays, lists, and other files. In this following example, manually create a comma-separated document with our data using file writing operations and a while loop.

In pseudo-code:

Open up a file to write in and append data.
Set up parameters for the while loop
Write headers
Run while loop that will write elements of the array to file
When complete, close the file

Once done, open the file on your machine and see your data. Enter the following code into the starter_scirpt.py file below our other code, note what each line is doing.

Save and run the file. You'll see county_data.txt appear where you designate.

Open county_data.txt in Sublime Text. It's a CSV with the data from the page!

Further reading on File Reading and Writing and Iteration in Python can be found at the following links.

We created this CSV manually to illustrate some basic Python. Python has modules and features that support CSV as well that are very useful and handy. These are best if you are reading in a CSV, allowing you to access the elements of the CSV. You can read more about the built in CSV module here.

Additional Reading and Resources

Conda Command Line Cheatsheet -

Mac Command Line Cheatsheet –

https://github.com/0nn0/terminal-mac-cheatsheet/wiki/Terminal-Cheatsheet-for-Mac-(-basics-)

Python Documentation -

Beautiful Soup Tutorials -

Credits

Prepared by Mike Foster & Luke Mich

Return to DUSPviz tutorials page

Someone on the NICAR-L listserv asked for advice on the best Python libraries for web scraping. My advice below includes what I did for last spring’s Computational Journalism class, specifically, the Search-Script-Scrape project, which involved 101-web-scraping exercises in Python.

See the repo here: https://github.com/compjour/search-script-scrape

Best Python libraries for web scraping

For the remainder of this post, I assume you’re using Python 3.x, though the code examples will be virtually the same for 2.x. For my class last year, I had everyone install the Anaconda Python distribution, which comes with all the libraries needed to complete the Search-Script-Scrape exercises, including the ones mentioned specifically below:

The best package for general web requests, such as downloading a file or submitting a POST request to a form, is the simply-named requests library(“HTTP for Humans”).

Anaconda Web Scraping Tool

Here’s an overly verbose example:

The requests library even does JSON parsing if you use it to fetch JSON files. Here’s an example with the Google Geocoding API:

For the parsing of HTML and XML, Beautiful Soup 4 seems to be the most frequently recommended. I never got around to using it because it was malfunctioning on my particular installation of Anaconda on OS X.

But I’ve found lxml to be perfectly fine. I believe both lxml and bs4 have similar capabilities – you can even specify lxml to be the parser for bs4. I think bs4 might have a friendlier syntax, but again, I don’t know, as I’ve gotten by with lxml just fine:

The standard urllib package also has a lot of useful utilities – I frequently use the methods from urllib.parse. Python 2 also has urllib but the methods are arranged differently.

Here’s an example of using the urljoin method to resolve the relative links on the California state data for high school test scores. The use of os.path.basename is simply for saving the each spreadsheet to your local hard drive:

And that’s about all you need for the majority of web-scraping work – at least the part that involves reading HTML and downloading files.

Examples of sites to scrape

The 101 scraping exercises didn’t go so great, as I didn’t give enough specifics about what the exact answers should be (e.g. round the numbers? Use complete sentences?) or even where the data files actually were – as it so happens, not everyone Googles things the same way I do. And I should’ve made them do it on a weekly basis, rather than waiting till the end of the quarter to try to cram them in before finals week.

The Github repo lists each exercise with the solution code, the relevant URL, and the number of lines in the solution code.

The exercises run the gamut of simple parsing of static HTML, to inspecting AJAX-heavy sites in which knowledge of the network panel is required to discover the JSON files to grab. In many of these exercises, the HTML-parsing is the trivial part – just a few lines to parse the HTML to dynamically find the URL for the zip or Excel file to download (via requests)…and then 40 to 50 lines of unzipping/reading/filtering to get the answer. That part is beyond what typically considered “web-scraping” and falls more into “data wrangling”.

I didn’t sort the exercises on the list by difficulty, and many of the solutions are not particulary great code. Sometimes I wrote the solution as if I were teaching it to a beginner. But other times I solved the problem using the style in the most randomly bizarre way relative to how I would normally solve it – hey, writing 100+ scrapers gets boring.

But here are a few representative exercises with some explanation:

1. Number of datasets currently listed on data.gov

I think data.gov actually has an API, but this script relies on finding the easiest tag to grab from the front page and extracting the text, i.e. the 186,569 from the text string, '186,569 datasets found'. This is obviously not a very robust script, as it will break when data.gov is redesigned. But it serves as a quick and easy HTML-parsing example.

29. Number of days until Texas’s next scheduled execution

Texas’s death penalty site is probably one of the best places to practice web scraping, as the HTML is pretty straightforward on the main landing pages (there are several, for scheduled and past executions, and current inmate roster), which have enough interesting tabular data to collect. But you can make it more complex by traversing the links to collect inmate data, mugshots, and final words. This script just finds the first person on the scheduled list and does some math to print the number of days until the execution (I probably made the datetime handling more convoluted than it needs to be in the provided solution)

3. The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days

The analytics.usa.gov site is a great place to practice AJAX-data scraping. It’s a very simple and robust site, but either you are aware of AJAX and know how to use the network panel (and in this case, locate ie.json, or you will have no clue how to scrape even a single number on this webpage. I think the difference between static HTML and AJAX sites is one of the tougher things to teach novices. But they pretty much have to learn the difference given how many of today’s websites use both static and dynamically-rendered pages.

6. From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees

There’s actually no HTML parsing if you assume the URLs for the data files can be hard coded. So besides the nominal use of the requests library, this ends up being a data-wrangling exercise: download two specific zip files, unzip them, read the CSV files, filter the dictionaries, then do some math.

Web Scraping Free

90. The currently serving U.S. congressmember with the most Twitter followers

Another example with no HTML parsing, but probably the most complicated example. You have to download and parse Sunlight Foundation’s CSV of Congressmember data to get all the Twitter usernames. Then authenticate with Twitter’s API, then perform mulitple batch lookups to get the data for all 500+ of the Congressional Twitter usernames. Then join the sorted result with the actual Congressmember identity. I probably shouldn’t have assigned this one.

HTML is not necessary

I included no-HTML exercises because there are plenty of data programming exercises that don’t have to deal with the specific nitty-gritty of the Web, such as understanding HTTP and/or HTML. It’s not just that a lot of public data has moved to JSON (e.g. the FEC API) – but that much of the best public data is found in bulk CSV and database files. These files can be programmatically fetched with simple usage of the requests library.

It’s not that parsing HTML isn’t a whole boatload of fun – and being able to do so is a useful skill if you want to build websites. But I believe novices have more than enough to learn from in sorting/filtering dictionaries and lists without worrying about learning how a website works.

Besides analytics.usa.gov, the data.usajobs.gov API, which lists federal job openings, is a great one to explore, because its data structure is simple and the site is robust. Here’s a Python exercise with the USAJobs API; and here’s one in Bash.

There’s also the Google Maps geocoding API, which can be hit up for a bit before you run into rate limits, and you get the bonus of teaching geocoding concepts. The NYTimes API requires creating an account, but you not only get good APIs for some political data, but for content data (i.e. articles, bestselling books) that is interesting fodder for journalism-related analysis.

But if you want to scrape HTML, then the Texas death penalty pages are the way to go, because of the simplicity of the HTML and the numerous ways you can traverse the pages and collect interesting data points. Besides the previously mentioned Texas Python scraping exercise, here’s one for Florida’s list of executions. And here’s a Bash exercise that scrapes data from Texas, Florida, and California and does a simple demographic analysis.

If you want more interesting public datasets – most of which require only a minimal of HTML-parsing to fetch – check out the list I talked about in last week’s info session on Stanford’s Computational Journalism Lab.