This content is from the fall 2016 version of this course. Please go here for the most recent version.
- Jupyter Notebook In Browser
- Jupyter Notebook Web Scraping Software
- Web Scraping With Python
- Jupyter Notebook Examples
- Jupiter Notebook Web Scraping Tool
If you have not already done so, you will need to properly install an Anaconda distribution of Python, following the installation instructions from the first week.
Sep 27, 2019 Let’s use Python and some web scraping techniques to download images. Update 2 (Feb 25, 2020): One of the problems with scraping w ebpages is that the target elements depend on the a selector of some sort. Let’s use Python and some web scraping techniques to download images. Update 2 (Feb 25, 2020): One of the problems with scraping w ebpages is that the target elements depend on the a selector of some sort.
I would also recommend installing a friendly text editor for editing scripts such as Atom. Once installed, you can start a new script by simply typing in bash atom name_of_your_new_script
. You can edit an existing script by using atom name_of_script
. SublimeText also works similar to Atom. Alternatively, you may use a native text editor such as Vim, but this has a higher learning curve.
Installing New Python Packages
One way to install new packages not already included in Anaconda is using conda install <package>
. While packages in Anaconda are curated, they are not always the most up to date version. Furthermore, not all packages are available with conda install
.
To resolve this issue, use the Python package manager pip
, which should be installed by default. To begin, update pip.
On Mac or Linux
On Windows
Jupyter Notebook In Browser
In contrast to querying API’s with Python, web-scraping relies on targeting the observed structure of a website itself to download specified content. A good conceptual model for web-scraping is the following example:
Suppose you would like to collect all the speeches and remarks of President Obama during his presidency. You could begin by going to the White House Speeches and Remarks website, finding a speech, copying the text, pasting it in a text file, and saving it. Repeat this process by navigating to the URL for each speech, copy the speech, and save it. Obviously, this would be an onerous experience to do manually. Web-scraping offers a programmatic solution to automate this process.
We will return to a simplified example of scraping presidential speech later in the tutorial. Before we get deeper into this, let’s review the key ideas in web-scraping.
Essential Processes of Webscraping
There are two essential concepts of webscraping
(1) Finding URLs to Download
Finding URLs to download is often one of the most challenging parts of web-scraping and is highly dependent on the website layout, URL construction, and formatting. Your basic goal here is to generate a list or CSV of URLs to download. How do I know what the URLs should be?
Jupyter Notebook Web Scraping Software
Some sites have very clean URLs:
In such a case, the URLs themselves follow a formula. Generating the list of URLs is a simple matter of using string manipulation to create URLs for all possible datetimes that exist in the range you desire.
This is rarely the case. Often URLs will have no programmable format. You must instead collect them. Here the strategy is to begin at the base_URL for your desired site content. From here, there will be one or more links (which lead to one or more link combinations) leading to the final URL. The task is to write a series of loops that recursively go through each possible path to get to the final set of URLs. This is site dependent.
Some sites have pager functions: Page 1 of X. Others have JavaScript that dynamically render site content when the user scrolls to the end of the page. Pager functions or other file paths can be solved using BeautifulSoup and URLlib. Dynamic pages are best resolved using a package like Selenium.
These topics are beyond the current scope. Today, we will be emphasizing the most basic of web-scraping tasks, downloading content for a specific URL.
(2) Downloading URLs
Once you have a URL, downloading it can be achieved in various ways, and depends on the format. If your file exists as a plain text file or pdf, you could simply download it from the command line using curl
or wget
.
Alternatively, for standard websites, you will often want to use an HTLM parser such as BeautifulSoup.
Note that you could use wget on most websites. For standard websites, however, wget will download the raw HTML (the main web development structure). This is not very readable and much longer than the desired content. Below, I demonstrate a simple implementation of BeautifulSoup to get select speech content.
Web Scraping With Python
Before turning to this example, please note the following web-scraping libraries in Python:
Essential Tools for Webscraping
Advanced Webscraping Tools
Selenium (For Advanced Web-scraping)
The code for this example is available on Github. Please clone the following:
To launch Jupyter, go to your Shell and type:
jupyter notebook
This will launch your web-browser and Jupyter from the location you run the above command. It is recommended you do this immediately after the above git clone
to open Jupyter in the correct location. You will have the option to open or navigate to the tutorial notebook, or to start a new one.
Open the folder Webscraping_and_APIs_in_Python
and open the notebook Elementary_Web_Scraping.ipynb
.
Jupyter Notebook Examples
If you cannot launch the notebook, you can view the HTML version here.
Jupiter Notebook Web Scraping Tool
This work is licensed under the CC BY-NC 4.0 Creative Commons License.