I’ve recently had to perform some web scraping from a site that required login.It wasn’t very straight forward as I expected so I’ve decided to write a tutorial for it.
Web scraping, web crawling, html scraping, and any other form of web data extraction can be complicated. Between obtaining the correct page source, to parsing the source correctly, rendering javascript, and obtaining data in a usable form, there’s a lot of work to be done. ParseHub helps you develop web scrapers to crawl single and various websites with the assistance for JavaScript, AJAX, cookies, sessions, and switches using their desktop application and deploy them to their cloud service.
For this tutorial we will scrape a list of projects from our bitbucket account.
The code from this tutorial can be found on my Github.
We will perform the following steps:
- Extract the details that we need for the login
- Perform login to the site
- Scrape the required data
For this tutorial, I’ve used the following packages (can be found in the requirements.txt):
Open the login page
Go to the following page “bitbucket.org/account/signin” .You will see the following page (perform logout in case you’re already logged in)
Sites For Web Scraping Jobs
Check the details that we need to extract in order to login
In this section we will build a dictionary that will hold our details for performing login:
- Right click on the “Username or email” field and select “inspect element”. We will use the value of the “name” attribue for this input which is “username”. “username” will be the key and our user name / email will be the value (on other sites this might be “email”, “user_name”, “login”, etc.).
- Right click on the “Password” field and select “inspect element”. In the script we will need to use the value of the “name” attribue for this input which is “password”. “password” will be the key in the dictionary and our password will be the value (on other sites this might be “user_password”, “login_password”, “pwd”, etc.).
- In the page source, search for a hidden input tag called “csrfmiddlewaretoken”. “csrfmiddlewaretoken” will be the key and value will be the hidden input value (on other sites this might be a hidden input with the name “csrf_token”, “authentication_token”, etc.). For example “Vy00PE3Ra6aISwKBrPn72SFml00IcUV8”.
We will end up with a dict that will look like this:
Keep in mind that this is the specific case for this site. While this login form is simple, other sites might require us to check the request log of the browser and find the relevant keys and values that we should use for the login step.
For this script we will only need to import the following:
First, we would like to create our session object. This object will allow us to persist the login session across all our requests.
Second, we would like to extract the csrf token from the web page, this token is used during login.For this example we are using lxml and xpath, we could have used regular expression or any other method that will extract this data.
** More about xpath and lxml can be found here.
Next, we would like to perform the login phase.In this phase, we send a POST request to the login url. We use the payload that we created in the previous step as the data.We also use a header for the request and add a referer key to it for the same url.
Now, that we were able to successfully login, we will perform the actual scraping from bitbucket dashboard page
In order to test this, let’s scrape the list of projects from the bitbucket dashboard page.Again, we will use xpath to find the target elements and print out the results. If everything went OK, the output should be the list of buckets / project that are in your bitbucket account.
You can also validate the requests results by checking the returned status code from each request.It won’t always let you know that the login phase was successful but it can be used as an indicator.
for example:
That’s it.
Full code sample can be found on Github.
- Python Web Scraping Tutorial
- Python Web Scraping Resources
- Selected Reading
In the previous chapter, we have seen scraping dynamic websites. In this chapter, let us understand scraping of websites that work on user based inputs, that is form based websites.
Introduction
These days WWW (World Wide Web) is moving towards social media as well as usergenerated contents. So the question arises how we can access such kind of information that is beyond login screen? For this we need to deal with forms and logins.
In previous chapters, we worked with HTTP GET method to request information but in this chapter we will work with HTTP POST method that pushes information to a web server for storage and analysis.
Interacting with Login forms
While working on Internet, you must have interacted with login forms many times. They may be very simple like including only a very few HTML fields, a submit button and an action page or they may be complicated and have some additional fields like email, leave a message along with captcha for security reasons.
In this section, we are going to deal with a simple submit form with the help of Python requests library.
First, we need to import requests library as follows −
Now, we need to provide the information for the fields of login form.
In next line of code, we need to provide the URL on which action of the form would happen.
After running the script, it will return the content of the page where action has happened.
Suppose if you want to submit any image with the form, then it is very easy with requests.post(). You can understand it with the help of following Python script −
Loading Cookies from the Web Server
A cookie, sometimes called web cookie or internet cookie, is a small piece of data sent from a website and our computer stores it in a file located inside our web browser.
In the context of dealings with login forms, cookies can be of two types. One, we dealt in the previous section, that allows us to submit information to a website and second which lets us to remain in a permanent “logged-in” state throughout our visit to the website. For the second kind of forms, websites use cookies to keep track of who is logged in and who is not.
What do cookies do?
Sites For Webscraping
These days most of the websites are using cookies for tracking. We can understand the working of cookies with the help of following steps −
Step 1 − First, the site will authenticate our login credentials and stores it in our browser’s cookie. This cookie generally contains a server-generated toke, time-out and tracking information.
Step 2 − Next, the website will use the cookie as a proof of authentication. This authentication is always shown whenever we visit the website.
Cookies are very problematic for web scrapers because if web scrapers do not keep track of the cookies, the submitted form is sent back and at the next page it seems that they never logged in. It is very easy to track the cookies with the help of Python requests library, as shown below −
In the above line of code, the URL would be the page which will act as the processor for the login form.
After running the above script, we will retrieve the cookies from the result of last request.
There is another issue with cookies that sometimes websites frequently modify cookies without warning. Such kind of situation can be dealt with requests.Session() as follows −
In the above line of code, the URL would be the page which will act as the processor for the login form.
Observe that you can easily understand the difference between script with session and without session.
Automating forms with Python
In this section we are going to deal with a Python module named Mechanize that will reduce our work and automate the process of filling up forms.
Mechanize module
Mechanize module provides us a high-level interface to interact with forms. Before starting using it we need to install it with the following command −
Note that it would work only in Python 2.x.
Example
In this example, we are going to automate the process of filling a login form having two fields namely email and password −
The above code is very easy to understand. First, we imported mechanize module. Then a Mechanize browser object has been created. Then, we navigated to the login URL and selected the form. After that, names and values are passed directly to the browser object.