
That means we have to take advantage of requests’ sessions. Well, we need to set our cookies in advance first. But what if we want to then open another page while logged in? Don’t worry about the subsequent redirects - the requests library deals with that automatically, by default. The data can be sent as a regular dictionary. Now that we’ve got the URL and form details, we can make the POST request. However, bear in mind that it could be buried in a list of many other requests, because of all the redirects and the subsequent fetching of all resources on the page. The original request should be there somewhere with all request and response headers visible, as well as the form data. Thus, all we need to do is fill our details and log in while the Network tab is open. Specifically, in the Chrome developers’ tools, there is a ‘Network’ tab that records all requests and responses. We do that with the help of the Developer tools. This information can also be obtained by intercepting the browser requests and inspecting them. There are many different options, so we should check the one employed by the developers through the ‘ name’ attribute. What should that parameter be called? Well, it might be simply ‘ userName’, or it could be called ‘ email’, maybe ‘ user’. Another important piece of information is the name of the input field.Īs trivial as it may seem, we don’t have that knowledge a priori.įor example, think about the username. This is important because the hidden parameters will also be placed in input tags and thus can be obtained. The URL of the request can be found in an attribute called ‘ action’, whereas the parameter fields are contained in the ‘ input’ tags. The majority of login forms are written using the HTML tag ‘ form’: You can either infer that information from the HTML or intercept the requests our browser submits. The first piece of the puzzle is to find out where the ‘post’ request is sent to and the format of the data. That way, the server knows that we are still logged in and can send us the correct page containing sensitive info. Those should be included in each subsequent request we submit. If we successfully signed into our account, client-side cookies are set. The other detail is related to the cookies we mentioned last time. This data often includes some “authenticity token” which signals that this login attempt is legitimate and it may or may not be required for successful login. First, although the user is asked to fill out only the email and password, the form sends additional data to the server. There are a couple of hidden details here, though. In case the credentials are correct, most of the time a couple of redirects are chained to finally lead us to some account homepage of sorts. The server then processes the data and checks its validity. When filled out, a POST request, containing the form data, is sent to some URL. This page contains a simple HTML form to prompt for ‘username’ (or ‘email’) and ‘password’. In most cases, though, it exhibits the following flow.įirst, when you press the ‘sign in’ button, you are redirected to a log-in page. With that out of the way, let’s walk through the steps to get past the login and scrape data.ĭepending on the popularity and security measures of the website you are trying to access, signing in can be anywhere between ‘relatively easy’ and ‘very hard’. So, always make sure to check the legality of your actions first. This means that distributing it or using it for commercial purposes without permission may be a legal violation. Information that requires a login to access is generally not public.
#WEBSCRAPER CHROME FILL OUT FORMS HOW TO#
How to Scrape Data That Requires a Login - Important Disclaimer After all, the browser has access to the same tools when it comes to a request as we do. If it does, that should always be your first choice. In this case, first check if the company provides an API for the purpose. It could be your channel analytics, your user history, or any other type of information you need.

Sometimes, you might set your sights on scraping data you can access only after you log into an account. So, what’s the next problem you could encounter when scraping? In our last tutorial, we looked into request headers and cookies and their role when you scrape data.
