Python Web Scraping Get Text



Introduction

We'll cover how to use Headless Chrome for web scraping Google Places. Google places does not necessarily require javascript because google will serve a different response if you disable javascript. But for better user emulation when browsing/scraping google places, a browser is recommended.

Headless Chrome is essentially the Chrome browser running without a head (no graphical user interface). The benefit being you can run a headless browser on a server environment that also has no graphical interface attached to it, which is normally accessed through shell access. It can also be faster to run headless and can have lower overhead on system resources.

Controlling a browser

Manually Opening a Socket and Sending the HTTP Request. Python Web Scraping - Processing CAPTCHA - In this chapter, let us understand how to perform web scraping and processing CAPTCHA that is used for testing a user for human or robot. (OCR), a process of extracting text from the images. For this purpose, we are going to use open source Tesseract OCR engine. It can be installed with the help of. Oct 03, 2018 Selenium has a function called “ findelementsbyxpath ”. We will pass our XPath into this function and get a selenium element. Once we have the element, we can extract the text inside our XPath using the ‘ text ’ function. In our case the text is basically the user id (‘dino001’). Python Web Scraping What is Web Scrapping. The internet is full of huge amount of data which can be used for different purposes. To collect this data we need to know how to scrape data from a website. Web scraping is the process of extracting and collecting data from websites and storing it on a local machine or in a database.

We need a way to control the browser with code, this can be done through what is called the Chrome DevTools Protocol or CDP. CDP is essentially a websocket server running on the browser that is based on JSONRPC. Instead of directly working with CDP we'll use a library called pyppeteer which is a python implementation of the CDP protocol that provides an easier to use abstraction. It's inspired by the Node version of the same library called puppeteer.

Get

Setting up

As usual with any of my python projects, I recommend working in a virtual python environment which helps us address dependencies and versions separately for each application / project. Let's create a virtual environment in our home directory and install the dependencies we need.

Make sure you are running at least python 3.6.1, 3.5 is end of support.The pyppeteer library will not work with python 3.6.0, this is due to the websockets library that it depends on not supporting that python version.

Let's create the following folders and files.

We created a __main__.py file, this lets us run the Google Places scraper with the following command (nothing should happen right now):

Launching a headless browser

We need to launch a Chrome browser. By default, pyppeteer will install the latest version of Chromium. It's also possible to just use Chrome as long as it is installed on your system. The library makes use of async/await for concurrency. In order to use this we import the asyncio package from python.

To launch with Chrome instead of Chromium add executablePath option to the launch function. Below, we launch the browser, navigate to google and take a screenshot. The screenshot will be saved in the folder you are running the scraper.

Digging in

Let's create some functions in core/browser.py to simplify working with a browser and the page. We'll make use of what I believe is an awesome feature in python for simplifying management of resources called context manager. Specifically we will use an async context manager.

An asynchronous context manager is a context manager that is able to suspend execution in its enter and exit methods.

This feature in python lets us write code like the below which handles opening and closing a browser with one line.

Let's add the PageSession async context manager in the file core/browser.py.

In our google-places/__main__.py file let's make use of our new PageSession and print the html content of the final rendered page with javascript executed.

Run the google-places module in your terminal with the same command we used earlier.

So now we can launch a browser, open a page (a tab in chrome) and navigate to a website and wait for javascript to finish loading/executing then close the browser with the above code.

Next let's do the following:

  • We want to visit google.com
  • Enter a search query for pediatrician near 94118
  • Click on google places to see more results
  • Scrape results from the page
  • Save results to a CSV file

Navigating pages

We want to end up on the following page navigations so we can pull the data we need.

Let's start by breaking up our code in google-places/__main__.py so we can first search then navigate to google places. We also want to clean up some of the string literals like the google url.

You can see the new code we added above as it has been highlighted. We use XPath to find the search bar, the search button and the view all button to get us to google places.

  1. Type in the search bar
  1. Click the search button
  1. Wait for the view all button to appear
  1. Click view all button to take us to google places
  1. Wait for an element on the new page to appear

Scraping the data with Pyppeteer

At this point we should be on the google places page and we can pull the data we want. The navigation flow we followed before is important for emulating a user.

Let's define the data we want to pull from the page.

  • Name
  • Location
  • Phone
  • Rating
  • Website Link

In core/browser.py let's add two methods to our PageSession to help us grab the text and an attribute (the website link for the doctor).

So we added get_text and get_link. These two methods will evaluate javascript on the browser, the same way if you were to type it on the Chrome console. You can see that they just use the DOM to grab the text of the element or the href attribute.

In google-places/__main__.py we will add a few functions that will grab the content that we care about from the page.

We make use of XPath to grab the elements. You can practice XPath in your Chrome browser by pressing F12 or right-clicking inspect to open the console.Why do I use XPath? It's easier to specify complex selectors because XPath has built in functions for handling things like finding elements which contain some text or traversing the tree in various ways.

For the phone, rating and link fields we default to None and substitute with 'N/A' because not all doctors have a phone number listed, a rating or a link. All of them seem to have a location and a name.

Because there are many doctors listed on the page we want to find the parent element and loop over each match, then evaluate the XPath we defined above.To do this let's add two more functions to tie it all together.

The entry point here is scrape_doctors which evaluates get_doctor_details on each container element.

Python Web Scraping Get Text Color

In the code below, we loop over each container element that matched our XPath and we get back a Future object by calling the function get_doctor_details.Because we don't use the await keyword, we get back a Future object which can be used by the asyncio.gather call to evaluate all Future objects in the tasks list.

This line allows us to wait for all async calls to finish concurrently.

Let's put this together in our main function. First we search and crawl to the right page, then we scrape with scrape_doctors.

Saving the output

In core/utils.py we'll add two functions to help us save our scraped output to a local CSV file.

Let's import it in google-places/__main__.py and save the output of scrape_doctors from our main function.

We should now have a file called pediatricians.csv which contains our output.

Wrapping up

From this guide we should have learned how to use a headless browser to crawl and scrape google places while emulating a real user.There's a lot more you can do with headless browsers such as generate pdfs, screenshots and other automation tasks.

Hopefully this guide helped you get started executing javascript and scraping with a headless browser. Till next time!


This post will walk through how to use the requests_html package to scrape options data from a JavaScript-rendered webpage. requests_html serves as an alternative to Selenium and PhantomJS, and provides a clear syntax similar to the awesome requests package. The code we’ll walk through is packaged into functions in the options module in the yahoo_fin package, but this article will show how to write the code from scratch using requests_html so that you can use the same idea to scrape other JavaScript-rendered webpages.

Note:

requests_html requires Python 3.6+. If you don’t have requests_html installed, you can download it using pip:

Motivation

Let’s say we want to scrape options data for a particular stock. As an example, let’s look at Netflix (since it’s well known). If we go to the below site, we can see the option chain information for the earliest upcoming options expiration date for Netflix:

On this webpage there’s a drop-down box allowing us to view data by other expiration dates. What if we want to get all the possible choices – i.e. all the possible expiration dates?

We can try using requests with BeautifulSoup, but that won’t work quite the way we want. To demonstrate, let’s try doing that to see what happens.

Running the above code shows us that option_tags is an empty list. This is because there are no option tags found in the HTML we scrapped from the webpage above. However, if we look at the source via a web browser, we can see that there are, indeed, option tags:

Why the disconnect? The reason why we see option tags when looking at the source code in a browser is that the browser is executing JavaScript code that renders that HTML i.e. it modifies the HTML of the page dynamically to allow a user to select one of the possible expiration dates. This means if we try just scraping the HTML, the JavaScript won’t be executed, and thus, we won’t see the tags containing the expiration dates. This brings us to requests_html.

Using requests_html to render JavaScript

Now, let’s use requests_html to run the JavaScript code in order to render the HTML we’re looking for.

Similar to the requests package, we can use a session object to get the webpage we need. This gets stored in a response variable, resp. If you print out resp you should see the message Response 200, which means the connection to the webpage was successful (otherwise you’ll get a different message).

Running resp.html will give us an object that allows us to print out, search through, and perform several functions on the webpage’s HTML. To simulate running the JavaScript code, we use the render method on the resp.html object. Note how we don’t need to set a variable equal to this rendered result i.e. running the below code:

stores the updated HTML as in attribute in resp.html. Specifically, we can access the rendered HTML like this:

So now resp.html.html contains the HTML we need containing the option tags. From here, we can parse out the expiration dates from these tags using the find method.

Similarly, if we wanted to search for other HTML tags we could just input whatever those are into the find method e.g. anchor (a), paragraph (p), header tags (h1, h2, h3, etc.) and so on.

Alternatively, we could also use BeautifulSoup on the rendered HTML (see below). However, the awesome point here is that we can create the connection to this webpage, render its JavaScript, and parse out the resultant HTML all in one package!

Lastly, we could scrape this particular webpage directly with yahoo_fin, which provides functions that wrap around requests_html specifically for Yahoo Finance’s website.

Scraping options data for each expiration date

Once we have the expiration dates, we could proceed with scraping the data associated with each date. In this particular case, the pattern of the URL for each expiration date’s data requires the date be converted to Unix timestamp format. This can be done using the pandas package.

Similarly, we could scrape this data using yahoo_fin. In this case, we just input the ticker symbol, NFLX and associated expiration date into either get_calls or get_puts to obtain the calls and puts data, respectively.

Get

Note: here we don’t need to convert each date to a Unix timestamp as these functions will figure that out automatically from the input dates.

That’s it for this post! To learn more about requests-html, check out my web scraping course on Udemy here!

Python Web Scraping Pdf

To see the official documentation for requests_html, click here.

Python Web Scraping Get Text File

Related