Since so much work on a computer involves going on the Internet, it’d be great if your programs could get online. Web scraping is the term for using a program to download and process content from the Web. For example, Google runs many web scraping programs to index web pages for its search engine. In this chapter, you will learn about several modules that make it easy to scrape web pages in Python.
webbrowser
. . This is about the only thing the webbrowser
module can do. Even so, the open()
function does make some interesting things possible. For example, it’s tedious to copy a street address to the clipboard and bring up a map of it on Google Maps. You could take a few steps out of this task by writing a simple script to automatically launch the map in your browser using the contents of your clipboard. This way, you only have to copy the address to a clipboard and run the script, and the map will be loaded for you.
This is what your program does:
Gets a street address from the command line arguments or clipboard.
Opens the web browser to the Google Maps page for the address.
This means your code will need to do the following:
Read the command line arguments from sys.argv
.
Read the clipboard contents.
Call the webbrowser.open()
function to open the web browser.
Open a new file editor window and save it as mapIt.py.
The address is in the URL, but there’s a lot of additional text there as well. Websites often add extra data to URLs to help track visitors or customize sites. But if you try just going to , you’ll find that it still brings up the correct page. So your program can be set to open a web browser to 'https://www.google.com/maps/place/
your_address_string'
(where your_address_string
is the address you want to map).
Click the address text field.
Paste the address.
Press ENTER.
See how mapIt.py makes this task less tedious?
Pragmatic Unicode:
To write the web page to a file, you can use a for
loop with the Response
object’s iter_content()
method.
What if you’re interested in scraping the temperature information for that ZIP code? Right-click where it is on the page (or CONTROL-click on OS X) and select Inspect Element from the context menu that appears. This will bring up the Developer Tools window, which shows you the HTML that produces this particular part of the web page. shows the developer tools open to the HTML of the temperature.
requests
module can download this page and then you can use Beautiful Soup to find the search result links in the HTML. Finally, you’ll use the webbrowser
module to open those links in browser tabs.Make your code look like the following:
). The front page at has a Prev button that guides the user back through prior comics. Downloading each comic by hand would take forever, but you can write a script to do this in a couple of minutes.Here’s what your program does:
Loads the XKCD home page.
Saves the comic image on that page.
Follows the Previous Comic link.
Repeats until it reaches the first comic.
Make your code look like the following:
.. Your browser should look something like .
find_element_*
and find_elements_*
methods being called on a WebDriver
object that’s stored in the variable browser
.WebElement
object for the <a>
element with the text Read It Online, and then simulates clicking that <a>
element. It’s just like if you clicked the link yourself; the browser then follows that link.Keys
variables.