Книга: Automate the Boring Stuff with Python: Practical Programming for Total Beginners
Назад: 8. Reading and Writing Files
Дальше: 12. Working with Excel Spreadsheets

]

Since so much work on a computer involves going on the Internet, it’d be great if your programs could get online. Web scraping is the term for using a program to download and process content from the Web. For example, Google runs many web scraping programs to index web pages for its search engine. In this chapter, you will learn about several modules that make it easy to scrape web pages in Python.

  • webbrowser. . This is about the only thing the webbrowser module can do. Even so, the open() function does make some interesting things possible. For example, it’s tedious to copy a street address to the clipboard and bring up a map of it on Google Maps. You could take a few steps out of this task by writing a simple script to automatically launch the map in your browser using the contents of your clipboard. This way, you only have to copy the address to a clipboard and run the script, and the map will be loaded for you.

    This is what your program does:

    • Gets a street address from the command line arguments or clipboard.

    • Opens the web browser to the Google Maps page for the address.

    This means your code will need to do the following:

    • Read the command line arguments from sys.argv.

    • Read the clipboard contents.

    • Call the webbrowser.open() function to open the web browser.

    Open a new file editor window and save it as mapIt.py.

    in the browser and search for an address, the URL in the address bar looks something like this: .

    The address is in the URL, but there’s a lot of additional text there as well. Websites often add extra data to URLs to help track visitors or customize sites. But if you try just going to , you’ll find that it still brings up the correct page. So your program can be set to open a web browser to 'https://www.google.com/maps/place/your_address_string' (where your_address_string is the address you want to map).

    compares the steps needed to display a map with and without mapIt.py.

    .

     

    Click the address text field.

     

    Paste the address.

     

    Press ENTER.

     

    See how mapIt.py makes this task less tedious?

  • Pragmatic Unicode:

To write the web page to a file, you can use a for loop with the Response object’s iter_content() method.

.

  • in a browser.

    in a browser.

    ). This is the text your browser actually receives. The browser knows how to display, or render, the web page from this HTML.

    ). Pressing F12 again will make the developer tools disappear. In Chrome, you can also bring up the developer tools by selecting View▸Developer▸Developer Tools. In OS X, pressing .

    . Before writing any code, do a little research. If you visit the site and search for the 94105 ZIP code, the site will take you to a page showing the forecast for that area.

    What if you’re interested in scraping the temperature information for that ZIP code? Right-click where it is on the page (or CONTROL-click on OS X) and select Inspect Element from the context menu that appears. This will bring up the Developer Tools window, which shows you the HTML that produces this particular part of the web page. shows the developer tools open to the HTML of the temperature.

    .

    ), but here’s a short introduction to selectors. shows examples of the most common CSS selector patterns.

    . The requests module can download this page and then you can use Beautiful Soup to find the search result links in the HTML. Finally, you’ll use the webbrowser module to open those links in browser tabs.

    Make your code look like the following:

    ). The front page at has a Prev button that guides the user back through prior comics. Downloading each comic by hand would take forever, but you can write a script to do this in a couple of minutes.

    Here’s what your program does:

    • Loads the XKCD home page.

    • Saves the comic image on that page.

    • Follows the Previous Comic link.

    • Repeats until it reaches the first comic.

    URL, indicating that there are no more previous pages.

    Make your code look like the following:

    .

    .

    . Your browser should look something like .

    shows several examples of find_element_* and find_elements_* methods being called on a WebDriver object that’s stored in the variable browser.

    .

    , gets the WebElement object for the <a> element with the text Read It Online, and then simulates clicking that <a> element. It’s just like if you clicked the link yourself; the browser then follows that link.

    lists the commonly used Keys variables.

    .

    and keep sending up, right, down, and left keystrokes to automatically play the game.

    ] The answer is no.

    © RuTLib.com 2015-2018