Python Web Scraping（Second Edition）

Katharine Jarmul Richard Lawson

更新时间：2021-07-09 19:43:08

最新章节：Summary

coverpage

Title Page

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Introduction to Web Scraping

When is web scraping useful?

Is web scraping legal?

Python 3

Background research

Checking robots.txt

Examining the Sitemap

Estimating the size of a website

Identifying the technology used by a website

Finding the owner of a website

Crawling your first website

Scraping versus crawling

Downloading a web page

Retrying downloads

Setting a user agent

Sitemap crawler

ID iteration crawler

Link crawlers

Advanced features

Parsing robots.txt

Supporting proxies

Throttling downloads

Avoiding spider traps

Final version

Using the requests library

Summary

Scraping the Data

Analyzing a web page

Three approaches to scrape a web page

Regular expressions

Beautiful Soup

Lxml

CSS selectors and your Browser Console

XPath Selectors

LXML and Family Trees

Comparing performance

Scraping results

Overview of Scraping

Adding a scrape callback to the link crawler

Summary

Caching Downloads

When to use caching?

Adding cache support to the link crawler

Disk Cache

Implementing DiskCache

Testing the cache

Saving disk space

Expiring stale data

Drawbacks of DiskCache

Key-value storage cache

What is key-value storage?

Installing Redis

Overview of Redis

Redis cache implementation

Compression

Testing the cache

Exploring requests-cache

Summary

Concurrent Downloading

One million web pages

Parsing the Alexa list

Sequential crawler

Threaded crawler

How threads and processes work

Implementing a multithreaded crawler

Multiprocessing crawler

Performance

Summary

Dynamic Content

An example dynamic web page

Reverse engineering a dynamic web page

Edge cases

Rendering a dynamic web page

PyQt or PySide

Debugging with Qt

Executing JavaScript

Website interaction with WebKit

Waiting for results

The Render class

Selenium

Selenium and Headless Browsers

Summary