更新时间:2021-07-09 19:43:08
coverpage
Title Page
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
Introduction to Web Scraping
When is web scraping useful?
Is web scraping legal?
Python 3
Background research
Checking robots.txt
Examining the Sitemap
Estimating the size of a website
Identifying the technology used by a website
Finding the owner of a website
Crawling your first website
Scraping versus crawling
Downloading a web page
Retrying downloads
Setting a user agent
Sitemap crawler
ID iteration crawler
Link crawlers
Advanced features
Parsing robots.txt
Supporting proxies
Throttling downloads
Avoiding spider traps
Final version
Using the requests library
Summary
Scraping the Data
Analyzing a web page
Three approaches to scrape a web page
Regular expressions
Beautiful Soup
Lxml
CSS selectors and your Browser Console
XPath Selectors
LXML and Family Trees
Comparing performance
Scraping results
Overview of Scraping
Adding a scrape callback to the link crawler
Caching Downloads
When to use caching?
Adding cache support to the link crawler
Disk Cache
Implementing DiskCache
Testing the cache
Saving disk space
Expiring stale data
Drawbacks of DiskCache
Key-value storage cache
What is key-value storage?
Installing Redis
Overview of Redis
Redis cache implementation
Compression
Exploring requests-cache
Concurrent Downloading
One million web pages
Parsing the Alexa list
Sequential crawler
Threaded crawler
How threads and processes work
Implementing a multithreaded crawler
Multiprocessing crawler
Performance
Dynamic Content
An example dynamic web page
Reverse engineering a dynamic web page
Edge cases
Rendering a dynamic web page
PyQt or PySide
Debugging with Qt
Executing JavaScript
Website interaction with WebKit
Waiting for results
The Render class
Selenium
Selenium and Headless Browsers