Python Web Scraping, 2/e
暫譯: Python 網頁擷取,第二版

Katharine Jarmul, Richard Lawson

商品描述

Key Features

  • A hands-on guide to web scraping using Python with solutions to real-world problems
  • Create a number of different web scrapers in Python to extract information
  • This book includes practical examples on using the popular and well-maintained libraries in Python for your web scraping needs

Book Description

The internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted. Web scraping is becoming increasingly useful as a means to gather and make sense of the wealth of information available online.

This book is the ultimate guide to using the latest features of Python 3.x to scrape data from websites. In the early chapters, you'll see how to extract data from static web pages. You'll learn to use caching with databases and files to save time and manage the load on servers. After covering the basics, you'll get hands-on practice in building a more sophisticated crawler using browsers, crawlers, and concurrent scrapers.

You'll determine when and how to scrape data from a JavaScript-dependent website using PyQt and Selenium. You'll get a better understanding of how to submit forms on complex websites protected by CAPTCHA. You'll find out how to automate these actions with Python packages such as mechanize. You'll also learn how to create class-based scrapers with Scrapy libraries and implement your learning on real websites.

By the end of the book, you will have explored testing websites with scrapers, remote scraping, best practices, working with images, and many other relevant topics.

What you will learn

  • Extract data from web pages with simple Python programming
  • Build a concurrent crawler to process web pages in parallel
  • Follow links to crawl a website
  • Extract features from the HTML
  • Cache downloaded HTML for reuse
  • Compare concurrent models to determine the fastest crawler
  • Find out how to parse JavaScript-dependent websites
  • Interact with forms and sessions

商品描述(中文翻譯)

主要特點

- 使用 Python 進行網頁擷取的實作指南,解決現實世界中的問題
- 在 Python 中創建多個不同的網頁擷取器以提取資訊
- 本書包含使用流行且維護良好的 Python 函式庫進行網頁擷取的實用範例

書籍描述

互聯網包含了有史以來最有用的數據集,這些數據大部分是免費公開可訪問的。然而,這些數據並不容易重用。它們嵌入在網站的結構和樣式中,需要仔細提取。網頁擷取作為收集和理解在線豐富資訊的一種手段,變得越來越有用。

本書是使用最新的 Python 3.x 特性從網站擷取數據的終極指南。在早期章節中,您將學習如何從靜態網頁中提取數據。您將學會使用緩存與資料庫和檔案來節省時間並管理伺服器的負載。在涵蓋基本知識後,您將實際操作,建立一個更複雜的爬蟲,使用瀏覽器、爬蟲和並行擷取器。

您將確定何時以及如何使用 PyQt 和 Selenium 從依賴 JavaScript 的網站擷取數據。您將更好地理解如何在受 CAPTCHA 保護的複雜網站上提交表單。您將發現如何使用 Python 套件如 mechanize 自動化這些操作。您還將學習如何使用 Scrapy 函式庫創建基於類別的擷取器,並在真實網站上實施您的學習。

到本書結束時,您將探索使用擷取器測試網站、遠程擷取、最佳實踐、處理圖像以及許多其他相關主題。

您將學到的內容

- 使用簡單的 Python 程式從網頁中提取數據
- 建立一個並行爬蟲以平行處理網頁
- 跟隨鏈接爬取網站
- 從 HTML 中提取特徵
- 緩存下載的 HTML 以便重用
- 比較並行模型以確定最快的爬蟲
- 瞭解如何解析依賴 JavaScript 的網站
- 與表單和會話互動