Web Scraping with Python
暫譯: 使用 Python 進行網頁爬蟲

Richard Lawson

買這商品的人也買了...

商品描述

Successfully scrape data from any website with the power of Python

About This Book

  • A hands-on guide to web scraping with real-life problems and solutions
  • Techniques to download and extract data from complex websites
  • Create a number of different web scrapers to extract information

Who This Book Is For

This book is aimed at developers who want to use web scraping for legitimate purposes. Prior programming experience with Python would be useful but not essential. Anyone with general knowledge of programming languages should be able to pick up the book and understand the principals involved.

What You Will Learn

  • Extract data from web pages with simple Python programming
  • Build a threaded crawler to process web pages in parallel
  • Follow links to crawl a website
  • Download cache to reduce bandwidth
  • Use multiple threads and processes to scrape faster
  • Learn how to parse JavaScript-dependent websites
  • Interact with forms and sessions
  • Solve CAPTCHAs on protected web pages
  • Discover how to track the state of a crawl

In Detail

The Internet contains the most useful set of data ever assembled, largely publicly accessible for free. However, this data is not easily reusable. It is embedded within the structure and style of websites and needs to be carefully extracted to be useful. Web scraping is becoming increasingly useful as a means to easily gather and make sense of the plethora of information available online. Using a simple language like Python, you can crawl the information out of complex websites using simple programming.

This book is the ultimate guide to using Python to scrape data from websites. In the early chapters it covers how to extract data from static web pages and how to use caching to manage the load on servers. After the basics we'll get our hands dirty with building a more sophisticated crawler with threads and more advanced topics. Learn step-by-step how to use Ajax URLs, employ the Firebug extension for monitoring, and indirectly scrape data. Discover more scraping nitty-gritties such as using the browser renderer, managing cookies, how to submit forms to extract data from complex websites protected by CAPTCHA, and so on. The book wraps up with how to create high-level scrapers with Scrapy libraries and implement what has been learned to real websites.

Style and approach

This book is a hands-on guide with real-life examples and solutions starting simple and then progressively becoming more complex. Each chapter in this book introduces a problem and then provides one or more possible solutions.

商品描述(中文翻譯)

使用 Python 成功從任何網站擷取數據

本書介紹



  • 一本針對網頁擷取的實作指南,包含真實案例及解決方案

  • 從複雜網站下載和提取數據的技術

  • 創建多個不同的網頁擷取器以提取信息

本書適合誰閱讀


本書針對希望合法使用網頁擷取的開發人員。具備 Python 的程式設計經驗會有幫助,但並非必需。任何對程式語言有一般了解的人都應該能夠閱讀本書並理解相關原則。

您將學到什麼



  • 使用簡單的 Python 程式設計從網頁中提取數據

  • 構建一個多執行緒的爬蟲以並行處理網頁

  • 跟隨鏈接爬取網站

  • 下載快取以減少帶寬使用

  • 使用多個執行緒和進程加快擷取速度

  • 學習如何解析依賴 JavaScript 的網站

  • 與表單和會話互動

  • 解決受保護網頁上的 CAPTCHA

  • 了解如何追蹤爬取的狀態

詳細內容


互聯網包含了有史以來最有用的數據集,並且大部分是免費公開可訪問的。然而,這些數據並不容易重用。它嵌入在網站的結構和樣式中,需要仔細提取才能有用。隨著網頁擷取作為一種輕鬆收集和理解在線海量信息的手段,變得越來越有用。使用像 Python 這樣的簡單語言,您可以通過簡單的程式設計從複雜的網站中爬取信息。


本書是使用 Python 從網站擷取數據的終極指南。在早期章節中,涵蓋了如何從靜態網頁中提取數據以及如何使用快取來管理伺服器的負載。在掌握基礎知識後,我們將深入探討構建更複雜的爬蟲,使用多執行緒和更高級的主題。逐步學習如何使用 Ajax URL,利用 Firebug 擴展進行監控,以及間接擷取數據。探索更多擷取的細節,例如使用瀏覽器渲染器、管理 Cookies、如何提交表單以從受 CAPTCHA 保護的複雜網站中提取數據等等。本書最後介紹如何使用 Scrapy 庫創建高級擷取器,並將所學應用於實際網站。

風格與方法


本書是一個實作指南,提供真實案例和解決方案,從簡單開始,然後逐步變得更複雜。本書的每一章都介紹一個問題,然後提供一個或多個可能的解決方案。

最後瀏覽商品 (20)