Getting Structured Data from the Internet: Running Web Crawlers/Scrapers on a Big Data Production Scale
暫譯: 從互聯網獲取結構化數據:在大數據生產規模上運行網路爬蟲/擷取工具

Patel, Jay M.

  • 出版商: Apress
  • 出版日期: 2020-11-13
  • 售價: $2,040
  • 貴賓價: 9.5$1,938
  • 語言: 英文
  • 頁數: 397
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1484265750
  • ISBN-13: 9781484265758
  • 相關分類: 大數據 Big-data
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Utilize web scraping at scale to quickly get unlimited amounts of free data available on the web into a structured format. This book teaches you to use Python scripts to crawl through websites at scale and scrape data from HTML and JavaScript-enabled pages and convert it into structured data formats such as CSV, Excel, JSON, or load it into a SQL database of your choice.

This book goes beyond the basics of web scraping and covers advanced topics such as natural language processing (NLP) and text analytics to extract names of people, places, email addresses, contact details, etc., from a page at production scale using distributed big data techniques on an Amazon Web Services (AWS)-based cloud infrastructure. It book covers developing a robust data processing and ingestion pipeline on the Common Crawl corpus, containing petabytes of data publicly available and a web crawl data set available on AWS's registry of open data.

Getting Structured Data from the Internet also includes a step-by-step tutorial on deploying your own crawlers using a production web scraping framework (such as Scrapy) and dealing with real-world issues (such as breaking Captcha, proxy IP rotation, and more). Code used in the book is provided to help you understand the concepts in practice and write your own web crawler to power your business ideas.


What You Will Learn

  • Understand web scraping, its applications/uses, and how to avoid web scraping by hitting publicly available rest API endpoints to directly get data
  • Develop a web scraper and crawler from scratch using lxml and BeautifulSoup library, and learn about scraping from JavaScript-enabled pages using Selenium
  • Use AWS-based cloud computing with EC2, S3, Athena, SQS, and SNS to analyze, extract, and store useful insights from crawled pages
  • Use SQL language on PostgreSQL running on Amazon Relational Database Service (RDS) and SQLite using SQLalchemy
  • Review sci-kit learn, Gensim, and spaCy to perform NLP tasks on scraped web pages such as name entity recognition, topic clustering (Kmeans, Agglomerative Clustering), topic modeling (LDA, NMF, LSI), topic classification (naive Bayes, Gradient Boosting Classifier) and text similarity (cosine distance-based nearest neighbors)
  • Handle web archival file formats and explore Common Crawl open data on AWS
  • Illustrate practical applications for web crawl data by building a similar website tool and a technology profiler similar to builtwith.com
  • Write scripts to create a backlinks database on a web scale similar to Ahrefs.com, Moz.com, Majestic.com, etc., for search engine optimization (SEO), competitor research, and determining website domain authority and ranking
  • Use web crawl data to build a news sentiment analysis system or alternative financial analysis covering stock market trading signals
  • Write a production-ready crawler in Python using Scrapy framework and deal with practical workarounds for Captchas, IP rotation, and more


Who This Book Is For

Primary audience: data analysts and scientists with little to no exposure to real-world data processing challenges, secondary: experienced software developers doing web-heavy data processing who need a primer, tertiary: business owners and startup founders who need to know more about implementation to better direct their technical team

商品描述(中文翻譯)

利用網路爬蟲技術大規模地快速獲取網路上無限量的免費數據,並將其轉換為結構化格式。本書教您使用 Python 腳本大規模地爬取網站,從 HTML 和啟用 JavaScript 的頁面中擷取數據,並將其轉換為結構化數據格式,如 CSV、Excel、JSON,或將其載入您選擇的 SQL 數據庫。

本書超越了網路爬蟲的基本知識,涵蓋了進階主題,如自然語言處理(NLP)和文本分析,以提取頁面上的人名、地名、電子郵件地址、聯絡資訊等,並使用基於 Amazon Web Services(AWS)的雲基礎設施,利用分散式大數據技術在生產規模下進行處理。本書還涵蓋了在 Common Crawl 資料集上開發穩健的數據處理和攝取管道,該資料集包含公開可用的數百萬 TB 數據以及 AWS 開放數據註冊中的網路爬蟲數據集。

《從互聯網獲取結構化數據》還包括逐步教程,教您如何使用生產級網路爬蟲框架(如 Scrapy)部署自己的爬蟲,並處理現實世界中的問題(如破解 Captcha、代理 IP 旋轉等)。書中提供的代碼幫助您理解實踐中的概念,並編寫自己的網路爬蟲以推動您的商業想法。

您將學到的內容:
- 理解網路爬蟲及其應用/用途,並學會如何通過訪問公開可用的 REST API 端點來避免網路爬蟲,直接獲取數據
- 從零開始使用 lxml 和 BeautifulSoup 庫開發網路爬蟲和爬蟲,並學習如何使用 Selenium 從啟用 JavaScript 的頁面中擷取數據
- 使用基於 AWS 的雲計算服務,包括 EC2、S3、Athena、SQS 和 SNS,分析、提取和存儲從爬取頁面中獲得的有用見解
- 在運行於 Amazon Relational Database Service(RDS)上的 PostgreSQL 和使用 SQLalchemy 的 SQLite 上使用 SQL 語言
- 回顧 sci-kit learn、Gensim 和 spaCy,對爬取的網頁執行 NLP 任務,如命名實體識別、主題聚類(Kmeans、Agglomerative Clustering)、主題建模(LDA、NMF、LSI)、主題分類(naive Bayes、Gradient Boosting Classifier)和文本相似度(基於餘弦距離的最近鄰)
- 處理網路檔案格式,並探索 AWS 上的 Common Crawl 開放數據
- 通過建立類似於 builtwith.com 的網站工具和技術分析器,說明網路爬蟲數據的實際應用
- 編寫腳本以創建類似於 Ahrefs.com、Moz.com、Majestic.com 等的網路規模反向連結數據庫,用於搜索引擎優化(SEO)、競爭對手研究,以及確定網站的域名權威性和排名
- 使用網路爬蟲數據構建新聞情感分析系統或替代金融分析,涵蓋股市交易信號
- 使用 Scrapy 框架在 Python 中編寫生產就緒的爬蟲,並處理 Captchas、IP 旋轉等實際解決方案

本書的主要讀者:對現實世界數據處理挑戰幾乎沒有接觸的數據分析師和科學家;次要讀者:需要入門知識的經驗豐富的軟體開發人員,從事網路重型數據處理;第三讀者:需要了解更多實施知識以更好地指導其技術團隊的企業主和創業者。

作者簡介

Jay M. Patel is a software developer with over 10 years of experience in data mining, web crawling/scraping, machine learning, and natural language processing (NLP) projects. He is a co-founder and principal data scientist of Specrom Analytics, providing content, email, social marketing, and social listening products and services using web crawling/scraping and advanced text mining.

Jay worked at the US Environmental Protection Agency (EPA) for five years where he designed workflows to crawl and extract useful insights from hundreds of thousands of documents that were parts of regulatory filings from companies. He also led one of the first research teams within the agency to use Apache Spark-based workflows for chem and bioinformatics applications such as chemical similarities and quantitative structure activity relationships. He developed recurrent neural networks and more advanced LSTM models in Tensorflow for chemical SMILES generation.

Jay graduated with a bachelor's degree in engineering from the Institute of Chemical Technology, University of Mumbai, India and a master of science degree from the University of Georgia, USA. Jay serves as an editor of a publication titled Web Data Extraction and also blogs about personal projects, open source packages, and experiences as a startup founder on his personal site, jaympatel.com.

作者簡介(中文翻譯)

Jay M. Patel 是一位擁有超過 10 年經驗的軟體開發人員,專注於資料挖掘、網頁爬蟲/擷取、機器學習和自然語言處理 (NLP) 項目。他是 Specrom Analytics 的共同創辦人及首席資料科學家,提供內容、電子郵件、社交行銷和社交聆聽產品及服務,利用網頁爬蟲/擷取和先進的文本挖掘技術。

Jay 在美國環境保護署 (EPA) 工作了五年,設計工作流程以爬取和提取數十萬份文件中的有用見解,這些文件是公司提交的法規檔案的一部分。他還領導了該機構內部第一個使用基於 Apache Spark 的工作流程進行化學和生物資訊學應用的研究團隊,這些應用包括化學相似性和定量結構-活性關係。他在 Tensorflow 中開發了遞迴神經網絡和更先進的 LSTM 模型,用於化學 SMILES 生成。

Jay 擁有印度孟買化學技術學院的工程學學士學位,以及美國喬治亞大學的科學碩士學位。Jay 目前擔任一份名為《網頁資料擷取》的出版物的編輯,並在他的個人網站 jaympatel.com 上撰寫有關個人項目、開源套件和作為創業者的經驗的部落格文章。