Instant Web Scraping with Java
暫譯: 即時網頁擷取與 Java

Ryan Mitchell

商品描述

Build simple scrapers or vast armies of Java-based bots to untangle and capture the Web

Overview

  • Learn something new in an Instant! A short, fast, focused guide delivering immediate results
  • Get your Java environment set up and running
  • Gather clean, formatted web data into your own database
  • Learn how to work around crawler-resistant websites and legally subvert security measures
  • Use built-in Java features to perform parallel processing and distributed scraping
  • Build test cases for your own websites using JUnit

In Detail

Java is often thought of as a stuffy enterprise language, while web scraping is the often-murky domain of scripting languages. By combining the robustness and extensibility of Java with the flexibility and power of web scraping, we can create immensely useful tools that can solve very difficult problems.

Instant Web Scraping with Java will guide you, step by step, through setting up your Java environment. You will also learn how to write simple web scrapers and distributed networks of crawlers. Throughout the book, we will provide useful tips, out-of-the-box working code, and additional resources to build expert knowledge.

Instant Web Scraping with Java will teach how to build your own web scrapers using real-world scraping examples that collect and store data from Wikipedia, public records data sites, IP address geolocation services, and more. You will learn how to run scrapers across multiple servers, run them in parallel, and subvert common methods of anti-scraper security used on modern websites. This book will also provide you with detailed step-by-step instructions, out-of-the-box working code, and expert pointers to further resources on key topics.

Instant Web Scraping with Java will show you how to view and collect any Internet data at the speed of your processor!

What you will learn from this book

  • Set up your Java environment and work with the Eclipse IDE
  • Execute complicated web crawlers that run without intervention
  • Handle errors, documentation, and writing robust code
  • Log scraped data for later retrieval and analysis
  • Write code to test website content and functionality with the JUnit framework
  • Learn techniques for getting around website security, designed to prevent automated scraping
  • Fill and submit forms automatically
  • Use threading to run scrapers in parallel
  • Use Java’s Remote Machine Invocation to create multi-server distributed scrapers

Approach

Filled with practical, step-by-step instructions and clear explanations for the most important and useful tasks. This book is full of short, concise recipes to learn a variety of useful web scraping techniques using Java. You will start with a simple basic recipe of setting up your Java environment and gradually learn some more advanced recipes such as using complex Scrapers.

Who this book is written for

Instant Web Scraping with Java is aimed at developers who, while not necessarily familiar with Java, are at least ready to dive into the complexities of this language with simple, step-by-step instructions leading the way. It is assumed that you have at least an intermediate knowledge of HTML, some knowledge of MySQL, and access to an Internet-connected computer while doing most of the exercises (after all, scraping the Web is difficult if your code can’t get online!)

商品描述(中文翻譯)

建構簡單的爬蟲或大量基於 Java 的機器人,以解開並捕捉網路

概述
- 立即學習新知!一本短小、快速、專注的指南,提供即時結果
- 設定並運行你的 Java 環境
- 將乾淨、格式化的網路數據收集到自己的資料庫中
- 學習如何繞過抗爬蟲網站並合法地破壞安全措施
- 使用內建的 Java 功能進行並行處理和分散式爬取
- 使用 JUnit 為自己的網站建立測試案例

詳細內容
Java 通常被認為是一種古板的企業語言,而網路爬蟲則是腳本語言的模糊領域。通過將 Java 的穩健性和可擴展性與網路爬蟲的靈活性和強大功能相結合,我們可以創建極其有用的工具來解決非常困難的問題。

《Instant Web Scraping with Java》將逐步指導你設置 Java 環境。你還將學習如何編寫簡單的網路爬蟲和分散式爬蟲網絡。在整本書中,我們將提供有用的提示、現成的可運行代碼以及額外資源,以建立專業知識。

《Instant Web Scraping with Java》將教你如何使用真實的爬取範例來構建自己的網路爬蟲,這些範例從維基百科、公共記錄數據網站、IP 地址地理定位服務等收集和存儲數據。你將學習如何在多台伺服器上運行爬蟲,並行運行它們,並繞過現代網站上常見的反爬蟲安全措施。本書還將為你提供詳細的逐步指導、現成的可運行代碼以及關鍵主題的專家指導。

《Instant Web Scraping with Java》將向你展示如何以處理器的速度查看和收集任何互聯網數據!

你將從本書中學到的內容
- 設置你的 Java 環境並使用 Eclipse IDE
- 執行無需干預的複雜網路爬蟲
- 處理錯誤、文檔和編寫穩健的代碼
- 記錄爬取的數據以便後續檢索和分析
- 使用 JUnit 框架編寫代碼以測試網站內容和功能
- 學習繞過旨在防止自動爬取的網站安全技術
- 自動填寫和提交表單
- 使用線程並行運行爬蟲
- 使用 Java 的遠程方法調用創建多伺服器分散式爬蟲

方法
本書充滿實用的逐步指導和對最重要和有用任務的清晰解釋。這本書包含短小精悍的食譜,讓你學習使用 Java 的各種有用的網路爬蟲技術。你將從設置 Java 環境的簡單基本食譜開始,逐漸學習一些更高級的食譜,例如使用複雜的爬蟲。

本書的讀者對象
《Instant Web Scraping with Java》旨在針對那些雖然不一定熟悉 Java,但至少準備好以簡單的逐步指導深入了解這門語言複雜性的開發者。假設你至少具備中級的 HTML 知識,對 MySQL 有一些了解,並在進行大多數練習時能夠訪問連接到互聯網的計算機(畢竟,如果你的代碼無法上網,爬取網路是很困難的!)

最後瀏覽商品 (20)