Spidering Hacks (Paperback)
暫譯: 蜘蛛爬行黑客技巧 (平裝本)

Name: Spidering Hacks (Paperback)
Price: 1207 TWD
Availability: OnlineOnly
Author: Kevin Hemenway, Tara Calishain
ISBN: 0596005776

Kevin Hemenway, Tara Calishain

出版商: O'Reilly
出版日期: 2003-12-02
售價: $1,270
貴賓價: 9.5 折 $1,207
語言: 英文
頁數: 424
裝訂: Paperback
ISBN: 0596005776
ISBN-13: 9780596005771
相關分類: Python、程式語言、Web-crawler 網路爬蟲

海外代購書籍(需單獨結帳)

買這商品的人也買了...

~~$680~~ $537

計算機組織與設計--軟硬體界面第二版 (Computer Organization & Design, 2/e)
~~$980~~ $774

C++ Primer, 3/e 中文版
~~$580~~ $458

Perl 學習手冊 (Learning Perl, 3/e)
~~$880~~ $695

LPI Linux 資格檢定 (LPI Linux Certification in a Nutshell)
~~$1,930~~ $1,834

Sun Certified Programmer & Developer for Java 2 Study Guide, 2/e
~~$780~~ $741

作業系統概念 (Operating System Concepts, 6/e Windows XP Update)
~~$590~~ $466

ASP.NET 程式設計徹底研究
~~$750~~ $585

精通 Visual Basic.NET 中文版黑皮書 (Visual Basic.NET Black Book)
~~$690~~ $538

STRUTS 實作手冊(Struts in Action: Building Web Applications with the Leading Java Framework)
~~$780~~ $616

Microsoft Windows Server 2003 系統實務
~~$620~~ $490

LDAP 系統管理 (LDAP System Administration)
~~$820~~ $804

數位影像處理 (Digital Image Processing, 2/e)
~~$750~~ $638

鳥哥的 Linux 私房菜－伺服器架設篇
~~$490~~ $382

詳解 JavaScript & HTML & CSS 語法辭典
~~$620~~ $527

發誓學會 Dreamweaver MX 2004 & PHP 資料庫網站中文版
~~$560~~ $476

鳥哥的 Linux 私房菜─基礎學習篇增訂版
~~$450~~ $356

Linux 防火牆：iptables
~~$720~~ $569

手機、PDA 程式設計入門─Java 手機、Pocket PC、Palm OS、Symbian OS 程式設計
~~$720~~ $569

JSP 動態網頁入門實務
~~$860~~ $679

Inside VCL-VCL 架構剖析
~~$390~~ $304

Web 配色事典﹝活用網頁安全色﹞
~~$490~~ $387

Dreamweaver MX 2004 魔法書中文版
~~$480~~ $379

Office 2003 入門與應用
~~$390~~ $308

快快樂樂學 PowerPoint 2003
~~$650~~ $507

ASP.NET 2.0 深度剖析範例集

商品描述

Summary

The Internet, with its profusion of information, has made us hungry for ever more, ever better data. Out of necessity, many of us have become pretty adept with search engine queries, but there are times when even the most powerful search engines aren't enough. If you've ever wanted your data in a different form than it's presented, or wanted to collect data from several sites and see it side-by-side without the constraints of a browser, then Spidering Hacks is for you.

Spidering Hacks takes you to the next level in Internet data retrieval--beyond search engines--by showing you how to create spiders and bots to retrieve information from your favorite sites and data sources. You'll no longer feel constrained by the way host sites think you want to see their data presented--you'll learn how to scrape and repurpose raw data so you can view in a way that's meaningful to you.

Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content. By the time you finish Spidering Hacks, you'll be able to:

Aggregate and associate data from disparate locations, then store and manipulate the data as you like

Gain a competitive edge in business by knowing when competitors' products are on sale, and comparing sales ranks and product placement on e-commerce sites

Integrate third-party data into your own applications or web sites

Make your own site easier to scrape and more usable to others

Keep up-to-date with your favorite comics strips, news stories, stock tips, and more without visiting the site every day

Like the other books in O'Reilly's popular Hacks series, Spidering Hacks brings you 100 industrial-strength tips and tools from the experts to help you master this technology. If you're interested in data retrieval of any type, this book provides a wealth of data for finding a wealth of data.

Table of Contents

Credits

Preface

Chapter 1. Walking Softly

1. A Crash Course in Spidering and Scraping

2. Best Practices for You and Your Spider

3. Anatomy of an HTML Page

4. Registering Your Spider

5. Preempting Discovery

6. Keeping Your Spider Out of Sticky Situations

7. Finding the Patterns of Identifiers

Chapter 2. Assembling a Toolbox

Perl Modules

Resources You May Find Helpful

8. Installing Perl Modules

9. Simply Fetching with LWP::Simple

10. More Involved Requests with LWP::UserAgent

11. Adding HTTP Headers to Your Request

12. Posting Form Data with LWP

13. Authentication, Cookies, and Proxies

14. Handling Relative and Absolute URLs

15. Secured Access and Browser Attributes

16. Respecting Your Scrapee's Bandwidth

17. Respecting robots.txt

18. Adding Progress Bars to Your Scripts

19. Scraping with HTML::TreeBuilder

20. Parsing with HTML::TokeParser

21. WWW::Mechanize 101

22. Scraping with WWW::Mechanize

23. In Praise of Regular Expressions

24. Painless RSS with Template::Extract

25. A Quick Introduction to XPath

26. Downloading with curl and wget

27. More Advanced wget Techniques

28. Using Pipes to Chain Commands

29. Running Multiple Utilities at Once

30. Utilizing the Web Scraping Proxy

31. Being Warned When Things Go Wrong

32. Being Adaptive to Site Redesigns

Chapter 3. Collecting Media Files

33. Detective Case Study: Newgrounds

34. Detective Case Study: iFilm

35. Downloading Movies from the Library of Congress
36. Downloading Images from Webshots

37. Downloading Comics with dailystrips

38. Archiving Your Favorite Webcams

39. News Wallpaper for Your Site

40. Saving Only POP3 Email Attachments

41. Downloading MP3s from a Playlist

42. Downloading from Usenet with nget

Chapter 4. Gleaning Data from Databases

43. Archiving Yahoo! Groups Messages with yahoo2mbox

44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups

45. Gleaning Buzz from Yahoo!

46. Spidering the Yahoo! Catalog

47. Tracking Additions to Yahoo!

48. Scattersearch with Yahoo! and Google

49. Yahoo! Directory Mindshare in Google

50. Weblog-Free Google Results

51. Spidering, Google, and Multiple Domains

52. Scraping Amazon.com Product Reviews

53. Receive an Email Alert for Newly Added Amazon.com Reviews
54. Scraping Amazon.com Customer Advice

55. Publishing Amazon.com Associates Statistics

56. Sorting Amazon.com Recommendations by Rating

57. Related Amazon.com Products with Alexa

58. Scraping Alexa's Competitive Data with Java

59. Finding Album Information with FreeDB and Amazon.com

60. Expanding Your Musical Tastes

61. Saving Daily Horoscopes to Your iPod

62. Graphing Data with RRDTOOL

63. Stocking Up on Financial Quotes

64. Super Author Searching

65. Mapping O'Reilly Best Sellers to Library Popularity

66. Using All Consuming to Get Book Lists

67. Tracking Packages with FedEx

68. Checking Blogs for New Comments

69. Aggregating RSS and Posting Changes

70. Using the Link Cosmos of Technorati

71. Finding Related RSS Feeds

72. Automatically Finding Blogs of Interest

73. Scraping TV Listings

74. What's Your Visitor's Weather Like?

75. Trendspotting with Geotargeting

76. Getting the Best Travel Route by Train

77. Geographic Distance and Back Again

78. Super Word Lookup

79. Word Associations with Lexical Freenet

80. Reformatting Bugtraq Reports

81. Keeping Tabs on the Web via Email

82. Publish IE's Favorites to Your Web Site

83. Spidering GameStop.com Game Prices

84. Bargain Hunting with PHP

85. Aggregating Multiple Search Engine Results

86. Robot Karaoke

87. Searching the Better Business Bureau

88. Searching for Health Inspections

89. Filtering for the Naughties

Chapter 5. Maintaining Your Collections

90. Using cron to Automate Tasks

91. Scheduling Tasks Without cron

92. Mirroring Web Sites with wget and rsync

93. Accumulating Search Results Over Time

Chapter 6. Giving Back to the World

94. Using XML::RSS to Repurpose Data

95. Placing RSS Headlines on Your Site

96. Making Your Resources Scrapable with Regular Expressions

97. Making Your Resources Scrapable with a REST Interface

98. Making Your Resources Scrapable with XML-RPC

99. Creating an IM Interface

100. Going Beyond the Book

Index

商品描述(中文翻譯)

**摘要**

網際網路上資訊的豐富使我們渴望獲得更多、更好的數據。出於必要，許多人已經變得相當擅長使用搜尋引擎查詢，但有時即使是最強大的搜尋引擎也無法滿足需求。如果你曾經希望以不同的形式獲取數據，或想從多個網站收集數據並並排查看，而不受瀏覽器的限制，那麼《Spidering Hacks》就是為你而寫的。

《Spidering Hacks》將帶你進入網際網路數據檢索的下一個層次——超越搜尋引擎——教你如何創建蜘蛛和機器人，從你最喜愛的網站和數據來源中檢索信息。你將不再受到主機網站認為你想要查看其數據的方式的限制——你將學會如何抓取和重新利用原始數據，以便以對你有意義的方式查看。

本書是為開發人員、研究人員、技術助理、圖書館員和高級用戶撰寫的，提供了有關蜘蛛和抓取方法的專家建議。你將從蜘蛛概念、工具（Perl、LWP、現成的工具）和倫理（如何知道何時越界：什麼是可接受和不可接受的）的一個速成課程開始。接下來，你將從數據庫中收集媒體文件和數據。然後你將學會如何解釋和理解數據，將其重新利用於其他應用程序，甚至構建授權接口將數據整合到你自己的內容中。當你完成《Spidering Hacks》時，你將能夠：

- 從不同位置聚合和關聯數據，然後根據需要存儲和操作數據
- 通過了解競爭對手的產品何時打折，並比較電子商務網站上的銷售排名和產品擺放，獲得商業競爭優勢
- 將第三方數據整合到你自己的應用程序或網站中
- 使你自己的網站更容易被抓取，並對其他人更具可用性
- 隨時了解你最喜愛的漫畫、新聞故事、股票建議等，而無需每天訪問網站

像O'Reilly的其他熱門Hacks系列書籍一樣，《Spidering Hacks》為你提供了100個來自專家的工業級技巧和工具，幫助你掌握這項技術。如果你對任何類型的數據檢索感興趣，本書提供了豐富的數據來幫助你找到大量數據。

**目錄**

- 版權頁
- 前言
- 第1章. 輕聲行走
- 1. 蜘蛛和抓取的速成課程
- 2. 你和你的蜘蛛的最佳實踐
- 3. HTML頁面的解剖
- 4. 註冊你的蜘蛛
- 5. 預防被發現
- 6. 讓你的蜘蛛遠離棘手的情況
- 7. 找到識別符的模式

- 第2章. 組建工具箱
- Perl模組
- 你可能會覺得有用的資源
- 8. 安裝Perl模組
- 9. 使用LWP::Simple簡單抓取
- 10. 使用LWP::UserAgent進行更複雜的請求
- 11. 為你的請求添加HTTP標頭
- 12. 使用LWP發送表單數據
- 13. 認證、Cookies和代理
- 14. 處理相對和絕對URL
- 15. 安全訪問和瀏覽器屬性
- 16. 尊重你的抓取對象的帶寬
- 17. 尊重robots.txt
- 18. 為你的腳本添加進度條
- 19. 使用HTML::TreeBuilder抓取
- 20. 使用HTML::TokeParser解析
- 21. WWW::Mechanize 101
- 22. 使用WWW::Mechanize抓取
- 23. 讚美正則表達式
- 24. 使用Template::Extract輕鬆處理RSS
- 25. XPath快速入門
- 26. 使用curl和wget下載
- 27. 更高級的wget技術
- 28. 使用管道鏈接命令
- 29. 同時運行多個工具
- 30. 利用網頁抓取代理
- 31. 當事情出錯時獲得警告
- 32. 適應網站重新設計

- 第3章. 收集媒體文件
- 33. 偵探案例研究：Newgrounds
- 34. 偵探案例研究：iFilm
- 35. 從國會圖書館下載電影
- 36. 從Webshots下載圖片
- 37. 使用dailystrips下載漫畫
- 38. 存檔你最喜愛的網路攝影機
- 39. 為你的網站提供新聞桌布
- 40. 僅保存POP3電子郵件附件
- 41. 從播放列表下載MP3
- 42. 使用nget從Usenet下載

- 第4章. 從數據庫中提取數據
- 43. 使用yahoo2mbox存檔Yahoo! Groups消息
- 44. 使用WWW::Yahoo::Groups存檔Yahoo! Groups消息
- 45. 從Yahoo!提取Buzz
- 46. 抓取Yahoo!目錄
- 47. 追蹤Yahoo!的新增內容
- 48. 使用Yahoo!和Google進行散佈搜索
- 49. 在Google中獲得Yahoo!目錄的心智份額
- 50. 無網誌的Google結果
- 51. 抓取、Google和多個域名
- 52. 抓取Amazon.com產品評論
- 53. 接收新添加的Amazon.com評論的電子郵件警報
- 54. 抓取Amazon.com客戶建議
- 55. 發布Amazon.com聯盟統計
- 56. 按評級排序Amazon.com推薦
- 57. 使用Alexa獲得相關的Amazon.com產品
- 58. 使用Java抓取Alexa的競爭數據
- 59. 使用FreeDB和Amazon.com查找專輯信息
- 60. 擴展你的音樂品味
- 61. 將每日星座運勢保存到你的iPod
- 62. 使用RRDTOOL繪製數據圖
- 63. 收集金融報價
- 64. 超級作者搜索
- 65. 將O'Reilly暢銷書與圖書館人氣對應
- 66. 使用All Consuming獲取書籍列表
- 67. 使用FedEx追蹤包裹
- 68. 檢查博客的新評論
- 69. 聚合RSS並發布變更
- 70. 使用Technorati的鏈接宇宙
- 71. 查找相關的RSS源
- 72. 自動查找感興趣的博客
- 73. 抓取電視節目表
- 74. 你的訪客的天氣如何？
- 75. 使用地理定位進行趨勢觀察
- 76. 獲得最佳火車旅行路線
- 77. 地理距離與回程
- 78. 超級單詞查詢
- 79. 使用Lexical Freenet進行單詞聯想
- 80. 重新格式化Bugtraq報告
- 81. 通過電子郵件保持對網路的關注
- 82. 將IE的收藏發布到你的網站
- 83. 抓取GameStop.com的遊戲價格
- 84. 使用PHP尋找便宜貨
- 85. 聚合多個搜尋引擎的結果
- 86. 機器人卡拉OK
- 87. 搜尋更好的商業局
- 88. 搜尋健康檢查
- 89. 過濾不當內容