Lucene in Action
暫譯: Lucene 實戰

Erik Hatcher, Otis Gospodnetic

  • 出版商: Manning
  • 出版日期: 2004-12-01
  • 售價: $1,740
  • 貴賓價: 9.5$1,653
  • 語言: 英文
  • 頁數: 456
  • 裝訂: Paperback
  • ISBN: 1932394281
  • ISBN-13: 9781932394283
  • 相關分類: 全文搜尋引擎 Full-text-search
  • 已過版

買這商品的人也買了...

商品描述

Descriptions:

Lucene is a gem in the open-source world--a highly scalable, fast search engine. It delivers performance and is disarmingly easy to use. Lucene in Action is the authoritative guide to Lucene. It describes how to index your data, including types you definitely need to know such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, filtering, and highlighting search results.

Lucene powers search in surprising places--in discussion groups at Fortune 100 companies, in commercial issue trackers, in email search from Microsoft, in the Nutch web search engine (that scales to billions of pages). It is used by diverse companies including Akamai, Overture, Technorati, HotJobs, Epiphany, FedEx, Mayo Clinic, MIT, New Scientist Magazine, and many others.

Adding search to your application can be easy. With many reusable examples and good advice on best practices, Lucene in Action shows you how. And if you would like to search through Lucene in Action over the Web, you can do so using Lucene itself as the search engine--take a look at the authors' awesome Search Inside solution. Its results page resembles Google's and provides a novel yet familiar interface to the entire book and book blog.

 

Table of Contents:

foreword xvii
preface xix

acknowledgments xxii
about this book xxv

Part 1 Core Lucene 1

1 Meet Lucene 3
1.1 Evolution of information organization and access 4
1.2 Understanding Lucene 6
What Lucene is 7
What Lucene can do for you 7
History of Lucene 9
Who uses Lucene 10
Lucene ports: Perl, Python, C++, .NET, Ruby 10
1.3 Indexing and searching 10
What is indexing, and why is it important? 10
What is searching? 11
1.4 Lucene in action: a sample application 11
Creating an index 12
Searching an index 15
1.5 Understanding the core indexing classes 18
IndexWriter 19
Directory 19
Analyzer 19
Document 20
Field 20
1.6 Understanding the core searching classes 22
IndexSearcher 23
Term 23
Query 23
TermQuery 24
Hits 24
1.7 Review of alternate search products 24
IR libraries 24
Indexing and searching applications 26
Online resources 27
1.8 Summary 27
 
2 Indexing 28
2.1 Understanding the indexing process 29
Conversion to text 29
Analysis 30
Index writing 31
2.2 Basic index operations 31
Adding documents to an index 31
Removing Documents from an index 33
Undeleting Documents 36
Updating Documents in an index 36
2.3 Boosting Documents and Fields 38
2.4 Indexing dates 39
2.5 Indexing numbers 40
2.6 Indexing Fields used for sorting 41
2.7 Controlling the indexing process 42
Tuning indexing performance 42
In-memory indexing: RAMDirectory 48
Limiting Field sizes: maxFieldLength 54
2.8 Optimizing an index 56
2.9 Concurrency, thread-safety, and locking issues 59
Concurrency rules 59
Thread-safety 60
Index locking 62
Disabling index locking 66
2.10 Debugging indexing 66
2.11 Summary 67
 
3 Adding search to your application 68
3.1 Implementing a simple search feature 69
Searching for a specific term 70
Parsing a user-entered query expression: QueryParser 72
3.2 Using IndexSearcher 75
Working with Hits 76
Paging through Hits 77
Reading indexes into memory 77
3.3 Understanding Lucene scoring 78
Lucene, you got a lot of ‘splainin’ to do! 80
3.4 Creating queries programmatically 81
Searching by term: TermQuery 82
Searching within a range: RangeQuery 83
Searching on a string: PrefixQuery 84
Combining queries: BooleanQuery 85
Searching by phrase: PhraseQuery 87
Searching by wildcard: WildcardQuery 90
Searching for similar terms: FuzzyQuery 92
3.5 Parsing query expressions: QueryParser 93
Query.toString 94
Boolean operators 94
Grouping 95
Field selection 95
Range searches 96
Phrase queries 98
Wildcard and prefix queries 99
Fuzzy queries 99
Boosting queries 99
To QueryParse or not to QueryParse? 100
3.6 Summary 100
 
4 Analysis 102
4.1 Using analyzers 104
Indexing analysis 105
QueryParser analysis 106
Parsing versus analysis: when an analyzer isn’t appropriate 107
4.2 Analyzing the analyzer 107
What’s in a token? 108
TokenStreams uncensored 109
Visualizing analyzers 112
Filtering order can be important 116
4.3 Using the built-in analyzers 119
StopAnalyzer 119
StandardAnalyzer 120
4.4 Dealing with keyword fields 121
Alternate keyword analyzer 125
4.5 “Sounds like” querying 125
4.6 Synonyms, aliases, and words that mean the same 128
Visualizing token positions 134
4.7 Stemming analysis 136
Leaving holes 136
Putting it together 137
Hole lot of trouble 138
4.8 Language analysis issues 140
Unicode and encodings 140
Analyzing non-English languages 141
Analyzing Asian languages 142
Zaijian 145
4.9 Nutch analysis 145
4.10 Summary 147
 
5 Advanced search techniques 149
5.1 Sorting search results 150
Using a sort 150
Sorting by relevance 152
Sorting by index order 153
Sorting by a field 154
Reversing sort order 154
Sorting by multiple fields 155
Selecting a sorting field type 156
Using a nondefault locale for sorting 157
Performance effect of sorting 157
5.2 Using PhrasePrefixQuery 157
5.3 Querying on multiple fields at once 159
5.4 Span queries: Lucene’s new hidden gem 161
Building block of spanning, SpanTermQuery 163
Finding spans at the beginning of a field 165
Spans near one another 166
Excluding span overlap from matches 168
Spanning the globe 169
SpanQuery and QueryParser 170
5.5 Filtering a search 171
Using DateFilter 171
Using QueryFilter 173
Security filters 174
A QueryFilter alternative 176
Caching filter results 177
Beyond the built-in filters 177
5.6 Searching across multiple Lucene indexes 178
Using MultiSearcher 178
Multithreaded searching using ParallelMultiSearcher 180
5.7 Leveraging term vectors 185
Books like this 186
What category? 189
5.8 Summary 193
 
6 Extending search 194
6.1 Using a custom sort method 195
Accessing values used in custom sorting 200
6.2 Developing a custom HitCollector 201
About BookLinkCollector 202
Using BookLinkCollector 202
6.3 Extending QueryParser 203
Customizing QueryParser’s behavior 203
Prohibiting fuzzy and wildcard queries 204
Handling numeric field-range queries 205
Allowing ordered phrase queries 208
6.4 Using a custom filter 209
Using a filtered query 212
6.5 Performance testing 213
Testing the speed of a search 213
Load testing 217
QueryParser again! 218
Morals of performance testing 220
6.6 Summary 220

Part 2 Applied Lucene 221

7 Parsing common document formats 223
7.1 Handling rich-text documents 224
Creating a common DocumentHandler interface 225
7.2 Indexing XML 226
Parsing and indexing using SAX 227
Parsing and indexing using Digester 230
7.3 Indexing a PDF document 235
Extracting text and indexing using PDFBox 236
Built-in Lucene support 239
7.4 Indexing an HTML document 241
Getting the HTML source data 242
Using JTidy 242
Using NekoHTML 245
7.5 Indexing a Microsoft Word document 248
Using POI 249
Using TextMining.org’s API 250
7.6 Indexing an RTF document 252
7.7 Indexing a plain-text document 253
7.8 Creating a document-handling framework 254
FileHandler interface 255
ExtensionFileHandler 257
FileIndexer application 260
Using FileIndexer 262
FileIndexer drawbacks, and how to extend the framework 263
7.9 Other text-extraction tools 264
Document-management systems and services 264
7.10 Summary 265
 
8 Tools and extensions 267
8.1 Playing in Lucene’s Sandbox 268
8.2 Interacting with an index 269
lucli: a command-line interface 269
Luke: the Lucene Index Toolbox 271
LIMO: Lucene Index Monitor 279
8.3 Analyzers, tokenizers, and TokenFilters, oh my 282
SnowballAnalyzer 283
Obtaining the Sandbox analyzers 284
8.4 Java Development with Ant and Lucene 284
Using the <index> task 285
Creating a custom document handler 286
Installation 290
8.5 JavaScript browser utilities 290
JavaScript query construction and validation 291
Escaping special characters 292
Using JavaScript support 292
8.6 Synonyms from WordNet 292
Building the synonym index 294
Tying WordNet synonyms into an analyzer 296
Calling on Lucene 297
8.7 Highlighting query terms 300
Highlighting with CSS 301
Highlighting Hits 303
8.8 Chaining filters 304
8.9 Storing an index in Berkeley DB 307
Coding to DbDirectory 308
Installing DbDirectory 309
8.10 Building the Sandbox 309
Check it out 310
Ant in the Sandbox 310
8.11 Summary 311
 
9 Lucene ports 312
9.1 Ports’ relation to Lucene 313
9.2 CLucene 314
Supported platforms 314
API compatibility 314
Unicode support 316
Performance 317
Users 317
9.3 dotLucene 317
API compatibility 317
Index compatibility 318
Performance 318
Users 318
9.4 Plucene 318
API compatibility 319
Index compatibility 320
Performance 320
Users 320
9.5 Lupy 320
API compatibility 320
Index compatibility 322
Performance 322
Users 322
9.6 PyLucene 322
API compatibility 323
Index compatibility 323
Performance 323
Users 323
9.7 Summary 324
 
10 Case studies 325
10.1 Nutch: “The NPR of search engines” 326
More in depth 327
Other Nutch features 328
10.2 Using Lucene at jGuru 329
Topic lexicons and document categorization 330
Search database structure 331
Index fields 332
Indexing and content preparation 333
Queries 335
JGuruMultiSearcher 339
Miscellaneous 340
10.3 Using Lucene in SearchBlox 341
Why choose Lucene? 341
SearchBlox architecture 342
Search results 343
Language support 343
Reporting Engine 344
Summary 344
10.4 Competitive intelligence with Lucene in XtraMind’s XM-InformationMinder? 344
The system architecture 347
How Lucene has helped us 350
10.5 Alias-i: orthographic variation with Lucene 351
Alias-i application architecture 352
Orthographic variation 354
The noisy channel model of spelling correction 355
The vector comparison model of spelling variation 356
A subword Lucene analyzer 357
Accuracy, efficiency, and other applications 360
Mixing in context 360
References 361
10.6 Artful searching at Michaels.com 361
Indexing content 362
Searching content 367
Search statistics 370
Summary 371
10.7 I love Lucene: TheServerSide 371
Building better search capability 371
High-level infrastructure 373
Building the index 374
Searching the index 377
Configuration: one place to rule them all 379
Web tier: TheSeeeeeeeeeeeerverSide? 383
Summary 385
10.8 Conclusion 385
 
appendix A Installing Lucene 387
appendix B Lucene index format 393
appendix C Resources 408
index 415

商品描述(中文翻譯)

描述:
Lucene 是開源世界中的一顆明珠——一個高度可擴展且快速的搜尋引擎。它提供卓越的性能,並且使用起來非常簡單。《Lucene in Action》是 Lucene 的權威指南。它描述了如何對您的數據進行索引,包括您必須了解的類型,如 MS Word、PDF、HTML 和 XML。它還介紹了搜尋、排序、過濾和高亮顯示搜尋結果。

Lucene 在意想不到的地方提供搜尋功能——在《財富》100 強公司的討論組中,在商業問題追蹤器中,在 Microsoft 的電子郵件搜尋中,在 Nutch 網頁搜尋引擎中(可擴展至數十億頁面)。它被包括 Akamai、Overture、Technorati、HotJobs、Epiphany、FedEx、Mayo Clinic、MIT、新科學家雜誌等多家不同公司使用。

將搜尋功能添加到您的應用程式中可以很簡單。《Lucene in Action》提供了許多可重用的範例和最佳實踐的良好建議,向您展示了如何實現。如果您想在網路上搜尋《Lucene in Action》,您可以使用 Lucene 本身作為搜尋引擎——請查看作者的精彩搜尋解決方案。其結果頁面類似於 Google,並提供了一個新穎而又熟悉的界面來瀏覽整本書及其書籍部落格。

目錄:
前言 xvii
序言 xix
致謝 xxii
關於本書 xxv

第一部分 核心 Lucene 1
1 認識 Lucene 3
1.1 資訊組織與存取的演變 4
1.2 理解 Lucene 6
Lucene 是什麼 7
Lucene 能為您做什麼 7
Lucene 的歷史 9
誰在使用 Lucene 10
Lucene 的移植:Perl、Python、C++、.NET、Ruby 10
1.3 索引與搜尋 10
什麼是索引,為什麼它很重要? 10
什麼是搜尋? 11
1.4 Lucene 實作:範例應用程式 11
創建索引 12
搜尋索引 15
1.5 理解核心索引類別 18
IndexWriter 19
Directory 19
Analyzer 19
Document 20
Field 20
1.6 理解核心搜尋類別 22
IndexSearcher 23
Term 23
Query 23
TermQuery 24
Hits 24
1.7 替代搜尋產品的回顧 24
IR 函式庫 24
索引與搜尋應用程式 26
在線資源 27
1.8 總結 27

2 索引 28
2.1 理解索引過程 29
轉換為文本 29
分析 30
索引寫入 31
2.2 基本索引操作 31
將文檔添加到索引 31
從索引中刪除文檔 33
取消刪除文檔 36
更新索引中的文檔 36
2.3 提升文檔和字段 38
2.4 索引日期 39
2.5 索引數字 40
2.6 用於排序的字段索引 41
2.7 控制索引過程 42
調整索引性能 42
內存索引:RAMDirectory 48
限制字段大小:maxFieldLength 54
2.8 優化索引 56
2.9 並發性、線程安全和鎖定問題 59
並發性規則 59
線程安全 60
索引鎖定 62
禁用索引鎖定 66
2.10 調試索引 66
2.11 總結 67

3 將搜尋添加到您的應用程式 68
3.1 實現簡單的搜尋功能 69
搜尋特定術語 70
解析用戶輸入的查詢表達式:QueryParser 72
3.2 使用 IndexSearcher 75
處理 Hits 76
分頁 Hits 77
將索引讀入內存 77
3.3 理解 Lucene 的計分 78
Lucene,您有很多解釋要做! 80
3.4 程式化創建查詢 81
按術語搜尋:TermQuery 82
在範圍內搜尋:RangeQuery 83
在字符串上搜尋:PrefixQuery 84
組合查詢:BooleanQuery 85
按短語搜尋:PhraseQuery 87
按通配符搜尋:WildcardQuery 90
搜尋相似術語:FuzzyQuery 92
3.5 解析查詢表達式:QueryParser 93
Query.toString 94
布林運算符 94
分組 95
字段選擇 95
範圍搜尋 96
短語查詢 98
通配符和前綴查詢 99
模糊查詢 99
提升查詢 99
要查詢解析還是不要查詢解析? 100
3.6 總結 100

4 分析 102
4.1 使用分析器 104
索引分析 105
QueryParser 分析 106
解析與分析:何時不適合使用分析器 107
4.2 分析分析器 107
令牌中包含什麼? 108
TokenStreams 不受審查 109
可視化分析器 112
過濾順序可能很重要 116
4.3 使用內建分析器 119
StopAnalyzer 119
StandardAnalyzer 120
4.4 處理關鍵字字段 121
替代關鍵字分析器 125
4.5 “聽起來像”查詢 125
4.6 同義詞、別名和意義相同的詞 128
可視化令牌位置 134
4.7 詞幹分析 136
留下空洞 136
整合 137
大量麻煩 138
4.8 語言分析問題 140
Unicode 和編碼 140

最後瀏覽商品 (20)