Lucene in Action
暫譯: Lucene 實戰
Erik Hatcher, Otis Gospodnetic
- 出版商: Manning
- 出版日期: 2004-12-01
- 售價: $1,740
- 貴賓價: 9.5 折 $1,653
- 語言: 英文
- 頁數: 456
- 裝訂: Paperback
- ISBN: 1932394281
- ISBN-13: 9781932394283
-
相關分類:
全文搜尋引擎 Full-text-search
已過版
買這商品的人也買了...
-
$1,880$1,786 -
$650$514 -
$590$466 -
$680$537 -
$560$476 -
$480$379 -
$750$593 -
$780$616 -
$780$616 -
$490$382 -
$780$616 -
$650$514 -
$650$507 -
$680$537 -
$490$417 -
$620$490 -
$590$460 -
$580$452 -
$620$490 -
$880$695 -
$540$427 -
$550$435 -
$650$507 -
$1,100$1,078 -
$299$236
商品描述
Descriptions:
Lucene is a gem in the open-source world--a highly scalable, fast search engine. It delivers performance and is disarmingly easy to use. Lucene in Action is the authoritative guide to Lucene. It describes how to index your data, including types you definitely need to know such as MS Word, PDF, HTML, and XML. It introduces you to searching, sorting, filtering, and highlighting search results.
Lucene powers search in surprising places--in discussion groups at Fortune 100 companies, in commercial issue trackers, in email search from Microsoft, in the Nutch web search engine (that scales to billions of pages). It is used by diverse companies including Akamai, Overture, Technorati, HotJobs, Epiphany, FedEx, Mayo Clinic, MIT, New Scientist Magazine, and many others.
Adding search to your application can be easy. With many reusable examples and good advice on best practices, Lucene in Action shows you how. And if you would like to search through Lucene in Action over the Web, you can do so using Lucene itself as the search engine--take a look at the authors' awesome Search Inside solution. Its results page resembles Google's and provides a novel yet familiar interface to the entire book and book blog.
Table of Contents:
foreword xvii
preface xix
acknowledgments xxii
about this book xxv
Part 1 Core Lucene 1
- 1 Meet Lucene 3
- 1.1 Evolution of information organization and access 4
- 1.2 Understanding Lucene 6
- What Lucene is 7
- What Lucene can do for you 7
- History of Lucene 9
- Who uses Lucene 10
- Lucene ports: Perl, Python, C++, .NET, Ruby 10
- 1.3 Indexing and searching 10
- What is indexing, and why is it important? 10
- What is searching? 11
- 1.4 Lucene in action: a sample application 11
- Creating an index 12
- Searching an index 15
- 1.5 Understanding the core indexing classes 18
- IndexWriter 19
- Directory 19
- Analyzer 19
- Document 20
- Field 20
- 1.6 Understanding the core searching classes 22
- IndexSearcher 23
- Term 23
- Query 23
- TermQuery 24
- Hits 24
- 1.7 Review of alternate search products 24
- IR libraries 24
- Indexing and searching applications 26
- Online resources 27
- 1.8 Summary 27
- 2 Indexing 28
- 2.1 Understanding the indexing process 29
- Conversion to text 29
- Analysis 30
- Index writing 31
- 2.2 Basic index operations 31
- Adding documents to an index 31
- Removing Documents from an index 33
- Undeleting Documents 36
- Updating Documents in an index 36
- 2.3 Boosting Documents and Fields 38
- 2.4 Indexing dates 39
- 2.5 Indexing numbers 40
- 2.6 Indexing Fields used for sorting 41
- 2.7 Controlling the indexing process 42
- Tuning indexing performance 42
- In-memory indexing: RAMDirectory 48
- Limiting Field sizes: maxFieldLength 54
- 2.8 Optimizing an index 56
- 2.9 Concurrency, thread-safety, and locking issues 59
- Concurrency rules 59
- Thread-safety 60
- Index locking 62
- Disabling index locking 66
- 2.10 Debugging indexing 66
- 2.11 Summary 67
- 3 Adding search to your application 68
- 3.1 Implementing a simple search feature 69
- Searching for a specific term 70
- Parsing a user-entered query expression: QueryParser 72
- 3.2 Using IndexSearcher 75
- Working with Hits 76
- Paging through Hits 77
- Reading indexes into memory 77
- 3.3 Understanding Lucene scoring 78
- Lucene, you got a lot of ‘splainin’ to do! 80
- 3.4 Creating queries programmatically 81
- Searching by term: TermQuery 82
- Searching within a range: RangeQuery 83
- Searching on a string: PrefixQuery 84
- Combining queries: BooleanQuery 85
- Searching by phrase: PhraseQuery 87
- Searching by wildcard: WildcardQuery 90
- Searching for similar terms: FuzzyQuery 92
- 3.5 Parsing query expressions: QueryParser 93
- Query.toString 94
- Boolean operators 94
- Grouping 95
- Field selection 95
- Range searches 96
- Phrase queries 98
- Wildcard and prefix queries 99
- Fuzzy queries 99
- Boosting queries 99
- To QueryParse or not to QueryParse? 100
- 3.6 Summary 100
- 4 Analysis 102
- 4.1 Using analyzers 104
- Indexing analysis 105
- QueryParser analysis 106
- Parsing versus analysis: when an analyzer isn’t appropriate 107
- 4.2 Analyzing the analyzer 107
- What’s in a token? 108
- TokenStreams uncensored 109
- Visualizing analyzers 112
- Filtering order can be important 116
- 4.3 Using the built-in analyzers 119
- StopAnalyzer 119
- StandardAnalyzer 120
- 4.4 Dealing with keyword fields 121
- Alternate keyword analyzer 125
- 4.5 “Sounds like” querying 125
- 4.6 Synonyms, aliases, and words that mean the same 128
- Visualizing token positions 134
- 4.7 Stemming analysis 136
- Leaving holes 136
- Putting it together 137
- Hole lot of trouble 138
- 4.8 Language analysis issues 140
- Unicode and encodings 140
- Analyzing non-English languages 141
- Analyzing Asian languages 142
- Zaijian 145
- 4.9 Nutch analysis 145
- 4.10 Summary 147
- 5 Advanced search techniques 149
- 5.1 Sorting search results 150
- Using a sort 150
- Sorting by relevance 152
- Sorting by index order 153
- Sorting by a field 154
- Reversing sort order 154
- Sorting by multiple fields 155
- Selecting a sorting field type 156
- Using a nondefault locale for sorting 157
- Performance effect of sorting 157
- 5.2 Using PhrasePrefixQuery 157
- 5.3 Querying on multiple fields at once 159
- 5.4 Span queries: Lucene’s new hidden gem 161
- Building block of spanning, SpanTermQuery 163
- Finding spans at the beginning of a field 165
- Spans near one another 166
- Excluding span overlap from matches 168
- Spanning the globe 169
- SpanQuery and QueryParser 170
- 5.5 Filtering a search 171
- Using DateFilter 171
- Using QueryFilter 173
- Security filters 174
- A QueryFilter alternative 176
- Caching filter results 177
- Beyond the built-in filters 177
- 5.6 Searching across multiple Lucene indexes 178
- Using MultiSearcher 178
- Multithreaded searching using ParallelMultiSearcher 180
- 5.7 Leveraging term vectors 185
- Books like this 186
- What category? 189
- 5.8 Summary 193
- 6 Extending search 194
- 6.1 Using a custom sort method 195
- Accessing values used in custom sorting 200
- 6.2 Developing a custom HitCollector 201
- About BookLinkCollector 202
- Using BookLinkCollector 202
- 6.3 Extending QueryParser 203
- Customizing QueryParser’s behavior 203
- Prohibiting fuzzy and wildcard queries 204
- Handling numeric field-range queries 205
- Allowing ordered phrase queries 208
- 6.4 Using a custom filter 209
- Using a filtered query 212
- 6.5 Performance testing 213
- Testing the speed of a search 213
- Load testing 217
- QueryParser again! 218
- Morals of performance testing 220
- 6.6 Summary 220
Part 2 Applied Lucene 221
- 7 Parsing common document formats 223
- 7.1 Handling rich-text documents 224
- Creating a common DocumentHandler interface 225
- 7.2 Indexing XML 226
- Parsing and indexing using SAX 227
- Parsing and indexing using Digester 230
- 7.3 Indexing a PDF document 235
- Extracting text and indexing using PDFBox 236
- Built-in Lucene support 239
- 7.4 Indexing an HTML document 241
- Getting the HTML source data 242
- Using JTidy 242
- Using NekoHTML 245
- 7.5 Indexing a Microsoft Word document 248
- Using POI 249
- Using TextMining.org’s API 250
- 7.6 Indexing an RTF document 252
- 7.7 Indexing a plain-text document 253
- 7.8 Creating a document-handling framework 254
- FileHandler interface 255
- ExtensionFileHandler 257
- FileIndexer application 260
- Using FileIndexer 262
- FileIndexer drawbacks, and how to extend the framework 263
- 7.9 Other text-extraction tools 264
- Document-management systems and services 264
- 7.10 Summary 265
- 8 Tools and extensions 267
- 8.1 Playing in Lucene’s Sandbox 268
- 8.2 Interacting with an index 269
- lucli: a command-line interface 269
- Luke: the Lucene Index Toolbox 271
- LIMO: Lucene Index Monitor 279
- 8.3 Analyzers, tokenizers, and TokenFilters, oh my 282
- SnowballAnalyzer 283
- Obtaining the Sandbox analyzers 284
- 8.4 Java Development with Ant and Lucene 284
- Using the <index> task 285
- Creating a custom document handler 286
- Installation 290
- 8.5 JavaScript browser utilities 290
- JavaScript query construction and validation 291
- Escaping special characters 292
- Using JavaScript support 292
- 8.6 Synonyms from WordNet 292
- Building the synonym index 294
- Tying WordNet synonyms into an analyzer 296
- Calling on Lucene 297
- 8.7 Highlighting query terms 300
- Highlighting with CSS 301
- Highlighting Hits 303
- 8.8 Chaining filters 304
- 8.9 Storing an index in Berkeley DB 307
- Coding to DbDirectory 308
- Installing DbDirectory 309
- 8.10 Building the Sandbox 309
- Check it out 310
- Ant in the Sandbox 310
- 8.11 Summary 311
- 9 Lucene ports 312
- 9.1 Ports’ relation to Lucene 313
- 9.2 CLucene 314
- Supported platforms 314
- API compatibility 314
- Unicode support 316
- Performance 317
- Users 317
- 9.3 dotLucene 317
- API compatibility 317
- Index compatibility 318
- Performance 318
- Users 318
- 9.4 Plucene 318
- API compatibility 319
- Index compatibility 320
- Performance 320
- Users 320
- 9.5 Lupy 320
- API compatibility 320
- Index compatibility 322
- Performance 322
- Users 322
- 9.6 PyLucene 322
- API compatibility 323
- Index compatibility 323
- Performance 323
- Users 323
- 9.7 Summary 324
- 10 Case studies 325
- 10.1 Nutch: “The NPR of search engines” 326
- More in depth 327
- Other Nutch features 328
- 10.2 Using Lucene at jGuru 329
- Topic lexicons and document categorization 330
- Search database structure 331
- Index fields 332
- Indexing and content preparation 333
- Queries 335
- JGuruMultiSearcher 339
- Miscellaneous 340
- 10.3 Using Lucene in SearchBlox 341
- Why choose Lucene? 341
- SearchBlox architecture 342
- Search results 343
- Language support 343
- Reporting Engine 344
- Summary 344
- 10.4 Competitive intelligence with Lucene in XtraMind’s XM-InformationMinder? 344
- The system architecture 347
- How Lucene has helped us 350
- 10.5 Alias-i: orthographic variation with Lucene 351
- Alias-i application architecture 352
- Orthographic variation 354
- The noisy channel model of spelling correction 355
- The vector comparison model of spelling variation 356
- A subword Lucene analyzer 357
- Accuracy, efficiency, and other applications 360
- Mixing in context 360
- References 361
- 10.6 Artful searching at Michaels.com 361
- Indexing content 362
- Searching content 367
- Search statistics 370
- Summary 371
- 10.7 I love Lucene: TheServerSide 371
- Building better search capability 371
- High-level infrastructure 373
- Building the index 374
- Searching the index 377
- Configuration: one place to rule them all 379
- Web tier: TheSeeeeeeeeeeeerverSide? 383
- Summary 385
- 10.8 Conclusion 385
appendix A Installing Lucene 387
appendix B Lucene index format 393
appendix C Resources 408
index 415
商品描述(中文翻譯)
描述:
Lucene 是開源世界中的一顆明珠——一個高度可擴展且快速的搜尋引擎。它提供卓越的性能,並且使用起來非常簡單。《Lucene in Action》是 Lucene 的權威指南。它描述了如何對您的數據進行索引,包括您必須了解的類型,如 MS Word、PDF、HTML 和 XML。它還介紹了搜尋、排序、過濾和高亮顯示搜尋結果。
Lucene 在意想不到的地方提供搜尋功能——在《財富》100 強公司的討論組中,在商業問題追蹤器中,在 Microsoft 的電子郵件搜尋中,在 Nutch 網頁搜尋引擎中(可擴展至數十億頁面)。它被包括 Akamai、Overture、Technorati、HotJobs、Epiphany、FedEx、Mayo Clinic、MIT、新科學家雜誌等多家不同公司使用。
將搜尋功能添加到您的應用程式中可以很簡單。《Lucene in Action》提供了許多可重用的範例和最佳實踐的良好建議,向您展示了如何實現。如果您想在網路上搜尋《Lucene in Action》,您可以使用 Lucene 本身作為搜尋引擎——請查看作者的精彩搜尋解決方案。其結果頁面類似於 Google,並提供了一個新穎而又熟悉的界面來瀏覽整本書及其書籍部落格。
目錄:
前言 xvii
序言 xix
致謝 xxii
關於本書 xxv
第一部分 核心 Lucene 1
1 認識 Lucene 3
1.1 資訊組織與存取的演變 4
1.2 理解 Lucene 6
Lucene 是什麼 7
Lucene 能為您做什麼 7
Lucene 的歷史 9
誰在使用 Lucene 10
Lucene 的移植:Perl、Python、C++、.NET、Ruby 10
1.3 索引與搜尋 10
什麼是索引,為什麼它很重要? 10
什麼是搜尋? 11
1.4 Lucene 實作:範例應用程式 11
創建索引 12
搜尋索引 15
1.5 理解核心索引類別 18
IndexWriter 19
Directory 19
Analyzer 19
Document 20
Field 20
1.6 理解核心搜尋類別 22
IndexSearcher 23
Term 23
Query 23
TermQuery 24
Hits 24
1.7 替代搜尋產品的回顧 24
IR 函式庫 24
索引與搜尋應用程式 26
在線資源 27
1.8 總結 27
2 索引 28
2.1 理解索引過程 29
轉換為文本 29
分析 30
索引寫入 31
2.2 基本索引操作 31
將文檔添加到索引 31
從索引中刪除文檔 33
取消刪除文檔 36
更新索引中的文檔 36
2.3 提升文檔和字段 38
2.4 索引日期 39
2.5 索引數字 40
2.6 用於排序的字段索引 41
2.7 控制索引過程 42
調整索引性能 42
內存索引:RAMDirectory 48
限制字段大小:maxFieldLength 54
2.8 優化索引 56
2.9 並發性、線程安全和鎖定問題 59
並發性規則 59
線程安全 60
索引鎖定 62
禁用索引鎖定 66
2.10 調試索引 66
2.11 總結 67
3 將搜尋添加到您的應用程式 68
3.1 實現簡單的搜尋功能 69
搜尋特定術語 70
解析用戶輸入的查詢表達式:QueryParser 72
3.2 使用 IndexSearcher 75
處理 Hits 76
分頁 Hits 77
將索引讀入內存 77
3.3 理解 Lucene 的計分 78
Lucene,您有很多解釋要做! 80
3.4 程式化創建查詢 81
按術語搜尋:TermQuery 82
在範圍內搜尋:RangeQuery 83
在字符串上搜尋:PrefixQuery 84
組合查詢:BooleanQuery 85
按短語搜尋:PhraseQuery 87
按通配符搜尋:WildcardQuery 90
搜尋相似術語:FuzzyQuery 92
3.5 解析查詢表達式:QueryParser 93
Query.toString 94
布林運算符 94
分組 95
字段選擇 95
範圍搜尋 96
短語查詢 98
通配符和前綴查詢 99
模糊查詢 99
提升查詢 99
要查詢解析還是不要查詢解析? 100
3.6 總結 100
4 分析 102
4.1 使用分析器 104
索引分析 105
QueryParser 分析 106
解析與分析:何時不適合使用分析器 107
4.2 分析分析器 107
令牌中包含什麼? 108
TokenStreams 不受審查 109
可視化分析器 112
過濾順序可能很重要 116
4.3 使用內建分析器 119
StopAnalyzer 119
StandardAnalyzer 120
4.4 處理關鍵字字段 121
替代關鍵字分析器 125
4.5 “聽起來像”查詢 125
4.6 同義詞、別名和意義相同的詞 128
可視化令牌位置 134
4.7 詞幹分析 136
留下空洞 136
整合 137
大量麻煩 138
4.8 語言分析問題 140
Unicode 和編碼 140