Simulating Information Retrieval Test Collections
暫譯: 模擬資訊檢索測試集
Hawking, David, Billerbeck, Bodo, Thomas, Paul
- 出版商: Morgan & Claypool
- 出版日期: 2020-09-04
- 售價: $3,530
- 貴賓價: 9.5 折 $3,354
- 語言: 英文
- 頁數: 184
- 裝訂: Hardcover - also called cloth, retail trade, or trade
- ISBN: 1681739593
- ISBN-13: 9781681739595
海外代購書籍(需單獨結帳)
相關主題
商品描述
Simulated test collections may find application in situations where real datasets cannot easily be accessed due to confidentiality concerns or practical inconvenience. They can potentially support Information Retrieval (IR) experimentation, tuning, validation, performance prediction, and hardware sizing. Naturally, the accuracy and usefulness of results obtained from a simulation depend upon the fidelity and generality of the models which underpin it. The fidelity of emulation of a real corpus is likely to be limited by the requirement that confidential information in the real corpus should not be able to be extracted from the emulated version. We present a range of methods exploring trade-offs between emulation fidelity and degree of preservation of privacy.
We present three different simple types of text generator which work at a micro level: Markov models, neural net models, and substitution ciphers. We also describe macro level methods where we can engineer macro properties of a corpus, giving a range of models for each of the salient properties: document length distribution, word frequency distribution (for independent and non-independent cases), word length and textual representation, and corpus growth.
We present results of emulating existing corpora and for scaling up corpora by two orders of magnitude. We show that simulated collections generated with relatively simple methods are suitable for some purposes and can be generated very quickly. Indeed it may sometimes be feasible to embed a simple lightweight corpus generator into an indexer for the purpose of efficiency studies.
Naturally, a corpus of artificial text cannot support IR experimentation in the absence of a set of compatible queries. We discuss and experiment with published methods for query generation and query log emulation.
We present a proof-of-the-pudding study in which we observe the predictive accuracy of efficiency and effectiveness results obtained on emulated versions of TREC corpora. The study includes three open-source retrieval systems and several TREC datasets. There is a trade-off between confidentiality and prediction accuracy and there are interesting interactions between retrieval systems and datasets. Our tentative conclusion is that there are emulation methods which achieve useful prediction accuracy while providing a level of confidentiality adequate for many applications.
商品描述(中文翻譯)
模擬測試集可能在因為保密問題或實際不便而無法輕易獲取真實數據集的情況下找到應用。它們可以潛在地支持資訊檢索(Information Retrieval, IR)實驗、調整、驗證、性能預測和硬體規模設定。自然地,從模擬中獲得的結果的準確性和有用性取決於支撐它的模型的真實性和普遍性。對於真實語料庫的模擬,其真實性可能受到限制,因為必須確保在模擬版本中無法提取真實語料庫中的保密信息。我們提出了一系列方法,探索模擬真實性與隱私保護程度之間的權衡。
我們介紹了三種不同的簡單文本生成器,這些生成器在微觀層面上運作:馬可夫模型、神經網絡模型和替代密碼。我們還描述了宏觀層面的方法,通過這些方法我們可以設計語料庫的宏觀特性,為每個顯著特性提供一系列模型:文檔長度分佈、詞頻分佈(獨立和非獨立情況)、詞長和文本表示,以及語料庫增長。
我們展示了模擬現有語料庫的結果,以及將語料庫擴展兩個數量級的結果。我們顯示,使用相對簡單的方法生成的模擬集合適合某些目的,並且可以非常快速地生成。事實上,有時將一個簡單輕量的語料庫生成器嵌入到索引器中以進行效率研究是可行的。
自然地,人工文本的語料庫在缺乏一組兼容查詢的情況下無法支持IR實驗。我們討論並實驗了已發表的查詢生成和查詢日誌模擬方法。
我們展示了一項驗證研究,在該研究中,我們觀察到在TREC語料庫的模擬版本上獲得的效率和有效性結果的預測準確性。該研究包括三個開源檢索系統和幾個TREC數據集。保密性和預測準確性之間存在權衡,檢索系統和數據集之間也存在有趣的互動。我們的初步結論是,存在一些模擬方法能夠在提供足夠的保密性以適用於許多應用的同時,實現有用的預測準確性。