Web Corpus Construction (Paperback)
暫譯: 網路語料庫建構 (平裝本)
Roland Schäfer, Felix Bildhauer
- 出版商: Morgan & Claypool
- 出版日期: 2013-07-01
- 售價: $1,420
- 貴賓價: 9.5 折 $1,349
- 語言: 英文
- 頁數: 146
- 裝訂: Paperback
- ISBN: 1608459837
- ISBN-13: 9781608459834
-
相關分類:
大數據 Big-data、Web-crawler 網路爬蟲
立即出貨 (庫存=1)
商品描述
The World Wide Web constitutes the largest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. There are several adavantages of this approach: (i) Working with such corpora obviates the problems encountered when using Internet search engines in quantitative linguistic research (such as non-transparent ranking algorithms). (ii) Creating a corpus from web data is virtually free. (iii) The size of corpora compiled from the WWW may exceed by several orders of magnitudes the size of language resources offered elsewhere. (iv) The data is locally available to the user, and it can be linguistically post-processed and queried with the tools preferred by her/him. This book addresses the main practical tasks in the creation of web corpora up to giga-token size. Among these tasks are the sampling process (i.e., web crawling) and the usual cleanups including boilerplate removal and removal of duplicated content. Linguistic processing and problems with linguistic processing coming from the different kinds of noise in web corpora are also covered. Finally, the authors show how web corpora can be evaluated and compared to other corpora (such as traditionally compiled corpora).
For additional material please visit the companion website: sites.morganclaypool.com/wcc
Table of Contents: Preface / Acknowledgments / Web Corpora / Data Collection / Post-Processing / Linguistic Processing / Corpus Evaluation and Comparison / Bibliography / Authors' Biographies
商品描述(中文翻譯)
全球資訊網是現存最大的各種語言文本來源。利用這些數據進行語言學研究的一種可行且合理的方法是為特定語言編纂靜態語料庫。這種方法有幾個優點:(i) 使用這類語料庫可以避免在定量語言學研究中使用網路搜尋引擎時遇到的問題(例如不透明的排名演算法)。(ii) 從網路數據創建語料庫幾乎是免費的。(iii) 從全球資訊網編纂的語料庫大小可能超過其他地方提供的語言資源幾個數量級。(iv) 數據對用戶是本地可用的,並且可以使用她/他所偏好的工具進行語言學後處理和查詢。本書針對創建高達千億字元的網路語料庫的主要實務任務進行探討。其中包括取樣過程(即網路爬蟲)和常見的清理工作,包括去除模板內容和重複內容。還涵蓋了語言學處理及來自網路語料庫中各種噪音的語言學處理問題。最後,作者展示了如何評估網路語料庫並將其與其他語料庫(如傳統編纂的語料庫)進行比較。
如需更多資料,請訪問伴隨網站:sites.morganclaypool.com/wcc
目錄:前言 / 致謝 / 網路語料庫 / 數據收集 / 後處理 / 語言學處理 / 語料庫評估與比較 / 參考文獻 / 作者簡介