Statistical Methods for Annotation Analysis
暫譯: 註解分析的統計方法
Paun, Silviu, Artstein, Ron, Poesio, Massimo
- 出版商: Morgan & Claypool
- 出版日期: 2022-01-13
- 售價: $2,890
- 貴賓價: 9.5 折 $2,746
- 語言: 英文
- 頁數: 217
- 裝訂: Quality Paper - also called trade paper
- ISBN: 1636392539
- ISBN-13: 9781636392530
海外代購書籍(需單獨結帳)
商品描述
Labelling data is one of the most fundamental activities in science, and has underpinned practice, particularly in medicine, for decades, as well as research in corpus linguistics since at least the development of the Brown corpus</b>. With the shift towards Machine Learning in Artificial Intelligence (AI), the creation of datasets to be used for training and evaluating AI systems, also known in AI as corpora, has become a central activity in the field as well.</p><p>Early AI datasets were created on an ad-hoc basis to tackle specific problems. As larger and more reusable datasets were created, requiring greater investment, the need for a more systematic approach to dataset creation arose to ensure increased quality. A range of statistical methods were adopted, often but not exclusively from the medical sciences, to ensure that the labels used were not subjective, or to choose among different labels provided by the coders. A wide variety of such methods is now in regular use. This book is meant to provide a survey of the most widely used among these statistical methods supporting annotation practice.</p><p>As far as the authors know, this is the first book attempting to cover the two families of methods in wider use. The first family of methods is concerned with the development of labelling schemes and, in particular, ensuring that such schemes are such that sufficient agreement can be observed among the coders. The second family includes methods developed to analyze the output of coders once the scheme has been agreed upon, particularly although not exclusively to identify the most likely label for an item among those provided by the coders.</p><p>The focus of this book is primarily on Natural Language Processing, the area of AI devoted to the development of models of language interpretation and production, but many if not most of the methods discussed here are also applicable to other areas of AI, or indeed, to other areas of Data Science.
商品描述(中文翻譯)
標註數據是科學中最基本的活動之一,並且在醫學等實踐中已經支撐了數十年,同時自從布朗語料庫(Brown corpus)發展以來,也一直是語料語言學研究的重要基礎。隨著人工智慧(AI)中機器學習的興起,創建用於訓練和評估AI系統的數據集(在AI中也稱為語料)已成為該領域的核心活動。
早期的AI數據集是為了解決特定問題而臨時創建的。隨著更大且可重用的數據集的創建,這需要更大的投資,因此出現了對數據集創建的更系統化方法的需求,以確保質量的提高。採用了多種統計方法,這些方法通常但不僅限於醫學科學,以確保所使用的標籤不是主觀的,或在編碼者提供的不同標籤中進行選擇。現在,各種這樣的方法已經被廣泛使用。本書旨在提供對這些支持標註實踐的統計方法中最常用的幾種的概述。
據作者所知,這是第一本試圖涵蓋更廣泛使用的兩類方法的書籍。第一類方法關注於標註方案的開發,特別是確保這些方案能夠在編碼者之間觀察到足夠的協議。第二類方法則是用於分析編碼者在方案達成一致後的輸出,特別是(但不僅限於)識別編碼者提供的項目中最可能的標籤。
本書的重點主要在於自然語言處理,這是專注於語言解釋和生成模型開發的AI領域,但這裡討論的許多方法(如果不是大多數)也適用於AI的其他領域,甚至是數據科學的其他領域。