Multidimensional Mining of Massive Text Data
暫譯: 大規模文本數據的多維挖掘
Zhang, Chao, Han, Jiawei
- 出版商: Morgan & Claypool
- 出版日期: 2019-03-21
- 售價: $3,350
- 貴賓價: 9.5 折 $3,183
- 語言: 英文
- 頁數: 198
- 裝訂: Hardcover - also called cloth, retail trade, or trade
- ISBN: 1681735210
- ISBN-13: 9781681735214
海外代購書籍(需單獨結帳)
相關主題
商品描述
Unstructured text, as one of the most important data forms, plays a crucial role in data-driven decision making in domains ranging from social networking and information retrieval to scientific research and healthcare informatics. In many emerging applications, people's information need from text data is becoming multidimensional-they demand useful insights along multiple aspects from a text corpus. However, acquiring such multidimensional knowledge from massive text data remains a challenging task.
This book presents data mining techniques that turn unstructured text data into multidimensional knowledge. We investigate two core questions. (1) How does one identify task-relevant text data with declarative queries in multiple dimensions? (2) How does one distill knowledge from text data in a multidimensional space? To address the above questions, we develop a text cube framework. First, we develop a cube construction module that organizes unstructured data into a cube structure, by discovering latent multidimensional and multi-granular structure from the unstructured text corpus and allocating documents into the structure. Second, we develop a cube exploitation module that models multiple dimensions in the cube space, thereby distilling from user-selected data multidimensional knowledge. Together, these two modules constitute an integrated pipeline: leveraging the cube structure, users can perform multidimensional, multigranular data selection with declarative queries; and with cube exploitation algorithms, users can extract multidimensional patterns from the selected data for decision making.
The proposed framework has two distinctive advantages when turning text data into multidimensional knowledge: flexibility and label-efficiency. First, it enables acquiring multidimensional knowledge flexibly, as the cube structure allows users to easily identify task-relevant data along multiple dimensions at varied granularities and further distill multidimensional knowledge. Second, the algorithms for cube construction and exploitation require little supervision; this makes the framework appealing for many applications where labeled data are expensive to obtain.
商品描述(中文翻譯)
非結構化文本作為最重要的數據形式之一,在數據驅動的決策制定中扮演著關鍵角色,應用範圍涵蓋社交網絡、信息檢索、科學研究和醫療信息學等領域。在許多新興應用中,人們對文本數據的信息需求變得多維度——他們希望從文本語料庫中獲得多方面的有用見解。然而,從大量文本數據中獲取這種多維知識仍然是一項挑戰性任務。
本書介紹了將非結構化文本數據轉化為多維知識的數據挖掘技術。我們探討了兩個核心問題:(1) 如何使用多維的聲明式查詢來識別與任務相關的文本數據?(2) 如何在多維空間中從文本數據中提煉知識?為了解決上述問題,我們開發了一個文本立方體框架。首先,我們開發了一個立方體構建模塊,通過從非結構化文本語料庫中發現潛在的多維和多粒度結構,並將文檔分配到該結構中,將非結構化數據組織成立方體結構。其次,我們開發了一個立方體利用模塊,該模塊在立方體空間中建模多個維度,從而從用戶選擇的數據中提煉多維知識。這兩個模塊共同構成了一個集成管道:利用立方體結構,用戶可以使用聲明式查詢進行多維、多粒度的數據選擇;通過立方體利用算法,用戶可以從選定的數據中提取多維模式以進行決策。
所提出的框架在將文本數據轉化為多維知識時具有兩個顯著優勢:靈活性和標籤效率。首先,它使得靈活獲取多維知識成為可能,因為立方體結構允許用戶輕鬆識別多維度和不同粒度的與任務相關的數據,並進一步提煉多維知識。其次,立方體構建和利用的算法需要很少的監督;這使得該框架對於許多標籤數據獲取成本高昂的應用具有吸引力。