Data Cleaning
暫譯: 數據清理
Ilyas, Ihab F., Chu, Xu
- 出版商: Macmillan
- 出版日期: 2019-06-18
- 售價: $3,180
- 貴賓價: 9.5 折 $3,021
- 語言: 英文
- 頁數: 282
- 裝訂: Hardcover - also called cloth, retail trade, or trade
- ISBN: 1450371523
- ISBN-13: 9781450371520
海外代購書籍(需單獨結帳)
相關主題
商品描述
Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and incorrect business decisions.
Poor data across businesses and the U.S. government are reported to cost trillions of dollars a year. Multiple surveys show that dirty data is the most common barrier faced by data scientists. Not surprisingly, developing effective and efficient data cleaning solutions is challenging and is rife with deep theoretical and engineering problems.
This book is about data cleaning, which is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, we give an overview of the end-to-end data cleaning process, describing various error detection and repair methods, and attempt to anchor these proposals with multiple taxonomies and views. Specifically, we cover four of the most common and important data cleaning tasks, namely, outlier detection, data transformation, error repair (including imputing missing values), and data deduplication. Furthermore, due to the increasing popularity and applicability of machine learning techniques, we include a chapter that specifically explores how machine learning techniques are used for data cleaning, and how data cleaning is used to improve machine learning models.
This book is intended to serve as a useful reference for researchers and practitioners who are interested in the area of data quality and data cleaning. It can also be used as a textbook for a graduate course. Although we aim at covering state-of-the-art algorithms and techniques, we recognize that data cleaning is still an active field of research and therefore provide future directions of research whenever appropriate.
商品描述(中文翻譯)
資料品質是資料管理中最重要的問題之一,因為髒資料常常導致不準確的資料分析結果和錯誤的商業決策。
據報導,企業和美國政府的資料不良每年造成的損失高達數兆美元。多項調查顯示,髒資料是資料科學家面臨的最常見障礙。毫不奇怪,開發有效且高效的資料清理解決方案是具有挑戰性的,並且充滿了深層的理論和工程問題。
本書關於資料清理,這個術語用來指所有檢測和修復資料錯誤的任務和活動。我們不專注於特定的資料清理任務,而是提供端到端資料清理過程的概述,描述各種錯誤檢測和修復方法,並試圖用多種分類法和觀點來支撐這些提案。具體而言,我們涵蓋了四個最常見和重要的資料清理任務,即異常值檢測、資料轉換、錯誤修復(包括填補缺失值)和資料去重。此外,由於機器學習技術的日益普及和適用性,我們還包括了一章專門探討機器學習技術如何用於資料清理,以及資料清理如何用於改善機器學習模型。
本書旨在為對資料品質和資料清理領域感興趣的研究人員和實務工作者提供有用的參考。它也可以用作研究生課程的教科書。儘管我們旨在涵蓋最先進的演算法和技術,但我們認識到資料清理仍然是一個活躍的研究領域,因此在適當的時候提供未來的研究方向。