Hands-On Entity Resolution: A Practical Guide to Data Matching with Python
暫譯: 實作實體解析:使用 Python 進行資料匹配的實用指南
Shearer, Michael
相關主題
商品描述
Entity resolution is a key analytic technique that enables you to identify multiple data records that refer to the same real-world entity. With this hands-on guide, product managers, data analysts, and data scientists will learn how to add value to data by cleansing, analyzing, and resolving datasets using open source Python libraries and cloud APIs.
Author Michael Shearer shows you how to scale up your data matching processes and improve the accuracy of your reconciliations. You'll be able to remove duplicate entries within a single source and join disparate data sources together when common keys aren't available. Using real-world data examples, this book helps you gain practical understanding to accelerate the delivery of real business value.
With entity resolution, you'll build rich and comprehensive data assets that reveal relationships for marketing and risk management purposes, key to harnessing the full potential of ML and AI. This book covers:
- Challenges in deduplicating and joining datasets
- Extracting, cleansing, and preparing datasets for matching
- Text matching algorithms to identify equivalent entities
- Techniques for deduplicating and joining datasets at scale
- Matching datasets containing persons and organizations
- Evaluating data matches
- Optimizing and tuning data matching algorithms
- Entity resolution using cloud APIs
- Matching using privacy-enhancing technologies
商品描述(中文翻譯)
實體解析是一種關鍵的分析技術,使您能夠識別多個數據記錄,這些記錄指向同一個現實世界的實體。通過這本實用指南,產品經理、數據分析師和數據科學家將學會如何通過清理、分析和解析數據集,利用開源的 Python 函式庫和雲端 API 為數據增值。
作者 Michael Shearer 向您展示如何擴展數據匹配過程並提高對帳的準確性。您將能夠在單一來源中刪除重複條目,並在沒有共同鍵的情況下將不同的數據來源結合在一起。這本書使用現實世界的數據範例,幫助您獲得實用的理解,以加速實現真正的商業價值。
通過實體解析,您將建立豐富且全面的數據資產,揭示用於市場營銷和風險管理的關係,這對於充分發揮機器學習(ML)和人工智慧(AI)的潛力至關重要。本書涵蓋的內容包括:
- 去重和合併數據集的挑戰
- 提取、清理和準備數據集以進行匹配
- 識別等效實體的文本匹配算法
- 大規模去重和合併數據集的技術
- 匹配包含個人和組織的數據集
- 評估數據匹配
- 優化和調整數據匹配算法
- 使用雲端 API 進行實體解析
- 使用隱私增強技術進行匹配