Data Engineering with Apache Spark, Delta Lake, and Lakehouse: Create scalable pipelines that ingest, curate, and aggregate complex data in a timely a
暫譯: 使用 Apache Spark、Delta Lake 和 Lakehouse 的資料工程:建立可擴展的管道,以即時攝取、整理和聚合複雜資料
Kukreja, Manoj
- 出版商: Packt Publishing
- 出版日期: 2021-10-22
- 售價: $2,010
- 貴賓價: 9.5 折 $1,910
- 語言: 英文
- 頁數: 480
- 裝訂: Quality Paper - also called trade paper
- ISBN: 1801077746
- ISBN-13: 9781801077743
-
相關分類:
JVM 語言、Spark
海外代購書籍(需單獨結帳)
商品描述
Understand the complexities of modern-day data engineering platforms and explore strategies to deal with them with the help of use case scenarios led by an industry expert in big data
Key Features:
- Become well-versed with the core concepts of Apache Spark and Delta Lake for building data platforms
- Learn how to ingest, process, and analyze data that can be later used for training machine learning models
- Understand how to operationalize data models in production using curated data
Book Description:
In the world of ever-changing data and ever-evolving schemas, it is important to build data pipelines that can auto-adjust to changes. This book will help you build scalable data platforms that managers, data scientists, and data analysts can rely on.
Starting with an introduction to data engineering, along with its key concepts and architectures, this book will show you how to use Microsoft Azure Cloud services effectively for data engineering. You'll cover data lake design patterns and the different stages through which the data needs to flow in a typical data lake. Once you've explored the main features of Delta Lake to build data lakes with fast performance and governance in mind, you'll advance to implementing the lambda architecture using Delta Lake. Packed with practical examples and code snippets, this book takes you through real-world examples based on production scenarios faced by the author in his 10 years of experience working with big data. Finally, you'll cover data lake deployment strategies that play an important role in provisioning the cloud resources and deploying the data pipelines in a repeatable and continuous way.
By the end of this data engineering book, you'll have learned how to effectively deal with ever-changing data and create scalable data pipelines to streamline data science, ML, and artificial intelligence (AI) tasks.
What You Will Learn:
- Discover the challenges you may face in the data engineering world
- Add ACID transactions to Apache Spark using Delta Lake
- Understand effective design strategies to build enterprise-grade data lakes
- Explore architectural and design patterns for building efficient data ingestion pipelines
- Orchestrate a data pipeline for preprocessing data using Apache Spark and Delta Lake APIs
- Automate deployment and monitoring of data pipelines in production
- Get to grips with securing, monitoring, and managing data pipelines models efficiently
Who this book is for:
This book is for aspiring data engineers and data analysts who are new to the world of data engineering and are looking for a practical guide to building scalable data platforms. If you already work with PySpark and want to use Delta Lake for data engineering, you'll find this book useful. Basic knowledge of Python, Spark, and SQL is expected.
商品描述(中文翻譯)
了解現代數據工程平台的複雜性,並探索應對這些挑戰的策略,透過業界大數據專家的案例場景進行學習
主要特點:
- 熟悉 Apache Spark 和 Delta Lake 的核心概念,以建立數據平台
- 學習如何攝取、處理和分析數據,這些數據可用於訓練機器學習模型
- 了解如何使用精選數據在生產環境中運行數據模型
書籍描述:
在不斷變化的數據和不斷演變的架構中,建立能夠自動調整的數據管道至關重要。本書將幫助您建立可擴展的數據平台,讓管理者、數據科學家和數據分析師可以依賴。
本書從數據工程的介紹開始,涵蓋其關鍵概念和架構,將向您展示如何有效利用 Microsoft Azure Cloud 服務進行數據工程。您將學習數據湖的設計模式以及數據在典型數據湖中需要流經的不同階段。在探索 Delta Lake 的主要特性以建立快速性能和治理的數據湖後,您將進一步實施使用 Delta Lake 的 lambda 架構。本書充滿實用範例和代碼片段,帶您通過作者在大數據領域十年工作經驗中面臨的生產場景的真實案例。最後,您將學習數據湖的部署策略,這在提供雲資源和以可重複和持續的方式部署數據管道中扮演著重要角色。
在本書結束時,您將學會如何有效應對不斷變化的數據,並創建可擴展的數據管道,以簡化數據科學、機器學習 (ML) 和人工智慧 (AI) 任務。
您將學到的內容:
- 發現您在數據工程領域可能面臨的挑戰
- 使用 Delta Lake 為 Apache Spark 添加 ACID 交易
- 了解有效的設計策略以建立企業級數據湖
- 探索建立高效數據攝取管道的架構和設計模式
- 協調數據管道以使用 Apache Spark 和 Delta Lake API 預處理數據
- 自動化生產環境中數據管道的部署和監控
- 有效掌握數據管道模型的安全性、監控和管理
本書適合誰:
本書適合有志於成為數據工程師和數據分析師的讀者,特別是那些對數據工程世界感到陌生並尋求實用指南以建立可擴展數據平台的人。如果您已經在使用 PySpark 並希望使用 Delta Lake 進行數據工程,您會發現本書非常有用。預期具備 Python、Spark 和 SQL 的基本知識。