Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications
暫譯: 現代數據工程與 Apache Spark:構建關鍵任務流式應用的實用指南

Haines, Scott

  • 出版商: Apress
  • 出版日期: 2022-03-23
  • 售價: $2,410
  • 貴賓價: 9.5$2,290
  • 語言: 英文
  • 頁數: 612
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1484274512
  • ISBN-13: 9781484274514
  • 相關分類: Spark
  • 海外代購書籍(需單獨結帳)

商品描述

Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey beginning with the basics of data ingestion, processing, and transformation, and ending up with an entire local data platform running Apache Spark, Apache Zeppelin, Apache Kafka, Redis, MySQL, Minio (S3), and Apache Airflow.

Apache Spark applications solve a wide range of data problems from traditional data loading and processing to rich SQL-based analysis as well as complex machine learning workloads and even near real-time processing of streaming data. Spark fits well as a central foundation for any data engineering workload. This book will teach you to write interactive Spark applications using Apache Zeppelin notebooks, write and compile reusable applications and modules, and fully test both batch and streaming. You will also learn to containerize your applications using Docker and run and deploy your Spark applications using a variety of tools such as Apache Airflow, Docker and Kubernetes.

​Reading this book will empower you to take advantage of Apache Spark to optimize your data pipelines and teach you to craft modular and testable Spark applications. You will create and deploy mission-critical streaming spark applications in a low-stress environment that paves the way for your own path to production.


What You Will Learn

- Simplify data transformation with Spark Pipelines and Spark SQL
- Bridge data engineering with machine learning
- Architect modular data pipeline applications
- Build reusable application components and libraries
- Containerize your Spark applications for consistency and reliability
- Use Docker and Kubernetes to deploy your Spark applications
- Speed up application experimentation using Apache Zeppelin and Docker
- Understand serializable structured data and data contracts
- Harness effective strategies for optimizing data in your data lakes
- Build end-to-end Spark structured streaming applications using Redis and Apache Kafka
- Embrace testing for your batch and streaming applications
- Deploy and monitor your Spark applications


Who This Book Is For
Professional software engineers who want to take their current skills and apply them to new and exciting opportunities within the data ecosystem, practicing data engineers who are looking for a guiding light while traversing the many challenges of moving from batch to streaming modes, data architects who wish to provide clear and concise direction for how best to harness and use Apache Spark within their organization, and those interested in the ins and outs of becoming a modern data engineer in today's fast-paced and data-hungry world

商品描述(中文翻譯)

利用 Apache Spark 在現代數據工程生態系統中。本實用指南將教您如何編寫功能完整的應用程式,遵循行業最佳實踐,並了解這些決策背後的理由。以 Apache Spark 為基礎,您將踏上逐步的旅程,從數據攝取、處理和轉換的基本概念開始,最終建立一個完整的本地數據平台,運行 Apache Spark、Apache Zeppelin、Apache Kafka、Redis、MySQL、Minio (S3) 和 Apache Airflow。

Apache Spark 應用程式解決了從傳統數據加載和處理到豐富的基於 SQL 的分析,以及複雜的機器學習工作負載,甚至是近實時的流數據處理等各種數據問題。Spark 非常適合作為任何數據工程工作負載的核心基礎。本書將教您使用 Apache Zeppelin 筆記本編寫互動式 Spark 應用程式,編寫和編譯可重用的應用程式和模組,並全面測試批處理和流處理。您還將學會使用 Docker 將應用程式容器化,並使用 Apache Airflow、Docker 和 Kubernetes 等各種工具運行和部署您的 Spark 應用程式。

閱讀本書將使您能夠利用 Apache Spark 優化數據管道,並教您如何製作模組化和可測試的 Spark 應用程式。您將在低壓環境中創建和部署關鍵任務的流式 Spark 應用程式,為您自己的生產之路鋪平道路。

您將學到的內容:

- 使用 Spark Pipelines 和 Spark SQL 簡化數據轉換
- 將數據工程與機器學習相結合
- 設計模組化的數據管道應用程式
- 構建可重用的應用程式組件和庫
- 將您的 Spark 應用程式容器化以確保一致性和可靠性
- 使用 Docker 和 Kubernetes 部署您的 Spark 應用程式
- 使用 Apache Zeppelin 和 Docker 加速應用程式實驗
- 理解可序列化的結構化數據和數據合約
- 掌握優化數據湖中數據的有效策略
- 使用 Redis 和 Apache Kafka 構建端到端的 Spark 結構化流式應用程式
- 接受批處理和流處理應用程式的測試
- 部署和監控您的 Spark 應用程式

本書適合對象:

專業軟體工程師希望將當前技能應用於數據生態系統中的新機會,實踐中的數據工程師在面對從批處理到流處理模式的眾多挑戰時尋求指導,數據架構師希望為如何在其組織中最佳利用和使用 Apache Spark 提供清晰明確的指導,以及對於在當今快速變化和數據需求旺盛的世界中成為現代數據工程師的方方面面感興趣的人士。

作者簡介

​Scott Haines is a full stack engineer with a current focus on real-time, highly available, trustworthy analytics systems. He works at Twilio as a Principal Software Engineer on the Voice Insights team, where he helps drive Spark adoption, creates streaming pipeline architectures, and helps to architect and build out a massive stream and batch processing platform.


Prior to Twilio, Scott worked writing the backend Java APIs for Yahoo Games as well as the real-time game ranking and ratings engine (built on Storm) to provide personalized recommendations and page views for 10 million customers. He finished his tenure at Yahoo working for Flurry Analytics where he wrote the alerts and notifications system for mobile devices.

作者簡介(中文翻譯)

Scott Haines 是一位全端工程師,目前專注於即時、高可用性和可信賴的分析系統。他在 Twilio 擔任首席軟體工程師,隸屬於 Voice Insights 團隊,負責推動 Spark 的採用,創建串流管道架構,並協助設計和建構一個龐大的串流和批次處理平台。

在加入 Twilio 之前,Scott 曾在 Yahoo Games 編寫後端的 Java API,以及基於 Storm 的即時遊戲排名和評分引擎,為 1000 萬客戶提供個性化推薦和頁面瀏覽。他在 Yahoo 的任期結束時,為 Flurry Analytics 工作,負責為行動裝置編寫警報和通知系統。