Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
暫譯: 使用 Scala 和 Spark 的資料工程:建立處理大量資料的串流和批次管道

Tome, Eric, Bhattacharjee, Rupam, Radford, David

  • 出版商: Packt Publishing
  • 出版日期: 2024-01-31
  • 售價: $1,860
  • 貴賓價: 9.5$1,767
  • 語言: 英文
  • 頁數: 300
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1804612588
  • ISBN-13: 9781804612583
  • 相關分類: JVM 語言Spark
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Take your data engineering skills to the next level by learning how to utilize Scala and functional programming to create continuous and scheduled pipelines that ingest, transform, and aggregate data


Key Features:


  • Transform data into a clean and trusted source of information for your organization using Scala
  • Build streaming and batch-processing pipelines with step-by-step explanations
  • Implement and orchestrate your pipelines by following CI/CD best practices and test-driven development (TDD)
  • Purchase of the print or Kindle book includes a free PDF eBook


Book Description:


Most data engineers know that performance issues in a distributed computing environment can easily lead to issues impacting the overall efficiency and effectiveness of data engineering tasks. While Python remains a popular choice for data engineering due to its ease of use, Scala shines in scenarios where the performance of distributed data processing is paramount.


This book will teach you how to leverage the Scala programming language on the Spark framework and use the latest cloud technologies to build continuous and triggered data pipelines. You'll do this by setting up a data engineering environment for local development and scalable distributed cloud deployments using data engineering best practices, test-driven development, and CI/CD. You'll also get to grips with DataFrame API, Dataset API, and Spark SQL API and its use. Data profiling and quality in Scala will also be covered, alongside techniques for orchestrating and performance tuning your end-to-end pipelines to deliver data to your end users.


By the end of this book, you will be able to build streaming and batch data pipelines using Scala while following software engineering best practices.


What You Will Learn:


  • Set up your development environment to build pipelines in Scala
  • Get to grips with polymorphic functions, type parameterization, and Scala implicits
  • Use Spark DataFrames, Datasets, and Spark SQL with Scala
  • Read and write data to object stores
  • Profile and clean your data using Deequ
  • Performance tune your data pipelines using Scala


Who this book is for:


This book is for data engineers who have experience in working with data and want to understand how to transform raw data into a clean, trusted, and valuable source of information for their organization using Scala and the latest cloud technologies.

商品描述(中文翻譯)

透過學習如何利用 Scala 和函數式編程來創建持續和定時的數據管道,以攝取、轉換和聚合數據,將您的數據工程技能提升到下一個層次


主要特點:



  • 使用 Scala 將數據轉換為您組織的乾淨且可信的資訊來源

  • 逐步解釋構建流式和批處理管道

  • 遵循 CI/CD 最佳實踐和測試驅動開發 (TDD) 來實施和協調您的管道

  • 購買印刷版或 Kindle 書籍包括免費 PDF 電子書


書籍描述:


大多數數據工程師都知道,在分散式計算環境中的性能問題很容易導致影響數據工程任務整體效率和有效性的問題。雖然 Python 由於其易用性仍然是數據工程的熱門選擇,但在分散式數據處理性能至關重要的情況下,Scala 表現出色。


本書將教您如何在 Spark 框架上利用 Scala 編程語言,並使用最新的雲技術來構建持續和觸發的數據管道。您將通過設置本地開發的數據工程環境和可擴展的分散式雲部署,使用數據工程最佳實踐、測試驅動開發和 CI/CD 來實現這一目標。您還將掌握 DataFrame API、Dataset API 和 Spark SQL API 及其用法。書中還將涵蓋 Scala 中的數據分析和質量,以及協調和性能調優端到端管道的技術,以將數據交付給最終用戶。


在本書結束時,您將能夠使用 Scala 構建流式和批處理數據管道,同時遵循軟體工程最佳實踐。


您將學到什麼:



  • 設置開發環境以在 Scala 中構建管道

  • 掌握多態函數、類型參數化和 Scala 隱式參數

  • 在 Scala 中使用 Spark DataFrames、Datasets 和 Spark SQL

  • 讀取和寫入對象存儲中的數據

  • 使用 Deequ 進行數據分析和清理

  • 使用 Scala 進行數據管道的性能調優


本書適合誰:


本書適合有數據處理經驗的數據工程師,想要了解如何使用 Scala 和最新的雲技術將原始數據轉換為乾淨、可信且有價值的資訊來源,以服務於他們的組織。