Hands-On Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing
暫譯: Apache Spark 3 實作指南:構建可擴展的批次與串流數據處理計算引擎

Antolínez García, Alfonso

  • 出版商: Apress
  • 出版日期: 2023-06-06
  • 售價: $2,500
  • 貴賓價: 9.5$2,375
  • 語言: 英文
  • 頁數: 403
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1484293797
  • ISBN-13: 9781484293799
  • 相關分類: JVM 語言Spark
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

This book explains how to scale Apache Spark 3 to handle massive amounts of data, either via batch or streaming processing. It covers how to use Spark's structured APIs to perform complex data transformations and analyses you can use to implement end-to-end analytics workflows. This book covers Spark 3's new features, theoretical foundations, and application architecture. The first section introduces the Apache Spark ecosystem as a unified engine for large scale data analytics, and shows you how to run and fine-tune your first application in Spark. The second section centers on batch processing suited to end-of-cycle processing, and data ingestion through files and databases. It explains Spark DataFrame API as well as structured and unstructured data with Apache Spark. The last section deals with scalable, high-throughput, fault-tolerant streaming processing workloads to process real-time data. Here you'll learn about Apache Spark Streaming's execution model, the architecture of Spark Streaming, monitoring, reporting, and recovering Spark streaming. A full chapter is devoted to future directions for Spark Streaming. With real-world use cases, code snippets, and notebooks hosted on GitHub, this book will give you an understanding of large-scale data analysis concepts--and help you put them to use.
Upon completing this book, you will have the knowledge and skills to seamlessly implement large-scale batch and streaming workloads to analyze real-time data streams with Apache Spark.
What You Will Learn
  • Master the concepts of Spark clusters and batch data processing
  • Understand data ingestion, transformation, and data storage
  • Gain insight into essential stream processing concepts and different streaming architectures
  • Implement streaming jobs and applications with Spark Streaming

Who This Book Is ForData engineers, data analysts, machine learning engineers, Python and R programmers

商品描述(中文翻譯)

本書解釋了如何擴展 Apache Spark 3 以處理大量數據,無論是通過批次處理還是流處理。它涵蓋了如何使用 Spark 的結構化 API 來執行複雜的數據轉換和分析,這些分析可用於實現端到端的分析工作流程。本書介紹了 Spark 3 的新功能、理論基礎和應用架構。第一部分介紹了 Apache Spark 生態系統,作為大規模數據分析的統一引擎,並展示了如何運行和微調您在 Spark 中的第一個應用程序。第二部分集中於適合週期結束處理的批次處理,以及通過文件和數據庫進行數據攝取。它解釋了 Spark DataFrame API 以及使用 Apache Spark 的結構化和非結構化數據。最後一部分處理可擴展的、高吞吐量的、容錯的流處理工作負載,以處理實時數據。在這裡,您將了解 Apache Spark Streaming 的執行模型、Spark Streaming 的架構、監控、報告和恢復 Spark 流處理。整整一章專門討論 Spark Streaming 的未來方向。通過真實世界的使用案例、代碼片段和托管在 GitHub 上的筆記本,本書將幫助您理解大規模數據分析的概念,並幫助您將其付諸實踐。

完成本書後,您將具備無縫實現大規模批次和流處理工作負載的知識和技能,以使用 Apache Spark 分析實時數據流。

您將學到什麼


  • 掌握 Spark 集群和批次數據處理的概念

  • 理解數據攝取、轉換和數據存儲

  • 深入了解基本的流處理概念和不同的流架構

  • 使用 Spark Streaming 實現流處理作業和應用程序

本書適合誰閱讀
數據工程師、數據分析師、機器學習工程師、Python 和 R 程式設計師

作者簡介

Alfonso Antolínez García is a senior IT manager with a long professional career serving in several multinational companies such as Bertelsmann SE, Lafarge, and TUI AG. He has been working in the media industry, the building materials industry, and the leisure industry. Alfonso also works as a university professor, teaching artificial intelligence, machine learning, and data science. In his spare time, he writes research papers on artificial intelligence, mathematics, physics, and the applications of information theory to other sciences.

作者簡介(中文翻譯)

阿方索·安托利內斯·加爾西亞是一位資深的IT經理,擁有在多家跨國公司(如Bertelsmann SE、Lafarge和TUI AG)服務的豐富職業生涯。他曾在媒體產業、建材產業和休閒產業工作。阿方索同時擔任大學教授,教授人工智慧、機器學習和數據科學。在空閒時間,他撰寫有關人工智慧、數學、物理學以及信息理論在其他科學中的應用的研究論文。