Hands-On Guide to Apache Spark 3: Build Scalable Computing Engines for Batch and Stream Data Processing (Apache Spark 3 實戰指南:構建可擴展的批次與串流數據處理引擎)
Antolínez García, Alfonso
相關主題
商品描述
This book explains how to scale Apache Spark 3 to handle massive amounts of data, either via batch or streaming processing. It covers how to use Spark's structured APIs to perform complex data transformations and analyses you can use to implement end-to-end analytics workflows. This book covers Spark 3's new features, theoretical foundations, and application architecture. The first section introduces the Apache Spark ecosystem as a unified engine for large scale data analytics, and shows you how to run and fine-tune your first application in Spark. The second section centers on batch processing suited to end-of-cycle processing, and data ingestion through files and databases. It explains Spark DataFrame API as well as structured and unstructured data with Apache Spark. The last section deals with scalable, high-throughput, fault-tolerant streaming processing workloads to process real-time data. Here you'll learn about Apache Spark Streaming's execution model, the architecture of Spark Streaming, monitoring, reporting, and recovering Spark streaming. A full chapter is devoted to future directions for Spark Streaming. With real-world use cases, code snippets, and notebooks hosted on GitHub, this book will give you an understanding of large-scale data analysis concepts--and help you put them to use.
Upon completing this book, you will have the knowledge and skills to seamlessly implement large-scale batch and streaming workloads to analyze real-time data streams with Apache Spark.
What You Will Learn
Who This Book Is ForData engineers, data analysts, machine learning engineers, Python and R programmers
Upon completing this book, you will have the knowledge and skills to seamlessly implement large-scale batch and streaming workloads to analyze real-time data streams with Apache Spark.
What You Will Learn
- Master the concepts of Spark clusters and batch data processing
- Understand data ingestion, transformation, and data storage
- Gain insight into essential stream processing concepts and different streaming architectures
- Implement streaming jobs and applications with Spark Streaming
Who This Book Is ForData engineers, data analysts, machine learning engineers, Python and R programmers
商品描述(中文翻譯)
本書解釋了如何擴展 Apache Spark 3 以處理大量的數據,無論是批處理還是流處理。它介紹了如何使用 Spark 的結構化 API 進行複雜的數據轉換和分析,以實現端到端的分析工作流程。本書涵蓋了 Spark 3 的新功能、理論基礎和應用架構。第一部分介紹了 Apache Spark 生態系統作為大規模數據分析的統一引擎,並展示了如何在 Spark 中運行和調優第一個應用程序。第二部分集中在適用於週期結束處理的批處理,以及通過文件和數據庫進行數據載入。它解釋了 Spark DataFrame API,以及使用 Apache Spark 處理結構化和非結構化數據。最後一部分處理可擴展、高吞吐量、容錯的流處理工作負載,以處理實時數據。在這裡,您將學習到 Apache Spark Streaming 的執行模型、Spark Streaming 的架構、監控、報告和恢復。一整章專門介紹了 Spark Streaming 的未來發展方向。通過真實世界的用例、代碼片段和在 GitHub 上托管的筆記本,本書將幫助您理解大規模數據分析的概念,並幫助您將其應用到實際情境中。
完成本書後,您將具備無縫實現大規模批處理和流處理工作負載,以分析實時數據流的知識和技能。
您將學到什麼
- 掌握 Spark 集群和批處理數據處理的概念
- 了解數據載入、轉換和數據存儲
- 深入了解基本的流處理概念和不同的流處理架構
- 使用 Spark Streaming 實現流處理作業和應用程序
本書適合對象數據工程師、數據分析師、機器學習工程師、Python 和 R 程序員
作者簡介
Alfonso Antolínez García is a senior IT manager with a long professional career serving in several multinational companies such as Bertelsmann SE, Lafarge, and TUI AG. He has been working in the media industry, the building materials industry, and the leisure industry. Alfonso also works as a university professor, teaching artificial intelligence, machine learning, and data science. In his spare time, he writes research papers on artificial intelligence, mathematics, physics, and the applications of information theory to other sciences.
作者簡介(中文翻譯)
Alfonso Antolínez García 是一位資深的資訊科技經理,擁有長期的專業經歷,曾在 Bertelsmann SE、Lafarge 和 TUI AG 等多家跨國公司服務。他曾在媒體、建材和休閒產業工作。此外,Alfonso 也是一位大學教授,教授人工智慧、機器學習和資料科學。在閒暇時間,他撰寫關於人工智慧、數學、物理學以及資訊理論在其他科學領域的應用的研究論文。