Essential PySpark for Scalable Data Analytics: A beginner's guide to harnessing the power and ease of PySpark 3
暫譯: 可擴展數據分析的基本 PySpark:初學者指南,掌握 PySpark 3 的力量與簡易性
Nudurupati, Sreeram
- 出版商: Packt Publishing
- 出版日期: 2021-10-29
- 定價: $1,750
- 售價: 9.0 折 $1,575
- 語言: 英文
- 頁數: 322
- 裝訂: Quality Paper - also called trade paper
- ISBN: 1800568878
- ISBN-13: 9781800568877
-
相關分類:
JVM 語言、Spark、Data Science
立即出貨 (庫存=1)
相關主題
商品描述
Get started with distributed computing using PySpark, a single unified framework to solve end-to-end data analytics at scale
Key Features:
- Discover how to convert huge amounts of raw data into meaningful and actionable insights
- Use Spark's unified analytics engine for end-to-end analytics, from data preparation to predictive analytics
- Perform data ingestion, cleansing, and integration for ML, data analytics, and data visualization
Book Description:
Apache Spark is a unified data analytics engine designed to process huge volumes of data quickly and efficiently. PySpark is Apache Spark's Python language API, which offers Python developers an easy-to-use scalable data analytics framework.
Essential PySpark for Scalable Data Analytics starts by exploring the distributed computing paradigm and provides a high-level overview of Apache Spark. You'll begin your analytics journey with the data engineering process, learning how to perform data ingestion, cleansing, and integration at scale. This book helps you build real-time analytics pipelines that enable you to gain insights much faster. You'll then discover methods for building cloud-based data lakes, and explore Delta Lake, which brings reliability and performance to data lakes. The book also covers Data Lakehouse, an emerging paradigm, which combines the structure and performance of a data warehouse with the scalability of cloud-based data lakes. Later, you'll perform scalable data science and machine learning tasks using PySpark, such as data preparation, feature engineering, and model training and productionization. Finally, you'll learn ways to scale out standard Python ML libraries along with a new pandas API on top of PySpark called Koalas.
By the end of this PySpark book, you'll be able to harness the power of PySpark to solve business problems.
What You Will Learn:
- Understand the role of distributed computing in the world of big data
- Gain an appreciation for Apache Spark as the de facto go-to for big data processing
- Scale out your data analytics process using Apache Spark
- Build data pipelines using data lakes, and perform data visualization with PySpark and Spark SQL
- Leverage the cloud to build truly scalable and real-time data analytics applications
- Explore the applications of data science and scalable machine learning with PySpark
- Integrate your clean and curated data with BI and SQL analysis tools
Who this book is for:
This book is for practicing data engineers, data scientists, data analysts, and data enthusiasts who are already using data analytics to explore distributed and scalable data analytics. Basic to intermediate knowledge of the disciplines of data engineering, data science, and SQL analytics is expected. General proficiency in using any programming language, especially Python, and working knowledge of performing data analytics using frameworks such as pandas and SQL will help you to get the most out of this book.
商品描述(中文翻譯)
開始使用 PySpark 進行分散式計算,這是一個統一的框架,用於解決大規模的端到端數據分析
主要特點:
- 了解如何將大量原始數據轉換為有意義且可行的見解
- 使用 Spark 的統一分析引擎進行端到端分析,從數據準備到預測分析
- 執行數據攝取、清理和整合,以便於機器學習、數據分析和數據可視化
書籍描述:
Apache Spark 是一個統一的數據分析引擎,旨在快速且高效地處理大量數據。PySpark 是 Apache Spark 的 Python 語言 API,為 Python 開發者提供了一個易於使用的可擴展數據分析框架。
《Essential PySpark for Scalable Data Analytics》一書首先探討分散式計算範式,並提供 Apache Spark 的高層次概述。您將從數據工程過程開始您的分析之旅,學習如何在大規模下執行數據攝取、清理和整合。本書幫助您構建實時分析管道,使您能夠更快地獲得見解。接著,您將發現構建基於雲的數據湖的方法,並探索 Delta Lake,這為數據湖帶來了可靠性和性能。本書還涵蓋了數據湖屋(Data Lakehouse),這是一種新興範式,結合了數據倉庫的結構和性能以及基於雲的數據湖的可擴展性。稍後,您將使用 PySpark 執行可擴展的數據科學和機器學習任務,例如數據準備、特徵工程以及模型訓練和生產化。最後,您將學習如何擴展標準的 Python 機器學習庫,以及在 PySpark 上的新 pandas API,稱為 Koalas。
在這本 PySpark 書籍結束時,您將能夠利用 PySpark 的力量來解決商業問題。
您將學到的內容:
- 了解分散式計算在大數據世界中的角色
- 認識 Apache Spark 作為大數據處理的事實標準
- 使用 Apache Spark 擴展您的數據分析過程
- 使用數據湖構建數據管道,並使用 PySpark 和 Spark SQL 執行數據可視化
- 利用雲端構建真正可擴展和實時的數據分析應用程序
- 探索使用 PySpark 的數據科學和可擴展機器學習的應用
- 將您的清理和整理過的數據與 BI 和 SQL 分析工具整合
本書適合誰:
本書適合正在使用數據分析探索分散式和可擴展數據分析的實踐數據工程師、數據科學家、數據分析師和數據愛好者。預期具備數據工程、數據科學和 SQL 分析的基本到中級知識。對任何編程語言(特別是 Python)的通用熟練度,以及使用 pandas 和 SQL 等框架進行數據分析的工作知識,將幫助您充分利用本書。