Simplify Big Data Analytics with Amazon EMR: A beginner's guide to learning and implementing Amazon EMR for building data analytics solutions
暫譯: 簡化大數據分析與 Amazon EMR：初學者學習與實作 Amazon EMR 以建立數據分析解決方案的指南

Name: Simplify Big Data Analytics with Amazon EMR: A beginner's guide to learning and implementing Amazon EMR for building data analytics solutions
Price: 1910 TWD
Availability: OnlineOnly
Author: Mishra, Sakti
ISBN: 1801071071

Mishra, Sakti

Simplify Big Data Analytics with Amazon EMR: A beginner's guide to learning and implementing Amazon EMR for building data analytics solutions

出版商: Packt Publishing
出版日期: 2022-03-25
售價: $2,010
貴賓價: 9.5 折 $1,910
語言: 英文
頁數: 430
裝訂: Quality Paper - also called trade paper
ISBN: 1801071071
ISBN-13: 9781801071079
相關分類: 大數據 Big-data、Data Science

海外代購書籍(需單獨結帳)

商品描述

Design scalable big data solutions using Hadoop, Spark, and AWS cloud native services

Key Features

- Build data pipelines that require distributed processing capabilities on a large volume of data
- Discover the security features of EMR such as data protection and granular permission management
- Explore best practices and optimization techniques for building data analytics solutions in Amazon EMR

Book Description

Amazon EMR, formerly Amazon Elastic MapReduce, provides a managed Hadoop cluster in Amazon Web Services (AWS) that you can use to implement batch or streaming data pipelines. By gaining expertise in Amazon EMR, you can design and implement data analytics pipelines with persistent or transient EMR clusters in AWS.

This book is a practical guide to Amazon EMR for building data pipelines. You'll start by understanding the Amazon EMR architecture, cluster nodes, features, and deployment options, along with their pricing. Next, the book covers the various big data applications that EMR supports. You'll then focus on the advanced configuration of EMR applications, hardware, networking, security, troubleshooting, logging, and the different SDKs and APIs it provides. Later chapters will show you how to implement common Amazon EMR use cases, including batch ETL with Spark, real-time streaming with Spark Streaming, and handling UPSERT in S3 Data Lake with Apache Hudi. Finally, you'll orchestrate your EMR jobs and strategize on-premises Hadoop cluster migration to EMR. In addition to this, you'll explore best practices and cost optimization techniques while implementing your data analytics pipeline in EMR.

By the end of this book, you'll be able to build and deploy Hadoop- or Spark-based apps on Amazon EMR and also migrate your existing on-premises Hadoop workloads to AWS.

What you will learn

- Explore Amazon EMR features, architecture, Hadoop interfaces, and EMR Studio
- Configure, deploy, and orchestrate Hadoop or Spark jobs in production
- Implement the security, data governance, and monitoring capabilities of EMR
- Build applications for batch and real-time streaming data analytics solutions
- Perform interactive development with a persistent EMR cluster and Notebook
- Orchestrate an EMR Spark job using AWS Step Functions and Apache Airflow

Who this book is for

This book is for data engineers, data analysts, data scientists, and solution architects who are interested in building data analytics solutions with the Hadoop ecosystem services and Amazon EMR. Prior experience in either Python programming, Scala, or the Java programming language and a basic understanding of Hadoop and AWS will help you make the most out of this book.

商品描述(中文翻譯)

**設計可擴展的大數據解決方案，使用 Hadoop、Spark 和 AWS 雲原生服務**

**主要特點**

- 建立需要分散式處理能力的大量數據數據管道
- 探索 EMR 的安全功能，如數據保護和細粒度權限管理
- 探討在 Amazon EMR 中構建數據分析解決方案的最佳實踐和優化技術

**書籍描述**

Amazon EMR，前身為 Amazon Elastic MapReduce，提供了一個在 Amazon Web Services (AWS) 中管理的 Hadoop 集群，您可以用來實現批量或串流數據管道。通過掌握 Amazon EMR，您可以設計和實施在 AWS 中使用持久或臨時 EMR 集群的數據分析管道。

本書是一本針對 Amazon EMR 的實用指南，用於構建數據管道。您將首先了解 Amazon EMR 的架構、集群節點、功能和部署選項，以及它們的定價。接下來，本書涵蓋了 EMR 支持的各種大數據應用。然後，您將專注於 EMR 應用的高級配置、硬體、網絡、安全性、故障排除、日誌記錄以及它提供的不同 SDK 和 API。後面的章節將向您展示如何實現常見的 Amazon EMR 使用案例，包括使用 Spark 的批量 ETL、使用 Spark Streaming 的實時串流，以及使用 Apache Hudi 在 S3 Data Lake 中處理 UPSERT。最後，您將協調您的 EMR 作業並制定本地 Hadoop 集群遷移到 EMR 的策略。此外，您還將在實施數據分析管道時探索最佳實踐和成本優化技術。

在本書結束時，您將能夠在 Amazon EMR 上構建和部署基於 Hadoop 或 Spark 的應用，並將您現有的本地 Hadoop 工作負載遷移到 AWS。

**您將學到的內容**

- 探索 Amazon EMR 的功能、架構、Hadoop 接口和 EMR Studio
- 配置、部署和協調生產中的 Hadoop 或 Spark 作業
- 實施 EMR 的安全性、數據治理和監控能力
- 為批量和實時串流數據分析解決方案構建應用
- 使用持久的 EMR 集群和 Notebook 進行互動式開發
- 使用 AWS Step Functions 和 Apache Airflow 協調 EMR Spark 作業

**本書適合誰**

本書適合數據工程師、數據分析師、數據科學家和解決方案架構師，他們對使用 Hadoop 生態系統服務和 Amazon EMR 構建數據分析解決方案感興趣。具備 Python 編程、Scala 或 Java 編程語言的先前經驗，以及對 Hadoop 和 AWS 的基本理解，將幫助您充分利用本書。

作者簡介

1. An Overview of Amazon EMR
2. Exploring the Architecture and Deployment Options
3. Common Use Cases and Architecture Patterns
4. Big Data Applications and Notebooks Available in Amazon EMR
5. Setting Up and Configuring EMR Clusters
6. Monitoring, Scaling, and High Availability
7. Understanding Security in Amazon EMR
8. Understanding Data Governance in Amazon EMR
9. Implementing Batch ETL Pipeline with Amazon EMR and Apache Spark
10. Implementing Real-Time Streaming with Amazon EMR and Spark Streaming
11. Implementing UPSERT on S3 Data Lake with Apache Spark and Apache Hudi
12. Orchestrating Amazon EMR Jobs with AWS Step Functions and Apache Airflow/MWAA
13. Migrating On-Premises Hadoop Workloads to Amazon EMR
14. Best Practices and Cost Optimization Techniques

作者簡介(中文翻譯)

1. Amazon EMR 概述
2. 探索架構與部署選項
3. 常見使用案例與架構模式
4. Amazon EMR 中可用的大數據應用程式與筆記本
5. 設定與配置 EMR 叢集
6. 監控、擴展與高可用性
7. 理解 Amazon EMR 的安全性
8. 理解 Amazon EMR 的數據治理
9. 使用 Amazon EMR 和 Apache Spark 實現批次 ETL 管道
10. 使用 Amazon EMR 和 Spark Streaming 實現即時串流
11. 使用 Apache Spark 和 Apache Hudi 在 S3 數據湖上實現 UPSERT
12. 使用 AWS Step Functions 和 Apache Airflow/MWAA 協調 Amazon EMR 作業
13. 將本地 Hadoop 工作負載遷移至 Amazon EMR
14. 最佳實踐與成本優化技術