In-Memory Analytics with Apache Arrow - Second Edition: Accelerate data analytics for efficient processing of flat and hierarchical data structures

Topol, Matthew, McKinney, Wes

  • 出版商: Packt Publishing
  • 出版日期: 2024-09-30
  • 售價: $2,010
  • 貴賓價: 9.5$1,910
  • 語言: 英文
  • 頁數: 406
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1835461220
  • ISBN-13: 9781835461228
  • 相關分類: Data ScienceAlgorithms-data-structures
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Harness the power of Apache Arrow to optimize tabular data processing and develop robust, high-performance data systems with its standardized, language-independent columnar memory format

Key Features:

- Explore Apache Arrow's data types and integration with pandas, Polars, and Parquet

- Work with Arrow libraries such as Flight SQL, Acero compute engine, and Dataset APIs for tabular data

- Enhance and accelerate machine learning data pipelines using Apache Arrow and its subprojects

- Purchase of the print or Kindle book includes a free PDF eBook

Book Description:

Apache Arrow is an open source, columnar in-memory data format designed for efficient data processing and analytics. This book harnesses the author's 15 years of experience to show you a standardized way to work with tabular data across various programming languages and environments, enabling high-performance data processing and exchange.

This updated second edition gives you an overview of the Arrow format, highlighting its versatility and benefits through real-world use cases. It guides you through enhancing data science workflows, optimizing performance with Apache Parquet and Spark, and ensuring seamless data translation. You'll explore data interchange and storage formats, and Arrow's relationships with Parquet, Protocol Buffers, FlatBuffers, JSON, and CSV. You'll also discover Apache Arrow subprojects, including Flight, SQL, Database Connectivity, and nanoarrow. You'll learn to streamline machine learning workflows, use Arrow Dataset APIs, and integrate with popular analytical data systems such as Snowflake, Dremio, and DuckDB. The latter chapters provide real-world examples and case studies of products powered by Apache Arrow, providing practical insights into its applications.

By the end of this book, you'll have all the building blocks to create efficient and powerful analytical services and utilities with Apache Arrow.

What You Will Learn:

- Use Apache Arrow libraries to access data files, both locally and in the cloud

- Understand the zero-copy elements of the Apache Arrow format

- Improve the read performance of data pipelines by memory-mapping Arrow files

- Produce and consume Apache Arrow data efficiently by sharing memory with the C API

- Leverage the Arrow compute engine, Acero, to perform complex operations

- Create Arrow Flight servers and clients for transferring data quickly

- Build the Arrow libraries locally and contribute to the community

Who this book is for:

This book is for developers, data engineers, and data scientists looking to explore the capabilities of Apache Arrow from the ground up. Whether you're building utilities for data analytics and query engines, or building full pipelines with tabular data, this book can help you out regardless of your preferred programming language. A basic understanding of data analysis concepts is needed, but not necessary. Code examples are provided using C++, Python, and Go throughout the book.

Table of Contents

- Getting Started with Apache Arrow

- Working with Key Arrow Specifications

- Format and Memory Handling

- Crossing the Language Barrier with the Arrow C Data API

- Acero: A Streaming Arrow Execution Engine

- Using the Arrow Datasets API

- Exploring Apache Arrow Flight RPC

- Understanding Arrow Database Connectivity (ADBC)

- Using Arrow with Machine Learning Workflows

- Powered by Apache Arrow

- How to Leave Your Mark on Arrow

- Future Development and Plans

商品描述(中文翻譯)

利用 Apache Arrow 的強大功能來優化表格數據處理,並使用其標準化、語言獨立的列式內存格式開發穩健的高性能數據系統。

主要特點:
- 探索 Apache Arrow 的數據類型及其與 pandas、Polars 和 Parquet 的整合
- 使用 Arrow 庫,如 Flight SQL、Acero 計算引擎和 Dataset API 來處理表格數據
- 利用 Apache Arrow 及其子專案增強和加速機器學習數據管道
- 購買印刷版或 Kindle 版書籍可獲得免費 PDF 電子書

書籍描述:
Apache Arrow 是一種開源的列式內存數據格式,旨在高效地進行數據處理和分析。本書利用作者 15 年的經驗,向您展示一種標準化的方式來處理各種編程語言和環境中的表格數據,實現高性能的數據處理和交換。

這本更新的第二版為您提供了 Arrow 格式的概述,通過實際案例突顯其多功能性和優勢。它指導您如何增強數據科學工作流程,使用 Apache Parquet 和 Spark 優化性能,並確保數據的無縫轉換。您將探索數據交換和存儲格式,以及 Arrow 與 Parquet、Protocol Buffers、FlatBuffers、JSON 和 CSV 的關係。您還將發現 Apache Arrow 的子專案,包括 Flight、SQL、數據庫連接和 nanoarrow。您將學會簡化機器學習工作流程,使用 Arrow Dataset API,並與 Snowflake、Dremio 和 DuckDB 等流行的分析數據系統整合。後面的章節提供了由 Apache Arrow 驅動的產品的實際案例和研究,提供了其應用的實用見解。

在本書結束時,您將擁有創建高效且強大的分析服務和工具所需的所有基礎構件,並能夠使用 Apache Arrow。

您將學到的內容:
- 使用 Apache Arrow 庫訪問本地和雲端的數據文件
- 理解 Apache Arrow 格式的零拷貝元素
- 通過內存映射 Arrow 文件來提高數據管道的讀取性能
- 通過與 C API 共享內存高效地生成和消費 Apache Arrow 數據
- 利用 Arrow 計算引擎 Acero 執行複雜操作
- 創建 Arrow Flight 伺服器和客戶端以快速傳輸數據
- 在本地構建 Arrow 庫並為社群做出貢獻

本書適合對象:
本書適合希望從基礎開始探索 Apache Arrow 功能的開發人員、數據工程師和數據科學家。無論您是為數據分析和查詢引擎構建工具,還是構建完整的表格數據管道,本書都能幫助您,無論您偏好的編程語言為何。需要對數據分析概念有基本了解,但並非必要。全書提供了使用 C++、Python 和 Go 的代碼示例。

目錄:
- 開始使用 Apache Arrow
- 使用關鍵的 Arrow 規範
- 格式和內存處理
- 通過 Arrow C 數據 API 跨越語言障礙
- Acero:一個流式 Arrow 執行引擎
- 使用 Arrow Datasets API
- 探索 Apache Arrow Flight RPC
- 理解 Arrow 數據庫連接 (ADBC)
- 在機器學習工作流程中使用 Arrow
- 由 Apache Arrow 驅動
- 如何在 Arrow 上留下您的印記
- 未來的發展和計劃