Apache Flume: Distributed Log Collection for Hadoop (What You Need to Know)

Steve Hoffman

  • 出版商: Packt Publishing
  • 出版日期: 2013-07-04
  • 售價: $1,710
  • 貴賓價: 9.5$1,625
  • 語言: 英文
  • 頁數: 108
  • 裝訂: Paperback
  • ISBN: 1782167919
  • ISBN-13: 9781782167914
  • 相關分類: Hadoop
  • 海外代購書籍(需單獨結帳)

買這商品的人也買了...

相關主題

商品描述

If your role includes moving datasets into Hadoop, this book will help you do it more efficiently using Apache Flume. From installation to customization, it's a complete step-by-step guide on making the service work for you.

Overview

  • Integrate Flume with your data sources
  • Transcode your data en-route in Flume
  • Route and separate your data using regular expression matching
  • Configure failover paths and load-balancing to remove single points of failure
  • Utilize Gzip Compression for files written to HDFS

In Detail

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with many failover and recovery mechanisms.

Apache Flume: Distributed Log Collection for Hadoop covers problems with HDFS and streaming data/logs, and how Flume can resolve these problems. This book explains the generalized architecture of Flume, which includes moving data to/from databases, NO-SQL-ish data stores, as well as optimizing performance. This book includes real-world scenarios on Flume implementation.

Apache Flume: Distributed Log Collection for Hadoop starts with an architectural overview of Flume and then discusses each component in detail. It guides you through the complete installation process and compilation of Flume.

It will give you a heads-up on how to use channels and channel selectors. For each architectural component (Sources, Channels, Sinks, Channel Processors, Sink Groups, and so on) the various implementations will be covered in detail along with configuration options. You can use it to customize Flume to your specific needs. There are pointers given on writing custom implementations as well that would help you learn and implement them.

  • By the end, you should be able to construct a series of Flume agents to transport your streaming data and logs from your systems into Hadoop in near real time.
  • What you will learn from this book

    • Understand the Flume architecture
    • Download and install open source Flume from Apache
    • Discover when to use a memory or file-backed channel
    • Understand and configure the Hadoop File System (HDFS) sink
    • Learn how to use sink groups to create redundant data flows
    • Configure and use various sources for ingesting data
    • Inspect data records and route to different or multiple destinations based on payload content
    • Transform data en-route to Hadoop
    • Monitor your data flows

    Approach

    A starter guide that covers Apache Flume in detail.

    Who this book is written for

    Apache Flume: Distributed Log Collection for Hadoop is intended for people who are responsible for moving datasets into Hadoop in a timely and reliable manner like software engineers, database administrators, and data warehouse administrators.

    商品描述(中文翻譯)

    如果您的角色包括將數據集移入Hadoop,這本書將幫助您更有效地使用Apache Flume。從安裝到自定義,這是一本完整的逐步指南,讓這項服務為您工作。

    概述:
    - 將Flume與您的數據源集成
    - 在Flume中途轉碼數據
    - 使用正則表達式匹配路由和分離數據
    - 配置故障轉移路徑和負載平衡以消除單點故障
    - 將寫入HDFS的文件使用Gzip壓縮

    詳細內容:
    Apache Flume是一個分佈式、可靠且可用的服務,用於高效地收集、聚合和移動大量日誌數據。它的主要目標是將數據從應用程序傳遞到Apache Hadoop的HDFS。它具有基於流數據流的簡單靈活的架構。它具有多個故障轉移和恢復機制,具有強大的容錯能力。

    《Apache Flume: Distributed Log Collection for Hadoop》介紹了HDFS和流數據/日誌的問題,以及Flume如何解決這些問題。本書解釋了Flume的通用架構,包括將數據移動到/從數據庫、NO-SQL數據存儲以及優化性能。本書還包括Flume實施的實際場景。

    《Apache Flume: Distributed Log Collection for Hadoop》從Flume的架構概述開始,然後詳細討論每個組件。它引導您完成完整的安裝過程和Flume的編譯。

    本書將告訴您如何使用通道和通道選擇器。對於每個架構組件(源、通道、接收器、通道處理器、接收器組等),將詳細介紹各種實現以及配置選項。您可以使用它根據自己的需求自定義Flume。還提供了有關編寫自定義實現的指針,這將幫助您學習和實施它們。

    最後,您應該能夠構建一系列Flume代理,將流數據和日誌從系統實時傳輸到Hadoop。

    從本書中您將學到:
    - 瞭解Flume的架構
    - 從Apache下載並安裝開源Flume
    - 瞭解何時使用內存或文件支持的通道
    - 瞭解並配置Hadoop文件系統(HDFS)接收器
    - 學習如何使用接收器組創建冗余數據流
    - 配置和使用各種源來輸入數據
    - 檢查數據記錄並根據有效負載內容將其路由到不同或多個目的地
    - 在傳輸到Hadoop的過程中轉換數據
    - 監控數據流

    這是一本詳細介紹Apache Flume的入門指南。

    本書適合以下讀者:
    - 負責及時可靠地將數據集移入Hadoop的人,如軟件工程師、數據庫管理員和數據倉庫管理員。