An Architecture for Fast and General Data Processing on Large Clusters
暫譯: 大型叢集上快速且通用的數據處理架構

Zaharia, Matei

  • 出版商: Morgan & Claypool
  • 出版日期: 2016-05-01
  • 售價: $2,070
  • 貴賓價: 9.5$1,967
  • 語言: 英文
  • 頁數: 141
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1970001569
  • ISBN-13: 9781970001563
  • 海外代購書籍(需單獨結帳)

商品描述

The past few years have seen a major change in computing systems, as growing data volumes and stalling processor speeds require more and more applications to scale out to clusters. Today, a myriad data sources, from the Internet to business operations to scientific instruments, produce large and valuable data streams. However, the processing capabilities of single machines have not kept up with the size of data. As a result, organizations increasingly need to scale out their computations over clusters.

At the same time, the speed and sophistication required of data processing have grown. In addition to simple queries, complex algorithms like machine learning and graph analysis are becoming common. And in addition to batch processing, streaming analysis of real-time data is required to let organizations take timely action. Future computing platforms will need to not only scale out traditional workloads, but support these new applications too.

This book, a revised version of the 2014 ACM Dissertation Award winning dissertation, proposes an architecture for cluster computing systems that can tackle emerging data processing workloads at scale. Whereas early cluster computing systems, like MapReduce, handled batch processing, our architecture also enables streaming and interactive queries, while keeping MapReduce's scalability and fault tolerance. And whereas most deployed systems only support simple one-pass computations (e.g., SQL queries), ours also extends to the multi-pass algorithms required for complex analytics like machine learning. Finally, unlike the specialized systems proposed for some of these workloads, our architecture allows these computations to be combined, enabling rich new applications that intermix, for example, streaming and batch processing.

We achieve these results through a simple extension to MapReduce that adds primitives for data sharing, called Resilient Distributed Datasets (RDDs). We show that this is enough to capture a wide range of workloads. We implement RDDs in the open source Spark system, which we evaluate using synthetic and real workloads. Spark matches or exceeds the performance of specialized systems in many domains, while offering stronger fault tolerance properties and allowing these workloads to be combined. Finally, we examine the generality of RDDs from both a theoretical modeling perspective and a systems perspective.

This version of the dissertation makes corrections throughout the text and adds a new section on the evolution of Apache Spark in industry since 2014. In addition, editing, formatting, and links for the references have been added.

商品描述(中文翻譯)

過去幾年,計算系統發生了重大變化,隨著數據量的增長和處理器速度的停滯,越來越多的應用需要擴展到集群。如今,從互聯網到商業運營再到科學儀器,無數數據來源產生了大量有價值的數據流。然而,單一機器的處理能力並未跟上數據的規模。因此,組織越來越需要在集群上擴展其計算能力。

同時,數據處理所需的速度和複雜性也在增長。除了簡單的查詢外,像機器學習和圖分析這樣的複雜算法正變得越來越普遍。此外,除了批處理外,還需要對實時數據進行流式分析,以便讓組織能夠及時採取行動。未來的計算平台不僅需要擴展傳統工作負載,還需要支持這些新應用。

本書是2014年ACM論文獎獲獎論文的修訂版,提出了一種集群計算系統的架構,能夠應對新興的數據處理工作負載。早期的集群計算系統,如MapReduce,主要處理批處理,而我們的架構還支持流式和互動查詢,同時保持MapReduce的可擴展性和容錯性。而且,大多數已部署的系統僅支持簡單的一次性計算(例如,SQL查詢),我們的系統還擴展到複雜分析所需的多次計算算法,如機器學習。最後,與為某些工作負載提出的專用系統不同,我們的架構允許這些計算進行組合,從而實現豐富的新應用,例如流式和批處理的混合。

我們通過對MapReduce的簡單擴展來實現這些結果,該擴展增加了數據共享的原語,稱為彈性分佈式數據集(Resilient Distributed Datasets, RDDs)。我們展示了這足以捕捉廣泛的工作負載。我們在開源的Spark系統中實現了RDD,並使用合成和實際工作負載進行評估。Spark在許多領域的性能與專用系統相當或超過,同時提供更強的容錯性,並允許這些工作負載進行組合。最後,我們從理論建模和系統的角度檢視RDD的通用性。

本版本的論文對文本進行了修正,並新增了一個關於自2014年以來Apache Spark在業界演變的新章節。此外,還添加了編輯、格式和參考文獻的鏈接。

最後瀏覽商品 (20)