Introduction to HPC with MPI for Data Science (Undergraduate Topics in Computer Science)
暫譯: 數據科學中的高效能計算導論:使用MPI(計算機科學本科主題)

Frank Nielsen

相關主題

商品描述

This gentle introduction to High Performance Computing (HPC) for Data Science using the Message Passing Interface (MPI) standard has been designed as a first course for undergraduates on parallel programming on distributed memory models, and requires only basic programming notions.

Divided into two parts the first part covers high performance computing using C++ with the Message Passing Interface (MPI) standard followed by a second part providing high-performance data analytics on computer clusters.

In the first part, the fundamental notions of blocking versus non-blocking point-to-point communications, global communications (like broadcast or scatter) and collaborative computations (reduce), with Amdalh and Gustafson speed-up laws are described before addressing parallel sorting and parallel linear algebra on computer clusters. The common ring, torus and hypercube topologies of clusters are then explained and global communication procedures on these topologies are studied. This first part closes with the MapReduce (MR) model of computation well-suited to processing big data using the MPI framework.

In the second part, the book focuses on high-performance data analytics. Flat and hierarchical clustering algorithms are introduced for data exploration along with how to program these algorithms on computer clusters, followed by machine learning classification, and an introduction to graph analytics. This part closes with a concise introduction to data core-sets that let big data problems be amenable to tiny data problems.

Exercises are included at the end of each chapter in order for students to practice the concepts learned, and a final section contains an overall exam which allows them to evaluate how well they have assimilated the material covered in the book.

商品描述(中文翻譯)

這本針對數據科學的高效能計算(High Performance Computing, HPC)入門書籍,使用訊息傳遞介面(Message Passing Interface, MPI)標準,旨在作為本科生平行程式設計的第一門課程,僅需具備基本的程式設計概念。

本書分為兩個部分,第一部分涵蓋使用 C++ 和 MPI 標準的高效能計算,第二部分則提供在計算叢集上的高效能數據分析。

在第一部分中,介紹了點對點通訊的阻塞(blocking)與非阻塞(non-blocking)基本概念、全域通訊(如廣播或散佈)及協作計算(reduce),並說明了 Amdahl 和 Gustafson 的加速法則,然後探討計算叢集上的平行排序和平行線性代數。接著解釋了叢集的常見環形、環面和超立方體拓撲,並研究這些拓撲上的全域通訊程序。第一部分以適合使用 MPI 框架處理大數據的 MapReduce(MR)計算模型作結。

在第二部分中,本書專注於高效能數據分析。介紹了用於數據探索的平面和層次聚類演算法,以及如何在計算叢集上編程這些演算法,接著是機器學習分類和圖形分析的介紹。這部分以簡明的數據核心集介紹作結,讓大數據問題能夠轉化為小數據問題。

每章結尾都包含練習題,讓學生能夠練習所學的概念,最後一部分則包含一個綜合考試,讓他們評估自己對書中內容的理解程度。