Mastering Spark for Data Science
暫譯: 掌握 Spark 進行資料科學分析

Andrew Morgan, Antoine Amend, David George, Matthew Hallett

商品描述

Unlock the complexities of lightning fast data science

About This Book

  • Develop and apply advanced analytical techniques with Spark
  • Learn how to tell a compelling story in data science using Spark's ecosystem
  • Explore data at a scale and work with cutting edge data science methods

Who This Book Is For

This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes.

What You Will Learn

  • Learn the design patterns that integrate Spark into with industrialized data science pipelines
  • Understand how commercial data scientists design scalable code and reusable code for data science services
  • Get a grasp of the new cutting edge data science methods so you can study trends and causality
  • Find out how to use Spark as a universal ingestion engine tool and as a web scraper
  • Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining
  • Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams
  • Grasp advanced Spark concepts, as well as solution design patterns and integration architectures
  • Demonstrate powerful data science pipelines
  • Get detailed guidance on how to run Spark in production

In Detail

The purpose of data science is to transform the world using data, and this goal is mainly achieved through disrupting and changing real processes in real industries. To operate at this level, you need to be able to build data science solutions of substance; ones that solve real problems, and that can run reliably enough for people to trust and act on. Spark has emerged as the big data platform of choice for data scientists.

This book deep dives into Spark to deliver production-grade data science solutions that are innovative, disruptive, and reliable enough to be trusted. We demonstrate the process through exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights. We use the core Spark APIs and take a deep-dive into advanced libraries including: Spark SQL, visual streaming, MLlib, and more.

We introduce advanced techniques and methods to help you build data science solutions, and show you how to construct commercial grade data products. Using a sequence of tutorials that deliver a working news intelligence service, we explain advanced Spark architectures, unveil sophisticated data science methods, demonstrate how to work with geographic data in Spark, and explain how to tune Spark algorithms so they scale linearly.

商品描述(中文翻譯)

解鎖閃電般快速的資料科學複雜性

關於本書



  • 使用 Spark 開發和應用先進的分析技術

  • 學習如何利用 Spark 的生態系統講述引人入勝的資料科學故事

  • 以大規模探索資料,並使用尖端的資料科學方法

本書適合誰


本書適合對 Spark 架構和資料科學應用有初步了解的人,尋求挑戰並希望學習尖端技術。本書假設讀者具備資料科學、常見機器學習方法和流行資料科學工具的工作知識,並假設您之前已經進行過概念驗證研究並建立過原型。

您將學到什麼



  • 學習將 Spark 整合進工業化資料科學管道的設計模式

  • 了解商業資料科學家如何設計可擴展的代碼和可重用的資料科學服務代碼

  • 掌握新的尖端資料科學方法,以便研究趨勢和因果關係

  • 了解如何將 Spark 作為通用的資料攝取引擎工具和網頁爬蟲使用

  • 實踐圖形處理中的高級主題實作,例如社群偵測和聯絡鏈接

  • 了解在執行擴展探索性資料分析時的最佳實踐,這在商業資料科學團隊中常用

  • 掌握高級 Spark 概念,以及解決方案設計模式和整合架構

  • 展示強大的資料科學管道

  • 獲得如何在生產環境中運行 Spark 的詳細指導

詳細內容


資料科學的目的是利用資料改變世界,這一目標主要是通過顛覆和改變真實行業中的實際流程來實現。要在這個層面上運作,您需要能夠構建有實質內容的資料科學解決方案;這些解決方案能解決真實問題,並且能夠可靠運行,以便人們信任並採取行動。Spark 已成為資料科學家的大數據平台首選。


本書深入探討 Spark,提供創新、顛覆性且可靠的生產級資料科學解決方案。我們通過探索構建一個複雜的全球新聞分析服務的過程來展示,該服務使用 Spark 生成持續的地緣政治和時事洞察。我們使用核心 Spark API,並深入研究包括 Spark SQL、視覺串流、MLlib 等高級庫。


我們介紹先進的技術和方法,幫助您構建資料科學解決方案,並展示如何構建商業級資料產品。通過一系列提供可運作的新聞情報服務的教程,我們解釋高級 Spark 架構,揭示複雜的資料科學方法,展示如何在 Spark 中處理地理資料,並解釋如何調整 Spark 算法以實現線性擴展。