Mastering Spark with R: The Complete Guide to Large-Scale Analysis and Modeling
暫譯: 精通 Spark 與 R:大型分析與建模的完整指南

Javier Luraschi , Kevin Kuo , Edgar Ruiz

商品描述

If you’re like most R users, you have deep knowledge and love for statistics. But as your organization continues to collect huge amounts of data, adding tools such as Apache Spark makes a lot of sense. With this practical book, data scientists and professionals working with large-scale data applications will learn how to use Spark from R to tackle big data and big compute problems.

Authors Javier Luraschi, Kevin Kuo, and Edgar Ruiz show you how to use R with Spark to solve different data analysis problems. This book covers relevant data science topics, cluster computing, and issues that should interest even the most advanced users.

  • Analyze, explore, transform, and visualize data in Apache Spark with R
  • Create statistical models to extract information and predict outcomes; automate the process in production-ready workflows
  • Perform analysis and modeling across many machines using distributed computing techniques
  • Use large-scale data from multiple sources and different formats with ease from within Spark
  • Learn about alternative modeling frameworks for graph processing, geospatial analysis, and genomics at scale
  • Dive into advanced topics including custom transformations, real-time data processing, and creating custom Spark extensions

商品描述(中文翻譯)

如果您和大多數 R 使用者一樣,對統計學有深厚的知識和熱愛。但隨著您的組織持續收集大量數據,添加像 Apache Spark 這樣的工具是非常有意義的。這本實用的書籍將幫助數據科學家和從事大規模數據應用的專業人士學習如何從 R 使用 Spark 來解決大數據和大計算問題。

作者 Javier Luraschi、Kevin Kuo 和 Edgar Ruiz 向您展示如何使用 R 與 Spark 解決不同的數據分析問題。本書涵蓋了相關的數據科學主題、叢集計算以及即使是最進階的使用者也會感興趣的議題。

- 使用 R 在 Apache Spark 中分析、探索、轉換和可視化數據
- 創建統計模型以提取信息和預測結果;在生產就緒的工作流程中自動化該過程
- 使用分散式計算技術在多台機器上進行分析和建模
- 輕鬆使用來自多個來源和不同格式的大規模數據,並在 Spark 中處理
- 了解用於圖形處理、地理空間分析和大規模基因組學的替代建模框架
- 深入探討進階主題,包括自定義轉換、實時數據處理和創建自定義 Spark 擴展

作者簡介

Javier is a software engineer with experience in technologies ranging from desktop, web, mobile and backend, to augmented reality and deep learning applications. He previously worked for Microsoft Research and SAP and holds a double degree in Mathematics and Software Engineering. He is the author of various R packages like sparklyr, cloudml, r2d3, mlflow, tfdeploy and kerasjs.

Kevin builds open source libraries for machine learning and model deployment. He has held data science positions in various industries including insurance where he was a credentialed actuary. Kevin is the creator of mlflow, mleap, sparkxgb among various R packages. He is also an amateur mixologist and sommelier.

Edgar Ruiz has a background in deploying enterprise reporting and business intelligence solutions. He is the author of multiple articles and blog posts sharing analytics insights and server infrastructure for data science. Edgar is the author and administrator of the db.rstudio.com web site, and the current administrator of the sparklyr web site. He's also the co-author of the dbplyr package, and creator of the dbplot, tidypredict and the modeldb package.

作者簡介(中文翻譯)

Javier 是一位軟體工程師,擁有從桌面、網頁、行動裝置到後端、擴增實境和深度學習應用等技術的經驗。他曾在 Microsoft Research 和 SAP 工作,並擁有數學和軟體工程的雙學位。他是多個 R 套件的作者,如 sparklyr、cloudml、r2d3、mlflow、tfdeploy 和 kerasjs。

Kevin 創建開源機器學習和模型部署的函式庫。他在各行各業擔任數據科學職位,包括保險業,並且是一位持證的精算師。Kevin 是 mlflow、mleap、sparkxgb 等多個 R 套件的創作者。他也是一位業餘調酒師和品酒師。

Edgar Ruiz 擁有部署企業報告和商業智慧解決方案的背景。他是多篇文章和部落格帖文的作者,分享分析見解和數據科學的伺服器基礎設施。Edgar 是 db.rstudio.com 網站的作者和管理員,也是 sparklyr 網站的現任管理員。他還是 dbplyr 套件的共同作者,以及 dbplot、tidypredict 和 modeldb 套件的創作者。