Advanced Analytics with Pyspark: Patterns for Learning from Data at Scale Using Python and Spark
暫譯: 使用 Pyspark 進行進階分析:利用 Python 和 Spark 從大規模數據中學習的模式
Tandon, Akash, Ryza, Sandy, Laserson, Uri
買這商品的人也買了...
商品描述
The amount of data being generated today is staggering--and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark's Python API, and other best practices in Spark programming.
Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques--including classification, clustering, collaborative filtering, and anomaly detection--to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing.
If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis.
- Familiarize yourself with Spark's programming model and ecosystem
- Learn general approaches in data science
- Examine complete implementations that analyze large public datasets
- Discover which machine learning tools make sense for particular problems
- Explore code that can be adapted to many uses
商品描述(中文翻譯)
當今生成的數據量驚人且持續增長。Apache Spark 已成為分析大數據的事實標準工具,並且現在是數據科學工具箱中的關鍵部分。本書針對 Spark 3.0 進行了更新,這本實用指南結合了 Spark、統計方法和現實世界的數據集,教您如何使用 PySpark(Spark 的 Python API)和其他 Spark 編程最佳實踐來解決分析問題。
數據科學家 Akash Tandon、Sandy Ryza、Uri Laserson、Sean Owen 和 Josh Wills 介紹了 Spark 生態系統,然後深入探討應用常見技術的模式,包括分類、聚類、協同過濾和異常檢測,這些技術應用於基因組學、安全性和金融等領域。本更新版還涵蓋了自然語言處理(NLP)和圖像處理。
如果您對機器學習和統計有基本了解,並且會使用 Python 編程,這本書將幫助您開始進行大規模數據分析。
- 熟悉 Spark 的編程模型和生態系統
- 學習數據科學中的一般方法
- 檢查分析大型公共數據集的完整實現
- 發現哪些機器學習工具適合特定問題
- 探索可以適應多種用途的代碼
作者簡介
Akash Tandon is an independent consultant and experienced full-stack data engineer. Previously, he was a senior data engineer at Atlan, where he built software for enterprise data science teams. In another life, he had worked on data science projects for governments, and built risk assessment tools at a FinTech startup. As a student, he wrote open source software with the R project for statistical computing and Google. In his free time, he researches things for no good reason.
Sandy Ryza is software engineer at Elementl. Previously, he developed algorithms for public transit at Remix and was a senior data scientist at Cloudera and Clover Health. He is an Apache Spark committer, Apache Hadoop PMC member, and founder of the Time Series for Spark project.
Uri Laserson is founder & CTO of Patch Biosciences. Previously, he worked on big data and genomics at Cloudera.
Sean Owen is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committer and PMC member, and co-author Advanced Analytics with Spark. Previously, he was director of Data Science at Cloudera and an engineer at Google.
Josh Wills is an independent data science and engineering consultant, the former head of data engineering at Slack and data science at Cloudera, and wrote a tweet about data scientists once.
作者簡介(中文翻譯)
Akash Tandon 是一位獨立顧問及經驗豐富的全端數據工程師。之前,他曾擔任 Atlan 的高級數據工程師,為企業數據科學團隊開發軟體。在另一段人生中,他曾為政府從事數據科學專案,並在一家金融科技初創公司建立風險評估工具。作為學生,他曾與 R 專案(用於統計計算)和 Google 一起撰寫開源軟體。在空閒時間,他會無緣無故地研究各種事物。
Sandy Ryza 是 Elementl 的軟體工程師。之前,他在 Remix 開發公共交通的演算法,並曾擔任 Cloudera 和 Clover Health 的高級數據科學家。他是 Apache Spark 的提交者、Apache Hadoop PMC 成員,以及 Time Series for Spark 專案的創始人。
Uri Laserson 是 Patch Biosciences 的創始人及首席技術官。之前,他在 Cloudera 從事大數據和基因組學的工作。
Sean Owen 是 Databricks 專注於機器學習和數據科學的首席解決方案架構師。他是 Apache Spark 的提交者和 PMC 成員,並共同撰寫了《Advanced Analytics with Spark》。之前,他曾擔任 Cloudera 的數據科學總監及 Google 的工程師。
Josh Wills 是一位獨立的數據科學和工程顧問,曾擔任 Slack 的數據工程主管和 Cloudera 的數據科學主管,並曾發過一條關於數據科學家的推文。