Distributed Machine Learning with Pyspark: Migrating Effortlessly from Pandas and Scikit-Learn

Testas, Abdelaziz

  • 出版商: Apress
  • 出版日期: 2023-11-24
  • 售價: $1,870
  • 貴賓價: 9.5$1,777
  • 語言: 英文
  • 頁數: 490
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1484297504
  • ISBN-13: 9781484297506
  • 相關分類: SparkMachine Learning
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Migrate from pandas and scikit-learn to PySpark to handle vast amounts of data and achieve faster data processing time. This book will show you how to make this transition by adapting your skills and leveraging the similarities in syntax, functionality, and interoperability between these tools.

Distributed Machine Learning with PySpark offers a roadmap to data scientists considering transitioning from small data libraries (pandas/scikit-learn) to big data processing and machine learning with PySpark. You will learn to translate Python code from pandas/scikit-learn to PySpark to preprocess large volumes of data and build, train, test, and evaluate popular machine learning algorithms such as linear and logistic regression, decision trees, random forests, support vector machines, Naïve Bayes, and neural networks.

After completing this book, you will understand the foundational concepts of data preparation and machine learning and will have the skills necessary to apply these methods using PySpark, the industry standard for building scalable ML data pipelines.

What You Will Learn

  • Master the fundamentals of supervised learning, unsupervised learning, NLP, and recommender systems
  • Understand the differences between PySpark, scikit-learn, and pandas
  • Perform linear regression, logistic regression, and decision tree regression with pandas, scikit-learn, and PySpark
  • Distinguish between the pipelines of PySpark and scikit-learn

Who This Book Is For

Data scientists, data engineers, and machine learning practitioners who have some familiarity with Python, but who are new to distributed machine learning and the PySpark framework.

商品描述(中文翻譯)

從pandas和scikit-learn遷移到PySpark,處理大量數據並實現更快的數據處理時間。本書將向您展示如何通過適應您的技能並利用這些工具之間的語法、功能和互操作性的相似之處來實現這一過渡。

《使用PySpark進行分散式機器學習》為考慮從小型數據庫(pandas/scikit-learn)過渡到使用PySpark進行大數據處理和機器學習的數據科學家提供了一個路線圖。您將學習將Python代碼從pandas/scikit-learn轉換為PySpark,以預處理大量數據並構建、訓練、測試和評估流行的機器學習算法,如線性回歸、邏輯回歸、決策樹、隨機森林、支持向量機、朴素貝葉斯和神經網絡。

完成本書後,您將了解數據準備和機器學習的基本概念,並具備使用PySpark應用這些方法的技能,PySpark是構建可擴展ML數據管道的行業標準。

您將學到什麼:
- 掌握監督學習、無監督學習、NLP和推薦系統的基本知識
- 理解PySpark、scikit-learn和pandas之間的區別
- 使用pandas、scikit-learn和PySpark執行線性回歸、邏輯回歸和決策樹回歸
- 區分PySpark和scikit-learn的管道

適合對Python有一定熟悉度,但對分散式機器學習和PySpark框架尚不熟悉的數據科學家、數據工程師和機器學習從業人員。

作者簡介

Abdelaziz Testas, Ph.D., is a data scientist with over a decade of experience in data analysis and machine learning, specializing in the use of standard Python libraries and Spark distributed computing. He holds a Ph.D. in Economics from Leeds University and a Master's degree in Finance from Glasgow University. He has also earned several certificates in computer science and data science.

In the last ten years, he has worked for Nielsen in Fremont, California as a Lead Data Scientist focused on improving the company's audience measurement through planning, initiating, and executing end-to-end data science projects and methodology work. He has created advanced solutions for Nielsen's digital ad and content rating products by leveraging subject matter expertise in media measurement and data science. He is passionate about helping others improve their machine learning skills and workflows, and is excited to share his knowledge and experience with a wider audience through this book.

作者簡介(中文翻譯)

Abdelaziz Testas博士是一位資料科學家,擁有十多年的數據分析和機器學習經驗,專注於使用標準Python庫和Spark分佈式計算。他擁有利茲大學的經濟學博士學位和格拉斯哥大學的金融學碩士學位。他還獲得了多個計算機科學和數據科學的證書。

在過去的十年中,他在加利福尼亞州弗里蒙特的尼爾森公司擔任首席數據科學家,致力於通過計劃、啟動和執行端到端的數據科學項目和方法論工作,改進公司的觀眾測量。他通過運用媒體測量和數據科學的專業知識,為尼爾森的數字廣告和內容評級產品創建了先進的解決方案。他熱衷於幫助他人提升機器學習技能和工作流程,並且很高興通過這本書將他的知識和經驗與更廣泛的讀者分享。