Applied Data Science Using Pyspark: Learn the End-To-End Predictive Model-Building Cycle

Kakarla, Ramcharan, Krishnan, Sundar, Dhamodharan, Balaji

  • 出版商: Apress
  • 出版日期: 2024-11-18
  • 售價: $2,350
  • 貴賓價: 9.5$2,233
  • 語言: 英文
  • 頁數: 390
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 9798868808197
  • ISBN-13: 9798868808197
  • 相關分類: SparkData ScienceMachine Learning
  • 尚未上市,無法訂購

相關主題

商品描述

This comprehensive guide, featuring hand-picked examples of daily use cases, will walk you through the end-to-end predictive model-building cycle using the latest techniques and industry tricks. In Chapters 1, 2, and 3, we will begin by setting up the environment and covering the basics of PySpark, focusing on data manipulation. Chapter 4 delves into the art of variable selection, demonstrating various techniques available in PySpark. In Chapters 5, 6, and 7, we explore machine learning algorithms, their implementations, and fine-tuning techniques. Chapters 8 and 9 will guide you through machine learning pipelines and various methods to operationalize and serve models using Docker/API. Chapter 10 will demonstrate how to unlock the power of predictive models to create a meaningful impact on your business. Chapter 11 introduces some of the most widely used and powerful modeling frameworks to unlock real value from data.

In this new edition, you will learn predictive modeling frameworks that can quantify customer lifetime values and estimate the return on your predictive modeling investments. This edition also includes methods to measure engagement and identify actionable populations for effective churn treatments. Additionally, a dedicated chapter on experimentation design has been added, covering steps to efficiently design, conduct, test, and measure the results of your models. All code examples have been updated to reflect the latest stable version of Spark.

You will:

  • Gain an overview of end-to-end predictive model building
  • Understand multiple variable selection techniques and their implementations
  • Learn how to operationalize models
  • Perform data science experiments and learn useful tips

商品描述(中文翻譯)

這本全面的指南,包含精心挑選的日常使用案例,將引導您了解從頭到尾的預測模型建構週期,使用最新的技術和行業技巧。在第1、2和3章中,我們將開始設置環境並介紹PySpark的基本知識,重點在於數據操作。第4章深入探討變數選擇的藝術,展示PySpark中可用的各種技術。在第5、6和7章中,我們將探索機器學習算法、其實現方式以及微調技術。第8和9章將指導您了解機器學習管道以及使用Docker/API來運作和提供模型的各種方法。第10章將展示如何釋放預測模型的力量,對您的業務產生有意義的影響。第11章介紹一些最廣泛使用且強大的建模框架,以從數據中釋放真正的價值。

在這個新版本中,您將學習可以量化客戶終身價值並估算預測建模投資回報的預測建模框架。本版本還包括測量參與度和識別可行人群以進行有效流失處理的方法。此外,新增了一章專門討論實驗設計,涵蓋有效設計、執行、測試和衡量模型結果的步驟。所有代碼示例已更新,以反映Spark的最新穩定版本。

您將:
- 獲得端到端預測模型建構的概述
- 理解多種變數選擇技術及其實現方式
- 學習如何運作模型
- 執行數據科學實驗並學習有用的技巧

作者簡介

Ramcharan Kakarla is currently Principal ML at Altice USA. He is a passionate data science and artificial intelligence advocate with 10 years of experience. He holds a master's degree from Oklahoma State University with specialization in data mining. He is currently pursuing masters in management from University of California, LA. Prior to UCLA and OSU, he received his bachelor's in electrical and electronics engineering from Sastra University in India. He was born and raised in the coastal town of Kakinada, India. He started his career working as a performance engineer with several Fortune 500 clients including State Farm, British Airways, Comcast and JP Morgan Chase. In his current role he is focused on building data science solutions and frameworks leveraging big data. He has published several papers and posters in the field of predictive analytics. He served as SAS Global Ambassador for the year 2015.

Sundar Krishnan is a Senior Data Science Manager at CVS Health. He has 12+ years of extensive experience leading cross-functional Data Science teams and is an AI, ML, and cloud platform expert. He has a proven track record of building high-performing teams and implementing innovative AI strategies to optimize operational costs and generate substantial revenue. Expert in 0 to 1 product development, successfully led teams from conception to market-ready products in Gen AI & data science. Sundar was born and raised in Tamil Nadu, India, and has a bachelor's degree from the Government College of Technology, Coimbatore. He completed his master's at Oklahoma State University, Stillwater. He blogs about his data science works on Medium in his spare time.

Balaji Dhamodharan is an award winning global Data Science leader, guiding teams to develop and implement innovative, scalable ML solutions. He currently leads the AI/ML and MLOps strategy initiatives with NXP Semiconductors. He has over a decade of experience delivering large-scale technology solutions across diverse industries. His expertise spans Software Engineering, Enterprise AI platforms, AutoML, MLOps, and Generative AI technologies. Balaji holds Masters degrees in Management Information Systems and Data Science from Oklahoma State University and Indiana University. Originally from Chennai, India, Balaji currently resides in Austin, TX, USA.

Venkata Gunnu is a Senior Executive Director of Knowledge Management and Innovation at

JPM Chase. He is an executive with a successful background crafting enterprise-wide data and

data science solutions, GenAI, process improvements, and data and data science-centric

products. Concept to implementation strategist with demonstrated success controlling multiple

projects that elevate organizational efficiency while optimizing resources. Data-focused and

analytical with a track record of automating functions, standardizing data management protocol, and introducing new business intelligence solutions.

作者簡介(中文翻譯)

Ramcharan Kakarla 目前是 Altice USA 的首席機器學習專家。他是一位熱衷於數據科學和人工智慧的倡導者,擁有十年的經驗。他持有俄克拉荷馬州立大學的碩士學位,專攻數據挖掘。目前,他正在加州大學洛杉磯分校攻讀管理碩士學位。在 UCLA 和 OSU 之前,他在印度的 Sastra University 獲得電氣與電子工程的學士學位。他出生並成長於印度的海濱小鎮 Kakinada。他的職業生涯始於性能工程師,曾為多家《財富》500 強企業工作,包括 State Farm、British Airways、Comcast 和 JP Morgan Chase。在目前的職位上,他專注於利用大數據構建數據科學解決方案和框架。他在預測分析領域發表了多篇論文和海報。他曾擔任 2015 年的 SAS 全球大使。

Sundar Krishnan 是 CVS Health 的高級數據科學經理。他擁有超過 12 年的豐富經驗,領導跨功能的數據科學團隊,是 AI、機器學習和雲平台的專家。他在建立高效能團隊和實施創新 AI 策略以優化運營成本和創造可觀收入方面有著良好的記錄。他在從 0 到 1 的產品開發方面是專家,成功地帶領團隊從概念到市場準備的產品,專注於生成式 AI 和數據科學。Sundar 出生並成長於印度的泰米爾納德邦,擁有來自 Coimbatore 政府技術學院的學士學位。他在俄克拉荷馬州立大學(Stillwater)完成碩士學位。在空閒時間,他在 Medium 上撰寫有關數據科學工作的部落格。

Balaji Dhamodharan 是一位獲獎的全球數據科學領導者,指導團隊開發和實施創新、可擴展的機器學習解決方案。他目前在 NXP Semiconductors 領導 AI/ML 和 MLOps 策略倡議。他擁有超過十年的經驗,提供跨多個行業的大規模技術解決方案。他的專業領域包括軟體工程、企業 AI 平台、自動機器學習、MLOps 和生成式 AI 技術。Balaji 擁有俄克拉荷馬州立大學和印第安納大學的管理資訊系統和數據科學碩士學位。Balaji 來自印度的 Chennai,目前居住在美國德克薩斯州的奧斯丁。

Venkata Gunnu 是 JPM Chase 知識管理和創新部的高級執行董事。他是一位成功的高管,擁有企業範圍內數據和數據科學解決方案、生成式 AI、流程改進以及以數據和數據科學為中心的產品的背景。他是一位從概念到實施的策略家,成功地控制多個項目,提升組織效率,同時優化資源。他專注於數據,具備分析能力,擁有自動化功能、標準化數據管理協議和引入新商業智慧解決方案的良好記錄。