Big Data Analysis with Python (Paperback) Combine Spark and Python to unlock the powers of parallel computing and machine learning
暫譯: 使用 Python 進行大數據分析 (平裝本)

Ivan Marin , Ankit Shukla , Sarang VK

買這商品的人也買了...

相關主題

商品描述

Key Features

  • Get a hands-on, fast-paced introduction to the Python data science stack
  • Explore ways to create useful metrics and statistics from large datasets
  • Create detailed analysis reports with real-world data

Book Description

Processing big data in real time is challenging due to scalability, information inconsistency, and fault tolerance. Big Data Analysis with Python teaches you how to use tools that can control this data avalanche for you. With this book, you'll learn practical techniques to aggregate data into useful dimensions for posterior analysis, extract statistical measurements, and transform datasets into features for other systems.

The book begins with an introduction to data manipulation in Python using pandas. You'll then get familiar with statistical analysis and plotting techniques. With multiple hands-on activities in store, you'll be able to analyze data that is distributed on several computers by using Dask. As you progress, you'll study how to aggregate data for plots when the entire data cannot be accommodated in memory. You'll also explore Hadoop (HDFS and YARN), which will help you tackle larger datasets. The book also covers Spark and explains how it interacts with other tools.

By the end of this book, you'll be able to bootstrap your own Python environment, process large files, and manipulate data to generate statistics, metrics, and graphs.

What you will learn

  • Use Python to read and transform data into different formats
  • Generate basic statistics and metrics using data on disk
  • Work with computing tasks distributed over a cluster
  • Convert data from various sources into storage or querying formats
  • Prepare data for statistical analysis, visualization, and machine learning
  • Present data in the form of effective visuals

Who this book is for

Big Data Analysis with Python is designed for Python developers, data analysts, and data scientists who want to get hands-on with methods to control data and transform it into impactful insights. Basic knowledge of statistical measurements and relational databases will help you to understand various concepts explained in this book.

 

商品描述(中文翻譯)

**主要特點**

- 獲得快速且實作的 Python 數據科學堆疊介紹
- 探索從大型數據集中創建有用的指標和統計數據的方法
- 使用真實數據創建詳細的分析報告

**書籍描述**

實時處理大數據具有挑戰性,因為需要考慮可擴展性、信息不一致性和容錯性。《使用 Python 進行大數據分析》教你如何使用工具來控制這場數據雪崩。通過本書,你將學習實用技術,將數據聚合成有用的維度以便後續分析,提取統計測量,並將數據集轉換為其他系統的特徵。

本書首先介紹如何使用 pandas 進行 Python 中的數據操作。接著,你將熟悉統計分析和繪圖技術。隨著多個實作活動的進行,你將能夠使用 Dask 分析分佈在多台計算機上的數據。隨著進度的推進,你將學習如何在整個數據無法容納於記憶體時聚合數據以進行繪圖。你還將探索 Hadoop(HDFS 和 YARN),這將幫助你處理更大的數據集。本書還涵蓋了 Spark,並解釋它如何與其他工具互動。

在本書結束時,你將能夠啟動自己的 Python 環境,處理大型文件,並操作數據以生成統計數據、指標和圖表。

**你將學到什麼**

- 使用 Python 讀取並將數據轉換為不同格式
- 使用磁碟上的數據生成基本統計數據和指標
- 處理分佈在集群上的計算任務
- 將來自各種來源的數據轉換為存儲或查詢格式
- 準備數據以進行統計分析、可視化和機器學習
- 以有效的視覺形式呈現數據

**本書適合誰**

《使用 Python 進行大數據分析》旨在為希望實作控制數據並將其轉化為有影響力見解的 Python 開發者、數據分析師和數據科學家而設。對統計測量和關聯數據庫的基本知識將幫助你理解本書中解釋的各種概念。

作者簡介

Ivan Marin is a Systems Architect and Data Scientist working at Daitan Group, a Campinas based software company. He designs Big Data systems for large volumes of data, and implements Machine Learning pipelines end to end using Python and Spark. He is also an active organizer of Data Science, Machine Learning and Python in São Paulo and has given Python for Data Science courses at university level.

Sarang VK in his current role as a data scientist, his responsibilities include identifying data sources, data preparation, development, and evaluation of predictive and optimization models for setting up production level machine learning / statistical solutions with back-end and front-end developments. Alongside, he supports pre-sales, stakeholder communication, requirement gathering, scoping, and solutions.

His strengths are Machine / Deep Learning, SQL, Predictive Analytics, Time-Series, Simulation Modelling, Optimization, Image/Text Analytics, NLP, Python, R, Spark, TensorFlow, Keras, h2o, SAP-PAL, AWS, SAP Predictive Factory, Azure, Financial Analytics, Supply Chain, Banking and Insurance, Retail/Customer Analytics, Trading Analytics, Healthcare Analytics, RPA, IPA.

Ankit Shukla is Data Scientist with a passion for using data science & advanced analytics to solve real-life problems and bring ideas to fruition. Skilled in using Machine Learning/AI & statistical modelling techniques to solve business problems & create actual dollar value for clients. Experienced in working with copious amounts of data, using the latest Big Data technologies to design data pipelines and generate impactful data-driven insights & reports.

His skill sets are: R, Python, SQL, HiveQL, Excel, Linux Shell Scripting, SAS (Working Knowledge), Docker Frameworks: Keras, OpenCV, XGBoost, NumPy, Scikit-learn, Caret, ggplot2, recommended lab Big Data: Hadoop, Hive, Impala, PySpark, SparkR, Pig, AWS (S3, EC-2, EMR, Sagemaker, Redshift) Machine Learning: Regression, Classification, Clustering, Feature Selection, Model Selection/Assessment, Recommender Systems, Neural Networks, Deep Learning, Transfer Learning Visualization: Tableau, R, Shiny.

作者簡介(中文翻譯)

伊凡·馬林是Daitan Group的系統架構師和數據科學家,該公司位於坎皮納斯,專注於軟體開發。他設計用於處理大量數據的Big Data系統,並使用Python和Spark實現端到端的機器學習管道。他也是聖保羅數據科學、機器學習和Python的活躍組織者,並在大學層級教授數據科學的Python課程。

薩朗·VK在目前的數據科學家角色中,負責識別數據來源、數據準備、開發和評估預測及優化模型,以建立生產級的機器學習/統計解決方案,並進行後端和前端開發。此外,他還支持售前、利益相關者溝通、需求收集、範疇界定和解決方案。

他的專長包括機器學習/深度學習、SQL、預測分析、時間序列、模擬建模、優化、圖像/文本分析、自然語言處理(NLP)、Python、R、Spark、TensorFlow、Keras、h2o、SAP-PAL、AWS、SAP預測工廠、Azure、金融分析、供應鏈、銀行和保險、零售/客戶分析、交易分析、醫療保健分析、RPA、IPA。

安基特·舒克拉是一名數據科學家,熱衷於利用數據科學和高級分析解決現實問題並實現創意。擅長使用機器學習/人工智慧和統計建模技術來解決商業問題,並為客戶創造實際的經濟價值。擁有處理大量數據的經驗,使用最新的Big Data技術設計數據管道,並生成有影響力的數據驅動見解和報告。

他的技能包括:R、Python、SQL、HiveQL、Excel、Linux Shell腳本、SAS(工作知識)、Docker框架:Keras、OpenCV、XGBoost、NumPy、Scikit-learn、Caret、ggplot2,推薦的實驗室Big Data:Hadoop、Hive、Impala、PySpark、SparkR、Pig、AWS(S3、EC-2、EMR、Sagemaker、Redshift)機器學習:回歸、分類、聚類、特徵選擇、模型選擇/評估、推薦系統、神經網絡、深度學習、轉移學習可視化:Tableau、R、Shiny。

目錄大綱

  1. The Python Data Science Stack
  2. Statistical Visualizations
  3. Working with Big Data Frameworks
  4. Diving Deeper with Spark
  5. Handling Missing Values and Correlation Analysis
  6. Exploratory Data Analysis
  7. Reproducibility in Big Data Analysis
  8. Creating a Full Analysis Report

目錄大綱(中文翻譯)


  1. The Python Data Science Stack

  2. Statistical Visualizations

  3. Working with Big Data Frameworks

  4. Diving Deeper with Spark

  5. Handling Missing Values and Correlation Analysis

  6. Exploratory Data Analysis

  7. Reproducibility in Big Data Analysis

  8. Creating a Full Analysis Report