Data Processing with Optimus: Supercharge big data preparation tasks for analytics and machine learning with Optimus using Dask and PySpark
暫譯: 使用Optimus進行數據處理:利用Dask和PySpark為分析和機器學習加速大數據準備任務

Leon, Argenis, Aguirre, Luis

  • 出版商: Packt Publishing
  • 出版日期: 2021-09-03
  • 售價: $1,830
  • 貴賓價: 9.5$1,739
  • 語言: 英文
  • 頁數: 300
  • 裝訂: Quality Paper - also called trade paper
  • ISBN: 1801079560
  • ISBN-13: 9781801079563
  • 相關分類: Spark大數據 Big-dataMachine Learning
  • 海外代購書籍(需單獨結帳)

買這商品的人也買了...

相關主題

商品描述

Written by the core Optimus team, this comprehensive guide will help you to understand how Optimus improves the whole data processing landscape


Key Features:

  • Load, merge, and save small and big data efficiently with Optimus
  • Learn Optimus functions for data analytics, feature engineering, machine learning, cross-validation, and NLP
  • Discover how Optimus improves other data frame technologies and helps you speed up your data processing tasks


Book Description:

Optimus is a Python library that works as a unified API for data cleaning, processing, and merging data. It can be used for handling small and big data on your local laptop or on remote clusters using CPUs or GPUs.


The book begins by covering the internals of Optimus and how it works in tandem with the existing technologies to serve your data processing needs. You'll then learn how to use Optimus for loading and saving data from text data formats such as CSV and JSON files, exploring binary files such as Excel, and for columnar data processing with Parquet, Avro, and OCR. Next, you'll get to grips with the profiler and its data types - a unique feature of Optimus Dataframe that assists with data quality. You'll see how to use the plots available in Optimus such as histogram, frequency charts, and scatter and box plots, and understand how Optimus lets you connect to libraries such as Plotly and Altair. You'll also delve into advanced applications such as feature engineering, machine learning, cross-validation, and natural language processing functions and explore the advancements in Optimus. Finally, you'll learn how to create data cleaning and transformation functions and add a hypothetical new data processing engine with Optimus.


By the end of this book, you'll be able to improve your data science workflow with Optimus easily.


What You Will Learn:

  • Use over 100 data processing functions over columns and other string-like values
  • Reshape and pivot data to get the output in the required format
  • Find out how to plot histograms, frequency charts, scatter plots, box plots, and more
  • Connect Optimus with popular Python visualization libraries such as Plotly and Altair
  • Apply string clustering techniques to normalize strings
  • Discover functions to explore, fix, and remove poor quality data
  • Use advanced techniques to remove outliers from your data
  • Add engines and custom functions to clean, process, and merge data


Who this book is for:

This book is for Python developers who want to explore, transform, and prepare big data for machine learning, analytics, and reporting using Optimus, a unified API to work with Pandas, Dask, cuDF, Dask-cuDF, Vaex, and Spark. Although not necessary, beginner-level knowledge of Python will be helpful. Basic knowledge of the CLI is required to install Optimus and its requirements. For using GPU technologies, you'll need an NVIDIA graphics card compatible with NVIDIA's RAPIDS library, which is compatible with Windows 10 and Linux.

商品描述(中文翻譯)

由核心 Optimus 團隊撰寫的這本綜合指南將幫助您了解 Optimus 如何改善整個數據處理環境

主要特點:


  • 使用 Optimus 高效地加載、合併和保存小型和大型數據

  • 學習 Optimus 在數據分析、特徵工程、機器學習、交叉驗證和自然語言處理中的功能

  • 發現 Optimus 如何改善其他數據框技術並幫助您加速數據處理任務

書籍描述:
Optimus 是一個 Python 函式庫,作為數據清理、處理和合併數據的統一 API。它可以用於在本地筆記本電腦或使用 CPU 或 GPU 的遠程集群上處理小型和大型數據。

本書首先介紹 Optimus 的內部運作及其如何與現有技術協同工作以滿足您的數據處理需求。接著,您將學習如何使用 Optimus 從文本數據格式(如 CSV 和 JSON 文件)加載和保存數據,探索二進制文件(如 Excel),以及使用 Parquet、Avro 和 OCR 進行列式數據處理。然後,您將熟悉分析器及其數據類型——這是 Optimus Dataframe 的一個獨特功能,有助於數據質量。您將看到如何使用 Optimus 中可用的圖表,如直方圖、頻率圖、散點圖和箱形圖,並了解 Optimus 如何讓您連接到 Plotly 和 Altair 等函式庫。您還將深入了解特徵工程、機器學習、交叉驗證和自然語言處理功能等高級應用,並探索 Optimus 的進步。最後,您將學習如何創建數據清理和轉換函數,並使用 Optimus 添加一個假設的新數據處理引擎。

在本書結束時,您將能夠輕鬆改善您的數據科學工作流程。

您將學到什麼:


  • 使用超過 100 個數據處理函數處理列和其他類似字符串的值

  • 重塑和透視數據以獲得所需格式的輸出

  • 了解如何繪製直方圖、頻率圖、散點圖、箱形圖等

  • 將 Optimus 與流行的 Python 可視化函式庫(如 Plotly 和 Altair)連接

  • 應用字符串聚類技術以標準化字符串

  • 發現探索、修復和刪除低質量數據的函數

  • 使用高級技術從數據中刪除異常值

  • 添加引擎和自定義函數以清理、處理和合併數據

本書適合誰:
本書適合希望使用 Optimus 探索、轉換和準備大數據以進行機器學習、分析和報告的 Python 開發人員,這是一個與 Pandas、Dask、cuDF、Dask-cuDF、Vaex 和 Spark 一起工作的統一 API。雖然不是必需的,但具備初級 Python 知識將會有所幫助。安裝 Optimus 及其要求需要基本的 CLI 知識。使用 GPU 技術時,您需要一個與 NVIDIA 的 RAPIDS 函式庫兼容的 NVIDIA 顯示卡,該函式庫與 Windows 10 和 Linux 兼容。