Supervised Machine Learning for Text Analysis in R
暫譯: R語言文本分析的監督式機器學習

Hvitfeldt, Emil, Silge, Julia

相關主題

商品描述

Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features for machine learning from language. Supervised Machine Learning for Text Analysis in R explains how to preprocess text data for modeling, train models, and evaluate model performance using tools from the tidyverse and tidymodels ecosystem. Models like these can be used to make predictions for new observations, to understand what natural language features or characteristics contribute to differences in the output, and more. If you are already familiar with the basics of predictive modeling, use the comprehensive, detailed examples in this book to extend your skills to the domain of natural language processing.

This book provides practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate unstructured text data into their modeling pipelines. Learn how to use text data for both regression and classification tasks, and how to apply more straightforward algorithms like regularized regression or support vector machines as well as deep learning approaches. Natural language must be dramatically transformed to be ready for computation, so we explore typical text preprocessing and feature engineering steps like tokenization and word embeddings from the ground up. These steps influence model results in ways we can measure, both in terms of model metrics and other tangible consequences such as how fair or appropriate model results are.

商品描述(中文翻譯)

文本數據在許多領域中都非常重要,從醫療保健到行銷再到數位人文,但需要專門的方法來從語言中創建機器學習的特徵。《使用 R 進行文本分析的監督式機器學習》解釋了如何對文本數據進行預處理以進行建模、訓練模型以及使用 tidyverse 和 tidymodels 生態系統中的工具來評估模型性能。這些模型可以用來對新觀察進行預測,了解自然語言的特徵或特性如何影響輸出之間的差異,等等。如果您已經熟悉預測建模的基本概念,可以利用本書中全面且詳細的範例,將您的技能擴展到自然語言處理的領域。

本書為希望將非結構化文本數據整合到建模流程中的數據科學家和分析師提供了實用的指導和可直接應用的知識。學習如何將文本數據用於回歸和分類任務,以及如何應用更簡單的算法,如正則化回歸或支持向量機,以及深度學習方法。自然語言必須經過徹底轉換才能準備好進行計算,因此我們從基礎開始探討典型的文本預處理和特徵工程步驟,如分詞和詞嵌入。這些步驟以我們可以衡量的方式影響模型結果,包括模型指標和其他具體後果,例如模型結果的公平性或適當性。

作者簡介

Emil Hvitfeldt is a clinical data analyst working in healthcare, and an adjunct professor at American University where he is teaching statistical machine learning with tidymodels. He is also an open source R developer and author of the textrecipes package.

Julia Silge is a data scientist and software engineer at RStudio PBC where she works on open source modeling tools. She is an author, an international keynote speaker and educator, and a real-world practitioner focusing on data analysis and machine learning practice.

作者簡介(中文翻譯)

Emil Hvitfeldt 是一位在醫療領域工作的臨床數據分析師,同時也是美國大學的兼任教授,教授使用 tidymodels 的統計機器學習。他還是一位開源 R 開發者,以及 textrecipes 套件的作者。

Julia Silge 是 RStudio PBC 的數據科學家和軟體工程師,專注於開源建模工具的開發。她是一位作者、國際主題演講者和教育者,並且是一位專注於數據分析和機器學習實踐的實務工作者。