Linguistic Resources for Natural Language Processing: On the Necessity of Using Linguistic Methods to Develop Nlp Software

Silberztein, Max

  • 出版商: Springer
  • 出版日期: 2024-03-14
  • 售價: $6,270
  • 貴賓價: 9.5$5,957
  • 語言: 英文
  • 頁數: 217
  • 裝訂: Hardcover - also called cloth, retail trade, or trade
  • ISBN: 3031438108
  • ISBN-13: 9783031438103
  • 相關分類: Text-mining
  • 海外代購書籍(需單獨結帳)

相關主題

商品描述

Empirical -- data-driven, neural network-based, probabilistic, and statistical -- methods seem to be the modern trend. Recently, OpenAI's ChatGPT, Google's Bard and Microsoft's Sydney chatbots have been garnering a lot of attention for their detailed answers across many knowledge domains. In consequence, most AI researchers are no longer interested in trying to understand what common intelligence is or how intelligent agents construct scenarios to solve various problems. Instead, they now develop systems that extract solutions from massive databases used as cheat sheets. In the same manner, Natural Language Processing (NLP) software that uses training corpora associated with empirical methods are trendy, as most researchers in NLP today use large training corpora, always to the detriment of the development of formalized dictionaries and grammars.

Not questioning the intrinsic value of many software applications based on empirical methods, this volume aims at rehabilitating the linguistic approach to NLP. In an introduction, the editor uncovers several limitations and flaws of using training corpora to develop NLP applications, even the simplest ones, such as automatic taggers.

The first part of the volume is dedicated to showing how carefully handcrafted linguistic resources could be successfully used to enhance current NLP software applications. The second part presents two representative cases where data-driven approaches cannot be implemented simply because there is not enough data available for low-resource languages. The third part addresses the problem of how to treat multiword units in NLP software, which is arguably the weakest point of NLP applications today but has a simple and elegant linguistic solution.

It is the editor's belief that readers interested in Natural Language Processing will appreciate the importance of this volume, both for its questioning of the training corpus-based approaches and for the intrinsic value of the linguistic formalization and the underlying methodology presented.


商品描述(中文翻譯)

「實證方法」──以數據驅動、基於神經網絡、概率和統計的方法──似乎是現代的趨勢。最近,OpenAI 的 ChatGPT、Google 的 Bard 和 Microsoft 的 Sydney 聊天機器人因其在多個知識領域提供詳細答案而受到了很多關注。因此,大多數人工智能研究人員不再對理解常見智能是什麼以及智能代理如何構建場景來解決各種問題感興趣。相反,他們現在開發的系統從大型數據庫中提取解決方案,這些數據庫被用作作弊紙條。同樣地,使用與實證方法相關的訓練語料庫的自然語言處理(NLP)軟件很流行,因為今天大多數NLP研究人員都使用大型訓練語料庫,這總是以犧牲形式化字典和語法的發展為代價。

本書的目的不是質疑基於訓練語料庫的許多軟件應用的內在價值,而是旨在恢復語言學方法對NLP的重要性。在引言中,編者揭示了使用訓練語料庫開發NLP應用的幾個限制和缺陷,即使是最簡單的自動標記工具也是如此。

本書的第一部分致力於展示如何成功地使用精心製作的語言學資源來增強當前的NLP軟件應用。第二部分介紹了兩個代表性案例,其中基於數據驅動的方法無法實施,因為低資源語言的數據不足。第三部分討論了如何處理NLP軟件中的多詞單位問題,這可以說是當今NLP應用的最薄弱環節,但有一個簡單而優雅的語言學解決方案。

編者相信,對自然語言處理感興趣的讀者將欣賞本書的重要性,既因為它對基於訓練語料庫的方法提出了質疑,也因為語言形式化和所呈現的基礎方法的內在價值。

作者簡介

Max Silberztein is a Professor of Linguistics, Computational Linguistics and Computer Science at the Université de Franche-Comté. He is the author of the three NLP software platforms (INTEX, NooJ and ATISHS), two books (Dictionnaires électroniques et analyse automatique de textes: le système INTEX, Masson 1993; Formalizing Natural Languages: the NooJ approach, Wiley 2016), and editor of over 15 volumes of selected Proceedings in Springer CCIS and LNCS series.


作者簡介(中文翻譯)

Max Silberztein是法蘭什孔泰大學的語言學、計算語言學和計算機科學教授。他是三個自然語言處理軟體平台(INTEX、NooJ和ATISHS)的作者,兩本書(Dictionnaires électroniques et analyse automatique de textes: le système INTEX, Masson 1993; Formalizing Natural Languages: the NooJ approach, Wiley 2016)的作者,並且是Springer CCIS和LNCS系列中超過15卷選定會議論文集的編輯者。