Computational Methods for Integrating Vision and Language (Synthesis Lectures on Computer Vision)
暫譯: 整合視覺與語言的計算方法(計算機視覺綜合講座)

Kobus Barnard

  • 出版商: Morgan & Claypool
  • 出版日期: 2016-04-21
  • 售價: $2,730
  • 貴賓價: 9.5$2,594
  • 語言: 英文
  • 頁數: 228
  • 裝訂: Paperback
  • ISBN: 1608451127
  • ISBN-13: 9781608451128
  • 相關分類: Computer Vision
  • 海外代購書籍(需單獨結帳)

商品描述

Modeling data from visual and linguistic modalities together creates opportunities for better understanding of both, and supports many useful applications. Examples of dual visual-linguistic data includes images with keywords, video with narrative, and figures in documents. We consider two key task-driven themes: translating from one modality to another (e.g., inferring annotations for images) and understanding the data using all modalities, where one modality can help disambiguate information in another. The multiple modalities can either be essentially semantically redundant (e.g., keywords provided by a person looking at the image), or largely complementary (e.g., meta data such as the camera used). Redundancy and complementarity are two endpoints of a scale, and we observe that good performance on translation requires some redundancy, and that joint inference is most useful where some information is complementary.

Computational methods discussed are broadly organized into ones for simple keywords, ones going beyond keywords toward natural language, and ones considering sequential aspects of natural language. Methods for keywords are further organized based on localization of semantics, going from words about the scene taken as whole, to words that apply to specific parts of the scene, to relationships between parts. Methods going beyond keywords are organized by the linguistic roles that are learned, exploited, or generated. These include proper nouns, adjectives, spatial and comparative prepositions, and verbs. More recent developments in dealing with sequential structure include automated captioning of scenes and video, alignment of video and text, and automated answering of questions about scenes depicted in images.

商品描述(中文翻譯)

建模視覺和語言模態的數據可以更好地理解這兩者,並支持許多有用的應用。雙重視覺-語言數據的例子包括帶有關鍵字的圖像、帶有敘述的視頻以及文件中的圖形。我們考慮兩個關鍵的任務驅動主題:從一種模態翻譯到另一種模態(例如,推斷圖像的註釋)以及使用所有模態理解數據,其中一種模態可以幫助消歧義另一種模態中的信息。多種模態可以是本質上語義冗餘的(例如,觀看圖像的人提供的關鍵字),或在很大程度上是互補的(例如,使用的相機等元數據)。冗餘和互補性是尺度的兩個端點,我們觀察到,良好的翻譯性能需要一些冗餘,而聯合推斷在某些信息互補的情況下最為有用。

討論的計算方法大致分為三類:針對簡單關鍵字的方法、超越關鍵字的方法以及考慮自然語言的序列方面的方法。關鍵字的方法進一步根據語義的定位進行組織,從整體場景的詞語到適用於場景特定部分的詞語,再到部分之間的關係。超越關鍵字的方法則根據學習、利用或生成的語言角色進行組織,包括專有名詞、形容詞、空間和比較介詞以及動詞。最近在處理序列結構方面的發展包括自動為場景和視頻生成標題、視頻與文本的對齊,以及自動回答有關圖像中描繪的場景的問題。