Computational Auditory Scene Analysis: Principles, Algorithms, and Applications
暫譯: 計算聽覺場景分析:原則、演算法與應用

DeLiang Wang, Guy J. Brown

  • 出版商: IEEE
  • 出版日期: 2006-10-01
  • 售價: $5,120
  • 貴賓價: 9.5$4,864
  • 語言: 英文
  • 頁數: 395
  • 裝訂: Hardcover
  • ISBN: 0471741094
  • ISBN-13: 9780471741091
  • 相關分類: Algorithms-data-structures
  • 海外代購書籍(需單獨結帳)

買這商品的人也買了...

商品描述

Description

How can we engineer systems capable of "cocktail party" listening?

Human listeners are able to perceptually segregate one sound source from an acoustic mixture, such as a single voice from a mixture of other voices and music at a busy cocktail party. How can we engineer "machine listening" systems that achieve this perceptual feat?

Albert Bregman's book Auditory Scene Analysis, published in 1990, drew an analogy between the perception of auditory scenes and visual scenes, and described a coherent framework for understanding the perceptual organization of sound. His account has stimulated much interest in computational studies of hearing. Such studies are motivated in part by the demand for practical sound separation systems, which have many applications including noise-robust automatic speech recognition, hearing prostheses, and automatic music transcription. This emerging field has become known as computational auditory scene analysis (CASA).

Computational Auditory Scene Analysis: Principles, Algorithms, and Applications provides a comprehensive and coherent account of the state of the art in CASA, in terms of the underlying principles, the algorithms and system architectures that are employed, and the potential applications of this exciting new technology. With a Foreword by Bregman, its chapters are written by leading researchers and cover a wide range of topics including:

  • Estimation of multiple fundamental frequencies
  • Feature-based and model-based approaches to CASA
  • Sound separation based on spatial location
  • Processing for reverberant environments
  • Segregation of speech and musical signals
  • Automatic speech recognition in noisy environments
  • Neural and perceptual modeling of auditory organization

The text is written at a level that will be accessible to graduate students and researchers from related science and engineering disciplines. The extensive bibliography accompanying each chapter will also make this book a valuable reference source. A web site accompanying the text (www.casabook.org) features software tools and sound demonstrations.

 

Table of Contents

Foreword.

Preface.

Contributors.

Acronyms.

1. Fundamentals of Computational Auditory Scene Analysis (DeLiang Wang and Guy J. Brown).

1.1 Human Auditory Scene Analysis.

1.1.1 Structure and Function of the Auditory System.

1.1.2 Perceptual Organization of Simple Stimuli.

1.1.3 Perceptual Segregation of Speech from Other Sounds.

1.1.4 Perceptual Mechanisms.

1.2 Computational Auditory Scene Analysis (CASA).

1.2.1 What Is CASA?

1.2.2 What Is the Goal of CASA?

1.2.3 Why CASA?

1.3 Basics of CASA Systems.

1.3.1 System Architecture.

1.3.2 Cochleagram.

1.3.3 Correlogram.

1.3.4 Cross-Correlogram.

1.3.5 Time-Frequency Masks.

1.3.6 Resynthesis.

1.4 CASA Evaluation.

1.4.1 Evaluation Criteria.

1.4.2 Corpora.

1.5 Other Sound Separation Approaches.

1.6 A Brief History of CASA (Prior to 2000).

1.6.1 Monaural CASA Systems.

1.6.2 Binaural CASA Systems.

1.6.3 Neural CASA Models.

1.7 Conclusions 36

Acknowledgments.

References.

2. Multiple F0 Estimation (Alain de Cheveigné).

2.1 Introduction.

2.2 Signal Models.

2.3 Single-Voice F0 Estimation.

2.3.1 Spectral Approach.

2.3.2 Temporal Approach.

2.3.3 Spectrotemporal Approach.

2.4 Multiple-Voice F0 Estimation.

2.4.1 Spectral Approach.

2.4.2 Temporal Approach.

2.4.3 Spectrotemporal Approach.

2.5 Issues.

2.5.1 Spectral Resolution.

2.5.2 Temporal Resolution.

2.5.3 Spectrotemporal Resolution.

2.6 Other Sources of Information.

2.6.1 Temporal and Spectral Continuity.

2.6.2 Instrument Models.

2.6.3 Learning-Based Techniques.

2.7 Estimating the Number of Sources.

2.8 Evaluation.

2.9 Application Scenarios.

2.10 Conclusion.

Acknowledgments.

References.

3. Feature-Based Speech Segregation (DeLiang Wang).

3.1 Introduction.

3.2 Feature Extraction.

3.2.1 Pitch Detection.

3.2.2 Onset and Offset Detection.

3.2.3 Amplitude Modulation Extraction.

3.2.4 Frequency Modulation Detection.

3.3 Auditory Segmentation.

3.3.1 What Is the Goal of Auditory Segmentation?

3.3.2 Segmentation Based on Cross-Channel Correlation and Temporal Continuity.

3.3.3 Segmentation Based on Onset and Offset Analysis.

3.4 Simultaneous Grouping.

3.4.1 Voiced Speech Segregation.

3.4.2 Unvoiced Speech Segregation.

3.5 Sequential Grouping.

3.5.1 Spectrum-Based Sequential Grouping.

3.5.2 Pitch-Based Sequential Grouping.

3.5.3 Model-Based Sequential Grouping.

3.6 Discussion.

Acknowledgments.

References.

4. Model-Based Scene Analysis (Daniel P. W. Ellis).

4.1 Introduction.

4.2 Source Separation as Inference.

4.3 Hidden Markov Models.

4.4 Aspects of Model-Based Systems.

4.4.1 Constraints: Types and Representations.

4.4.2 Fitting Models.

4.4.3 Generating Output.

4.5 Discussion.

4.5.1 Unknown Interference.

4.5.2 Ambiguity and Adaptation.

4.5.3 Relations to Other Separation Approaches.

4.6 Conclusions.

References.

5. Binaural Sound Localization (Richard M. Stern, Guy J. Brown, and DeLiang Wang).

5.1 Introduction.

5.2 Physical and Physiological Mechanisms Underlying Auditory Localization.

5.2.1 Physical Cues.

5.2.2 Physiological Estimation of ITD and IID.

5.3 Spatial Perception of Single Sources.

5.3.1 Sensitivity to Differences in Interaural Time and Intensity.

5.3.2 Lateralization of Single Sources.

5.3.3 Localization of Single Sources.

5.3.4 The Precedence Effect.

5.4 Spatial Perception of Multiple Sources.

5.4.1 Localization of Multiple Sources.

5.4.2 Binaural Signal Detection.

5.5 Models of Binaural Perception.

5.5.1 Classical Models of Binaural Hearing.

5.5.2 Cross-Correlation-Based Models of Binaural Interaction.

5.5.3 Some Extensions to Cross-Correlation-Based Binaural Models.

5.6 Multisource Sound Localization.

5.6.1 Estimating Source Azimuth from Interaural Cross-Correlation.

5.6.2 Methods for Resolving Azimuth Ambiguity.

5.6.3 Localization of Moving Sources.

5.7 General Discussion.

Acknowledgments.

References.

6. Localization-Based Grouping (Albert S. Feng and Douglas L. Jones).

6.1 Introduction.

6.2 Classical Beamforming Techniques.

6.2.1 Fixed Beamforming Techniques.

6.2.2 Adaptive Beamforming Techniques.

6.2.3 Independent Component Analysis Techniques.

6.2.4 Other Localization-Based Techniques.

6.3 Location-Based Grouping Using Interaural Time Difference Cue.

6.4 Location-Based Grouping Using Interaural Intensity Difference Cue.

6.5 Location-Based Grouping Using Multiple Binaural Cues.

6.6 Discussion and Conclusions.

Acknowledgments.

References.

7. Reverberation (Guy J. Brown and Kalle J. Palomäki).

7.1 Introduction.

7.2 Effects of Reverberation on Listeners.

7.2.1 Speech Perception.

7.2.2 Sound Localization.

7.2.3 Source Separation and Signal Detection.

7.2.4 Distance Perception.

7.2.5 Auditory Spatial Impression.

7.3 Effects of Reverberation on Machines.

7.4 Mechanisms Underlying Robustness to Reverberation in Human Listeners.

7.4.1 The Role of Slow Temporal Modulations in Speech Perception.

7.4.2 The Binaural Advantage.

7.4.3 The Precedence Effect.

7.4.4 Perceptual Compensation for Spectral Envelope Distortion.

7.5 Reverberation-Robust Acoustic Processing.

7.5.1 Dereverberation.

7.5.2 Reverberation-Robust Acoustic Features.

7.5.3 Reverberation Masking.

7.6 CASA and Reverberation.

7.6.1 Systems Based on Directional Filtering.

7.6.2 CASA for Robust ASR in Reverberant Conditions.

7.6.3 Systems that Use Multiple Cues.

7.7 Discussion and Conclusions.

Acknowledgments.

References.

8. Analysis of Musical Audio Signals (Masataka Goto).

8.1 Introduction.

8.2 Music Scene Description.

8.2.1 Music Scene Descriptions.

8.2.2 Difficulties Associated with Musical Audio Signals.

8.3 Estimating Melody and Bass Lines.

8.3.1 PreFEst-front-end: Forming the Observed Probability Density Functions.

8.3.2 PreFEst-core: Estimating the F0’s Probability Density Function.

8.3.3 PreFEst-back-end: Sequential F0 Tracking by Multiple-Agent Architecture.

8.3.4 Other Methods.

8.4 Estimating Beat Structure.

8.4.1 Estimating Period and Phase.

8.4.2 Dealing with Ambiguity.

8.4.3 Using Musical Knowledge.

8.5 Estimating Chorus Sections and Repeated Sections.

8.5.1 Extracting Acoustic Features and Calculating Their Similarity.

8.5.2 Finding Repeated Sections.

8.5.3 Grouping Repeated Sections.

8.5.4 Detecting Modulated Repetition.

8.5.5 Selecting Chorus Sections.

8.5.6 Other Methods.

8.6 Discussion and Conclusions.

8.6.1 Importance.

8.6.2 Evaluation Issues.

8.6.3 Future Directions.

References.

9. Robust Automatic Speech Recognition (Jon Barker).

9.1 Introduction.

9.2 ASA and Speech Perception in Humans.

9.2.1 Speech Perception and Simultaneous Grouping.

9.2.2 Speech Perception and Sequential Grouping.

9.2.3 Speech Schemes.

9.2.4 Challenges to the ASA Account of Speech Perception.

9.2.5 Interim Summary.

9.3 Speech Recognition by Machine.

9.3.1 The Statistical Basis of ASR.

9.3.2 Traditional Approaches to Robust ASR.

9.3.3 CASA-Driven Approaches to ASR.

9.4 Primitive CASA and ASR.

9.4.1 Speech and Time-Frequency Masking.

9.4.2 The Missing-Data Approach to ASR.

9.4.3 Marginalization-Based Missing-Data ASR Systems.

9.4.4 Imputation-Based Missing-Data Solutions.

9.4.5 Estimating the Missing-Data Mask.

9.4.6 Difficulties with the Missing-Data Approach.

9.5 Model-Based CASA and ASR.

9.5.1 The Speech Fragment Decoding Framework.

9.5.2 Coupling Source Segregation and Recognition.

9.6 Discussion and Conclusions.

9.7 Concluding Remarks.

References.

10. Neural and Perceptual Modeling (Guy J. Brown and DeLiang Wang).

10.1 Introduction.

10.2 The Neural Basis of Auditory Grouping.

10.2.1 Theoretical Solutions to the Binding Problem.

10.2.2 Empirical Results on Binding and ASA.

10.3 Models of Individual Neurons.

10.3.1 Relaxation Oscillators.

10.3.2 Spike Oscillators.

10.3.3 A Model of a Specific Auditory Neuron.

10.4 Models of Specific Perceptual Phenomena.

10.4.1 Perceptual Streaming of Tone Sequences.

10.4.2 Perceptual Segregation of Concurrent Vowels with Different F0s.

10.5 The Oscillatory Correlation Framework for CASA.

10.5.1 Speech Segregation Based on Oscillatory Correlation.

10.6 Schema-Driven Grouping.

10.7 Discussion.

10.7.1 Temporal or Spatial Coding of Auditory Grouping.

10.7.2 Physiological Support for Neural Time Delays.

10.7.3 Convergence of Psychological, Physiological, and Computational Approaches.

10.7.4 Neural Models as a Framework for CASA.

10.7.5 The Role of Attention.

10.7.6 Schema-Based Organization.

Acknowledgments.

References.

Index.

商品描述(中文翻譯)

**描述**

我們如何設計能夠進行「雞尾酒會」聆聽的系統?

人類聆聽者能夠從聲音混合中感知並分離出一個聲音來源,例如在熱鬧的雞尾酒會中,從其他聲音和音樂中分辨出單一的聲音。我們如何設計能夠實現這一感知成就的「機器聆聽」系統?

阿爾伯特·布雷格曼(Albert Bregman)於1990年出版的《聽覺場景分析》(Auditory Scene Analysis)一書,將聽覺場景的感知與視覺場景進行類比,並描述了一個理解聲音感知組織的連貫框架。他的論述激發了對聽覺計算研究的濃厚興趣。這些研究部分是由於對實用聲音分離系統的需求,這些系統有許多應用,包括抗噪自動語音識別、助聽器和自動音樂轉錄。這一新興領域被稱為計算聽覺場景分析(Computational Auditory Scene Analysis,簡稱CASA)。

《計算聽覺場景分析:原則、演算法與應用》(Computational Auditory Scene Analysis: Principles, Algorithms, and Applications)提供了CASA的最新技術狀態的全面且連貫的說明,涵蓋了基本原則、所使用的演算法和系統架構,以及這一令人興奮的新技術的潛在應用。該書由布雷格曼撰寫的前言引入,章節由領先的研究人員撰寫,涵蓋了廣泛的主題,包括:
- 多重基頻估計
- 基於特徵和模型的CASA方法
- 基於空間位置的聲音分離
- 反射環境的處理
- 語音與音樂信號的分離
- 噪音環境中的自動語音識別
- 聽覺組織的神經與感知建模

該文本的寫作水平適合研究生和相關科學及工程學科的研究人員。每章附帶的廣泛參考文獻也使本書成為一個有價值的參考來源。隨書附帶的網站(www.casabook.org)提供了軟體工具和聲音示範。

**目錄**

前言
序言
貢獻者
縮寫詞
1. 計算聽覺場景分析的基本原則(DeLiang Wang 和 Guy J. Brown)
1.1 人類聽覺場景分析
1.1.1 聽覺系統的結構與功能
1.1.2 簡單刺激的感知組織
1.1.3 語音與其他聲音的感知分離
1.1.4 感知機制
1.2 計算聽覺場景分析(CASA)
1.2.1 CASA是什麼?
1.2.2 CASA的目標是什麼?
1.2.3 為什麼需要CASA?
1.3 CASA系統的基本知識
1.3.1 系統架構
1.3.2 耳蝸圖(Cochleagram)
1.3.3 相關圖(Correlogram)
1.3.4 交叉相關圖(Cross-Correlogram)
1.3.5 時頻掩蔽(Time-Frequency Masks)
1.3.6 重新合成(Resynthesis)
1.4 CASA評估
1.4.1 評估標準
1.4.2 語料庫
1.5 其他聲音分離方法
1.6 CASA的簡史(2000年前)
1.6.1 單耳CASA系統
1.6.2 雙耳CASA系統
1.6.3 神經CASA模型
1.7 結論
致謝
參考文獻
2. 多重基頻估計(Alain de Cheveigné)
2.1 介紹
2.2 信號模型
2.3 單聲道基頻估計
2.3.1 頻譜方法
2.3.2 時間方法
2.3.3 頻譜時間方法
2.4 多聲道基頻估計
2.4.1 頻譜方法
2.4.2 時間方法
2.4.3 頻譜時間方法
2.5 問題
2.5.1 頻譜解析度
2.5.2 時間解析度
2.5.3 頻譜時間解析度
2.6 其他信息來源
2.6.1 時間和頻譜連續性
2.6.2 樂器模型
2.6.3 基於學習的技術
2.7 估計來源數量
2.8 評估
2.9 應用場景
2.10 結論
致謝
參考文獻
3. 基於特徵的語音分離(DeLiang Wang)
3.1 介紹
3.2 特徵提取
3.2.1 音高檢測
3.2.2 開始和結束檢測
3.2.3 振幅調變提取
3.2.4 頻率調變檢測
3.3 聽覺分段
3.3.1 聽覺分段的目標是什麼?
3.3.2 基於跨通道相關性和時間連續性的分段
3.3.3 基於開始和結束分析的分段
3.4 同時分組
3.4.1 有聲語音分離
3.4.2 無聲語音分離
3.5 順序分組
3.5.1 基於頻譜的順序分組
3.5.2 基於音高的順序分組
3.5.3 基於模型的順序分組
3.6 討論
致謝
參考文獻
4. 基於模型的場景分析(Daniel P. W. Ellis)
4.1 介紹
4.2 作為推理的來源分離
4.3 隱馬可夫模型
4.4 基於模型的系統的各個方面
4.4.1 約束:類型和表示
4.4.2 擬合模型
4.4.3 生成輸出
4.5 討論
4.5.1 未知干擾
4.5.2 模糊性和適應
4.5.3 與其他分離方法的關係
4.6 結論
參考文獻
5. 雙耳聲音定位(Richard M. Stern、Guy J. Brown 和 DeLiang Wang)
5.1 介紹
5.2 聽覺定位的物理和生理機制
5.2.1 物理線索
5.2.2 生理上對ITD和IID的估計
5.3 單一來源的空間感知
5.3.1 對耳間時間和強度差異的敏感性
5.3.2 單一來源的側向化
5.3.3 單一來源的定位
5.3.4 優先效應
5.4 多來源的空間感知
5.4.1 多來源的定位
5.4.2 雙耳信號檢測
5.5 雙耳感知模型
5.5.1 雙耳聽覺的經典模型
5.5.2 基於交叉相關的雙耳互動模型
5.5.3 基於交叉相關的雙耳模型的一些擴展
5.6 多來源聲音定位
5.6.1 從耳間交叉相關估計來源方位
5.6.2 解決方位模糊的方法
5.6.3 移動來源的定位
5.7 一般討論
致謝
參考文獻
6. 基於定位的分組(Albert S. Feng 和 Douglas L. Jones)
6.1 介紹
6.2 經典波束形成技術
6.2.1 固定波束形成技術
6.2.2 自適應波束形成技術
6.2.3 獨立成分分析技術
6.2.4 其他基於定位的技術
6.3 使用耳間時間差線索的基於位置的分組
6.4 使用耳間強度差線索的基於位置的分組
6.5 使用多個雙耳線索的基於位置的分組
6.6 討論與結論
致謝
參考文獻
7. 反射(Guy J. Brown 和 Kalle J. Palomäki)
7.1 介紹
7.2 反射對聽眾的影響
7.2.1 語音感知
7.2.2 聲音定位
7.2.3 來源分離和信號檢測
7.2.4 距離感知
7.2.5 聽覺空間印象
7.3 反射對機器的影響
7.4 人類聽眾對反射的穩健性機制
7.4.1 緩慢時間調變在語音感知中的作用
7.4.2 雙耳優勢
7.4.3 優先效應
7.4.4 對頻譜包絡失真的感知補償
7.5 穩健的反射聲學處理
7.5.1 去反射
7.5.2 穩健的反射聲學特徵
7.5.3 反射掩蔽
7.6 CASA與反射
7.6.1 基於方向過濾的系統
7.6.2 在反射條件下穩健的自動語音識別的CASA
7.6.3 使用多個線索的系統
7.7 討論與結論
致謝
參考文獻
8. 音樂音頻信號分析(Masataka Goto)
8.1 介紹
8.2 音樂場景描述
8.2.1 音樂場景描述
8.2.2 與音樂音頻信號相關的困難
8.3 旋律和低音線的估計
8.3.1 PreFEst前端:形成觀察到的概率密度函數
8.3.2 PreFEst核心:估計基頻的概率密度函數
8.3.3 PreFEst後端:通過多代理架構進行序列基頻跟踪
8.3.4 其他方法
8.4 節拍結構的估計
8.4.1 估計週期和相位
8.4.2 處理模糊性
8.4.3 使用音樂知識
8.5 估計合唱部分和重複部分
8.5.1 提取聲學特徵並計算其相似性
8.5.2 尋找重複部分
8.5.3 分組重複部分
8.5.4 檢測調變重複
8.5.5 選擇合唱部分
8.5.6 其他方法
8.6 討論與結論
8.6.1 重要性
8.6.2 評估問題
8.6.3 未來方向
參考文獻
9. 穩健的自動語音識別(Jon Barker)
9.1 介紹
9.2 ASA與人類的語音感知
9.2.1 語音感知與同時分組
9.2.2 語音感知與順序分組
9.2.3 語音方案
9.2.4 對ASA語音感知解釋的挑戰
9.2.5 暫時總結
9.3 機器的語音識別
9.3.1 自動語音識別的統計基礎
9.3.2 傳統的穩健自動語音識別方法
9.3.3 基於CASA的自動語音識別方法
9.4 原始CASA與自動語音識別
9.4.1 語音與時頻掩蔽
9.4.2 自動語音識別的缺失數據方法
9.4.3 基於邊際化的缺失數據自動語音識別系統
9.4.4 基於插補的缺失數據解決方案
9.4.5 估計缺失數據掩蔽
9.4.6 缺失數據方法的困難
9.5 基於模型的CASA與自動語音識別
9.5.1 語音片段解碼框架
9.5.2 結合來源分離與識別
9.6 討論與結論
9.7 結語
參考文獻
10. 神經與感知建模(Guy J. Brown 和 DeLiang Wang)
10.1 介紹
10.2 聽覺分組的神經基礎
10.2.1 綁定問題的理論解決方案

最後瀏覽商品 (20)