Introduction to Speech Recognition

(updated: )
  1. 1. Overview
    1. 1.1. Decoding
      1. 1.1.1. Speech Capture
      2. 1.1.2. Pre-processing
      3. 1.1.3. Acoustic Model
        1. 1.1.3.1. GMM + HMM
        2. 1.1.3.2. tri-phoneme model
      4. 1.1.4. Lexicon (Pronunciation Model)
      5. 1.1.5. Language Model
      6. 1.1.6. Decoder
    2. 1.2. Training
    3. 1.3. Models
      1. 1.3.1. GMM
      2. 1.3.2. HMM
        1. 1.3.2.1. Viterbi
    4. 1.4. Summerize
      1. 1.4.1. Fumular
    5. 1.5. Other Approaches

Overview

Overview

Decoding

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
┌─────────┐
│ wave │
│ │
└─────────┘
│ MFCC / PLP / XX_PLP ...
┌─────────┐
│ Feature │
Vectors
└─────────┘
│ Acoustic Model (GMM / DNN)
┌─────────┐
│ Phoneme │
│ State │
└─────────┘
│ Acoustic Model (HMM)
┌─────────┐
│ Phoneme │
│ │
└─────────┘
│ Lexicon
┌─────────┐
Word
│ │
└─────────┘
Language Model
┌─────────┐
Text
│ │
└─────────┘

Speech Capture

  • Wave
  • Mono
  • VAD (noise)

Pre-processing

  • Short-time analysis: process the data window by window, as human voices have a duration of consistency within the range of 20ms to 50ms, generally we take 25ms. And make the step size of 10ms.
  • FFT: converting raw wave data to frequency data.
  • Cut: keep only a range of frequency data, that most human voices cover.
  • Other… (DCT)
  • Feature Extraction
    • MFCC (mel-frequency cepstral coefficient), 39 dimentions
    • PLP (perceptual linear prediction), 42 dimentions

Production-based Analysis

  • Spectral Envelope
  • Cepstral Analysis
  • Linear Predictive Analysis

Perception-based Analysis

  • Mel-Frequency Cepstrum Coefficients
  • Perceptual Linear Prediction

Acoustic Model

  • Acoustic model is a classification model, who’s input is the fetures, MFCC/PLP…, while output is the phonemes, about 70 most frequently used.
  • Unsupervised learning
  • GMM + HMM / HMM + LSTM
  • Baum-Welch(EM) / MLE

GMM + HMM

HMM: modeling the states of phonemes.
GMM: modeling the output probabilities.

tri-phoneme model

Each phoneme in different context, may pronounce differently.
So setup a

Lexicon (Pronunciation Model)

words -> pronunciation in terms of phones

Language Model

For modeling the structure of sentences.
Making the sequence of words more like a human speech.

  • N-gram
  • WFST(Weighted Finite State Transducer)

Decoder

Viterbi algorithm: dynamic programming for combining all these to get word sequence from speech

Training

  • Labels marked (hand-crafted)
  • Acoustic model (EM)
    Gaussians for computing P(X|Q)
  • Laxicon (hand-crafted)
    HMM: what phones can follow each other
  • Language model (n-gram, RNN)

Models

GMM

HMM

$$ \lambda = (\pi, A, B) $$

  • Pi: initial state probabilities matrix
  • A: hidden states transition probabilities matrix
  • B: output probabilities matrix (Confusion Matrix)
  • S: Hidden States
  • O: Observations

Problem & Solution

  • Evaluation: given the observations and lamdas, find the pobability of the observation sequence - Forword algrithm
  • Decoding: given the observations and lamda, find out hidden stats - Viterbi algorithm
  • Learning: given the observations and learn the lamda - Baum-Welch (forward-backward algorithm)

Viterbi

Forced alignment.
Viterbi animated demo

Summerize

Fumular

  • X: acoustic feature vectors (observations)
  • W: a word sequence
  • Q: model parameters

Find the most probable word sequence Wˆ = w1,w2, . . . ,wM
given the acoustic observations X = x1, x2, . . . , xn:
$$ W^* = argmax_{w}P(W|X) $$

Applying Bayes’ Theorem:
$$ W^* = argmax_{w}\frac{P(X|W)P(W)}{P(X)} $$

P(X) is irrelevant:
$$ W^* = argmax_{w}P(X|W)P(W) $$

Words are composed of state sequences so we may express
this criterion by summing over all state sequences
Q = q1, q2, . . . , qn:
$$ W^* = argmax_{w}P(W)\sum_Q{P(Q|W)P(X|Q)} $$

Other Approaches

  • HMM + GMM -> HMM + DNN + GMM(frame alignment only)
  • HMM + DNN -> RNN
  • end 2 end: RNN + CTC (Connectionist Temporal Classification)