Explicit N-best formant features for segment-based speech recognition
Citations Over TimeTop 18% of 1996 papers
Abstract
This thesis investigates the use of explicit speech knowledge in computer speech-recognition. Speech knowledge is generally expressed in terms of acoustic events occurring near phonetic segment boundaries and the location, shape and dynamics of formant trajectories. This suggests the creation of a segment -based recognition framework and the use of explicit formant features in a flexible integration scheme to ultimately improve the phonetic recognition accuracy. We describe a segmentation algorithm that produces a lattice of segment hypotheses, each with an associated broad phonetic identity. We build a single phonetic segment classifies along with separate vowel/semi-vowel and consonant classifiers based on traditional cepstral features paying attention to reducing the mismatch between training and deployment conditions. We develop a robust, N-best formant tracking algorithm that generates a list of up to N consistent formant interpretations. The use of the N-best feature paradigm is based on the observation that there are generally only a handful of reasonable interpretation of the given formant information. Instead of finding the best formant interpretation through the use of a global cost function that includes energy maximization and smoothness terms, we delay the selection of the correct formant interpretation until after the segment classification and phonetic search. We use the formant interpretations to extract features for a vowel/semi-vowel segment classifier. The formant trajectories are approximated either by three line segments or by a third-order Legendre polynomial. We show that together with formant amplitude, formant bandwidth, pitch, and segment durations we can produce a classifier of comparable performance to a cepstral-based classifier. We further demonstrate the potential of the N-best classification paradigm and show that a combination of formant and cepstral features further improves the classification a accuracy. Finally, the validity of the entire approach of using a segment-based approach, separate classifiers for vowels and consonants, and explicit formant features is verified by phonetic recognition experiments.
Related Papers
- → A system for feature classification of emotions based on speech analysis; applications to human-robot interaction(2014)10 cited
- → Formants Based Analysis for Speech Recognition(2006)8 cited
- Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers(2003)
- → Speaker and channel-normalized set of formant parameters for telephone speech recognition(1999)1 cited
- Continuous formant-tracking applied to visual representations of the speech and speech recognition.(1997)