Broad class phoneme detection
We categorize American English phonemes into several groups: vowel, semi-vowel, nasal, whisper, fricative/affricative, closure/stop, silence and some special phonemes (/q/ and /dx/), among which five main groups (vowel, semi-vowel, nasal, fricative, stop) are further examined. Thereafter, we construct several detectors based on acoustic features for each phoneme group and compare them with HMM-based systems by testing on continuous speech data, TIMIT, and some data in unfavorable environments, like TIMIT with additive noise, and NTIMIT. To detect vowels, a compact vowel detector based only on two acoustic features, periodicity and energy, is implemented. It performs with 86.8% accuracy and 22.4% total error rate. Even under some adverse environments, it still works stably. To detect fricatives, several detectors based on SVMs using different acoustic features are constructed and a typical performance of one of these has 90.6% and 24.8% as accuracy and total error rate, respectively. Whereas for stops, features of total energy, energy above 3kHz and Wiener entropy are employed into SVMs and the detector obtains accuracy of 93.2% and total error rate of 19.6%. All of these results are comparable with or even better than HMM-based systems. However, detectors based on static acoustic features for nasals and semi-vowels do not perform as well as expected. By examining the details of the errors, the associated detection problems are revealed, and inspire a new approach to detection. To deal with non-static features, we propose a combination of HMMs and SVMs for detection of phoneme groups and obtain satisfactory results. We believe that this method can also be extended for more general speech recognition applications.