Xem mẫu

Speech Recognition using Neural Networks Joe Tebelskis May 1995 CMU-CS-95-142 School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213-3890 Submitted in partial fulfillment of the requirements for a degree of Doctor of Philosophy in Computer Science Thesis Committee: Alex Waibel, chair Raj Reddy Jaime Carbonell Richard Lippmann, MIT Lincoln Labs Copyright 1995 Joe Tebelskis This research was supported during separate phases by ATR Interpreting Telephony Research Laboratories, NEC Corporation, Siemens AG, the National Science Foundation, the Advanced Research Projects Adminis-tration, and the Department of Defense under Contract No. MDA904-92-C-5161. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of ATR, NEC, Siemens, NSF, or the United States Government. Keywords: Speech recognition, neural networks, hidden Markov models, hybrid systems, acoustic modeling, prediction, classification, probability estimation, discrimination, global optimization. iii Abstract This thesis examines how artificial neural networks can benefit a large vocabulary, speaker independent, continuous speech recognition system. Currently, most speech recognition systems are based on hidden Markov models (HMMs), a statistical framework that supports both acoustic and temporal modeling. Despite their state-of-the-art performance, HMMs make a number of suboptimal modeling assumptions that limit their potential effectiveness. Neural networks avoid many of these assumptions, while they can also learn complex func-tions, generalize effectively, tolerate noise, and support parallelism. While neural networks can readily be applied to acoustic modeling, it is not yet clear how they can be used for tem-poral modeling. Therefore, we explore a class of systems called NN-HMM hybrids, in which neural networks perform acoustic modeling, and HMMs perform temporal modeling. We argue that a NN-HMM hybrid has several theoretical advantages over a pure HMM system, including better acoustic modeling accuracy, better context sensitivity, more natural dis-crimination, and a more economical use of parameters. These advantages are confirmed experimentally by a NN-HMM hybrid that we developed, based on context-independent phoneme models, that achieved 90.5% word accuracy on the Resource Management data-base, in contrast to only 86.0% accuracy achieved by a pure HMM under similar conditions. In the course of developing this system, we explored two different ways to use neural net-works for acoustic modeling: prediction and classification. We found that predictive net-works yield poor results because of a lack of discrimination, but classification networks gave excellent results. We verified that, in accordance with theory, the output activations of a classification network form highly accurate estimates of the posterior probabilities P(class|input), and we showed how these can easily be converted to likelihoods P(input|class) for standard HMM recognition algorithms. Finally, this thesis reports how we optimized the accuracy of our system with many natural techniques, such as expanding the input window size, normalizing the inputs, increasing the number of hidden units, convert-ing the network’s output activations to log likelihoods, optimizing the learning rate schedule by automatic search, backpropagating error from word level outputs, and using gender dependent networks. iv v Acknowledgements I wish to thank Alex Waibel for the guidance, encouragement, and friendship that he man-aged to extend to me during our six years of collaboration over all those inconvenient oceans — and for his unflagging efforts to provide a world-class, international research environment, which made this thesis possible. Alex’s scientific integrity, humane idealism, good cheer, and great ambition have earned him my respect, plus a standing invitation to dinner whenever he next passes through my corner of the world. I also wish to thank Raj Reddy, Jaime Carbonell, and Rich Lippmann for serving on my thesis committee and offer-ing their valuable suggestions, both on my thesis proposal and on this final dissertation. I would also like to thank Scott Fahlman, my first advisor, for channeling my early enthusi-asm for neural networks, and teaching me what it means to do good research. Many colleagues around the world have influenced this thesis, including past and present members of the Boltzmann Group, the NNSpeech Group at CMU, and the NNSpeech Group at the University of Karlsruhe in Germany. I especially want to thank my closest col-laborators over these years — Bojan Petek, Otto Schmidbauer, Torsten Zeppenfeld, Her-mann Hild, Patrick Haffner, Arthur McNair, Tilo Sloboda, Monika Woszczyna, Ivica Rogina, Michael Finke, and Thorsten Schueler — for their contributions and their friend-ship. I also wish to acknowledge valuable interactions I’ve had with many other talented researchers, including Fil Alleva, Uli Bodenhausen, Herve Bourlard, Lin Chase, Mike Cohen, Mark Derthick, Mike Franzini, Paul Gleichauff, John Hampshire, Nobuo Hataoka, Geoff Hinton, Xuedong Huang, Mei-Yuh Hwang, Ken-ichi Iso, Ajay Jain, Yochai Konig, George Lakoff, Kevin Lang, Chris Lebiere, Kai-Fu Lee, Ester Levin, Stefan Manke, Jay McClelland, Chris McConnell, Abdelhamid Mellouk, Nelson Morgan, Barak Pearlmutter, Dave Plaut, Dean Pomerleau, Steve Renals, Roni Rosenfeld, Dave Rumelhart, Dave Sanner, Hidefumi Sawai, David Servan-Schreiber, Bernhard Suhm, Sebastian Thrun, Dave Touretzky, Minh Tue Voh, Wayne Ward, Christoph Windheuser, and Michael Witbrock. I am especially indebted to Yochai Konig at ICSI, who was extremely generous in helping me to understand and reproduce ICSI’s experimental results; and to Arthur McNair for taking over the Janus demos in 1992 so that I could focus on my speech research, and for con-stantly keeping our environment running so smoothly. Thanks to Hal McCarter and his col-leagues at Adaptive Solutions for their assistance with the CNAPS parallel computer; and to Nigel Goddard at the Pittsburgh Supercomputer Center for help with the Cray C90. Thanks to Roni Rosenfeld, Lin Chase, and Michael Finke for proofreading portions of this thesis. I am also grateful to Robert Wilensky for getting me started in Artificial Intelligence, and especially to both Douglas Hofstadter and Allen Newell for sharing some treasured, pivotal hours with me. ... - tailieumienphi.vn
nguon tai.lieu . vn