Xem mẫu

Sproat, R. & Olive, J. “Text-to-Speech Synthesis” Digital Signal Processing Handbook Ed. Vijay K. Madisetti and Douglas B. Williams Boca Raton: CRC Press LLC, 1999 c 1999 by CRC Press LLC 46 Text-to-Speech Synthesis Richard Sproat Bell Laboratories Lucent Technologies Joseph Olive Bell Laboratories Lucent Technologies 46.1 Introduction 46.2 Text Analysis and Linguistic Analysis TextPreprocessing Accentuation WordPronunciation In-tonational Phrasing Segmental Durations Intonation 46.3 Speech Synthesis 46.4 The Future of TTS References 46.1 Introduction Text-to-speech synthesis has had a long history, one that can be traced back at least to Dudley’s “Voder”, developed at Bell Laboratories and demonstrated at the 1939 World’s Fair [1]. Practical systems for automatically generating speech parameters from a linguistic representation (such as a phoneme string) were not available until the 1960s, and systems for converting from ordinary text into speech were first completed in the 1970s, with MITalk being the best-known such system [2]. Many projects in text-to-speech conversion have been initiated in the intervening years, and papers on many of these systems have been published.1 Itistemptingtothinkoftheproblemofconvertingwrittentextintospeechas“speechrecognition in reverse”: current speech recognition systems are generally deemed successful if they can convert speech input into the sequence of words that was uttered by the speaker, so one might imagine that a text-to-speech (TTS) synthesizer would start with the words in the text, convert each word one-by-oneintospeech(beingcarefultopronounceeachwordcorrectly),andconcatenatetheresult together. However, when one considers what literate native speakers of a language must do when they read a text aloud, it quickly becomes clear that things are much more complicated than this simplistic view suggests. Pronouncing words correctly is only part of the problem faced by human readers: in order to sound natural and to sound as if they understand what they are reading, they mustalsoappropriatelyemphasize(accent)somewords,anddeemphasizeothers;theymust“chunk” thesentenceintomeaningful(intonational)phrases;theymustpickanappropriateF0(fundamental frequency) contour; they must control certain aspects of their voice quality; they must know that a word should be pronounced longer if it appears in some positions in the sentence than if it appears in others because segmental durations are affected by various factors, including phrasal position. 1For example, [3] gives an overview of recent Dutch efforts in this area. Audio examples of several current projects on TTS can be found at the WWW URL http://www.cs.bham.ac.uk/jpi/synth/museum.html. c 1999 by CRC Press LLC What makes reading such a difficult task is that all writing systems systematically fail to specify many kinds of information that are important in speech. While the written form of a sentence (usually) completely specifies the words that are present, it will only partly specify the intonational phrases (typically with some form of punctuation), will usually not indicate which words to accent or deaccent, and hardly ever give information on segmental duration, voice quality, or intonation. (One might think that a question mark “?” indicates that a sentence should be pronounced with a rising intonation: generally, though, a question mark merely indicates that a sentence is a question, leavingituptothereadertojudgewhetherthisquestionshouldberenderedwitharisingintonation.) Theorthographiesofsomelanguages—e.g., Chinese, Japanese, andThai—failtogiveinformation on where word boundaries are, so that even this needs to be figured out by the reader.2 Humans are able to perform these tasks because, in addition to being knowledgeable about the grammar of their language, they also (usually) understand the content of the text that they are reading, and can thus appropriately manipulate various extragrammatical “affective” factors, such as appropriate use of intonation and voice quality. The task of a TTS system is thus a complex one that involves mimicking what human readers do. But a machine is hobbled by the fact that it generally “knows” the grammatical facts of the language only imperfectly, and generally can be said to “understand” nothing of what it is reading. TTS algorithms thus have to do the best they can making use, where possible, of purely grammatical information to decide on such things as accentuation, phrasing, and intonation — and coming up with a reasonable “middle ground” analysis for aspects of the output that are more dependent on actual understanding. It is natural to divide the TTS problem into two broad subproblems. The first of these is the conversion of text — an imperfect representation of language, as we have seen — into some form of linguistic representation that includes information on the phonemes (sounds) to be produced, their duration, the locations of any pauses, and the F0 contour to be used. The second — the actual synthesis of speech — takes this information and converts it into a speech waveform. Each of these main tasks naturally breaks down into further subtasks, some of which have been alluded to. The first part, text and linguistic analysis, may be broken down as follows: Textpreprocessing: includingend-of-sentencedetection,“textnormalization”(expansion of numerals and abbreviations), and limited grammatical analysis, such as grammatical part-of-speech assignment. Accent assignment: the assignment of levels of prominence to various words in the sen-tence. Word pronunciation: including the pronunciation of names and the disambiguation of homographs.3 Intonational phrasing: the breaking of (usually long) stretches of text into one or more intonational units. Segmentaldurations: thedetermination, onthebasisoflinguisticinformationcomputed thus far, of appropriate durations for phonemes in the input. F0 contour computation. 2Even in English, single orthographic words, e.g., AT&T, can actually represent multiple words — A T and T. 3Ahomographisasinglewrittenwordthatrepresentstwoormoredifferentlexicalentries, oftenhavingdifferentpronun-ciations: an example would be bass, which could be the word for a musical range — with pronunciation /bejs/ — or a fish — with pronunciation /bæs/. We transcribe pronunciations using the International Phonetic Association’s (IPA) symbol set. Symbols used in this chapter are defined in Table 46.1. c 1999 by CRC Press LLC Speech synthesis breaks down into two parts: The selection and concatenation of appropriate concatenative units given the phoneme string. The synthesis of a speech waveform given the units, plus a model of the glottal source. 46.2 Text Analysis and Linguistic Analysis 46.2.1 Text Preprocessing The input to TTS systems is text encoded using an electronic coding scheme appropriate for the language,suchasASCII,JIS(Japanese),orBig-5(Chinese). OneofthefirsttasksfacingaTTSsystem is that of dividing the input into reasonable chunks, the most obvious chunk being the sentence. In somewritingsystemsthereisadesignatedsymbolusedformarkingtheendofadeclarativesentence and for nothing else — in Chinese, for example, a small circle is used — and in such languages end-of-sentence detection is generally not a problem. For English and other languages we are not so fortunate because a period, in addition to its use as a sentence delimiter, is also used, for example, to mark abbreviations: if one sees the period in Mr., one would not (normally) want to analyze this as an end-of-sentence marker. Thus, before one concludes that a period does in fact mark the end of a sentence, one needs to eliminate some other possible analyses. In a typical TTS system, text analysis would include an abbreviation-expansion module; this module is invoked to check for commonabbreviationswhichmightallowonetoeliminateoneormorepossibleperiodsfromfurther consideration. Forexample,ifapreprocessorforEnglishencountersthestringMr. inanappropriate context (e.g., followed by a capitalized word), it will expand it as mister and remove the period. Of course, abbreviation expansion itself is not trivial, since many abbreviations are ambiguous. For example, is St. to be expanded as Street or Saint? Is Dr., Doctor or Drive? Such cases can be disambiguated via a series of heuristics. For St., for example, the system might first check to see if the abbreviation is followed by a capitalized word (i.e., a potential name), in which case it would be expanded as Saint; otherwise, if it is preceded by a capitalized word, a number, or an alphanumeric (49th), it would be expanded as Street. Another problem that must be dealt with is the conversion of numbers into words: 232 should usually be expanded as two hundred thirty two, whereas if the same sequence occurs as part of 232-3142 — a likely telephone number — it would normally be read two three two. In languages like English, tokenization into words can to a large extent be done on the basis of white space. In contrast, in many Asian languages, including Chinese, the situation is not so simple because spaces are never used to delimit words. For the purposes of text analysis it is therefore generally necessary to “reconstruct” word boundary information. A minimal requirement for word segmentation is an on-line dictionary that enumerates the wordforms of the language. This is not enough on its own, however, since there are many words that will not be found in the dictionary; among these are personal names, foreign names in transliteration, and morphological derivatives of words that do not occur in the dictionary. It is therefore necessary to build models of these non-dictionary words; see [4] for further discussion. In addition to lexical analysis, the text-analysis portion of a TTS system will typically perform syntactic analysis of various kinds. One commonly performed analysis is grammatical part-of-speech assignment, as information on the part of speech of words can be useful for accentuation and phrasing, among other things. Thus, in a sentence like they can can cans, it is useful for accentuation purposes to know that the first can is a function word — an auxiliary verb, whereas the second and third are content words — respectively a verb and a noun. There are a number of part-of-speech algorithmsavailable,perhapsthebestknownbeingthestochasticmethodof[5],whichcomputesthe c 1999 by CRC Press LLC most likely analysis of a sequence of words, maximizing the product of the lexical probabilities of the parts-of-speechinthesentence(i.e.,thepossiblepartsofspeechofeachwordandtheirprobabilities), and the n-gram probabilities (probabilities of n-grams of parts of speech), which provide a model of the context. 46.2.2 Accentuation In languages like English, various words in a sentence are associated with accents, which are usually manifested as upward or downward movements of fundamental frequency. Usually, not every word in the sentence bears an accent, however, and the decision on which words should be accented and which should be unaccented is one of the problems that must be addressed as part of text analysis. It is common in prosodic analysis to distinguish three levels of prominence. Two are accented and unaccented, as just described, and the third is cliticized. Cliticized words are unaccented but in addition have lost their word stress, so that they tend to be durationally short: in effect, they behave like unstressed affixes, even though they are written as separate words. A good first step in assigning accents is to make the accentual determination on the basis of broad lexicalcategoriesorpartsofspeech. Contentwords—nouns,verbs,adjectives,andperhapsadverbs, tend in general to be accented; function words, including auxiliary verbs and prepositions tend to be deaccented; short function words tend to be cliticized. But accenting has a wider function than merely communicating lexical category distinctions between words. In English, one important set of constructions where accenting is more complicated than what might be inferred from the above discussion are complex noun phrases — basically, a noun preceded by one or more adjectival or nominal modifiers. In a “discourse-neutral” context, some constructions are accented on the final word (Madison Avenue), some on the penultimate (Wall Street, kitchen towel rack), and some on an evenearlierword(sumppumpfactory). Theassignmentofaccenttocomplexnounphrasesdepends on complex lexical and semantic factors; see [6]. Accenting is not only sensitive to syntactic structure and semantics, but also to properties of the discourse. One straightforward effect is contrast, as in the example I didn’t ask for cherry pie, I asked for apple pie. For most speakers, the “discourse neutral” accent would be on pie, but in this example thereisaclearintentiontocontrasttheingredientsinthepies,andpieisthusdeaccentedtoeffectthe contrast between cherry and apple. See [7] for a discussion of how these kind of effects are handled inaTTSsystemforEnglish. Note,whilehumanlikeaccentingcapabilitiesarepossibleinmanycases, there are still some intractable problems. For example, just as one would often deaccent a word that had been previously mentioned, so would one often deaccent a word if a supercategory of that word had been mentioned: My son wants a Labrador, but I’m allergic to dogs. Handling such cases in any general way is beyond the capabilities of current TTS systems. 46.2.3 Word Pronunciation The next stage of analysis involves computing pronunciations for the words in the input, given the orthographic representation of those words. The simplest approach is to have a set of “letter-to-sound” rules that simply map sequences of graphemes into sequences of phonemes, along with possible diacritic information, such as stress placement. This approach is naturally best suited to languages where there is a relatively simple relation between orthography and phonology: languages such as Spanish or Finnish fall into this category. However, languages like English manifestly do not, so it has generally been recognized that a highly accurate word pronunciation module must contain c 1999 by CRC Press LLC ... - tailieumienphi.vn
nguon tai.lieu . vn