Xem mẫu

Richard V.Cox. “Speech Coding.” 2000 CRC Press LLC. . Speech Coding Richard V. Cox AT&T Labs — Research 45.1 Introduction Examples of Applications Speech Coder Attributes 45.2 Useful Models for Speech and Hearing The LPC Speech Production Model Models of Human Per-ception for Speech Coding 45.3 Types of Speech Coders Model-Based Speech Coders Time Domain Waveform-Following Speech Coders Frequency Domain Waveform-Following Speech Coders 45.4 Current Standards CurrentITUWaveformSignalCoders ITULinearPrediction Analysis-by-SynthesisSpeechCoders DigitalCellularSpeech Coding Standards Secure Voice Standards Performance References 45.1 Introduction Digital speech coding is used in a wide variety of everyday applications that the ordinary person takes for granted, such as network telephony or telephone answering machines. By speech coding we mean a method for reducing the amount of information needed to represent a speech signal for transmission or storage applications. For most applications this means using a lossy compression algorithm because a small amount of perceptible degradation is acceptable. This section reviews some of the applications, the basic attributes of speech coders, methods currently used for coding, and some of the most important speech coding standards. 45.1.1 Examples of Applications Digital speech transmission is used in network telephony. The speech coding used is just sample-by-sample quantization. The transmission rate for most calls is fixed at 64 kilobits per second (kb/s). Thespeechissampledat8000Hz(8kHz)andalogarithmic8-bitquantizerisusedtorepresenteach sample as one of 256 possible output values. International calls over transoceanic cables or satellites are often reduced in bit rate to 32 kb/s in order to boost the capacity of this relatively expensive equipment. Digital wireless transmission has already begun. In North America, Europe, and Japan therearedigitalcellularphonesystemsalreadyinoperationwithbitratesrangingfrom6.7to13kb/s for the speech coders. Secure telephony has existed since World War II, based on the first vocoder. (Vocoder is a contraction of the words voice coder.) Secure telephony involves first converting the speech to a digital form, then digitally encrypting it and then transmitting it. At the receiver, it is decrypted, decoded, and reconverted back to analog. Current videotelephony is accomplished c 1999 by CRC Press LLC through digital transmission of both the speech and the video signals. An emerging use of speech coders is for simultaneous voice and data. In these applications, users exchange data (text, images, FAX, or any other form of digital information) while carrying on a conversation. All of the above examples involve real-time conversations. Today we use speech coders for many storage applications that make our lives easier. For example, voice mail systems and telephone answering machines allow us to leave messages for others. The called party can retrieve the message when they wish, even from halfway around the world. The same storage technology can be used to broadcast announcements to many different individuals. Another emerging use of speech coding is multimedia. Most forms of multimedia involve only one-way communications, so we include them with storage applications. Multimedia documents on computers can have snippets of speech as an integral part. Capabilities currently exist to allow users to make voice annotations onto documents stored on a personal computer (PC) or workstation. 45.1.2 Speech Coder Attributes Speech coders have attributes that can be placed in four groups: bit rate, quality, complexity, and delay. For a given application, some of these attributes are pre-determined while tradeoffs can be made among the others. For example, the communications channel may set a limit on bit rate, or cost considerations may limit complexity. Quality can usually be improved by increasing bit rate or complexity,andsometimesbyincreasingdelay. Inthefollowingsections,wediscusstheseattributes. Primarily we will be discussing telephone bandwidth speech. This is a slightly nebulous term. In the telephone network, speech is first bandpass filtered from roughly 200 to 3200Hz. This is often referredtoas3kHzspeech. Speechissampledat8kHzinthetelephonenetwork. Theusualtelephone bandwidth filter rolls off to about 35 dB by 4 kHz in order to eliminate the aliasing artifacts caused by sampling. There is a second bandwidth of interest. It is referred to as wideband speech. The sampling rate is doubled to 16 kHz. The lowpass filter is assumed to begin rolling off at 7 kHz. At the low end, the speechisassumedtobeuncontaminedbylinenoiseandonlytheDCcomponentneedstobefiltered out. Thus,thehighpassfiltercutofffrequencyis50Hz. Whenwerefertowidebandspeech,wemean speech with a bandwidth of 50 to 7000 Hz and a sampling rate of 16 kHz. This is also referred to as 7 kHz speech. Bit Rate Bitratetellsusthedegreeofcompressionthatthecoderachieves. Telephonebandwidthspeech issampledat8kHzanddigitizedwithan8-bitlogarithmicquantizer,resultinginabitrateof64kb/s. Fortelephonebandwidthspeechcoders, wemeasurethedegreeofcompressionbyhowmuchthebit rate is lowered from 64 kb/s. International telephone network standards currently exist for coders operating from 64 kb/s down to 5.3 kb/s. The speech coders for regional cellular standards span the range from 13 to 3.45 kb/s and those for secure telephony span the range from 16 kb/s to 800 b/s. Finally, there are proprietary speech coders that are in common use which span the entire range. Speech coders need not have a constant bit rate. Considerable compression can be gained by not transmitting speech during the silence intervals of a conversation. Nor is it necessary to keep the bit rate fixed during the talkspurts of a conversation. Delay The communication delay of the coder is more important for transmission than for storage applications. In real-time conversations, a large communication delay can impose an awkward protocol on talkers. Large communication delays of 300 ms or greater are particularly objectionable to users even if there are no echoes. c 1999 by CRC Press LLC Most low bit rate speech coders are block coders. They encode a block of speech, also known as a frame, at a time. Speech coding delay can be allocated as follows. First, there is algorithmic delay. Some coders have an amount of look-ahead or other inherent delays in addition to their frame size. The sum of frame size and other inherent delays constitutes algorithmic delay. The coder requires computation. The amount of time required for this is called processing delay. It is dependent on the speed of the processor used. Other delays in a complete system are the multiplexing delay and the transmission delay. Complexity The degree of complexity is a determining factor in both the cost and power consumption of a speech coder. Cost is almost always a factor in the selection of a speech coder for a given application. With the advent of wireless and portable communications, power consumption has also become an important factor. Simple scalar quantizers, such as linear or logarithmic PCM, are necessary in any coding system and have the lowest possible complexity. More complex speech coders are first simulated on host processors, then implemented on DSP chips and may later be implemented on special purpose VLSI devices. Speed and random access memory (RAM) are the two most important contributing factors of complexity. The faster the chip or the greater the chip size, the greater the cost. In fact, complexity is a determining factor for both cost and power consumption. Generally 1 word of RAM takes up as much on-chip area as 4 to 6 words of read only memory (ROM). Most speech coders are implemented on fixed point DSP chips, soonewaytocomparethecomplexityofcodersistomeasuretheirspeedandmemoryrequirements when efficiently implemented on commercially available fixed point DSP chips. DSP chips are available in both 16-bit fixed point and 32-bit floating point. 16-bit DSP chips are generally preferred for dedicated speech coder implementations because the chips are usually less expensive and consume less power than implementations based on floating point DSPs. A disadvantage of fixed-point DSP chips is that the speech coding algorithm must be implemented using 16-bit arithmetic. As part of the implementation process, a representation must be selected for each and every variable. Some can be represented in a fixed format, some in block floating point, and still others may require double precision. As VLSI technology has advanced, fixed point DSP chips contain a richer set of instructions to handle the data manipulations required to implement representations such as block floating point. The advantage of floating point DSP chips is that implementing speech coders is much quicker. Their arithmetic precision is about the same as that of a high level language simulation, so the steps of determining the representation of each and every variable and how these representations affect performance can be omitted. Quality The attribute of quality has many dimensions. Ultimately quality is determined by how the speech sounds to a listener. Some of the factors that affect the performance of a coder are whether the input speech is clean or noisy, whether the bit stream has been corrupted by errors, and whether multiple encodings have taken place. Speech coder quality ratings are determined by means of subjective listening tests. The listening is done in a quiet booth and may use specified telephone handsets, headphones, or loudspeakers. The speech material is presented to the listeners at specified levels and is originally prepared to have particular frequency characteristics. The most often used test is the absolute category rating (ACR) test. Subjects hear pairs of sentences and are asked to give one of the following ratings: excellent, good, fair, poor, or bad. A typical test contains a variety of different talkers and a number of different coders or reference conditions. The data resulting from this test can be analyzed in many ways. The simplest way is to assign a numerical ranking to each response, giving a 5 to the best possible rating, 4 to the next best, down to a 1 for the worst rating, then computing the mean rating for each of the c 1999 by CRC Press LLC conditions under test. This is a referred to as a mean opinion score (MOS) and the ACR test is often referred to as a MOS test. Therearemanyotherdimensionstoqualitybesidesthosepertainingtonoiselesschannels. Biterror sensitivity is another aspect of quality. For some low bit rate applications such as secure telephones over 2.4 or 4.8 kb/s modems, it might be reasonable to expect the distribution of bit errors to be random and coders should be made robust for low random bit error rates up to 1 to 2%. For radio channels, such as in digital cellular telephony, provision is made for additional bits to be used for channel coding to protect the information bearing bits. Errors are more likely to occur in bursts and the speech coder requires a mechanism to recover from an entire lost frame. This is referred to as frame erasure concealment, another aspect of quality for cellular speech coders. Forthepurposesofconservingbandwidth,voiceactivitydetectorsaresometimesusedwithspeech coders. During non-speech intervals, the speech coder bit stream is discontinued. At the receiver “comfort noise” is injected to simulate the background acoustic noise at the encoder. This method is used for some cellular systems and also in digital speech interpolation (DSI) systems to increase the effective number of channels or circuits. Most international phone calls carried on undersea cablesorsatellitesuseDSIsystems. Thereissomeimpactonqualitywhenthesetechniquesareused. Subjective testing can determine the degree of degradation. 45.2 Useful Models for Speech and Hearing 45.2.1 The LPC Speech Production Model Human speech is produced in the vocal tract by a combination of the vocal cords in the glottis interacting with the articulators of the vocal tract. The vocal tract can be approximated as a tube of varying diameter. The shape of the tube gives rise to resonant frequencies called formants. Over the years, the most successful speech coding techniques have been based on linear prediction coding (LPC).TheLPCmodelisderivedfromamathematicalapproximationtothevocaltractrepresentation as a variable diameter tube. The essential element of LPC is the linear prediction filter. This is an all pole filter which predicts the value of the next sample based on a linear combination of previous samples. Let xn be the speech sample value at sampling instant n. The object is to find a set of prediction coefficients faig such that the prediction error for a frame of size M is minimized: M−1 X !2 " D aixnCm−i CxnCm (45.1) mD0 iD1 where I is the order of the linear prediction model. The prediction value for xn is given by I xQn D − aixn−i (45.2) iD1 The prediction error signal feng is also referred to as the residual signal. In z-transform notation we can write I A.z/ D 1 C aiz−i (45.3) iD1 1=A.z/ is referred to as the LPC synthesis filter and (ironically) A.z/ is referred to as the LPC inverse filter. c 1999 by CRC Press LLC ... - --nqh--
nguon tai.lieu . vn