CSA3020

Lecture 4 - Sound and Audio

Reference: Steinmetz, R., and Nahrstedt, K. (1995). Multimedia: Computing, Communications & Applications. Prentice Hall. Chapter 3.
Steinmetz, R. and Nahrstedt, K. (2002). Multimedia Fundamentals: Volume 1. Prentice Hall. Chapter 3.

Applications of Sound and Audio

Sound (and its derivatives; speech, music, etc., generally referred to, if audible to humans, as audio) has a significant part to play in multimedia applications.
From interacting through a multi-modal user interface (e.g., Surfing the Web by voice) and text-to-speech systems (Apple Speech Technologies), to software agents capable of expressing themselves in natural language (e.g., VirtualFriend); Internet-based radio and TV (e.g, RealAudio, and Internet Radio and TV sites); video-conferencing (e.g., Cu-SeeMe) and Internet telephony (e.g., VocalTec Communications); generating computer music, sounds for games, and computer-controlled musical instruments (e.g., MIDI); to personalised elevator music and refrigerators that hum along to your mood, audio is essential.
This lecture presents general properties of sound, and how to convert it into a bit stream that can be manipulated by a computer (digitization). Finally we give an overview of speech recognition and synthesis

Properties of Sound

Sound is created by the vibration of matter and manifests itself when the pressure waves in the air created by the vibration reach an acoustic device (such as an ear, tape recorder, microphone, loud speaker, etc.) capable of converting the pressure waves. [Philosophical issues... the world is completely silent, sounds are only "inside our heads". If a tree falls in a forest, and there is nothing to hear it, does it make a sound? In space (a vacuum), nobody hears you scream.]
These vibrations displace the air, and the alterations in pressure propogate through the air in a wave-like motion (a waveform (see Figure below).

The shape of the waveform that repreats at regular intervals is called a period, and sounds musical (e.g., a bird singing). A waveform that is not periodic sounds like noise (e.g., me singing!).
The frequency of a sound is the number of periods in a second and is measured in hertz (Hz). 1000 Hz = 1 kiloHertz (kHz).
Audible (to humans) frequency occurs in the 20Hz to 20kHz range. Other frequency ranges are:

Infra-sound 0 - 20Hz

Ultrasound 20kHz - 1GHz

Hypersound 1GHz - 10THz

The amplitude of a sound is a property subjectively heard as loudness.

Digitizing Audio

Natural sound occurs as continuous, and hence, analog, pressure waves. In order to covert these pressure waves into a representation a computer can manipulate, it is necessary to digitize them.
An Analog-to-Digital Convervter (ADC) measures the amplitude of pressure waves at regular time intervals (called samples) to generate a digital representation of the sound. The reverse conversion, to play digital sound through an analog device (such as speakers) is performed by a Digital-to-Analog Converter (DAC).
The number of samples taken per second is called the sampling rate. CD quality sound is sampled at 44,100 Hz, which means that it is sampled 44,100 times per second. This appears to be well above the frequency range of the human ear. However, the Nyquist sampling theorem states that "For lossless digitization, the sampling rate should be at least twice the maximum frequency responses." The human ear can hear sound in the range 20Hz to 20KHz, and the bandwidth (19980Hz) is slightly less than half the CD standard sampling rate. Following the Nyquist theorem, this means that CD quality sound can represent frequencies only up to 22,050Hz, which is much closer to that of human hearing.
Just as the waveform is sampled at discrete times, the value of the sample taken is also represented as a discrete value. The resolution or quantization of a sample value is dependent on the number of bits used to represent the amplitude. The greater the number of bits used, the better the resolution, but the more storage space is required. Typically, amplitude is sampled as either 8-bit (resulting in 256 possible sample values) or 16-bit (yielding 65536 values).

Comparison of Audio Quality vs. Data Rate (from Basics of Digital Audio)

Quality    Sample Rate  Bits per   Mono/        Data Rate        Frequency
              (KHz)      Sample    Stereo     (Uncompressed)       Band
---------  -----------	--------  --------   -----------------  ------------

Telephone     8            8        Mono       8   KBytes/sec   200-3,400 Hz

AM Radio     11.025        8        Mono      11.0 KBytes/sec

FM Radio     22.050       16       Stereo     88.2 KBytes/sec

CD           44.1         16       Stereo    176.4 KBytes/sec   20-20,000 Hz

DAT          48           16       Stereo    192.0 KBytes/sec   20-20,000 Hz

MIDI

See An introduction to MIDI and A Tutorial on MIDI and Music Synthesis for good introductions to MIDI. You should know how MIDI works at an introductory level, although we will not cover it in the lectures.

Speech Synthesis and Analysis

Speech synthesis (generation) and analysis are important aspects of multimedia systems. As multi-modal user interfaces become more common, it will become increasingly important for humans to communicate with computers using spoken language approaching natural language, and for computer systems to communicate with humans using artificially generated speech. The human acceptance of computer-generated speech is dependent on the speech sounding natural and easy to understand. However, speech synthesis and analysis have a multitude of other applications. Voice recognition systems are an important class of security systems; speech synthesis can give those who are vocally impaired a means for spoken communication. Speech synthesis and analysis are also an important aspect for computer systems which can be used by illiterate and visually impaired users.

Speech Synthesis in a Nutshell

Real-time speech generation

The easiest way of generating speech in real-time is by using pre-recorded speech (e.g., MaltaCom's fault-reporting service, Barbie and Barney). However, the limitation is that if a word is not pre-recorded, then it cannot be used. A more flexible, though more time-consuming, solution is to generate speech by recording individual speech units (of which there are a finite set) and then generate speech by concatenating the sounds.
However, consider how you would pronounce the following "Betty is by the sea", normally, quizzically, and agitatedly. Also consider how "an arm and a leg" would sound with a British accent and with a New York accent. Stress (together with melody called prosody) also plays a large part in sound generation. However, getting the prosody right is still a challenge, and consequently computer-generated speech can sound quite unnatural. Apart from this high-level problem, there are also problems with words which follow each other. Consider the word the. The sound changes depending on whether the following word starts with a vowel or a consonent. These problems can be overcome using coarticulation rules over phone order. Other problems which influence pronounciation include ambiguity. Consider the word lead in the following sentences: "The general lead his army to famous victory", and "In parks, dogs should always be kept on a lead". Although some pronounciations can be disambiguated using syntactic analysis - at face value, in the first sentence lead is a verb in the 3rd person singular past tense and in the second it is a noun, on other occasions, semantic analysis is necessary. Despite these problems, it is possible to generate speech to an acceptible level of quality.
The figure below (from Steinmetz and Nahrstedt, 1995, pg. 46/Steinmetz and Nahrstedt, 2002, pg. 36) shows the components of a speech synthesis system.

Speech Analysis

The figure above (from Steinmetz and Nahrstedt, pg. 47/Steinmetz and Nahrstedt, 2002, pg. 37) identifies the research areas concened with speech analysis.
The primary goal of speech analysis is to correctly determine individual words with a probability <= 1. Reasons why systems are less accurate than this are due to ambient noise (humans are remarkably good at speech recognition even in noisy environments), word sense ambiguity ("there" and "their", for example), dialect and stress.
Once individual words in a sentence have been recognised, the probability of recognising the sentence correctly is the probability of recognising the individual words multiplied by the number of words in the sentence. For example, if the probability of recognising individual words is 0.95, then the probability of correctly recognising a 3 word sentence is 0.875. Factors which reduce the probability of sentences being correctly recognised include correctly determining word boundaries (compare "An arm and a leg" spoken with a British and New York accent - although, obviously, the misinterpretation is by British listeners to a New York accent!), semantics, and time normalization. The same sentence can be spoken quickly or slowly - as can individual words in an utterance.
Speech recognition systems are divided into speaker-independent recognition systems and speaker-dependent recognition systems. The main differences are that although speaker-independent systems can be used by many different speakers without training, only a limited number of words are recognised (e.g., some of British Telecom's telephone services which can recognise only the words "Yes" and "No" - but compare this to MaltaCom's services which require a "9" or "0" tone to be sent in response to questions), whereas with a speaker-dependent system, after training, it can recognise an extensive vocabularly in excess of 25,000 words.

Infra-sound	0 - 20Hz
Ultrasound	20kHz - 1GHz
Hypersound	1GHz - 10THz