Data Compression: MPEG-4 and Digital Audio Production

MPEG-4

MPEG-4 will define a method of describing objects (both visual and audible) and how they are “composited” and interact together to form “scenes.” The scene description part of the MPEG-4 standard describes a format for transmitting the spatiotemporal positioning information that describes how individual audiovisual objects are composed within a scene. A “real world” audio object is defined as an audible semantic entity recorded with one microphone—in case of a mono recording—or with more microphones, at different positions, in case of a multichannel recording. Audio objects can be grouped or mixed together, but objects cannot be split easily into subobjects.

Applications for MPEG-4 audio might include “mix minus 1” applications in which an orchestra is recorded minus the concerto instrument, allowing a musician to play along with his or her instrument at home, or where all effects and music tracks in a feature film are “mix minus the dialogue,” allowing very flexible multilingual applications because each language is a separate audio object and can be selected as required in the decoder.

In principle, none of these applications is anything but straightforward; they could be handled by existing digital (or analogue) systems. The problem, once again, is bandwidth. MPEG-4 is designed for very low bit rates and this should suggest that MPEG have designed (or integrated) a number of very powerful audio tools to reduce necessary data throughput. These tools include the MPEG-4 Structured Audio format, which uses low bit-rate algorithmic sound models to code sounds. Furthermore, MPEG-4 includes the functionality to use and control postproduction panning and reverberation effects at the decoder, as well as the use of a SAOL signal-processing language enabling music synthesis and sound effects to be generated, once again, at the terminal rather than prior to transmission.

Structured Audio

We have already seen how MPEG (and Dolby) coding aims to remove perceptual redundancy from an audio signal, as well as removing other simpler representational redundancy by means of efficient bit-coding schemes. Structured audio (SA) compression schemes compress sound by, first, exploiting another type of redundancy in signals— structural redundancy.

Structural redundancy is a natural result of the way sound is created in human situations. The same sounds, or sounds which are very similar, occur over and over again. For example, a performance of a work for solo piano consists of many piano notes. Each time the performer strikes the “middle C” key on the piano, a very similar sound is created by the piano’s mechanism. To a first approximation, we could view the sound as exactly the same upon each strike; to a closer one, we could view it as the same except for the velocity with which the key is struck and so on. In a PCM representation of the piano performance, each note is treated as a completely independent entity; each time the “middle C” is struck, the sound of that note is independently represented in the data sequence. This is even true in a perceptual coding of the sound. The representation has been compressed, but the structural redundancy present in rerepresenting the same note as different events has not been removed.

In structured coding, we assume that each occurrence of a particular note is the same, except for a difference described by an algorithm with a few parameters. In the model-transmission stage we transmit the basic sound (either a sound sample or another algorithm) and the algorithm which describes the differences. Then, for sound transmission, we need only code the note desired, the time of occurrence, and the parameters controlling the differentiating algorithm.

SAOL

SAOL (pronounced “sail”) stands for “Structured Audio Orchestra Language” and falls into the music-synthesis category of “Music V” languages. Its fundamental processing model is based on the interaction of oscillators running at various rates. Note that this approach is different from the idea (used in the multimedia world) of using MIDI information to drive synthesis chips on sound cards. This latter approach has the disadvantage that, depending on IC technology, music will sound different depending on which sound card is realized. Using SAOL (a much “lower-level” language than MIDI) realizations will always sound the same.

At the beginning of an MPEG-4 session involving SA, the server transmits to the client a stream information header, which contains a number of data elements. The most important of these is the orchestra chunk, which contains a tokenized representation of a program written in SAOL. The orchestra chunk consists of the description of a number of instruments. Each instrument is a single parametric signal-processing element that maps a set of parametric controls to a sound. For example, a SAOL instrument might describe a physical model of a plucked string. The model is transmitted through code, which implements it, using the repertoire of delay lines, digital filters, fractional-delay interpolators, and so forth that are the basic building blocks of SAOL.

The bit stream data itself, which follows the header, is made up mainly of time-stamped parametric events. Each event refers to an instrument described in the orchestra chunk in the header and provides the parameters required for that instrument. Other sorts of data may also be conveyed in the bit stream; tempo and pitch changes, for example.

Unfortunately, at the time of writing (and probably for some time beyond!) the techniques required for automatically producing a structured audio bit stream from an arbitrary, prerecorded sound are beyond today’s state of the art, although they are an active research topic. These techniques are often called “automatic source separation” or “automatic

transcription.” In the meantime, composers and sound designers will use special content creation tools to directly create SA bit streams. This is not considered to be a fundamental obstacle to the use of MPEG-4 structured audio because these tools are very similar to the ones that contemporary composers and editors use already; all that is required is to make their tools capable of producing MPEG-4 output bit streams. There is an interesting corollary here with MPEG-4 for video. For, while we are not yet capable of integrating and coding real-world images and sounds, there are immediate applications for directly synthesized programs. MPEG-4 audio also foresees the use of text-to-speech conversion systems.

Audio Scenes

Just as video scenes are made from visual objects, audio scenes may be usefully described as the spatiotemporal combination of audio objects. An “audio object” is a single audio stream coded using one of the MPEG-4 coding tools, such as structured audio. Audio objects are related to each other by mixing, effects processing, switching, and delaying them, and may be panned to a particular three-dimensional location. The effects processing is described abstractly in terms of a signal-processing language—the same language used for SA.

Digital Audio Production

We’ve already looked at the technical advantages of digital signal processing and recording over its older analogue counterpart. We now come to consider the operational impact of this technology, where it has brought with it a raft of new tools and some new problems.

Leave a comment

Your email address will not be published. Required fields are marked *