Audio data comes in a sometimes bewildering variety of forms. The number of fundamental ways in which sound can be represented is actually fairly small. The variety of audio file types is due to the fact that there are quite a few approaches to compressing audio data and a number of different ways of packaging the data. We first describe how the audio data itself is represented, then how it is packaged into files. People often speak of audio formats sloppily without distinguishing between data formats and file formats, but it is critical to keep this distinction in mind as many file formats can contain date represented in more than one way and most data representations can be packaged in more than one file format. Saying that a sound file is a ".wav" file says nothing about the audio data format. Similarly, saying that a file contains PCM data says nothing about the file format.
Sound consists of audible variation in air pressure. Microphones convert variation in air pressure into a varying voltage. To represent sound digitally, we must convert this varying voltage into a series of numbers representing its amplitude. This process is known as analog-to-digital conversion. Audio data consisting of such numbers is said to be in pulse code modulation format, abbreviated PCM. Audio data is often stored in other formats, usually in order to compress it, but it almost always starts off in PCM format.
The numbers produced by an analog-to-digital converter are, in general, arbitrary. Although the original pressure data has dimensions of dynes per square centimeter, the relationship between these actual pressure values and the numbers generated by analog-to-digital conversion is determined both by the response characteristics of the microphone and by the preamplifier in the analog-to-digital converter. We rarely know the exact properties of either the microphone or the analog-to-digital converter. Furthermore, we normally adjust the analog-to-digital converter, or the preamplifier that precedes it, to choose the best input level. We want to use the largest dynamic range possible, so as to take advantage of the full detail of the signal, while at the same time ensuring that we not exceed the limits of the electronics and avoiding clipping, which distorts the signal. Therefore, it is almost never the case that we know how many dynes per square centimeter the numbers represent. For most purposes this does not matter as all we care about is the relative amplitude of the signal. Absolute pressure levels are of interest for some work in auditory psychophysics. In this case, it is necessary to calibrate the system and fix its parameters (such as the preamplifier gain).
The air pressure variation, and therefore the corresponding voltage produced by a microphone, is continuous in two-dimensions. That is, the values vary continuously, and they exist at every point in time. However, a digital system such as a computer cannot directly represent a continuous signal. Instead, it must measure the signal at a finite set of discrete times. This is known as sampling. Furtheremore, it must make use of a finite number of discrete amplitude levels. This is known as quantization. The number of levels used is known as the resolution. The resolution is usually expressed in bits, that is, as the base-2 logarithm of the actual number. A system with a resolution of 8 bits makes use of 2^8 = 256 levels. A system with 16 bit resolution makes use of 2^16 = 65,536 levels. The sampling rate and resolution determine the quality of the digital representation of the sound. "CD-quality" sound has a resolution of 16 bits and a sampling rate of 44,100 samples per second.
Here is a continuous waveform:
The sampling rate is the number of times per second that the amplitude of the signal is measured and so has dimensions of samples per second. The higher the sampling rate, the more accurately the sampled signal will represent the original signal. The choice of sampling rate is determined by the Nyquist Sampling Theorem. This theorem states that if the maximum frequency at which the original signal contains energy is F, then if it is sampled at a rate strictly greater than 2F samples per second, it will be possible to reconstruct the original signal perfectly from the sampled signal. In other words, the sampled signal will contain all of the information in the original signal.
It is important to note that the sampling rate must be strictly greater than 2F. Sampling at a rate of exactly 2F can result in error. For example, suppose that the original signal is a sine wave of frequency F. If we sample at a frequency of 2F, with the first sample at time 0, all of our samples will have value 0. The original sine wave cannot be reconstructed from such a sampled signal. This is easily seen in the following illustrations. All four show 1Hz sine waves. The two on the left have phase 0. The two on the right have phase π. The top two have amplitude 1.0. The bottom two have amplitude 2.0. If we sample at two samples per second with an offset of 0, all of our samples will be 0 in all four cases. The sampled signal will not contain the information necessary to decide which of the four original signals to reconstruct.
Furthermore, the Nyquist theorem is based on the assumption that the original signal is infinitely long. If it is not, sampling at just over the Nyquist rate of 2F will not necessarily permit perfect reconsruction. If the sampling rate is only a little bit greater than the Nyquist rate, the increase may not shift the sampling points over far enough to make the quantized values different. Consider again our example of the sine wave. Let us suppose that its frequency is 1000 Hz. If we sample at a rate of 2000 samples per second we may end up with every sample equal to 0. Suppose we sample at a rate of 2001 samples per second, which technically satsifies the Nyquist criterion. If the signal is long enough, even if the first sample is at a zero, some samples will be a significant distance from a zero and there will be enough information to reconstruct the signal. But if the signal is fairly short, the samples will be taken at points where the value is so close to zero that it may not be distinguishable from zero when quantized. As a result, for short signals it is necessary to use a sampling rate significantly higher than the Nyquist frequency.
Intuitively it may seem that if a signal is sampled at too low a rate the result will be that the higher frequency components will be lost but that the lower frequency components will be unaffected. Unfortunately, this is not the case. Instead, what happens is that the energy from the higher frequencies is treated as if it were at lower frequencies; its energy is added to the energy actually present at these lower frequencies. This distortion is known as aliasing.
In order to digitize a signal without danger of aliasing, it is customary to pass it through a low-pass filter first in order to remove any energy above the Nyquist frequency. For example, if we are interested in energy up to 8Khz, we can sample at a frequency a little over 16,000 samples per second, after filtering out the energy above 8Khz. However, physically realizable low-pass filters do not simply pass 100% of the energy below the cutoff frequency and eliminate all of the energy above it. The reduction in energy above the nominal cutoff frequency is gradual. If we use a filter with a nominal cutoff of 8Khz, we may still have significant energy in the region a little bit above 8Khz. Therefore, in order to be safe, if we know that we are interested in energy up to frequency F, we use a low-pass filter with nominal cutoff at F and we choose a sampling rate significantly above 2F. That way, our sampling rate will be high enough that any energy in the region just above the nominal cutoff frequency will not be aliased. A common practice is to use a sampling rate of 2.5F.
Older analog-to-digital converters that allow a choice of sampling rates have to have variable analog filters before the digitizer. Nowadays, signal processing hardware is so fast and cheap that the usual approach is to use a low-pass filter with a fixed cutoff frequency, then sample at a high rate. This sampled signal is then downsampled (converted to a lower sampling rate) after digital low-pass filtering.
The most common sampling rate nowadays is 44,100 samples per second. This is the sampling rate used for music CDs. Since the music market is much larger than any other (such as the market for acoustic phonetics research), off-the-shelf hardware and software is designed to its specifications. This sampling rate allows for frequency content up to a little over 20KHz, which covers the entire range that human beings can hear. Indeed, most adults cannot hear frequencies nearly that high.
When digitizing material for linguistic research, you can save space by using a lower sampling rate, say 22,050 samples per second. All of the linguistic information in speech is below 8KHz, so this rate is more than adequate.
Another sampling rate that is sometimes seen is 8,000 samples per second. This rate corresponds (using the multiplier of 2.5 rather than 2.0 discussed above) to a maximum frequency of 3.2KHz, which is the upper bound of the telephone band. This rate is therefore appropriate for applications involving telephone speech. It is too low for good quality speech or music, and not acceptable for most phonetic research.
The error produced by quantizing a signal is known as quantization noise. The quality of a quantized signal may be measured by computing the signal-to-noise ratio (SNR), where the noise in question is the quantization noise. Each bit of resolution adds approximately 6 decibels to the signal-to-noise ratio. A resolution of 8 bits therefore yields an SNR of approximately 48dB. A resolution of 16 bits yields an SNR of approximately 96dB. Some older digitizers, such as those used on PCs and Macintoshen in the 1980s and early 1990s, produce only 8 bit resolution, so one occasionally encounters old 8-bit sound files. 16 bit resolution is considered desirable for purposes such as acoustic phonetics research and professional quality music. Almost all digitizers in use today provide 16 bit resolution or higher.
The effects of quantization may be understood intuitively by comparing a quantized signal with the original continuous signal. Here is a continuous signal overlaid with a 2-bit quantization:
When digitizing an analog signal, it is important to set the input level of the digitizer correctly. This means that the extreme values of the input signal should be just within the range of the digitizer. If they exceed the range of the digitizer, the result will be a form of distortion known as clipping. Significant overloading may also damage the digitizer. On the other hand, if the input level is set too low, the result will effectively be a smaller than optimum resolution. If the digitizer has available 65,536 levels but the signal only ranges over half the input range, only half of the levels are used. In effect the signal is digitized with 15 bits of resolution rather than the available 16.
A single stream of sound, such as that from an ordinary monaural recording, constitutes one channel. Stereo requires two channels. Quadriphonic music requires four channels. Recordings made in professional music studios may have many channels prior to mixing, one for each instrument and singer. In practice, the common values are one and two.
The integers used to represent amplitude values may be signed or unsigned. A signed number is one that may be either positive or negative. An unsigned number may never be negative. Whether the numbers used are signed or unsigned has no effect on the resolution. The number of distinct amplitude levels remains the same. For example, in the usual machine-level representation of integers (known as 2's-complement representation), 16 bit signed integers range between -32,768 and 32,767. The 16 bit unsigned integers range in value from 0 to 65,535. In both cases the total number of amplitude levels is 65,536. Therefore, it doesn't really matter whether signed or unsigned representation is used, but to operate on the values correctly it is necessary to know which representation is intended. For example, the bit pattern 1111111111111111 represents the value 65,535, the maximum value, as an unsigned integer but -1, at the middle of the amplitude range, as a 2's-complement signed integer.
Recall from our previous explanation of endianness that different computers interpret multibyte sequences in different ways. Since 16 bit integers take up two bytes, they are affected by endianness. Audio software will generally convert data of the wrong byte order as necessary. You will only need to deal with endian-ness if you write low-level audio processing software or if you encounter raw audio data so that your software cannot determine the byte order of the data from the header. If you do encounter a raw file with the wrong byte order, it should be easy to detect as it will sound like noise.
Audio data takes up a lot of space, at least in comparison with text. A single second of compact disc audio takes up about as much space as 15,000 words of ASCII text, that is, 60 pages of a typical book. Here is a chart showing the amount of space occupied by different durations of monaural sound at different sampling rates. A 10GB disk, for example, will hold only about 31 hours of audio at the CD-rate.
|1 second||1 minute||1 hour|
|44,100 samples/second 16 bit||l88.2KB||5.3MB||317.5MB|
|22,050 samples/second 16 bit||44.1KB||2.6MB||158.8MB|
|16,000 samples/second 16 bit||32.0KB||1.9MB||115.2MB|
Note that in this chart KB stands for 1,000 bytes and MB for 1,000,000 bytes. These are the definitions used by the International Electrotechnical Commission, the international body that sets standards in the electronics and electrical areas. Disk manufacturers use these units to describe the size of their products. In contrast, computer programmers generally use KB to mean 1,024 bytes and MB to mean 1024 * 1024= 1,048,576 bytes.
Since audio data occupies a lot of space, there has long been motivation to compress it. Indeed, audio compression precedes the use of digital computers and digital data transmission. Bell Laboratories carried out pioneering research on the location of the information in speech in the frequency domain so that AT&T could pack as many telephone conversations onto a single line as possible. This research showed that most of the information in speech lies between 300 and 3,000 Hz. That is why, even today, telephone circuits filter out all energy outside of this band.
Compression techniques are of two basic types: lossless and lossy. A lossless compression technique is one that yields a compressed signal from which the original signal can be reconstructed perfectly. No information is lost as a result of the compression. A lossy compression technique is one that discards information. The original signal cannot be reconstructed perfectly from a signal compressed by a lossy method.
A program or hardware device that compresses and decompresses data is known as a codec, short for "compressor - decompressor".
As processor speeds, data transmission rates, and the capacity of hard drives and other storage media have increased, the motivation for compressing audio data has decreased. For research data, there is no reason to make use of compression, certainly not of lossy compression. If you are concerned with the space taken up by your own recordings, here are some suggestions for minimizing storage while avoiding lossy compression. However, commercial audio data, such as music, is still frequently compressed in order to increase the amount that will fit on portable players and reduce the time needed for downloads.
Here are some useful references on audio compression:
Lossless compression techniques are not widely used because the amount of compression that they produce is relatively small. The degree of compression obtained depends on the content of the file. With speech, lossless compression reduces the size of the file at best to about 25% of its original size, at worst to about 50%. Quiet classical music compresses almost as well as speech, while "noisy" modern music, tends to compress poorly, often to about 75% of its original size. At present, the main users of lossless compression appear to be fans of recordings of live concerts. Two formats, FLAC and SHN (Shorten), are especially popular.
In areas such as phonetics research, the use of lossless compression is desirable. Those generating audio data should consider using one of the lossless techniques if they are going to compress at all. Further information on lossless compression may be found here.
There are numerous lossy compression techniques, most of which are now rarely encountered. Two lossy compression techniques are of some importance: minidisc and mp3. Minidisc compression is important because minidisc recorders have been used for the collection of linguistic data. MP3 compression is important because a great deal of audio is distributed in this form.
Minidiscs are small (7cm x 7cm) storage media introduced by Sony in 1991. They provide digital data storage on devices much cheaper than digital audio tape recorders.
The audio data on minidiscs is compressed using Adaptive Transform Acoustic Coding, usually known as ATRAC. The ATRAC algorithm is described here. ATRAC coding compresses the data to about 20% of its original size with minimal loss of information. Minidisc audio is generally considered to be of "near CD quality". Because of the complexity of the compression algorithm, it is difficult to specify exactly how it distorts the signal. However, it appears that minidisc compression has no effect on most phonetic measurements. The only sort of work for which minidisc compression may be problematic is the measurement of low-level, high frequency components of the spectrum, e.g. in weak fricatives or stop bursts. A careful comparison of vowel formant measurements obtained from uncompressed speech and minidisc compressed speech by Maciej Baranowski revealed no differences.
MP3 compression is widely used for music as well as for speech streamed over the internet. MP3 is a form of MPEG compression, a standard designed for multimedia, including video as well as audio. MPEG stands for Moving Picture Experts Group, a joint working group of the International Standards Organization and the International Electrotechnical Commission. MP3 is an abbreviation for "MPEG, Layer 3". MPEG is actually a set of compression algorithms, for various types of data and degrees of compression. From time to time a new version of the MPEG standard is issued, containing additional algorithms so as to accomodate new input data formats and bitrates. These versions should not be confused with the layers. The various standards can be read here.
For further information on MPEG compression, see the MPEG Audio Web Page.
Audio data may come without any packaging at all. Files that contain nothing but audio data are known as raw files. They usually contain uncompressed monaural pulse code modulation (PCM) data. In order to use such files, it is necessary to know, or be able to figure out, the sampling rate, resolution, signedness, and endianness of the data.
The simpler form of packaging consists of a header. This is some information at the beginning ("head") of the file. The header will typically contain one or more bytes, known as a "magic number", identifying the file type, and basic information about the audio data, namely the sampling rate, resolution, and number of channels. The header may also identify the compression used, if any, and specify accidental aspects of the representation, e.g. whether the data is signed or unsigned. The header may also indicate the amount of data, that is, the number of bytes or samples that follow the header. Since the dominant market for audio data is the music industry, the header may contain information such as a title and performer. File formats intended for research may contain a record of the processing that the data has undergone.
The more complex forms of packaging provide a sort of tree structure, in which the file consists of "chunks" of information, each of which in turn may contain other chunks. The WAVE and AIFF file formats are of this type. Such file formats typically allow the file to contain multiple pieces of audio data, such as several songs. They may also provide for the inclusion of additional information about the audio, such as a play list. Some file formats are intended as general multimedia file formats. They therefore provide not only for audio data but for other types of data, such as video, still images, or animations.
Here we describe the most common audio file types. The discussion here will also give the reader a good general idea of the organization of audio files.
Sound files that consist of nothing but PCM audio data are called raw sound files. Some audio i/o devices, especially older devices intended for research rather than the commercial market, produce such files. They are no longer commonly seen.
Since raw sound files have no header in which to store information, it is necessary to know their sampling rate, resolution, signedness, and number of channels. Some of these parameters are occasionally encoded in the filename, but there are no truly general conventions. Filename suffixes are sometimes used to convey the resolution and signedness. For example, the suffix .sb is likely to indicate that samples consist of one byte, that is, have a resolution of 8 bits, and are signed. The suffix .uw in this system indicates that each sample is represented by a two-byte word, that is, has a resolution of 16 bits, and is unsigned.
AU files are a good example of a simple file type consisting of a header followed by data. There are actually two kinds of AU file. The suffix .au was originally used by Sun for headerless audio files containing μ-law compressed audio sampled at 8,000 samples per second. Subsequently, the present format was adopted. The SND format on NeXT computers is the same as the AU format. AU files consists of a header with the following format followed by a single chunk of audio data. The numerical values in the header must be stored in big-endian format.
Here is the structure of an AU format file header:
|4||0||Magic number: .snd|
|4||3||Offset of the sound data from the beginning of the file = 23 + N|
|4||7||Number of bytes of audio data|
|4||11||Sound format code|
|4||15||Sampling rate in samples per second|
|4||19||Number of channels|
|N||23||Optional text describing data|
Here are the sound format codes.
What is probably the most common format in use today is the WAVE format, usually marked by the suffix .wav. WAVE files are actually a special case, for audio, of the RIFF format for multimedia files. The RIFF format is a Microsoft standard. The full specification is contained in the document Microsoft Multimedia Standards Update, Revision 3.0, April 15, 1994. A copy may be downloaded here. The RIFF format is a derivative of the Interchange Format Files format developed by Electronic Arts.
There is also a slight variant of the RIFF format known as the RIFX format. RIFX differs from RIFF in the endianess of integer data. RIFF data is required to be little-endian; RIFX data is required to be big-endian. In all other respects RIFX format is identical to RIFF format except for the fact that the magic number is RIFX rather than RIFF. The RIFF format was developed for use with Intel processors. The RIFX format is an adaptation for Motorola processors, which have the opposite endianness.
A beautifully-illustrated explanation of the WAVE format can be found here
Here is the structure of the simplest standard-conforming WAVE file:
|4||Magic number: RIFF/RIFX||0|
|4||WAVE chunk size = file size - 8||4|
|4||WAVE identifier: WAVE||8|
|4||Format chunk identifier: fmt<space>||12|
|4||Format chunk size: 16||16|
|2||Sound format code||20|
|2||Number of channels||22|
|4||Average data rate in bytes per second||28|
|2||Bytes per sample*||32|
|2||Bits per sample*||34|
|4||Chunk identifier: data||36|
|4||Chunk length in bytes: N||40|
Here are the sound format codes.
One occasionally encounters simpler files that do not conform to the standard. In these, the file header is immediately followed by the audio data. The data chunk identifier and chunk length are missing. Such files can be converted into the standard format by RepairWave.
It is possible for standard-conforming WAVE files to be more complex. They may contain multiple data chunks, and they may contain chunks of other types, such as play lists, cue lists, padding (to cause the audio data to start at a specific location), and text containing information about the origin of the file and the processing it has undergone. It is permissible for WAVE files to include non-standard chunk types. Standard-conforming software will simply skip over chunks that it does not know how to handle.
AIFF format is widely used on Apple computers and, as a result, in professional audio processing software. Like, RIFF/WAVE, AIFF ("Audio Interchange File Format") is a derivative of the Interchange Format Files format developed by Electronic Arts. It is simpler than RIFF/WAVE format in that it is intended only for audio data and supports a smaller range of audio data formats.
An AIFF file consists of a header followed by one or more "chunks". A minimal AIFF sound file therefore consists of a header and a sound chunk. In addition to sound chunks, a variety of other chunks are possible, including markers of positions in the waveform data, MIDI synthesizer commands, and comments.
The audio data in an AIFF file is always uncompressed PCM. The header contains information about the number of channels, sampling rate, and resolution. The audio data, like all integer data in this format, is stored in big-endian format.
Further information can be found here.
MP3 files are characterized primarily by their compression, as described above. However, unlike some other types of compressed data, which may be packaged in a variety of ways, MP3 compressed data is usually packaged in a particular way. It is of course possible to embed data in MP3 format, that is, complete with frame headers, within another kind of package. For example, a WAVE file may contain MP3 data. In this case, the data chunk consists of MP3 compressed audio plus frame headers. The fact that the data consists of MPEG data is indicated by the value 80 for the data format code.
An MP3 file consists of a set of data frames. Each data frame begins with a four byte header. The header may be followed by two bytes of Cyclic Redundancy Check data (for error detection/correction), flagged by the setting of bit 16 of the header. The remainder of the frame contains the audio data. In a pure MP3 file, there is no overall file header, though there may be one if the MP3 data is embedded within another package.
A good, detailed explanation of MP3 file format may be found here.
Ogg Vorbis is a new audio format intended primarily for music. It is roughly comparable to MP3 format, but is non-proprietary and open-source. Ogg Vorbis consists of a compression algorithm and a file format. The compression algorithm is a perceptual compression algorithm, like ATRAC and MP3. It is reported to sound better than MP3 at lower bit rates.
The file format is a little different from the others we have discussed. An ogg file or bitstream must begin with three header packets:
The ogg format specification can be read on-line here or downloaded as a PDF file here. Further information and software can be obtained from the Ogg Vorbis Project.
The extension ram or ra is commonly used for Real Audio files. These are used for streaming audio. A RAM file is a plain text file each line of which contains the URL of an audio file. The URL contains not only the location of the audio file but parameters to be passed to the program that plays the streaming audio. There is no particular RAM audio format. The audio files to which the URLs point may be in any format supported by the player, e.g. MP3. Information about Real Audio can be found here.
If you need to download the audio files listed in a RAM file, you may find curl useful. In some cases, it is problematic to download the audio files. In this case, you can play the streaming audio and capture it using vsound.
sox can convert most common audio formats. (Be sure to get a current version of Sox. There is at least one outdated, formerly official Sox web page still up [http://www.spies.com/Sox/]. The current web site is the Sourceforge site.) A tutorial on the use of sox can be found here.
One format that sox does not handle is MPEG. You can use the mpeg player mpg123 to convert MPEG to raw format (16 bit stereo linear PCM, native byte order) by using the -s command line flag to send the output to the standard output instead of the sound card. You can then use sox to convert the raw data to another format. For instance, this sequence of commands will convert an MP3 with a sampling rate of 22,050 to WAVE format:
mpg123 -s foo.mp3 > foo.raw sox -w -s -r 22050 -c 1 foo.raw foo.wav
Another format that sox does not handle is ogg format. The program oggdec, part of the vorbistools package downloadable from the Ogg Vorbis Project website, converts ogg format to WAVE or raw PCM.
A similar tool is sndfile-convert, which is part of the libsndfile package. libsndfile is a library that allows C programmers to read and write a variety of audio file formats. sndfile-convert is provided as a demonstration of the use of the library. It is found in the examples directory of the libsndfile distribution. It does not have all of the functions of sox, such as sound effects, but has a simpler syntax. It also handles some data formats that sox does not, such as floating point data.
Another useful tool is ffmpeg. It is aimed primarily at converting video formats, but since video files usually contain audio as well, it can also convert a variety of audio formats, especially formats used for commercial purposes and not normally encountered in linguistic research.
The audio stream can be extracted from RealMedia video files (extension usually .rm) using RealMedia Analyzer, which is available for GNU/Linux, OS/2, MS Windows, and DOS.
A more general solution is available on GNU/Linux systems for any situation in which you have the ability to play an audio file but have no program that can extract it. This situation arises when a proprietary codec is available only in binary form or only as part of a closed-source program. vsound deals with this situation by intercepting calls to open /dev/dsp, the device file generally used as the interface to the sound card, and substituting a normal file for the sound card. The result is that the decoded audio data that would have been sent to the sound card is instead written into a file.
As discussed above, the file name suffix often provides information about the format of an audio file. The file utility recognizes many audio file formats. (Note that the file program provided with some versions of Unix, such as SunOs, is inferior to this one.) The sndfile-info program, provided as a demonstration of the use of the libsndfile library, will provide detailed information about sound files in a variety of formats. InfoWave provides a detailed description of the content of WAVE files.