MPEG(Taken from Compression FAQ and written by written by Mark Adler
around January 1992; modified and updated by Harald Popp
in March 94.)
Q: What is MPEG, exactly?
A: MPEG is the "Moving Picture Experts Group", working under the
joint direction of the International Standards Organization (ISO)
and the International Electro-Technical Commission (IEC). This
group works on standards for the coding of moving pictures and
associated audio.
Q: What is the status of MPEG's work, then? What's about MPEG-1, -2,
and so on?
A: MPEG approaches the growing need for multimedia standards step-by-
step. Today, three "phases" are defined:
MPEG-1: "Coding of Moving Pictures and Associated Audio for
Digital Storage Media at up to about 1.5 MBit/s"
Status: International Standard IS-11172, completed in 10.92
MPEG-2: "Generic Coding of Moving Pictures and Associated Audio"
Status: Comittee Draft CD 13818 as found in documents MPEG93 /
N601, N602, N603 (11.93)
MPEG-3: no longer exists (has been merged into MPEG-2)
MPEG-4: "Very Low Bitrate Audio-Visual Coding"
Status: Call for Proposals 11.94, Working Draft in 11.96
Q: MPEG-1 is ready-for-use. How does the standard look like?
A: MPEG-1 consists of 4 parts:
IS 11172-1: System
describes synchronization and multiplexing of video and audio
IS 11172-2: Video
describes compression of non-interlaced video signals
IS 11172-3: Audio
describes compression of audio signals
CD 11172-4: Compliance Testing
describes procedures for determining the characteristics of coded
bitstreams and the decoding porcess and for testing compliance
with the requirements stated in the other parts
Q. Does MPEG have anything to do with JPEG?
A. Well, it sounds the same, and they are part of the same
subcommittee of ISO along with JBIG and MHEG, and they usually meet
at the same place at the same time. However, they are different
sets of people with few or no common individual members, and they
have different charters and requirements. JPEG is for still image
compression.
Q. Then what's JBIG and MHEG?
A. Sorry I mentioned them. Ok, I'll simply say that JBIG is for binary
image compression (like faxes), and MHEG is for multi-media data
standards (like integrating stills, video, audio, text, etc.).
For an introduction to JBIG, see question 74 below.
Q. So how does MPEG-1 work? Tell me about video coding!
A. First off, it starts with a relatively low resolution video
sequence (possibly decimated from the original) of about 352 by
240 frames by 30 frames/s (US--different numbers for Europe),
but original high (CD) quality audio. The images are in color,
but converted to YUV space, and the two chrominance channels
(U and V) are decimated further to 176 by 120 pixels. It turns
out that you can get away with a lot less resolution in those
channels and not notice it, at least in "natural" (not computer
generated) images.
The basic scheme is to predict motion from frame to frame in the
temporal direction, and then to use DCT's (discrete cosine
transforms) to organize the redundancy in the spatial directions.
The DCT's are done on 8x8 blocks, and the motion prediction is
done in the luminance (Y) channel on 16x16 blocks. In other words,
given the 16x16 block in the current frame that you are trying to
code, you look for a close match to that block in a previous or
future frame (there are backward prediction modes where later
frames are sent first to allow interpolating between frames).
The DCT coefficients (of either the actual data, or the difference
between this block and the close match) are "quantized", which
means that you divide them by some value to drop bits off the
bottom end. Hopefully, many of the coefficients will then end up
being zero. The quantization can change for every "macroblock"
(a macroblock is 16x16 of Y and the corresponding 8x8's in both
U and V). The results of all of this, which include the DCT
coefficients, the motion vectors, and the quantization parameters
(and other stuff) is Huffman coded using fixed tables. The DCT
coefficients have a special Huffman table that is "two-dimensional"
in that one code specifies a run-length of zeros and the non-zero
value that ended the run. Also, the motion vectors and the DC
DCT components are DPCM (subtracted from the last one) coded.
Q. So is each frame predicted from the last frame?
A. No. The scheme is a little more complicated than that. There are
three types of coded frames. There are "I" or intra frames. They
are simply a frame coded as a still image, not using any past
history. You have to start somewhere. Then there are "P" or
predicted frames. They are predicted from the most recently
reconstructed I or P frame. (I'm describing this from the point
of view of the decompressor.) Each macroblock in a P frame can
either come with a vector and difference DCT coefficients for a
close match in the last I or P, or it can just be "intra" coded
(like in the I frames) if there was no good match.
Lastly, there are "B" or bidirectional frames. They are predicted
from the closest two I or P frames, one in the past and one in the
future. You search for matching blocks in those frames, and try
three different things to see which works best. (Now I have the
point of view of the compressor, just to confuse you.) You try
using the forward vector, the backward vector, and you try
averaging the two blocks from the future and past frames, and
subtracting that from the block being coded. If none of those work
well, you can intracode the block.
The sequence of decoded frames usually goes like:
IBBPBBPBBPBBIBBPBBPB...
Where there are 12 frames from I to I (for US and Japan anyway.)
This is based on a random access requirement that you need a
starting point at least once every 0.4 seconds or so. The ratio
of P's to B's is based on experience.
Of course, for the decoder to work, you have to send that first
P *before* the first two B's, so the compressed data stream ends
up looking like:
0xx312645...
where those are frame numbers. xx might be nothing (if this is
the true starting point), or it might be the B's of frames -2 and
-1 if we're in the middle of the stream somewhere.
You have to decode the I, then decode the P, keep both of those
in memory, and then decode the two B's. You probably display the
I while you're decoding the P, and display the B's as you're
decoding them, and then display the P as you're decoding the next
P, and so on.
Q. You've got to be kidding.
A. No, really!
Q. Hmm. Where did they get 352x240?
A. That derives from the CCIR-601 digital television standard which
is used by professional digital video equipment. It is (in the US)
720 by 243 by 60 fields (not frames) per second, where the fields
are interlaced when displayed. (It is important to note though
that fields are actually acquired and displayed a 60th of a second
apart.) The chrominance channels are 360 by 243 by 60 fields a
second, again interlaced. This degree of chrominance decimation
(2:1 in the horizontal direction) is called 4:2:2. The source
input format for MPEG I, called SIF, is CCIR-601 decimated by 2:1
in the horizontal direction, 2:1 in the time direction, and an
additional 2:1 in the chrominance vertical direction. And some
lines are cut off to make sure things divide by 8 or 16 where
needed.
Q. What if I'm in Europe?
A. For 50 Hz display standards (PAL, SECAM) change the number of lines
in a field from 243 or 240 to 288, and change the display rate to
50 fields/s or 25 frames/s. Similarly, change the 120 lines in
the decimated chrominance channels to 144 lines. Since 288*50 is
exactly equal to 240*60, the two formats have the same source data
rate.
Q. What will MPEG-2 do for video coding?
A. As I said, there is a considerable loss of quality in going from
CCIR-601 to SIF resolution. For entertainment video, it's simply
not acceptable. You want to use more bits and code all or almost
all the CCIR-601 data. From subjective testing at the Japan
meeting in November 1991, it seems that 4 MBits/s can give very
good quality compared to the original CCIR-601 material. The
objective of MPEG-2 is to define a bit stream optimized for
these resolutions and bit rates.
Q. Why not just scale up what you're doing with MPEG-1?
A. The main difficulty is the interlacing. The simplest way to extend
MPEG-1 to interlaced material is to put the fields together into
frames (720x486x30/s). This results in bad motion artifacts that
stem from the fact that moving objects are in different places
in the two fields, and so don't line up in the frames. Compressing
and decompressing without taking that into account somehow tends to
muddle the objects in the two different fields.
The other thing you might try is to code the even and odd field
streams separately. This avoids the motion artifacts, but as you
might imagine, doesn't get very good compression since you are not
using the redundancy between the even and odd fields where there
is not much motion (which is typically most of image).
Or you can code it as a single stream of fields. Or you can
interpolate lines. Or, etc. etc. There are many things you can
try, and the point of MPEG-2 is to figure out what works well.
MPEG-2 is not limited to consider only derivations of MPEG-1.
There were several non-MPEG-1-like schemes in the competition in
November, and some aspects of those algorithms may or may not
make it into the final standard for entertainment video
compression.
Q. So what works?
A. Basically, derivations of MPEG-1 worked quite well, with one that
used wavelet subband coding instead of DCT's that also worked very
well. Also among the worked-very-well's was a scheme that did not
use B frames at all, just I and P's. All of them, except maybe
one, did some sort of adaptive frame/field coding, where a decision
is made on a macroblock basis as to whether to code that one as
one frame macroblock or as two field macroblocks. Some other
aspects are how to code I-frames--some suggest predicting the even
field from the odd field. Or you can predict evens from evens and
odds or odds from evens and odds or any field from any other field,
etc.
Q. So what works?
A. Ok, we're not really sure what works best yet. The next step is
to define a "test model" to start from, that incorporates most of
the salient features of the worked-very-well proposals in a
simple way. Then experiments will be done on that test model,
making a mod at a time, and seeing what makes it better and what
makes it worse. Example experiments are, B's or no B's, DCT vs.
wavelets, various field prediction modes, etc. The requirements,
such as implementation cost, quality, random access, etc. will all
feed into this process as well.
Q. When will all this be finished?
A. I don't know. I'd have to hope in about a year or less.
Q: Talking about MPEG audio coding, I heard a lot about "Layer 1, 2
and 3". What does it mean, exactly?
A: MPEG-1, IS 11172-3, describes the compression of audio signals
using high performance perceptual coding schemes. It specifies a
family of three audio coding schemes, simply called Layer-1,-2,-3,
with increasing encoder complexity and performance (sound quality
per bitrate). The three codecs are compatible in a hierarchical
way, i.e. a Layer-N decoder is able to decode bitstream data
encoded in Layer-N and all Layers below N (e.g., a Layer-3
decoder may accept Layer-1,-2 and -3, whereas a Layer-2 decoder
may accept only Layer-1 and -2.)
Q: So we have a family of three audio coding schemes. What does the
MPEG standard define, exactly?
A: For each Layer, the standard specifies the bitstream format and
the decoder. To allow for future improvements, it does *not*
specify the encoder , but an informative chapter gives an example
for an encoder for each Layer.
Q: What have the three audio Layers in common?
A: All Layers use the same basic structure. The coding scheme can be
described as "perceptual noise shaping" or "perceptual subband /
transform coding".
The encoder analyzes the spectral components of the audio signal
by calculating a filterbank or transform and applies a
psychoacoustic model to estimate the just noticeable noise-
level. In its quantization and coding stage, the encoder tries
to allocate the available number of data bits in a way to meet
both the bitrate and masking requirements.
The decoder is much less complex. Its only task is to synthesize
an audio signal out of the coded spectral components.
All Layers use the same analysis filterbank (polyphase with 32
subbands). Layer-3 adds a MDCT transform to increase the frequency
resolution.
All Layers use the same "header information" in their bitstream,
to support the hierarchical structure of the standard.
All Layers use a bitstream structure that contains parts that are
more sensitive to biterrors ("header", "bit allocation",
"scalefactors", "side information") and parts that are less
sensitive ("data of spectral components").
All Layers may use 32, 44.1 or 48 kHz sampling frequency.
All Layers are allowed to work with similar bitrates:
Layer-1: from 32 kbps to 448 kbps
Layer-2: from 32 kbps to 384 kbps
Layer-3: from 32 kbps to 320 kbps
Q: What are the main differences between the three Layers, from a
global view?
A: From Layer-1 to Layer-3,
complexity increases (mainly true for the encoder),
overall codec delay increases, and
performance increases (sound quality per bitrate).
Q: Which Layer should I use for my application?
A: Good Question. Of course, it depends on all your requirements. But
as a first approach, you should consider the available bitrate of
your application as the Layers have been designed to support
certain areas of bitrates most efficiently, i.e. with a minimum
drop of sound quality.
Let us look a little closer at the strong domains of each Layer.
Layer-1: Its ISO target bitrate is 192 kbps per audio channel.
Layer-1 is a simplified version of Layer-2. It is most useful for
bitrates around the "high" bitrates around or above 192 kbps. A
version of Layer-1 is used as "PASC" with the DCC recorder.
Layer-2: Its ISO target bitrate is 128 kbps per audio channel.
Layer-2 is identical with MUSICAM. It has been designed as trade-
off between sound quality per bitrate and encoder complexity. It
is most useful for bitrates around the "medium" bitrates of 128 or
even 96 kbps per audio channel. The DAB (EU 147) proponents have
decided to use Layer-2 in the future Digital Audio Broadcasting
network.
Layer-3: Its ISO target bitrate is 64 kbps per audio channel.
Layer-3 merges the best ideas of MUSICAM and ASPEC. It has been
designed for best performance at "low" bitrates around 64 kbps or
even below. The Layer-3 format specifies a set of advanced
features that all address one goal: to preserve as much sound
quality as possible even at rather low bitrates. Today, Layer-3 is
already in use in various telecommunication networks (ISDN,
satellite links, and so on) and speech announcement systems.
Q: Tell me more about sound quality. How do you assess that?
A: Today, there is no alternative to expensive listening tests.
During the ISO-MPEG-1 process, 3 international listening tests
have been performed, with a lot of trained listeners, supervised
by Swedish Radio. They took place in 7.90, 3.91 and 11.91. Another
international listening test was performed by CCIR, now ITU-R, in
92.
All these tests used the "triple stimulus, hidden reference"
method and the CCIR impairment scale to assess the audio quality.
The listening sequence is "ABC", with A = original, BC = pair of
original / coded signal with random sequence, and the listener has
to evaluate both B and C with a number between 1.0 and 5.0. The
meaning of these values is:
5.0 = transparent (this should be the original signal)
4.0 = perceptible, but not annoying (first differences noticable)
3.0 = slightly annoying
2.0 = annoying
1.0 = very annoying
With perceptual codecs (like MPEG audio), all traditional
parameters (like SNR, THD+N, bandwidth) are especially useless.
Fraunhofer-IIS works on objective quality assessment tools, like
the NMR meter (Noise-to-Mask-Ratio), too. BTW: If you need more
informations about NMR, please contact nmr@iis.fhg.de.
Q: Now that I know how to assess quality, come on, tell me the
results of these tests.
A: Well, for low bitrates, the main result is that at 60 or 64 kbps
per channel), Layer-2 scored always between 2.1 and 2.6, whereas
Layer-3 scored between 3.6 and 3.8. This is a significant increase
in sound quality, indeed! Furthermore, the selection process for
critical sound material showed that it was rather difficult to
find worst-case material for Layer-3 whereas it was not so hard to
find such items for Layer-2.
Q: OK, a Layer-2 codec at low bitrates may sound poor today, but
couldn't that be improved in the future? I guess you just told me
before that the encoder is not fixed in the standard.
A: Good thinking! As the sound quality mainly depends on the encoder
implementation, it is true that there is no such thing as a "Layer-
N"- quality. So we definitely only know the performance of the
reference codecs during the international tests. Who knows what
will happen in the future? What we do know now, is:
Today, Layer-3 already provides a sound quality that comes very
near to CD quality at 64 kbps per channel. Layer-2 is far away
from that.
Tomorrow, both Layers may improve. Layer-2 has been designed as a
trade-off between quality and complexity, so the bitstream format
allows only limited innovations. In contrast, even the current
reference Layer-3-codec exploits only a small part of the powerful
mechanisms inside the Layer-3 bitstream format.
Q: All in all, you sound as if anybody should use Layer-3 for low
bitrates. Why on earth do some vendors still offer only Layer-2
equipment for these applications?
A: Well, maybe because they started to design and develop their
system rather early, e.g. in 1990. As Layer-2 is identical with
MUSICAM, it has been available since summer of 90, at latest. In
that year, Layer-3 development started and could be successfully
finished in spring 92. So, for a certain time, vendors could only
exploit the existing part of the new MPEG standard.
Now the situation has changed. All Layers are available, the
standard is completed, and new systems need not limit themselves,
but may capitalize on the full features of MPEG audio.
Q: How do I get the MPEG documents?
A: You may order it from your national standards body.
E.g., in Germany, please contact:
DIN-Beuth Verlag, Auslandsnormen
Mrs. Niehoff, Burggrafenstr. 6, D-10772 Berlin, Germany
Phone: 030-2601-2757, Fax: 030-2601-1231
E.g., in USA, you may order it from ANSI [phone (212) 642-4900] or
buy it from companies like OMNICOM phone +44 438 742424
FAX +44 438 740154
Q. How do I join MPEG?
A. You don't join MPEG. You have to participate in ISO as part of a
national delegation. How you get to be part of the national
delegation is up to each nation. I only know the U.S., where you
have to attend the corresponding ANSI meetings to be able to
attend the ISO meetings. Your company or institution has to be
willing to sink some bucks into travel since, naturally, these
meetings are held all over the world. (For example, Paris,
Santa Clara, Kurihama Japan, Singapore, Haifa Israel, Rio de
Janeiro, London, etc.)
|