CS6825

CS6825: Computer Vision
MPEG

(Taken from Compression FAQ and written by written by Mark Adler around January 1992; modified and updated by Harald Popp in March 94.)



Q: What is MPEG, exactly?



A: MPEG is the "Moving Picture Experts Group", working under the 

   joint direction of the International Standards Organization (ISO) 

   and the International Electro-Technical Commission (IEC). This 

   group works on standards for the coding of moving pictures and 

   associated audio.



Q: What is the status of MPEG's work, then? What's about MPEG-1, -2, 

   and so on?



A: MPEG approaches the growing need for multimedia standards step-by-

   step. Today, three "phases" are defined:

   

   MPEG-1: "Coding of Moving Pictures and Associated Audio for 

           Digital Storage Media at up to about 1.5 MBit/s"  

 

   Status: International Standard IS-11172, completed in 10.92

   

   MPEG-2: "Generic Coding of Moving Pictures and Associated Audio"

   

   Status: Comittee Draft CD 13818 as found in documents MPEG93 / 

           N601, N602, N603 (11.93)   



   MPEG-3: no longer exists (has been merged into MPEG-2)

   

   MPEG-4: "Very Low Bitrate Audio-Visual Coding"

   

   Status: Call for Proposals 11.94, Working Draft in 11.96 



Q: MPEG-1 is ready-for-use. How does the standard look like?



A: MPEG-1 consists of 4 parts:



   IS 11172-1: System

   describes synchronization and multiplexing of video and audio



   IS 11172-2: Video

   describes compression of non-interlaced video signals

   

   IS 11172-3: Audio

   describes compression of audio signals 

   

   CD 11172-4: Compliance Testing

   describes procedures for determining the characteristics of coded 

   bitstreams and the decoding porcess and for testing compliance 

   with the requirements stated in the other parts



Q. Does MPEG have anything to do with JPEG? 



A. Well, it sounds the same, and they are part of the same 

   subcommittee of ISO along with JBIG and MHEG, and they usually meet 

   at the same place at the same time.  However, they are different 

   sets of people with few or no common individual members, and they 

   have different charters and requirements.  JPEG is for still image 

   compression.



Q. Then what's JBIG and MHEG?



A. Sorry I mentioned them. Ok, I'll simply say that JBIG is for binary

   image compression (like faxes), and MHEG is for multi-media data

   standards (like integrating stills, video, audio, text, etc.).

   For an introduction to JBIG, see question 74 below.



Q. So how does MPEG-1 work? Tell me about video coding!



A. First off, it starts with a relatively low resolution video

   sequence (possibly decimated from the original) of about 352 by

   240 frames by 30 frames/s (US--different numbers for Europe),

   but original high (CD) quality audio.  The images are in color,

   but converted to YUV space, and the two chrominance channels

   (U and V) are decimated further to 176 by 120 pixels.  It turns

   out that you can get away with a lot less resolution in those

   channels and not notice it, at least in "natural" (not computer

   generated) images.



   The basic scheme is to predict motion from frame to frame in the

   temporal direction, and then to use DCT's (discrete cosine

   transforms) to organize the redundancy in the spatial directions.

   The DCT's are done on 8x8 blocks, and the motion prediction is

   done in the luminance (Y) channel on 16x16 blocks.  In other words,

   given the 16x16 block in the current frame that you are trying to

   code, you look for a close match to that block in a previous or

   future frame (there are backward prediction modes where later

   frames are sent first to allow interpolating between frames).

   The DCT coefficients (of either the actual data, or the difference

   between this block and the close match) are "quantized", which

   means that you divide them by some value to drop bits off the

   bottom end.  Hopefully, many of the coefficients will then end up

   being zero.  The quantization can change for every "macroblock"

   (a macroblock is 16x16 of Y and the corresponding 8x8's in both

   U and V).  The results of all of this, which include the DCT

   coefficients, the motion vectors, and the quantization parameters

   (and other stuff) is Huffman coded using fixed tables.  The DCT

   coefficients have a special Huffman table that is "two-dimensional"

   in that one code specifies a run-length of zeros and the non-zero

   value that ended the run.  Also, the motion vectors and the DC

   DCT components are DPCM (subtracted from the last one) coded.



Q. So is each frame predicted from the last frame?



A. No.  The scheme is a little more complicated than that.  There are

   three types of coded frames.  There are "I" or intra frames.  They

   are simply a frame coded as a still image, not using any past

   history.  You have to start somewhere.  Then there are "P" or

   predicted frames.  They are predicted from the most recently

   reconstructed I or P frame.  (I'm describing this from the point

   of view of the decompressor.)  Each macroblock in a P frame can

   either come with a vector and difference DCT coefficients for a

   close match in the last I or P, or it can just be "intra" coded

   (like in the I frames) if there was no good match.



   Lastly, there are "B" or bidirectional frames.  They are predicted

   from the closest two I or P frames, one in the past and one in the

   future.  You search for matching blocks in those frames, and try

   three different things to see which works best.  (Now I have the

   point of view of the compressor, just to confuse you.)  You try 

   using the forward vector, the backward vector, and you try 

   averaging the two blocks from the future and past frames, and 

   subtracting that from the block being coded.  If none of those work 

   well, you can intracode the block.



   The sequence of decoded frames usually goes like:



   IBBPBBPBBPBBIBBPBBPB...



   Where there are 12 frames from I to I (for US and Japan anyway.)

   This is based on a random access requirement that you need a

   starting point at least once every 0.4 seconds or so.  The ratio

   of P's to B's is based on experience.



   Of course, for the decoder to work, you have to send that first

   P *before* the first two B's, so the compressed data stream ends

   up looking like:



   0xx312645...



   where those are frame numbers.  xx might be nothing (if this is

   the true starting point), or it might be the B's of frames -2 and

   -1 if we're in the middle of the stream somewhere.



   You have to decode the I, then decode the P, keep both of those

   in memory, and then decode the two B's.  You probably display the

   I while you're decoding the P, and display the B's as you're

   decoding them, and then display the P as you're decoding the next

   P, and so on.



Q. You've got to be kidding.



A. No, really!



Q. Hmm.  Where did they get 352x240?



A. That derives from the CCIR-601 digital television standard which

   is used by professional digital video equipment.  It is (in the US)

   720 by 243 by 60 fields (not frames) per second, where the fields

   are interlaced when displayed.  (It is important to note though

   that fields are actually acquired and displayed a 60th of a second

   apart.)  The chrominance channels are 360 by 243 by 60 fields a

   second, again interlaced.  This degree of chrominance decimation

   (2:1 in the horizontal direction) is called 4:2:2.  The source

   input format for MPEG I, called SIF, is CCIR-601 decimated by 2:1

   in the horizontal direction, 2:1 in the time direction, and an

   additional 2:1 in the chrominance vertical direction.  And some

   lines are cut off to make sure things divide by 8 or 16 where

   needed.



Q. What if I'm in Europe?



A. For 50 Hz display standards (PAL, SECAM) change the number of lines

   in a field from 243 or 240 to 288, and change the display rate to

   50 fields/s or 25 frames/s.  Similarly, change the 120 lines in

   the decimated chrominance channels to 144 lines.  Since 288*50 is

   exactly equal to 240*60, the two formats have the same source data

   rate.



Q. What will MPEG-2 do for video coding?



A. As I said, there is a considerable loss of quality in going from

   CCIR-601 to SIF resolution.  For entertainment video, it's simply

   not acceptable.  You want to use more bits and code all or almost

   all the CCIR-601 data.  From subjective testing at the Japan

   meeting in November 1991, it seems that 4 MBits/s can give very

   good quality compared to the original CCIR-601 material.  The

   objective of MPEG-2 is to define a bit stream optimized for 

   these resolutions and bit rates.



Q. Why not just scale up what you're doing with MPEG-1?



A. The main difficulty is the interlacing.  The simplest way to extend

   MPEG-1 to interlaced material is to put the fields together into

   frames (720x486x30/s).  This results in bad motion artifacts that

   stem from the fact that moving objects are in different places

   in the two fields, and so don't line up in the frames.  Compressing

   and decompressing without taking that into account somehow tends to

   muddle the objects in the two different fields.



   The other thing you might try is to code the even and odd field

   streams separately.  This avoids the motion artifacts, but as you

   might imagine, doesn't get very good compression since you are not

   using the redundancy between the even and odd fields where there

   is not much motion (which is typically most of image).



   Or you can code it as a single stream of fields.  Or you can

   interpolate lines.  Or, etc. etc.  There are many things you can

   try, and the point of MPEG-2 is to figure out what works well.

   MPEG-2 is not limited to consider only derivations of MPEG-1.

   There were several non-MPEG-1-like schemes in the competition in

   November, and some aspects of those algorithms may or may not

   make it into the final standard for entertainment video 

   compression.



Q. So what works?



A. Basically, derivations of MPEG-1 worked quite well, with one that

   used wavelet subband coding instead of DCT's that also worked very

   well.  Also among the worked-very-well's was a scheme that did not

   use B frames at all, just I and P's.  All of them, except maybe 

   one, did some sort of adaptive frame/field coding, where a decision 

   is made on a macroblock basis as to whether to code that one as 

   one frame macroblock or as two field macroblocks.  Some other 

   aspects are how to code I-frames--some suggest predicting the even 

   field from the odd field.  Or you can predict evens from evens and 

   odds or odds from evens and odds or any field from any other field, 

   etc.



Q. So what works?



A. Ok, we're not really sure what works best yet.  The next step is

   to define a "test model" to start from, that incorporates most of

   the salient features of the worked-very-well proposals in a

   simple way.  Then experiments will be done on that test model,

   making a mod at a time, and seeing what makes it better and what

   makes it worse.  Example experiments are, B's or no B's, DCT vs.

   wavelets, various field prediction modes, etc.  The requirements,

   such as implementation cost, quality, random access, etc. will all

   feed into this process as well.



Q. When will all this be finished?



A. I don't know.  I'd have to hope in about a year or less.



Q: Talking about MPEG audio coding, I heard a lot about "Layer 1, 2 

   and 3". What does it mean, exactly?   



A: MPEG-1, IS 11172-3, describes the compression of audio signals 

   using high performance perceptual coding schemes. It specifies a 

   family of three audio coding schemes, simply called Layer-1,-2,-3, 

   with increasing encoder complexity and performance (sound quality 

   per bitrate). The three codecs are compatible in a hierarchical 

   way, i.e. a Layer-N decoder is able to decode bitstream data 

   encoded in Layer-N and all Layers below N (e.g., a Layer-3 

   decoder may accept Layer-1,-2 and -3, whereas a Layer-2 decoder 

   may accept only Layer-1 and -2.)



Q: So we have a family of three audio coding schemes. What does the 

   MPEG standard define, exactly?

   

A: For each Layer, the standard specifies the bitstream format and 

   the decoder. To allow for future improvements, it does *not* 

   specify the encoder , but an informative chapter gives an example 

   for an encoder for each Layer.    



Q: What have the three audio Layers in common?



A: All Layers use the same basic structure. The coding scheme can be  

   described as "perceptual noise shaping" or "perceptual subband / 

   transform coding". 



   The encoder analyzes the spectral components of the audio signal 

   by calculating a filterbank or transform and applies a 

   psychoacoustic model to estimate the just noticeable noise-

   level. In its quantization and coding stage, the encoder tries 

   to allocate the available number of data bits in a way to meet 

   both the bitrate and masking requirements.



   The decoder is much less complex. Its only task is to synthesize 

   an audio signal out of the coded spectral components.

   

   All Layers use the same analysis filterbank (polyphase with 32 

   subbands). Layer-3 adds a MDCT transform to increase the frequency 

   resolution.

   

   All Layers use the same "header information" in their bitstream, 

   to support the hierarchical structure of the standard.

   

   All Layers use a bitstream structure that contains parts that are 

   more sensitive to biterrors ("header", "bit allocation", 

   "scalefactors", "side information") and parts that are less 

   sensitive ("data of spectral components").  



   All Layers may use 32, 44.1 or 48 kHz sampling frequency.

   

   All Layers are allowed to work with similar bitrates:

   Layer-1: from 32 kbps to 448 kbps

   Layer-2: from 32 kbps to 384 kbps

   Layer-3: from 32 kbps to 320 kbps



Q: What are the main differences between the three Layers, from a 

   global view?



A: From Layer-1 to Layer-3,

   complexity increases (mainly true for the encoder),

   overall codec delay increases, and

   performance increases (sound quality per bitrate).



Q: Which Layer should I use for my application?



A: Good Question. Of course, it depends on all your requirements. But 

   as a first approach, you should consider the available bitrate of 

   your application as the Layers have been designed to support 

   certain areas of bitrates most efficiently, i.e. with a minimum 

   drop of sound quality.



   Let us look a little closer at the strong domains of each Layer.

    

   Layer-1: Its ISO target bitrate is 192 kbps per audio channel.



   Layer-1 is a simplified version of Layer-2. It is most useful for 

   bitrates around the "high" bitrates around or above 192 kbps. A 

   version of Layer-1 is used as "PASC" with the DCC recorder.



   Layer-2: Its ISO target bitrate is 128 kbps per audio channel.

   

   Layer-2 is identical with MUSICAM. It has been designed as trade-

   off between sound quality per bitrate and encoder complexity. It 

   is most useful for bitrates around the "medium" bitrates of 128 or 

   even 96 kbps per audio channel. The DAB (EU 147) proponents have 

   decided to use Layer-2 in the future Digital Audio Broadcasting 

   network.      



   Layer-3: Its ISO target bitrate is 64 kbps per audio channel.

   

   Layer-3 merges the best ideas of MUSICAM and ASPEC. It has been 

   designed for best performance at "low" bitrates around 64 kbps or 

   even below. The Layer-3 format specifies a set of advanced 

   features that all address one goal: to preserve as much sound 

   quality as possible even at rather low bitrates. Today, Layer-3 is 

   already in use in various telecommunication networks (ISDN, 

   satellite links, and so on) and speech announcement systems. 



Q: Tell me more about sound quality. How do you assess that?



A: Today, there is no alternative to expensive listening tests. 

   During the ISO-MPEG-1 process, 3 international listening tests 

   have been performed, with a lot of trained listeners, supervised 

   by Swedish Radio. They took place in 7.90, 3.91 and 11.91. Another 

   international listening test was performed by CCIR, now ITU-R, in 

   92.      

   

   All these tests used the "triple stimulus, hidden reference" 

   method and the CCIR impairment scale to assess the audio quality.

   The listening sequence is "ABC", with A = original, BC = pair of 

   original / coded signal with random sequence, and the listener has 

   to evaluate both B and C with a number between 1.0 and 5.0. The 

   meaning of these values is:

   

   5.0 = transparent (this should be the original signal)

   4.0 = perceptible, but not annoying (first differences noticable)  

   3.0 = slightly annoying   

   2.0 = annoying

   1.0 = very annoying



   With perceptual codecs (like MPEG audio), all traditional 

   parameters (like SNR, THD+N, bandwidth) are especially useless. 

   Fraunhofer-IIS works on objective quality assessment tools, like 

   the NMR meter (Noise-to-Mask-Ratio), too. BTW: If you need more 

   informations about NMR, please contact nmr@iis.fhg.de.



Q: Now that I know how to assess quality, come on, tell me the 

   results of these tests.

   

A: Well, for low bitrates, the main result is that at 60 or 64 kbps 

   per channel), Layer-2 scored always between 2.1 and 2.6, whereas 

   Layer-3 scored between 3.6 and 3.8. This is a significant increase 

   in sound quality, indeed! Furthermore, the selection process for 

   critical sound material showed that it was rather difficult to 

   find worst-case material for Layer-3 whereas it was not so hard to 

   find such items for Layer-2.

  

Q: OK, a Layer-2 codec at low bitrates may sound poor today, but 

   couldn't that be improved in the future? I guess you just told me 

   before that the encoder is not fixed in the standard.

   

A: Good thinking! As the sound quality mainly depends on the encoder 

   implementation, it is true that there is no such thing as a "Layer-

   N"- quality. So we definitely only know the performance of the 

   reference codecs during the international tests. Who knows what 

   will happen in the future? What we do know now, is:

   

   Today, Layer-3 already provides a sound quality that comes very 

   near to CD quality at 64 kbps per channel. Layer-2 is far away 

   from that.

   

   Tomorrow, both Layers may improve. Layer-2 has been designed as a 

   trade-off between quality and complexity, so the bitstream format 

   allows only limited innovations. In contrast, even the current

   reference Layer-3-codec exploits only a small part of the powerful 

   mechanisms inside the Layer-3 bitstream format.  



Q: All in all, you sound as if anybody should use Layer-3 for low 

   bitrates. Why on earth do some vendors still offer only Layer-2 

   equipment for these applications?

   

A: Well, maybe because they started to design and develop their 

   system rather early, e.g. in 1990. As Layer-2 is identical with 

   MUSICAM, it has been available since summer of 90, at latest. In 

   that year, Layer-3 development started and could be successfully 

   finished in spring 92. So, for a certain time, vendors could only 

   exploit the existing part of the new MPEG standard.   

   

   Now the situation has changed. All Layers are available, the 

   standard is completed, and new systems need not limit themselves, 

   but may capitalize on the full features of MPEG audio.



Q: How do I get the MPEG documents?



A: You may order it from your national standards body.



   E.g., in Germany, please contact:

   DIN-Beuth Verlag, Auslandsnormen

   Mrs. Niehoff, Burggrafenstr. 6, D-10772 Berlin, Germany

   Phone: 030-2601-2757, Fax: 030-2601-1231



   E.g., in USA, you may order it from ANSI [phone (212) 642-4900] or 

   buy it from companies like OMNICOM phone +44 438 742424

                                      FAX   +44 438 740154



Q. How do I join MPEG?



A. You don't join MPEG.  You have to participate in ISO as part of a

   national delegation.  How you get to be part of the national

   delegation is up to each nation.  I only know the U.S., where you

   have to attend the corresponding ANSI meetings to be able to

   attend the ISO meetings.  Your company or institution has to be

   willing to sink some bucks into travel since, naturally, these

   meetings are held all over the world.  (For example, Paris,

   Santa Clara, Kurihama Japan, Singapore, Haifa Israel, Rio de

   Janeiro, London, etc.)