CS6825: Computer Vision word cloud

JPEG

(Sections of this taken from JPEG FAQ and written by Tom Lane .)
JPEG (pronounced "jay-peg") is a standardized image compression mechanism. JPEG stands for Joint

Photographic Experts Group, the original name of the committee that wrote the standard. JPEG is designed for

compressing either full-color or gray-scale digital images of "natural", real-world scenes. It does not work very

well on non-realistic images, such as cartoons or line drawings. 



JPEG does not handle black-and-white (1-bit-per-pixel) images, nor does it handle motion picture

compression. Related standards for compressing those types of images exist, and are called JBIG and MPEG

respectively. 



Regular JPEG is "lossy", meaning that the image you get out of decompression isn't quite identical to what you

originally put in. The algorithm achieves much of its compression by exploiting known limitations of the human

eye, notably the fact that small color details aren't perceived as well as small details of light-and-dark. Thus,

JPEG is intended for compressing images that will be looked at by humans. If you plan to machine-analyze your

images, the small errors introduced by JPEG may be a problem for you, even if they are invisible to the eye.

The JPEG standard includes a separate lossless mode, but it is rarely used and does not give nearly as much

compression as the lossy mode. 



There are a lot of parameters to the JPEG compression process. By adjusting the parameters, you can trade off

compressed image size against reconstructed image quality over a *very* wide range. You can get image

quality ranging from op-art (at 100x smaller than the original 24-bit image) to quite indistinguishable from the

source (at about 3x smaller). Usually the threshold of visible difference from the source image is somewhere

around 10x to 20x smaller than the original, ie, 1 to 2 bits per pixel for color images. Grayscale images do not

compress as much. In fact, for comparable visual quality, a grayscale image needs perhaps 25% less space than

a color image; certainly not the 66% less that you might naively expect. JPEG defines a "baseline" lossy

algorithm, plus optional extensions for progressive and hierarchical coding. There is also a separate lossless

compression mode; this typically gives about 2:1 compression, ie, about 12 bits per color pixel. Most currently

available JPEG hardware and software handles only the baseline mode. 



Here's the outline of the baseline compression algorithm: 



1. Transform the image into a suitable color space.  This is a no-op for

grayscale, but for color images you generally want to transform RGB into a

luminance/chrominance color space (YCbCr, YUV, etc).  The luminance component

is grayscale and the other two axes are color information.  The reason for

doing this is that you can afford to lose a lot more information in the

chrominance components than you can in the luminance component: the human eye

is not as sensitive to high-frequency chroma info as it is to high-frequency

luminance.  (See any TV system for precedents.)  You don't have to change the

color space if you don't want to, since the remainder of the algorithm works

on each color component independently, and doesn't care just what the data

is.  However, compression will be less since you will have to code all the

components at luminance quality.  Note that colorspace transformation is

slightly lossy due to roundoff error, but the amount of error is much smaller

than what we typically introduce later on.



2. (Optional) Downsample each component by averaging together groups of

pixels.  The luminance component is left at full resolution, while the chroma

components are often reduced 2:1 horizontally and either 2:1 or 1:1 (no

change) vertically.  In JPEG-speak these alternatives are usually called 2h2v

and 2h1v sampling, but you may also see the terms "411" and "422" sampling.

This step immediately reduces the data volume by one-half or one-third.

In numerical terms it is highly lossy, but for most images it has almost no

impact on perceived quality, because of the eye's poorer resolution for chroma

info.  Note that downsampling is not applicable to grayscale data; this is one

reason color images are more compressible than grayscale.



3. Group the pixel values for each component into 8x8 blocks.  Transform each

8x8 block through a discrete cosine transform (DCT).  The DCT is a relative of

the Fourier transform and likewise gives a frequency map, with 8x8 components.

Thus you now have numbers representing the average value in each block and

successively higher-frequency changes within the block.  The motivation for

doing this is that you can now throw away high-frequency information without

affecting low-frequency information.  (The DCT transform itself is reversible

except for roundoff error.)  See question 25 for fast DCT algorithms.



4. In each block, divide each of the 64 frequency components by a separate

"quantization coefficient", and round the results to integers.  This is the

fundamental information-losing step.  The larger the quantization

coefficients, the more data is discarded.  Note that even the minimum possible

quantization coefficient, 1, loses some info, because the exact DCT outputs

are typically not integers.  Higher frequencies are always quantized less

accurately (given larger coefficients) than lower, since they are less visible

to the eye.  Also, the luminance data is typically quantized more accurately

than the chroma data, by using separate 64-element quantization tables.

Tuning the quantization tables for best results is something of a black art,

and is an active research area.  Most existing encoders use simple linear

scaling of the example tables given in the JPEG standard, using a single

user-specified "quality" setting to determine the scaling multiplier.  This

works fairly well for midrange qualities (not too far from the sample tables

themselves) but is quite nonoptimal at very high or low quality settings.



5. Encode the reduced coefficients using either Huffman or arithmetic coding.

(Strictly speaking, baseline JPEG only allows Huffman coding; arithmetic

coding is an optional extension.)   Notice that this step is lossless, so it

doesn't affect image quality.  The arithmetic coding option uses Q-coding;

it is identical to the coder used in JBIG (see question 74).  Be aware that

Q-coding is patented.  Most existing implementations support only the Huffman

mode, so as to avoid license fees.  The arithmetic mode offers maybe 5 or 10%

better compression, which isn't enough to justify paying fees.



6. Tack on appropriate headers, etc, and output the result.  In a normal

"interchange" JPEG file, all of the compression parameters are included

in the headers so that the decompressor can reverse the process.  These

parameters include the quantization tables and the Huffman coding tables.

For specialized applications, the spec permits those tables to be omitted

from the file; this saves several hundred bytes of overhead, but it means

that the decompressor must know a-priori what tables the compressor used.

Omitting the tables is safe only in closed systems.





The decompression algorithm reverses this process. The decompressor multiplies the reduced coefficients by

the quantization table entries to produce approximate DCT coefficients. Since these are only approximate, the

reconstructed pixel values are also approximate, but if the design has done what it's supposed to do, the errors

won't be highly visible. A high-quality decompressor will typically add some smoothing steps to reduce

pixel-to-pixel discontinuities. 



The JPEG standard does not specify the exact behavior of compressors and decompressors, so there's some

room for creative implementation. In particular, implementations can trade off speed against image quality by

choosing more accurate or faster-but-less-accurate approximations to the DCT. Similar tradeoffs exist for the

downsampling/upsampling and colorspace conversion steps. (The spec does include some minimum accuracy

requirements for the DCT step, but these are widely ignored, and are not too meaningful anyway in the absence

of accuracy requirements for the other lossy steps.) 





Extensions: 

The progressive mode is intended to support real-time transmission of images. It allows the DCT coefficients to

be sent piecemeal in multiple "scans" of the image. With each scan, the decoder can produce a higher-quality

rendition of the image. Thus a low-quality preview can be sent very quickly, then refined as time allows. The

total space needed is roughly the same as for a baseline JPEG image of the same final quality. (In fact, it can be

somewhat *less* if a custom Huffman table is used for each scan, because the Huffman codes can be optimized

over a smaller, more uniform population of data than appears in a baseline image's single scan.) The decoder

must do essentially a full JPEG decode cycle for each scan: inverse DCT, upsample, and color conversion must

all be done again, not to mention any color quantization for 8-bit displays. So this scheme is useful only with fast

decoders or slow transmission lines. Up until 1995, progressive JPEG was a rare bird, but its use is now

spreading as software decoders have become fast enough to make it useful with modem-speed data

transmission. 



The hierarchical mode represents an image at multiple resolutions. For example, one could provide 512x512,

1024x1024, and 2048x2048 versions of the image. The higher-resolution images are coded as differences from

the next smaller image, and thus require many fewer bits than they would if stored independently. (However, the

total number of bits will be greater than that needed to store just the highest-resolution frame in baseline form.)

The individual frames in a hierarchical sequence can be coded progressively if desired. Hierarchical mode is not

widely supported at present. 



Part 3 of the JPEG standard, approved at the end of 1995, introduces several new extensions. The one most

likely to become popular is variable quantization, which allows the quantization table to be scaled to different

levels in different parts of the image. In this way the "more critical" parts of the image can be coded at higher

quality than the "less critical" parts. A signaling code can be inserted at any DCT block boundary to set a new

scaling factor. 



Another Part 3 extension is selective refinement. This feature permits a scan in a progressive sequence, or a

refinement frame of a hierarchical sequence, to cover only part of the total image area. This is an alternative way

of solving the variable-quality problem. My (tgl's) guess is that this will not get widely implemented, with

variable quantization proving a more popular approach, but I've been wrong before. 



The third major extension added by Part 3 is a "tiling" concept that allows an image to be built up as a

composite of JPEG frames, which may have different sizes, resolutions, quality settings, even colorspaces. (For

example, a color image that occupies a small part of a mostly-grayscale page could be represented as a

separate frame, without having to store the whole page in color.) Again, there's some overlap in functionality

with variable quantization and selective refinement. The general case of arbitrary tiles is rather complex and is

unlikely to be widely implemented. In the simplest case all the tiles are the same size and use similar quality

settings. This case may become popular even if the general tiling mechanism doesn't, because it surmounts the

64K-pixel-on-a-side image size limitation that was (not very foresightedly) built into the basic JPEG standard.

The individual frames are still restricted to 64K for compatibility reasons, but the total size of a tiled JPEG

image can be up to 2^32 pixels on a side. 





Lossless JPEG:

The separate lossless mode does not use DCT, since roundoff errors prevent a DCT calculation from being

lossless. For the same reason, one would not normally use colorspace conversion or downsampling, although

these are permitted by the standard. The lossless mode simply codes the difference between each pixel and the

"predicted" value for the pixel. The predicted value is a simple function of the already-transmitted pixels just

above and to the left of the current one (for example, their average; 8 different predictor functions are

permitted). The sequence of differences is encoded using the same back end (Huffman or arithmetic) used in

the lossy mode. 



Lossless JPEG with the Huffman back end is certainly not a state-of-the-art lossless compression method, and

wasn't even when it was introduced. The arithmetic-coding back end may make it competitive, but you're

probably best off looking at other methods if you need only lossless compression. 



The main reason for providing a lossless option is that it makes a good adjunct to the hierarchical mode: the

final scan in a hierarchical sequence can be a lossless coding of the remaining differences, to achieve overall

losslessness. This isn't quite as useful as it may at first appear, because exact losslessness is not guaranteed

unless the encoder and decoder have identical IDCT implementations (ie, identical roundoff errors). And you

can't use downsampling or colorspace conversion either if you want true losslessness. But in some applications

the combination is useful. 





References: For a good technical introduction to JPEG, see: Wallace, Gregory K. "The JPEG Still Picture

Compression Standard", Communications of the ACM, April 1991 (vol. 34 no. 4), pp. 30-44. 



(Adjacent articles in that issue discuss MPEG motion picture compression, applications of JPEG, and related

topics.) If you don't have the CACM issue handy, a PostScript file containing a revised version of this article is

available at ftp://ftp.uu.net/graphics/jpeg/wallace.ps.gz. This file (actually a preprint for a later article in IEEE

Trans. Consum. Elect.) omits the sample images that appeared in CACM, but it includes corrections and some

added material. Note: the Wallace article is copyright ACM and IEEE, and it may not be used for commercial

purposes. 



An alternative, more leisurely explanation of JPEG can be found in "The Data Compression Book" by Mark

Nelson ([Nel 1991], see question 7). This book provides excellent introductions to many data compression

methods including JPEG, plus sample source code in C. The JPEG-related source code is far from

industrial-strength, but it's a pretty good learning tool. 



An excellent textbook about JPEG is "JPEG Still Image Data Compression Standard" by William B.

Pennebaker and Joan L. Mitchell. Published by Van Nostrand Reinhold, 1993, ISBN 0-442-01272-1. 650

pages, price US$59.95. (VNR will accept credit card orders at 800/842-3636, or get your local bookstore to

order it.) This book includes the complete text of the ISO JPEG standards, DIS 10918-1 and draft DIS

10918-2. Review by Tom Lane: "This is by far the most complete exposition of JPEG in existence. It's written

by two people who know what they are talking about: both served on the ISO JPEG standards committee. If

you want to know how JPEG works or why it works that way, this is the book to have." 



The official specification of JPEG is not currently available on-line, and is not likely ever to be available for free

because of ISO and ITU copyright restrictions. You can order it from your national standards agency as ISO

standards IS 10918-1, 10918-2, 10918-3, or as ITU-T standards T.81, T.83, T.84. See

ftp://ftp.uu.net/graphics/jpeg/jpeg.documents.gz for more info. NOTE: buying the Pennebaker and Mitchell

textbook is a much better deal than purchasing the standard directly: it's cheaper and includes a lot of useful

explanatory material along with the full draft text of the spec. The book unfortunately doesn't include Part 3 of

the spec, but if you need Part 3, buy the book and just that part and you'll still be ahead. 





© Lynne Grewe