Data Compression

Literature references and annotations by Dick Grune, dick@dickgrune.com. Last update: Thu Dec 10 17:56:34 2009.
These references and annotations were originally intended for personal use and are presented here only in the hope that they may be useful to others. There is no claim to completeness or even correctness. Each annotation represents my understanding of the text at the moment I wrote the annotation.
No guarantees given; comments and content criticism welcome.

* D. Salomon, "Data Compression", Springer, New York, 1998, pp. 448.

* Alistair Moffat, Andrew Turpin, "On the Implementation of Minimum-Redundancy Prefix Codes", in Data Compression Conference, 1996, pp. 170-179.
Describes in detail a modification to Huffman compression, which can be implemented very efficiently. Full of nifty and complicated algorithms.

* R.N. Horspool, G.V. Cormack, "Constructing word-based text compression algorithms", in 2nd IEEE Data Compression Conference, Snowbird, April 1992,

* Gregory K. Wallace, "The JPEG still picture compression standard", Commun. ACM, 34, #4, April 1991, pp. 31-44.
JPEG stands for "Joint Photographic Experts Group". The picture is divided into blocks of 8 µ 8 pixels. Each such block is subjected to a Forward Discrete Cosine Transform, a Fourier Transform derivative, which yields in principle 64 coefficients: the DC component and 63 spatial harmonics, the AC components. the blocks are treated in "reading order", from top left to bottom right. Three forms of compression are done: 1. the DC component is replaced by its difference from the previous one; this is lossless. 2. depending on the desired compression, the AC components are quantized either by reduction to one of the values in a given table or by truncating to their N most significant bits; this is lossy. 3. the resulting stream undergoes Huffman coding.
There is a completely different scheme for doing lossless JPEG compression. It works by defining a predictor (e.g. the average of the pixel above and the one to the left) and then recording the difference between this prediction and the actual value, followed by the usual Huffman coding.

* Didier Le Gall, "MPEG: a video compression standard for multi-media applications", Commun. ACM, 34, #4, April 1991, pp. 47-58.
In principle MPEG consists of a sequence of JPEG pictures; in practice there is a lot of additional trickery, which has to do with inter-picture compression. The pictures in and MPEG stream are of three types: I pictures (I for intra), which are new, independent pictures, used to start a new set of related pictures; P pictures (P for predicted), which follow a given I picture and which are specified as differences with the I picture; and B pictures (B for Bidirectional), which come in between the I and P pictures and which are specified as the weighted average of the surrounding I and P pictures. A possible sequence could be: I B B P B B P B B I ... MPEG allows but does not require much more sophisticated motion detection methods.

* Ming Lou, "Overview of the px64 kbit/s video coding standard", Commun. ACM, 34, #4, April 1991, pp. 60-63.
Very brief description. The official name of the coding is "CCITT Recommendation H.261 Video Format". It consists of a hierarchical data structure, with each field consisting of a header and a sequence of subfields, each of which has again a similar format. The lowest level contains the data, compressed in various ways, or may even be omitted, if there was no change.

* Timothy Bell, Ian H. Witten, John G. Cleary, "Modeling for Text Compression", ACM Computing Surveys, 21, #4, pp. 557-591. Dec. 1989,

* G.V. Cormack, "Data Compression", Encyclopedia of Science and Technology Yearbook, McGraw-Hill, 1988,

* D.W. Jones, "Application of splay trees to data compression", Commun. ACM, 31, #8, Aug. 1988, pp. 996-1007.

* G.V. Cormack, R.N. Horspool, "Data compression using dynamic Markov modelling", Computer J., 30, #6, Dec. 1987, pp. 541-550.

* R.N. Horspool, G.V. Cormack, "Comments on "A locally adaptive data compression scheme"", Commun. ACM, 30, #9, Sept. 1987, pp. 792-793.

* I.H. Witten, R.M. Neal, J.G. Cleary, "Arithmetic coding for data compression", Commun. ACM, 30, #6, June 1987, pp. 520-540.

* H. Tanaka, "Data structure of Huffman codes and its application to efficient coding and decoding", IEEE Trans. Inform. Theory, 33, #1, Jan. 1987, pp. 154-156.

* J.L. Bentley, D.D. Sleator, R.E. Tarjan, V.K. Wei, "A locally adaptive data compression scheme", Commun. ACM, 29, #4, April 1986, pp. 320-330.

* R.N. Horspool, G.V. Cormack, "Comments on "Data compression using static Huffman code-decode tables"", Commun. ACM, 29, #2, Feb. 1986, pp. 150-152.

* Gordon V. Cormack, "Data compression on a database system", Commun. ACM, 28, #12, Dec. 1985, pp. 1336-1342.
The exit and entry hooks for record retrieval are used to compress and decompress the records. Since the records are relatively small, little is known about their contents, speed is essential, and space is at a premium, the following drastic scheme was conceived. Basically, each character is compressed using a Huffman table determined by the previous character; the Huffman tables are determined offline by sampling the data base.
Two problems had to be solved, bot relating to the size of the fixed tables inside the routine. The first was that the usual Huffman tables have some entries of excessive length; these are used very infrequently but still take up space inside the routine. This was solved by a modification to the Huffman table generation algorithm to limit the length of any encoding to 15 bits. This degrades the tables a little, but not much.
The second was that 256 Huffman tables just occupy too much space, period. This was solved as follows. For each pair of tables, the degradation caused by combining the tables was computed, and the tables with the lowest degradation penalty were combined, until the algorthm fitted.
The algorithm achieved a 65% size reduction on the records, and a 42% reduction in database size; the difference is caused by non-record material in the database. The compression slowed down the computation by 17 %, but sped up I/O by 33%, resulting in a net gain.

* R.N. Horspool, G.V. Cormack, "A practical method for data compression with computer applications", in CIPS Session 84, Calgary, 1984,

* G.V. Cormack, R.N. Horspool, "Algorithms for adaptive Huffman codes", Inf. Proc. Letters, 18, #3, 1984, pp. 159-165.

* John G. Cleary, Ian H. Witten, "Data Compression Using Adaptive Coding and Partial String Matching", IEEE Transactions on Communications, 1984, 32, pp. 396-402.
Improves Lempel-Ziv on two counts: records the frequency of the words in the dictionary and adapts the code accordingly; and uses arithemtic encoding.

* T.A. Welch, "A technique for high-performance data compression", IEEE Comput., 17, #6, June 1984, pp. 8-19.

* A.M.M. Al-Hussaini, "File compression using probabilistic grammars and LR parsing", Ph.D. Thesis, Loughborough, 1983, pp. ???.

* D. Severance, "A practitioner's guide to data base compression", Inf. Syst., 8, #1, 1983, pp. 51-62.

* J.A. Storer, T.G. Szymanski, "Data compression via textual substitution", J. ACM, 29, #4, Oct. 1982, pp. 928-951.

* H. Reghbati, "An overview of data compression tecniques", Computer, 14, #4, May 1981, pp. 71-75.

* M. Rodeh, V.R. Pratt, S. Even, "Linear algorithm for data compression via string matching", J. ACM, 28, #1, Jan. 1981, pp. 16-24.

* J. Ziv, A. Lempel, "A universal algorithm for sequential data compression", IEEE Trans. Inf. Theory, 23, #3, May 1977, pp. 337-343.

* E.J. Schuegraf, "A survey of data compression methods for non-numeric records", Can. J. Inf. Sci., 2, #1, May 1977, pp. 93-105.

* P. Elias, "Universal codeword sets and representations of the integers", IEEE Trans. Inform. Theory, 21, #2, March 1975, pp. 194-203.

* S.E. Hutchins, "Data compression on context-free languages", in IFIP '71, North-Holland Publ., 1972, pp. 104-109.

* D. Huffman, "A method for the construction of minimum redundancy codes", Proc. I.R.E., 40, #9, Sept. 1952, pp. 1098-1101.