Compression

Introduction   The FAQ   Arithmetic Coding   PPM   Exe-Packer  

Introduction

Compression is a very interesting topic. How can you make documents, images, movies and programs smaller? Please read the following essays by Joa Koester, they will give you the necessary background:

  1. 1st part missing
  2. Propability and Crunching
  3. Adapting to the sources (Huffman revisited)
  4. Leaving Huffman and entering the realms of R'lyeh, eh, RLE
  5. Interlude and the Mystery of Lempel-Ziv Part I
  6. The secret of Zip and Rar (LZW explained)

The FAQ

The FAQ of comp.compression and comp.compression.research has grown really big. It was splitted into 3 parts, download them here:

Arithmetic Coding

Arithmetic Coding is the successor of Hufmann. It provides an even better compression and can be used almost everywhere. Check these Postscript-files (viewed with Ghostscript):

Burrows and Wheeler

M. Burrows and D.J. Wheeler presented in May 1994 A Block-sorting Lossless Data Compression Algorithm which achieves speed comparable to algorithms based on the techniquesof Lempel and Ziv, but obtains compression close to the best statistical modelling techniques. It works very well and I use it e.g. to compress any kind of text-files. Up to now it isn't widely spread yet, but I hope this will change soon. Try it yourself!

Prediction by Partial Matching

PPM is a very clever algorithm with high compression ratio but lower speed. I think today's computers are fast enough so the least you can do is to give it a try. Public doman source code is available by FTP and a guy named Andreas Muegge has written a C++ class and an essay at CodeProject.

The Theory (from Unbounded length contexts for PPM by John G. Cleary, W. J. Teahan, Ian H. Witten)

Prediction by partial matching, or PPM, is a finitecontext statistical modeling technique that can be viewed as blending together several fixedorder context models to predict the next character in the input sequence. Prediction probabilities for each context in the model are calculated from frequency counts which are updated adaptively; and the symbol that actually occurs is encoded relative to its predicted distribution using arithmetic coding. The maximum context length is a fixed constant, and it has been found that increasing it beyond about six or so does not generally improve compression.

The basic idea of PPM is to use the last few characters in the input stream to predict the upcoming one. Models that condition their predictions on a few immediately preceding symbols are called finitecontext models of order k, where k is the number of preceding symbols used. PPM employs a suite of fixedorder context models with different values of k, from 0 up to some predetermined maximum, to predict upcoming characters.

For each model, a note is kept of all characters that have followed every lengthk subsequence observed so far in the input, and the number of times that each has occurred. Prediction probabilities are calculated from these counts. The probabilities associated with each character that has followed the last k characters in the past are used to predict the upcoming character. Thus from each model, a separate predicted probability distribution is obtained.

These distributions are effectively combined into a single one, and arithmetic coding is used to encode the character that actually occurs, relative to that distribution. The combination is achieved through the use of escape probabilities. Recall that each model has a different value of k. The model with the largest k is, by default, the one used for coding. However, if a novel character is encountered in this context, which means that the context cannot be used for encoding it, an escape symbol is transmitted to signal the decoder to switch to the model with the next smaller value of k. The process continues until a model is reached in which the character is not novel, at which point it is encoded with respect to the distribution predicted by that model. To ensure that the process terminates, a model is assumed to be present below the lowest level, containing all characters in the coding alphabet. This mechanism effectively blends the different order models together in a proportion that depends on the values actually used for escape probabilities.

Exe-Packer

Exe-packer used to be very popular in DOS times. They had two functions: compress any executable (exe, com) and protect them from being disassembled and modified. The last was a nice dream... Most (if not all) programs were reverse-engineered and unpackers were written for them. Compressing though worked very fine. Some time had to pass until you can pack Win32 exe and dlls too. The best one I found is UPX - you can use it for free, visit the Homepage for more information. It provides very good compression-rates and seem to work stable. Of course, you should still keep backups of your files. The source code is available but with one drawback: the best compression module is not included. If you compile your own version it will compress worse than the original program. On the one hand I can understand the authors that they don't want to give away everything for free and let other companies profit from their ideas but on the other hand I have no motivation to spend my time with a program where the official version will always be better.

 

Home   Reverse Engineering   My essays and progs   Programming   Compression    Encryption   Who am I