On “Data Compression Using Long Common Strings”, McIlroy & Bentley

Am just reading up on BMDiff:

[BigTable] compression looks for similar values along the rows, columns, and times. They use variations of BMDiff and Zippy. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s).

…which led to reading the paper:

We propose a data compression scheme that recognizes the second occurrence of the input text as a repetition. It then represents the second string with a reference to the first, using just a few bytes.

…and thence:

Because we represent only long common strings, we are free to use an inefficient representation. We will represent a repeated string by the sequence ‘‘<start,length>’’, where start is the initial position and length is the size of the common sequence. For instance, the Constitution begins:

The Constitution of the United States PREAMBLE We, the people of the United States, in order to form a more perfect Union, …

Its compressed form begins:

The Constitution of the United States PREAMBLE We, the people <16,21>, in order to form a more perfect Union, …

The reason that I am reading this paper with joy is not only that I like Doug McIlroy’s writing style, but also their conclusions about combining BMDiff with Gzip:

The file sizes are given in megabytes, and the percentages show the effectiveness of each compression. By itself, com 1000 reduces the file to about 86% of its original size, while gzip reduces it to about 35% of its original size. If we apply gzip after applying com 1000, though, gzip is almost as effective as before: the two methods appear to factor. Our precompression algorithm compressed three different kinds of long common strings that went unnoticed by gzip …

…comments about gzip and pre-processing which are startlingly familiar from somewhere:

How does the DAWG dictionary-compression algorithm work?

Essentially it is a preprocessor for gzip that removes redundancy from a sorted list of words, and typically shrinks an input wordlist by some 50% without negatively impacting gzip’s ability to further compress the file. In the new version of the DAWG code – slightly improved over the version that ships with Crack v5.0, but fundamentally the same – all you need do is:

  1. sort the wordlist into normal Unix order. (beware localization!)
  2. for each word that the DAWG preprocessor reads…
  3. count how many leading characters it shares with the previous word that was read…
  4. encode that number as a character from the set [0-9A-Za-z] for values 0..61 (if the value is >61 then stop there)
  5. print said character (the encoded number) and the remaining stem of the word
  6. end-for-loop

[...]

Inspiration for using DAWG in Crack came from Paul Leyland back in the early 1990s, who mentioned something similar being used to encode dictionaries for crossword-puzzle solving programs; we continue to be astonished at how effective DAWG is on sorted inputs without materially impacting subsequent compression (ie: gzip); a gzipped-DAWG file is also typically about 50% of the size of the gzipped non-DAWGed file. Just goes to prove that knowledge of the sort of input you’ll be dealing with, can beat a general-purpose program hands-down; there are also interesting conclusions that can be drawn regarding the entropy of human languages after sorting.

It’s a really nice feeling when you can put someone else’s white paper into a personal context.

It’s better yet when you beat them to some of the conclusions by several years. :-)

2 thoughts on “On “Data Compression Using Long Common Strings”, McIlroy & Bentley

  1. Federico Jose Farina

    Good article! But i dont see when McIlroy & Bentley paper refers to the algorithm BMDIFF. They never named. What is BMDiff? Thanks!

    Reply

Leave a Reply