Huffman encoding will perform best when the distribution of symbols of an alphab
ID: 651978 • Letter: H
Question
Huffman encoding will perform best when the distribution of symbols of an alphabet that the string to be encoded uses is dyadic.
Given an arbitrary bit string S, how can we find the best alphabet for encoding? Suppose S is an ASCII file. Then given the regularity of 1-byte characters that such files exhibit, we would expect that an optimal, or at least pretty good, alphabet should contain, say, 8-bit or 16-bit words (which we then build codes for after constructing the Huffman tree).
Is there an algorithm for finding the optimal word width (assume we use constant-length words).
I would guess that to evaluate an alphabet, it would only be fair if we considered the costs of storing the actual encoding as well. This addresses the case where the alphabet is just one symbol - the entire original string. Technically the message would just be one bit, but the the encoding tree that's stored would have to indicate that the one bit used is a code for the original string, so we've just increased our message by two bits trivially!
(Constant-length encoding information such as width size, encoding table size, etc., need not be considered for the comparisons, of course).
Explanation / Answer
I'm having trouble answering your question for two reasons. First, the entropy changes as you change the alphabet, so the "best" alphabet depends on the correlations between characters in the class of strings that you are trying to encode, not just the "dyadicness". (This is the problem with the notion of entropy: it depends on your model of what you know about the method by which a string was generated, it is not a fundamental property of a string.) So I can't think of an algorithm that would do better than "try every character length and see which one ends up with the best result."
Second it's not clear (to me) why you are trying to choose an alphabet that gets closest to having a "dyadic" probability distribution (I had to look that term up), when in practice the main reason that people use Huffman coding is that there is an adaptive version (i.e., one that doesn't need to store the encoding,) which is "good enough", not because it is optimal. In practice (for example in the old Unix pack utility, or in the Huffman coding done at the end of MPEG encoding), the input alphabet size is chosen to be some "natural" size (bytes if you are encoding Unix files in the 1980s, some much larger alphabet if you know you are encoding Unicode).
If you want a non-adaptive encoding with a fixed-width alphabet (given that you are unaware that there should be any correlation between symbols) then you should use arithmetic coding, which gets closer to optimal for non-dyadic distributions.
If you have reason to believe that there are correlations between nearby characters then you might use something like PPM, and if you believe that there are likely to be repeated substrings (but not necessarily nearby) you might use some some kind of Lempel-Ziv compression. (Various kinds of Lempel-Ziv are used by Unix's gzip and compress).
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.