Data compression is a way of finding economical ways to encode information, for

ID: 3913189 • Letter: D

Question

Data compression is a way of finding economical ways to encode information, for storing files using limited disk space, or for transferring files over a network. Sophisticated data compression algorithms for text files take into account sequences of characters that occur often, so that they can encode those sequences in an efficient way. For example, if the word Rumplestiltskin occurs often, then there will be a very short code for that word. But a simpler kind of data compression works on individual characters, without paying attention to how characters are grouped. This assignment has to do with that simple form of data compression, using an algorithm called Huffman coding.

A file is a sequence of bits. Codes such as Ascii and Unicode use a fixed number of bits to encode each character. For example, in the 8-bit Ascii code, the letters 'a', 'b' and 'c' have the following encodings.

'a'

01100001

'b'

01100010

'c'

01100011

In the 16 bit Unicode standard, the codes for 'a', 'b' and 'c' are as follows.

'a'

0000000001100001

'b'

0000000001100010

'c'

0000000001100011

Some characters are used more than others. For an economical encoding, instead of using the same number of bits for every character, it would be better to use shorter codes for characters that occur more frequently. For example, suppose that a document contains only the letters a, b and c, and suppose that the letter b occurs far more often than either a or c. An alternative character encoding is as follows.

'a'

'b'

'c'

This code has the property that no character code is a prefix of any other character code. Using this kind of character code (called a prefix code) allows you to break apart a string of bits into the characters that it encodes. For example, string of bits "01001110" encodes "babca". To decode a string, you repeatedly remove a prefix that encodes a character. Since no code is a prefix of any other code, there is only one way to select the next character.

Requirements

For this assignment you will write two programs. The first program will be given two file names, which I will refer to as A and B. It should do the following.

Count how many occurrences of each character are in file A.

Show the character frequencies on the standard output.

Construct a Huffman code for the characters that occur in file A (ignoring characters that do not occur at all).

Write the code on the standard output.

Reread file A and, using the constructed Huffman code, write an encoded version of file A into file B.

The second program will also be given two file names, A and B. It reads file A, which must be a file that was written by your first program, and it writes file B, which should be the decoded version of A. The decoded file should look identical to the file that you encoded.

The encoder will be called huffman and the decoder will be called unhuffman.

The standard output

If the input String being encoded contains only characters a, b, c, d and e, the program might show the following information on the standard output, in addition to encoding and decoding files.

The character frequencies are:

a 500

b 230

c 300

d 250

e 700

A Huffman code is as follows.

a 10

b 010

c 00

d 011

e 11

Algorithmic Issues

There is an algorithm that is due to Huffman for finding efficient prefix codes; they are called Huffman codes. The input to the algorithm is a String of characters with their frequencies.

Trees and codes

A tree is used to define a prefix code. Each node of the tree is either a leaf or a nonleaf. Each nonleaf has two children, one labeled the '0' child, the other the '1' child. Each leaf contains a character. An example is the following tree.

. .

/ /

0/ 0/

/ /

c b a d

A tree defines a code. You find the code for a character by following the path from the top (the root) of the tree to that character, writing down the 0's and 1's that you see. For example, in the tree above, the code for 'b' is 01, and the code for 'a' is 10. What is the code for 'c'? A different code is given by the following tree.

b .

a c

where 'b' has code 0 and 'a' has code 10.

A tree can be thought of as a composite character. A tree with leaves a and c stands for the combination of a and c. A character by itself is thought of as a tree that has just one leaf.

Huffman's algorithm

The algorithm to create the tree starts by making each character into a tree that is just a leaf holding that character. Each tree has a frequency, which is the character frequency. For example, if the frequency input is

a 500

b 230

c 300

d 250

e 700

then the algorithm creates five leaves, one holding a, with frequency of 500, one holding b with a frequency of 230, etc.

(Note. A frequency is associated with an entire tree, not with each node of a tree. Initially, each tree has just one node, but as the algorithm runs the trees will grow. You still have just one frequency per tree.)

Now the algorithm repeatedly performs the following, as long as there are still at least two trees left in the collection of trees.

Remove the two trees that have the lowest frequencies from the collection of trees. Suppose that they are trees S and T. Suppose that tree S has frequency FS and tree T has frequency FT.

Build a tree that looks as follows.

S T

Say that this tree has frequency FS + FT, the sum of the frequencies of S and T.

Add the new tree, with its new frequency, to the collection of trees.

The number of trees decreases each time because you remove two trees and put back just one tree. When there is only one tree left, the algorithm is done. That tree is the result.

A notation for trees

It is a little awkward to draw pictures of the trees, so a notation for describing trees is useful. The notation groups the two subrees of a node in parentheses. So a pair of parentheses stands for a nonleaf node in the tree. For example, tree

e c

is written (e c). Tree

b .

a c

is written (b (a c)). Tree

. .

/ /

0/ 0/

/ /

c b a d

is written ((c b) (a d)). We use this notation to demonstrate the algorithm.

'a'

01100001

'b'

01100010

'c'

01100011

Explanation / Answer

package huffmancompression;

import java.io.File;

import java.io.FileNotFoundException;

import java.util.PriorityQueue;

import java.util.Scanner;

import java.util.TreeMap;

// Defines a class for encoding and decoding

public class HuffmanCompression

{

// To store file reading status

static final boolean readFromFile = true;

// To store new text status

static final boolean newTextBasedOnOldOne = false;

// Creates a priority quee node class object

static PriorityQueue<Node> nodes = new PriorityQueue<>((o1, o2) -> (o1.value < o2.value) ? -1 : 1);

// Creates a Tree map class object

static TreeMap<Character, String> codes = new TreeMap<>();

// To store original text

static String data = "";

// To store encoded text

static String encodedData = "";

// To store decoded text

static String decodedData = "";

// To store ASCII values

static int ASCII[] = new int[128];

// main method definition

public static void main(String[] args) throws FileNotFoundException

{

// Opens the file for reading

Scanner scanner = new Scanner(new File("Source.txt"));

// Sets the decision to 1

int decision = 1;

// Loops till decision is not -1

while (decision != -1)

{

// Calls the method to execute the decision

if (executeDecision(scanner, decision))

continue;

// Calls the method to read file contents

decision = readFile(scanner);

}// End of while loop

}// End of method

// Method to read file contents

private static int readFile(Scanner scanner)

{

int data;

// Reads the data from file and converts it into integer

data = Integer.parseInt(scanner.nextLine());

// Checks the reading file status

if (readFromFile)

System.out.println("Decision: " + data);

return data;

}// End of method

// Method to executes the decision type

private static boolean executeDecision(Scanner scanner, int decision)

{

// Checks if the decision is one read the new data from file

if (decision == 1)

{

// Calls the method to deal with new data

if (readNewData(scanner))

return true;

}// End of if condition

// Otherwise checks if decision is two encode the data

else if (decision == 2)

{

// Calls he method to encode data

if (encodingData(scanner))

return true;

}// End of else if condition

// Otherwise checks if decision is three decode the data

else if (decision == 3)

{

// Calls the method to decode the data

decodingData(scanner);

}// End of else if condition

return false;

}// End of method

// Method to read data from file

private static boolean readNewData(Scanner scanner)

{

// Stores the length

int oldTextLength = data.length();

// Reads the data

data = scanner.nextLine();

// Checks for the text validity

if (newTextBasedOnOldOne && (oldTextLength != 0 && !checkSameCharacter()))

{

System.out.println("Not Valid input");

data = "";

return true;

}// End of if condition

// Reset the variables and objects to null

ASCII = new int[128];

nodes.clear();

codes.clear();

encodedData = "";

decodedData = "";

// Displays the data

System.out.println("Data = " + data);

// Calls the method to calculate character Interval

calculateCharInterval(nodes, true);

// Calls the method create tree

createTree(nodes);

// Calls the method to create code

createCodes(nodes.peek(), "");

// Calls the method to display codes

displayCodes();

System.out.println("-- Encoding/Decoding --");

// Calls the method to encode data

encodeData();

// Calls the method to decode data

decodeData();

return false;

}// End of method

// Method to encoding data

private static boolean encodingData(Scanner scanner)

{

// Reads the dta from file

data = scanner.nextLine();

// Displays the data

System.out.println("Data to Encode: " + data);

// Checks for same character

if (!checkSameCharacter())

{

System.out.println("ERROR: Not Valid input");

data = "";

return true;

}// End of if condition

// Calls the method encode data

encodeData();

return false;

}// End of method

// Method to decoding data

private static void decodingData(Scanner scanner)

{

// Reads the data from file

encodedData = scanner.nextLine();

System.out.println("Data to Decode: " + encodedData);

// Calls the method to decode data

decodeData();

}// End of method

// Method to return same character status

private static boolean checkSameCharacter()

{

boolean foundStatus = true;

// Loops till length of the data

for (int i = 0; i < data.length(); i++)

// Checks the current character availability of the character in the ASCII array

if (ASCII[data.charAt(i)] == 0)

{

foundStatus = false;

break;

}// End of if condition

// Returns the flag status

return foundStatus;

}// End of method

// Method to decode the data

private static void decodeData()

{

decodedData = "";

// Peeps a node

Node node = nodes.peek();

// Loops till end of the encoded data

for (int counter = 0; counter < encodedData.length(); )

{

// Creates a temporary node

Node tmpNode = node;

// Checks for leaf node

while (tmpNode.leftNode != null && tmpNode.rightNode != null && counter < encodedData.length())

{

// Checks if the character is '1'

if (encodedData.charAt(counter) == '1')

// Extracts the right node

tmpNode = tmpNode.rightNode;

// Otherwise extracts right node

else

tmpNode = tmpNode.leftNode;

// Increase the loop counter by one

counter++;

}// End of while loop

// Checks if the node is NULL

if (tmpNode != null)

// Checks if the length is one

if (tmpNode.character.length() == 1)

// Concatenates the cahracter with decode data

decodedData += tmpNode.character;

// Otherwise display error message for invalid data

else

System.out.println("ERROR: Not Valid Data");

}// End of for loop

// Displays the decoded data

System.out.println("Decoded Data: " + decodedData);

}// End of method

// Method to encode data

private static void encodeData()

{

encodedData = "";

// Loops till length of the data

for (int counter = 0; counter < data.length(); counter++)

// Concatenates the data with encoded data

encodedData += codes.get(data.charAt(counter));

System.out.println("Encoded Data: " + encodedData);

}// End of method

// Method to create a tree

private static void createTree(PriorityQueue<Node> vector)

{

// Loops till size is greater than one

while (vector.size() > 1)

// Adds the node

vector.add(new Node(vector.poll(), vector.poll()));

}// End of method

// Method to display the code

private static void displayCodes()

{

System.out.println("--- Displaying Codes ---");

codes.forEach((k, v) -> System.out.println("'" + k + "' : " + v));

}// End of method

// Method to claculate character Interval

private static void calculateCharInterval(PriorityQueue<Node> vector, boolean intervalStatus)

{

// Checks the interval status

if (intervalStatus)

System.out.println("*********** Character Intervals ***********");

// Loops till end of the data

for (int counter = 0; counter < data.length(); counter++)

// Stores the character frequency

ASCII[data.charAt(counter)]++;

// Loops till length of the ASCII array

for (int counter = 0; counter < ASCII.length; counter++)

// Checks if the current counter value of the ASCII array is grater than zero

if (ASCII[counter] > 0)

{

vector.add(new Node(ASCII[counter] / (data.length() * 1.0), ((char) counter) + ""));

// Checks the interval status

if (intervalStatus)

// Displays the character and it Interval

System.out.println("'" + ((char) counter) + "' : " + ASCII[counter] / (data.length() * 1.0));

}// End of if condition

}// End of method

// Recursive method to create the code

private static void createCodes(Node node, String s)

{

// Checks if node is not null

if (node != null)

{

// Checks if right node is not null

if (node.rightNode != null)

// Recursively call the method for right child

createCodes(node.rightNode, s + "1");

// Checks if left node is not null

if (node.leftNode != null)

// Recursively call the method for left child

createCodes(node.leftNode, s + "0");

// Checks for leaf node

if (node.leftNode == null && node.rightNode == null)

codes.put(node.character.charAt(0), s);

}// End of if condition

}// End of method

}// End of class

// Class to generate node

class Node

{

// Instance variable to store left and right child node

Node leftNode, rightNode;

// To store number

double value;

// To store the character

String character;

// Parameterized constructor definition

public Node(double value, String character)

{

this.value = value;

this.character = character;

leftNode = null;

rightNode = null;

}// End of parameterized constructor

// Overrides parameterized constructor with left and right child node

public Node(Node left, Node right)

{

this.value = left.value + right.value;

character = left.character + right.character;

// Checks if left chaild value is less than right child value

if (left.value < right.value)

{

this.rightNode = right;

this.leftNode = left;

}// End of if condition

// Otherwise

else

{

this.rightNode = left;

this.leftNode = right;

}// End of else

}// End of method

}// End of class Node

Sample Output:

run:
Data = Ahmed Kamel Taha
*********** Character Intervals ***********
' ' : 0.125
'A' : 0.0625
'K' : 0.0625
'T' : 0.0625
'a' : 0.1875
'd' : 0.0625
'e' : 0.125
'h' : 0.125
'l' : 0.0625
'm' : 0.125
--- Displaying Codes ---
' ' : 101
'A' : 0101
'K' : 0100
'T' : 1000
'a' : 111
'd' : 1001
'e' : 001
'h' : 110
'l' : 000
'm' : 011
-- Encoding/Decoding --
Encoded Data: 0101110011001100110101001110110010001011000111110111
Decoded Data: Ahmed Kamel Taha
Decision: 2
Data to Encode: Rahim
ERROR: Not Valid input
Data to Encode: 3
ERROR: Not Valid input
Data to Encode: 0101110011001100110101001110110010001011000111110111
ERROR: Not Valid input
Data to Encode: 3
ERROR: Not Valid input
Data to Encode: 0100111011001000
ERROR: Not Valid input
Data to Encode: 1
ERROR: Not Valid input
Data to Encode: Taha
Encoded Data: 1000111110111
Decision: -1

Navigate

Data compression algorithms (like gzip) take as input a bit string i of length n

Data compression is a way of finding economical ways to encode information, for

Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.

Data compression is a way of finding economical ways to encode information, for

Question

Explanation / Answer

Related Questions

Navigate