Friday, December 25, 2009

Burrows–Wheeler transform

Authors: Michael Burrows and David Wheeler
As some of you might noticed, I haven't posted in quite some time, this is because I'm working on a secret compression algorithm, but I'm willing to share some knowledge about how compression works and what I've implemented or discovered.
I'm writing this post about Burrows–Wheeler transform(a.K.a. block sorting) because it really really really shaked me about how it works.
This block sorting is done before compression so that a lot of repeating bytes/characters in data will get one-after-another, other methods are also used after BWT, like Move-To-Front(which I'm saving for another post... sorry).
More information about history and authors can be found on Wikipedia.
The idea is that you take some data(let's say 256 bytes) encode it and your output should be 257 bytes, why?
input = 256 bytes -encode-> output = 256 bytes + 1 byte for index.
As you read this post you will understand that the more data you encode the bigger the index is but the more memory you need(but not necessarily).
BWT Encoding
Definition: take the data that you wish to encode, create a matrix with first row being the data, for each byte/character in data add a new row with first byte/character to be the second character in precedent row until you have a square matrix(number of columns = number of rows), i.e.
data: "abcde"
constructing the matrix:
abcde <-- the data !! bcdea <-- shifted row from original data(precedent) cdeab <-- shifted row from precedent row deabc eabcd -------- stop! no. columns = no. rows!! -------- Now sort the first column of the matrix in alphabetical order(along with it's rows), in the above example there's no need for sorting! now search in the matrix for the row(first row = ZERO as index) that matches your original data and that row number will tell you the index with which you can reconstruct the original data, in the above example the INDEX = 0 (ZERO). The output is the last column of the matrix "eabcd" and index = 0.
Decoding
The extraordinary thing about this encoding is the decoding, it's out of the box!
So we know the last column of the matrix and the index, we know that first column of the matrix is the encoded data in alphabetical order.
WARNING!! In order to decode you do not need to reconstruct the entire matrix, you just need to sort the encoded data in alphabetical order and put the sorted column next to the encoded one and you get:
a _ _ _ e
b _ _ _ a
c _ _ _ b
d _ _ _ c
e _ _ _ d
Now assign indexes to the left column starting from row zero = 0 and increment it by 1 for each row and you get:
0 a _ _ _ e
1 b _ _ _ a
2 c _ _ _ b
3 d _ _ _ c
4 e _ _ _ d
at this point you just need to assign indexes to the last column so that the index of char "a" from first column will be assigned to the char "a" from last column like so:
0 a _ _ _ e 4
1 b _ _ _ a 0
2 c _ _ _ b 1
3 d _ _ _ c 2
4 e _ _ _ d 3
Now we know:
- first column
- last column
- the index
- the first and last char of the original data "a" and "e"
Decoding is done by starting from last column where the char's index is equal to the index you received along with encoded data.
So index = 0(which is "a", we already know the first char) we go along the row to the first column and we find out that the 2nd char is "b"(1 b <-- a 0), now we know the 2nd char and the index = 1, we go to the last column and where index is 1 and go along the row from right to left and we get the 3rd char "c" along with a new index which is 2, again we go to the last column where index = 3 and go along the row from right to left and find out that the 4th char is "d", again we get a new index which is 3, go to the last column find the the row where index is 3, go along it to the left and find the new index which is not important because we already know the last char but just for the sake of the theory we got the last char which is "e", voia! you have decoded the message, of course the algorithm can be improved by using indexes directly, but first you must understand how it works! Now a real example of the power of BWT. Let's presume you wish to encode "i can what you can", create the matrix:

i can what you can <-- original data can what you cani can what you cani an what you cani c n what you cani ca what you cani can what you cani can hat you cani can w at you cani can wh t you cani can wha you cani can what you cani can what ou cani can what y u cani can what yo cani can what you cani can what you ani can what you c ni can what you ca





After creating the matrix, you have to sort it by first column of each row, in order of appearance of each character that is repeating and the sort result is
can what you cani
 what you cani can
 you cani can what
 cani can what you
an what you cani c
at you cani can wh
ani can what you c
can what you cani 
cani can what you 
hat you cani can w
i can what you can
n what you cani ca
ni can what you ca
ou cani can what y
t you cani can wha
u cani can what yo
what you cani can 
you cani can what

the output is " ntuchc wnaayao "(last two chars of out put is two spaces, copy-paste it" and index = 10(counting first row as 0).
Problem, reconstruct the initial data based on the small example presented above with matrix sorted, index known and everything in front of you, to help you even more I make the indexes also here there are:
0  |  can what you cani | 10
1  |  what you cani can | 11
2  |  you cani can what | 14
3  |  cani can what you | 15
4  | an what you cani c | 7
5  | at you cani can wh | 9
6  | ani can what you c | 8
7  | can what you cani  | 0
8  | cani can what you  | 1
9  | hat you cani can w | 16
10 | i can what you can | 12
11 | n what you cani ca | 4
12 | ni can what you ca | 5
13 | ou cani can what y | 17
14 | t you cani can wha | 6
15 | u cani can what yo | 13
16 | what you cani can  | 2
17 | you cani can what  | 3

Have fun and drop me a comment if you like the post or the sort.

No comments:

Post a Comment

Blogroll(General programming and Delphi feeds)