Experimental Results of

"Sliding Window Update Using Suffix Arrays"

 

 

 

Experimental Setup

  • Experimental tests were carried out on a laptop with a 2 GHz Intel Core2Duo T7300 CPU and 2 GB of RAM, using a single core.

  • Our source code is written in 'C', using Microsoft Visual Studio 2008.

  • The linear time SA construction algorithm SA-IS available here was used.

  • For comparison purposes, we also present the results of the Binary Tree Encoder by Mark Nelson, GZip  and LZMA (LZ Markov chain algorithm), namely the LZMA SDK, version 4.65, released on 3 February 2009.

  • The sources for our code, GZIP, and LZMA were compiled with the same settings.

  • The standard corpora Calgary (18 files, 3 MB) and Silesia (12 files, 211 MB) were used.

                   

    

Calgary Corpus

 

Table 1: Amount of memory, total encoding time (in seconds), and average compression ratio (in bpb), for several lengths of (|dictionary|, |LAB|) on the Calgary Corpus, using “best” compression. GZip “fast” obtains Time=0.5 and bpb=3.20 while GZip “best” does Time=1.2 and bpb=2.79. The best encoding time is underlined.

 

Calgary Corpus

Our SA Algorithm

Binary Tree

LZMA

#

Dict.

LAB

Memory

Time

Bpb

Memory

Time

Bpb

Memory

Time

Bpb

1

2048

1024

24576

2.2

5.77

26636

3.9

5.65

4217856

4.7

2.99

2

4096

1024

43008

2.5

5.40

53260

4.3

4.98

4241408

4.8

2.82

3

4096

2048

48128

2.4

5.75

53260

11.1

5.48

4241408

4.8

2.82

4

8192

2048

84992

3.8

5.49

106508

11.7

4.88

4288512

5.1

2.69

5

16384

256

149760

9.1

4.36

213004

4.5

4.12

4382720

5.2

2.61

6

32768

256

297216

18.4

4.31

425996

5.5

4.08

4571136

4.9

2.54

7

32768

1024

301056

11.1

4.86

425996

7.5

4.40

4571136

4.9

2.54

8

32768

2048

306176

9.5

5.16

425996

15.8

4.57

4571136

4.9

2.54

 

Figure 1: Time-memory trade-off between our SA Algorithm  (A#) and Binary Tree (T#) for the Calgary Corpus, on the 8 encoding tests of Table 1. We include GZip for comparison: Gf is GZip “fast”, Gb is GZip “best”. Our encoders use less memory than the BT-encoder. For some compression settings, they are close to GZip encoding time.

  

 

Silesia Corpus

Table 2: Amount of memory, total encoding time (in seconds) and average compression ratio (in bpb), for several lengths of (|dictionary|, |LAB|) on the Silesia Corpus, using “best” compression. GZip “fast” obtains Time=19.5 and bpb=3.32 while GZip “best” does Time=74.4 and bpb=2.98. The best encoding time is underlined.

 

Silesia Corpus

Our SA Algorithm

Binary Tree

LZMA

#

Dict.

LAB

Memory

Time

Bpb

Memory

Time

Bpb

Memory

Time

Bpb

1

2048

1024

24576

118.7

5.66

26636

249.5

5.65

4217856

333.53

3.05

2

4096

1024

43008

116.9

5.41

53260

303.4

5.25

4241408

349.05

2.90

3

4096

2048

48128

112.9

5.68

53260

694.9

5.63

4241408

349.05

2.90

4

8192

2048

84992

143.4

5.44

106508

668.9

5.27

4288512

356.77

2.76

5

16384

256

149760

319.1

4.55

213004

254.6

4.44

4382720

366.47

2.62

6

32768

256

297216

542.7

4.41

425996

318.1

4.31

4571136

356.34

2.52

7

32768

1024

301056

322.2

4.80

425996

382.6

4.64

4571136

356.34

2.52

8

32768

2048

306176

302.3

5.02

425996

979.8

4.81

4571136

356.34

2.52

 

Figure 2: Time-memory trade-off between our SA Algorithm  (A#) and Binary Tree (T#) for the Silesia Corpus, on the 8 encoding tests of Table 1. We include GZip for comparison: Gf is GZip “fast”, Gb is GZip “best”.

 

 

 

Some Detailed Experiments of our SA encoder

 

Calgary Corpus

 

LZSS SA Algorithm on Calgary Corpus

(Dict.,LAB)=(2048,512)

  ---------------------------------------------------------------------------------------------------------------------

  File                       Original                 Packed                 Time       Bpb    Check

---------------------------------------------------------------------------------------------------------------------

  bib                       111274                     77349                 0.094     5.561  OK

  book1                 768784                   608507                 0.687     6.332  OK

  book2                 610866                   430924                 0.531     5.643  OK

  geo                      102410                  104504                 0.093     8.164  OK

  news                  3  77119                  293007                 0.328     6.216  OK

  obj1                        21514                    16373                 0.031      6.088  OK

  obj2                      246824                  148807                 0.172     4.823  OK

  paper1                   53171                    36477                  0.047     5.488  OK

  paper2                   82209                    58301                  0.078     5.673  OK

  paper3                   46536                    34554                  0.047     5.940  OK

  paper4                   13296                      9770                  0.015      5.878  OK

  paper5                   11964                      8953                  0.015      5.987  OK

  paper6                   38115                    26353                  0.031     5.531  OK

  pic                        513226                    82911                  0.531     1.292  OK

  progc                     39621                    27418                   0.032     5.536  OK

  progl                      71656                    35899                   0.063     4.008  OK

  progp                    49389                    24578                    0.031     3.981  OK

  trans                      93705                    53479                   0.062     4.566  OK

---------------------------------------------------------------------------------------------------------------------

Total Packed  = 2078164

Total Time    = 2.888

Average Time  = 0.160

Average Bpb   = 5.373 

Total Mem.    = 11270 bytes

 

 

 

LZSS SA Algorithm on Calgary Corpus

(Dict.,LAB)=(8192,512)

 ------------------------------------------------------------------------------------------------------------

   File               Original             Packed            Time      Bpb       Check

 ------------------------------------------------------------------------------------------------------------

  bib                  111274               63735              0.078    4.582  OK

  book1             768784             528337             0.797    5.498  OK

  book2             610866             364087             0.578    4.768  OK

  geo                102410              102838             0.110    8.033  OK

  news              377119              259336             0.328    5.501  OK

  obj1                  21514                16741              0.047    6.225  OK

  obj2                246824              140748             0.172    4.562  OK

  paper1             53171               31365              0.047    4.719  OK

  paper2             82209               50346              0.078    4.899  OK

  paper3             46536               30141              0.063    5.182  OK

  paper4             13296                9183               0.016    5.525  OK

  paper5             11964                8577               0.016    5.735  OK

  paper6             38115               22805              0.047    4.787  OK

  pic                 513226               85474               1.407    1.332  OK

  progc               39621               23763               0.031    4.798  OK

  progl                71656                29159               0.063    3.255  OK

  progp               49389               19719               0.047    3.194  OK

  trans                 93705               37327               0.062    3.187  OK

------------------------------------------------------------------------------------------------------------

Total Packed  = 1823681

Total Time    = 3.987

Average Time  = 0.221

Average Bpb   = 4.766 

Total Mem.    = 74762 bytes

 

 

 

Silesia Corpus

LZSS SA Algorithm on Silesia Corpus

(Dict.,LAB)=(2048,512)

  -----------------------------------------------------------------------------------------------------------

   File               Original             Packed            Time                Bpb    Check

 -----------------------------------------------------------------------------------------------------------

dickens            10192448          7627228            9.609               5.987  OK

 mozilla            51220482          30862871          40.703              4.820  OK

 mr                      9970566            6443750           8.406               5.170  OK

 nci                   33553447           9111824           21.718              2.172  OK

 ooffice              6152194           5036342           5.000                6.549  OK

 osdb               10085686           9455648           9.187                7.500  OK

 reymont            6627204           3780591           5.062                4.564  OK

 samba            21606402          11099679          15.219              4.110  OK

 sao                   7251946            7104230           6.672                7.837  OK

 webster          41458705          24381498          33.672              4.705  OK

 xml                    5345282           1665844           3.375                2.493  OK

 x-ray                 8474242            9439387           7.859                8.911  OK

-----------------------------------------------------------------------------------------------------------

Total Packed  = 126008892

Total Time    = 166.482

Average Time  = 13.873

Average Bpb   = 5.402 

Total Mem.    = 11270 bytes

 


 

LZSS SA Algorithm on Silesia Corpus

(Dict.,LAB)=(16384,1024)

 -----------------------------------------------------------------------------------------------------------

   File               Original             Packed            Time                Bpb      Check

 ------------------------------------------------------------------------------------------------------------

 dickens            10192448          6345041           12.734              4.980  OK

 mozilla             51220482          31213367          55.371              4.875  OK

 mr                      9970566            6340350           12.439              5.087  OK

 nci                   33553447           5347453           22.147              1.275  OK

 ooffice             6152194           5171200               5.735               6.724  OK

 osdb              10085686           6253754              6.689               4.960  OK

 reymont            6627204           3041540              6.017               3.672  OK

 samba              21606402          9649370           16.007              3.573  OK

 sao                   7251946            7195505             5.878               7.938  OK

 webster          41458705          20431101          37.436              3.942  OK

 xml                   5345282            1166151             2.714               1.745  OK

 x-ray                 8474242           11548922           6.769               10.903  OK

 ------------------------------------------------------------------------------------------------------------

Total Packed  = 113703754

Total Time    = 189.936

Average Time  = 15.828

Average Bpb   = 4.973 

Total Mem.    = 153102 bytes

 

Canterbury Corpus

 

LZSS SA Algorithm on Canterbury Corpus

(Dict.,LAB)=(8192,1024)

 ------------------------------------------------------------------------------------------------------------

   File               Original             Packed            Time                Bpb   Check

 ------------------------------------------------------------------------------------------------------------

 alice29.txt          152091                96229           0.141               5.062  OK

 arj241a.exe        223858              269397           0.156               9.627  OK

 asyoulik.txt        125181                88615           0.109               5.663  OK

 bible.txt           4047394            2171012           3.297               4.291  OK

 cp.htm                24605                14223           0.016               4.624  OK

 E.coli             4638692            2112222           10.625              3.643  OK

 fields.c                11152                  6393           0.000               4.586  OK

 grammar.lsp          3723                  3046           0.000               6.545  OK

 kennedy.xls    1029746               492467           0.500                3.826  OK

 lcet10.txt          426756               258989           0.375                4.855  OK

 plrabn12.txt      481863               343675           0.469                5.706  OK

 ptt5                   513218                91364           1.297                1.424  OK

 sum                    38242                 26082           0.047                5.456  OK

 world192.txt      2473402          1682806            1.922                5.443  OK

 xargs.1                 4229                   3345           0.000                6.328  OK

 -----------------------------------------------------------------------------------------------------------

Total Packed  = 7659865

Total Time    = 18.954

Average Time  = 1.264

Average Bpb   = 5.139 

Total Mem.    = 79374 bytes

 

 

Main Conclusions

  • Our encoders use less memory than the tree-based encoders.

  • With some compression setting they are also faster than the tree based encoders.

  • To the best of our knowledge, this is the first LZ SA-encoder faster than tree-based encoders.