KEYWORDS: Sensor fusion, Associative arrays, Computer programming, Data storage, Data archive systems, Data communications, Data processing, Data backup, Data centers, Receivers
Data deduplication is a simple, dictionary based compression method that became very popular in storage
archiving and backup. It has the advantage of direct, random access to any piece ("chunk") of a file in one table
lookup; that's not the case with differential file compression, the other common storage archival method. The
compression efficiency (chunk matching) of deduplication improves for smaller chunk sizes, however the sequence
of hashes replacing the deduplicated object (file) increases significantly. Within the sequence of chunks that an
object is decomposed, sub-sequences of adjacent chunks tend to repeat. We exploit this insight first in an online
scheme used to reduce the amount of hash metadata generated. With each newly created entry in the chunk
repository we add a "chronological" pointer linking this entry with the next new entry, in time order. When the
hashes produced by the chunker follow the chronological pointers we encode them as a "sequence of hashes" by
specifying the first hash in the sequence and the length of the sequence. The shrinkage is orders of magnitude
smaller than what a customary compression algorithm (gzip) achieves and has a significant impact on the overall
deduplication efficiency when relatively small chunks are used. A second scheme is also introduced that optimizes
the chunk sizes by joining repeated sub-sequences of small chunks into new "super chunks" with the constraint
to achieve practically the same matching performance. We employ suffix arrays to find these repeating subsequences
and to determine a new encoding that covers the original sequence. As a result, fewer chunks are used
to represent a file, reducing fragmentation i.e. the number of disk accesses needed to reconstruct the file, and
requiring fewer entries in the chunk dictionary and fewer hashes to encode a file. As the experimental results
show, this method provides over 10 time reduction in fragmentation and over 5 times reduction in the number
of entries in repository while achieving similar or slightly better overall deduplication ratios.
There are applications (such as Internet search engines) where short textual strings, for example abstracts or pieces of Web pages, need to be compressed independently of each other. The usual adaptive compression algorithms perform poorly on these short strings due to the lack of necessary data to learn.
In this manuscript, we introduce a compression algorithm targeting short text strings; e.g., containing a few hundred symbols. We also target natural language insensitivity, to facilitate its robust compression and fast implementation. The algorithm is based on the following findings. Applying the move-to-front transform (MTFT) after the Burrows-Wheeler transform (BWT) brings the short textual strings to a "normalized form" where the distribution of the resulting "ranks" has a shape similar over the set of natural language strings. This facilitates the use of a static coding method with few variations, which we call shortBWT, where no on-line learning is needed, to encode the ranks. Finally, for short strings, shortBWT runs very fast because the strings fit into the cache of most current computers.
The introduction for this paper will review the mathematical bases of BWT and MTF, it will also review our recently published metric for rapidly pre-characterizing the compressibility of such short textual strings when using these transforms.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.