Courtesy of Data Center Knowledge
Jeff Tofano, the chief technology officer at SEPATON, Inc., has more than 30
years of experience in the data protection, storage and
high-availability industries. He leads SEPATON's technical
direction, addressing the data protection needs of large
enterprises.
IT managers in big data environments are aware of the important
role that deduplication plays in their ongoing struggle to backup,
replicate, retain and restore massive, fast-growing data volumes.
However, few are aware that inline/hash-based deduplication
technologies are not able to efficiently deduplicate multi-stream
database data, multiplexed data or data in progressive incremental
backup environments. Understanding the limitations of
inline/hash-based deduplication and the impact of these limitations
on data protection can save "big data" organizations hundreds of
thousands of dollars annually in disk capacity, power and
cooling.
Hash vs Content Aware Deduplication
While several different deduplication technologies are suitable
for smaller data volumes, IT managers in big data environments have
two broad categories of deduplication to choose from:
inline/hash-based deduplication and content aware byte differential
deduplication.
Inline/hash-based technologies are designed to
find matches in data before it is written to disk. They analyze
segments of data as they are being backed up and assign a unique
identifier called a fingerprint to each segment. Most of these
technologies use an algorithm that computes a cryptographic hash
value from a fixed or variable segment of data in the backup
stream, regardless of the data type. The fingerprints are stored in
an index. As each backup is performed, the fingerprints of incoming
data are compared to those already in the index. If the fingerprint
exists in the index, the incoming data is replaced with a pointer
to data. If the fingerprint does not exist, the data is written to
the disk as a new unique chunk. The fingerprint assignment, index
lookup, and pointer replacement steps must all performed before the
data is written to disk. To contain the size of the index,
inline/hash-based deduplication technologies are purposely designed
for small-to-medium sized enterprises with data volumes and change
rates are small enough to be deduplicated without causing a
bottleneck in backup "ingest" performance.
And even these in these solutions designed for smaller data
sets, most hash-based technologies rely on large duplicate chunk
sizes and ignore small duplicates to achieve reasonable
performance.
Content aware technologies (including Sepaton's
ContentAware byte differential deduplication) work in a
fundamentally different way. These schemes extract metadata from
the incoming data stream and use it to identify duplicate data.
They then analyze this small subset of data that contains
duplicates at the byte level for optimal capacity reduction. The
deduplication process is performed outside the ingest operation
(concurrent processing) so that it does not slow backup or restore
processes. Because there is no index, and because the analysis of
suspect duplicates can be done in parallel these technologies are
able to scale processing across as many nodes and to scale capacity
to store tens of petabytes in a single system with
deduplication.
Poor Capacity Reduction for Databases and Progressive
Incremental Backup Environments
Databases, such as Oracle, SAP, DB2, and Exchange, as well as
data streams found in big data environments typically have data
change in segments of 8 KB or smaller. This granularity of storing
structured and semi-structured data poses a significant problem for
inline deduplication technologies simply because the granularity of
change is too small for them to deal with effectively. Also,
because hash-based schemes are unaware of content and data types,
there is no way to control what data is compared against what -
every incoming chunk is compared against the entire index
regardless of the probability of duplication. Most importantly, in
these technologies, examining data in sections smaller than 8 KB
typically causes a severe performance bottleneck and prohibits use
of all capacity provided. As a result, hash-based schemes leave a
large volume of duplicate data from critical databases and
analytical tools completely unhandled.
The sub 8 KB limitation of hash-based deduplication is also a
problem in the progressive incremental backup environments commonly
used in big data enterprises, including: non-file backups, TSM
progressive incremental backups and backups from applications that
fragment their data, such as NetWorker, HP Data Protector. Just as
in database deduplication, inline/hash-based deduplication
technologies cannot examine data in these common big data backup
environments at a level of granularity reducing capacity reduction
efficiency.
The "triage" approach used by ContentAware technologies enables
them to focus their process-intensive deduplication examination on
the subset of data that contains duplicates and to examine that
data at the individual byte level for maximum efficiency.
Scalability Limitations
Another limitation of hash-based deduplication is the lack of
scalability. Providing a single global coherent index across
multiple nodes is extremely hard to do and beyond the abilities of
all current hash-based engines. Related, hash-based schemes that
support multiple federated indexes further exacerbate the issue of
unhandled regions of duplicate data.
Big data enterprises need to move massive and fast-growing data
volumes to the safety of a backup environment within a fixed backup
window. Without the ability to scale, hash-based deduplication
technologies force big data environments to create "sprawl" by
dividing backups onto numerous individual, single-node backup
systems. Introducing each of these systems requires significant
load balancing and adjustment. The big data enterprise then has
multiple - often dozens - of individual systems that need ongoing
tuning, upgrading, maintenance as well as space, power and cooling
in the data center. Fragmenting the backup among these systems
results in inherently less efficient deduplication and over buying
when new systems are added for performance before added capacity is
needed.
However, as discussed above, these inline hash-based solutions
are not scalable and cannot handle large data volumes efficiently.
Restore performance is also a challenge for hash-based technologies
as they have a single node to perform the compute-intensive tasks
of reassembling the most recent backup while it continues to
perform all backup, deduplication, and replication processes.
Unpredictable performance makes restore time objectives difficult
to achieve with any measure of confidence, including vaulting to
tape. ContentAware byte differential deduplication, in contrast,
keeps a fully hydrated copy of the most recent backup data intact
for immediate restores/tape vaulting and can apply as many as eight
processing nodes simultaneously to all data protection processes to
sustain deterministic high performance.
Replication Challenges
Most hash-based deduplication technologies enable efficient
replication of data across a WAN by engaging in complex fingerprint
negotiations in an attempt to avoid sending duplicate data. The
fingerprint negotiation phase often generates significant transfer
latency and resultant "dead-time" on the wire. Consequently, many
hash-based replication schemes typically don't run faster than
non-deduplicated schemes unless the deduplication rate is very high
- they struggle with and often run significantly slower when
database data or other high-change rate data is replicated. Content
aware byte differential technologies solve these replication
problems by streaming delta requests to target systems. Data is
only pulled from the source system when the delta rules can't be
applied, effectively minimizing the amount of data transferred,
while also effectively utilizing the full bandwidth of the wire and
avoiding costly latency gaps.
For large enterprise organizations with big data environments to
protect, many deduplication solutions that are well-suited to
smaller environments fall short. The sheer volume and overall
complexity of the big data backup environment requires a higher
level of deduplication efficiency, performance, scalability and
flexibility than hash-based deduplication technologies can deliver.
Content aware deduplication technologies that were designed
specifically for large, data-intensive environments offer a more
efficient and cost-effective alternative.
Industry Perspectives is a content channel at Data Center
Knowledge highlighting thought leadership in the data center arena.
See our
guidelines and submission process for information on
participating. View previously published Industry Perspectives in
our
Knowledge Library.