Data deduplication: Reducing storage bloat

Data storage needs continue to grow unabated, straining backup and disaster recovery systems while requiring more online spindles, using more power, and generating more heat. No one expects a respite from this explosion in data growth. That leaves IT profession­als to search for technology solutions that can at least lighten the load Microsoft MCTS Training.

One solution particularly well-suited to backup and disaster recovery is data deduplication, which takes advantage of the enormous amount of redundancy in business data. Eliminating duplicate data can reduce the amount of storage space necessary from a 10:1 ratio to a 50:1 ratio and beyond, depending on the technology used and the level of redundancy. With a little help from data deduplication, admins can reduce costs, lighten backup requirements, and accelerate data restoration in the event of an emergency.
The Planned Refresh – Simplifying IT Management : View now

[ Get the full scoop on keeping your storage under control in the InfoWorld “Data Deduplication Deep Dive” PDF special report. | Better manage your company’s information overload with our Enterprise Data Explosion newsletter. ]

Deduplication takes several different forms, each with its own approach and optimal role in backup and disaster recovery scenarios. Ultimately, few doubt that data deduplication technology will extend beyond the backup tier and apply its benefits across business storage systems. But first, let’s take a look at why data deduplication has become so attractive to so many organizations.

Too much data, too little time Duplicated data is strewn all over the enterprise. Files are saved to a file share in the data center, with other copies located on an FTP server facing the Internet, and yet another copy (or two) located in users’ personal folders. Sometimes copies are made as a backup version prior to exporting to another system or updating to new software. Are users good about deleting these extra copies? Not so much.

A classic example of duplicate data is the email blast. It goes like this: Someone in human resources wants to send out the new Internet acceptable use policy PDF to 100 users on the network. So he or she creates an email, addresses it to a mailing list, attaches the PDF, and presses Send. The mail server now has 100 copies of the same attachment in its storage system. Only one copy of the attachment is really necessary, yet with no deduplication system in place, all the copies sit in the mail store taking up space.

Server virtualization is another area rife with duplicate data. The whole idea of virtualization is to “do more with less” and maximize hardware utilization by spinning up multiple virtual machines in one physical server. This equates to less hardware expense, lower utility costs, and (hopefully) easier management Microsoft MCITP Certification.

Each virtualized server is contained in a file. For instance, VMware uses a single VMDK (virtual machine disk) file as the virtual hard disk for the virtual machine. As you would expect, VMDK files tend to be rather large — at least 2GB in size, and usually much larger.