Tuesday, November 3, 2009

Sun Adds De-Duplication to ZFS

Yesterday Jeff Bonwick (Sun) announced that deduplication is now officially part of ZFS - Sun's Zettabyte File System that is at the heart of Sun's Unified Storage platform and NexentaStor. In his post, Jeff touched on the major issues surrounding deduplication in ZFS:
Deduplication in ZFS is Block-level

ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system. Block-level dedup also maps naturally to ZFS's 256-bit block checksums, which provide unique block signatures for all blocks in a storage pool as long as the checksum function is cryptographically strong (e.g. SHA256).

Deduplication in ZFS is Synchronous

ZFS assumes a highly multithreaded operating system (Solaris) and a hardware environment in which CPU cycles (GHz times cores times sockets) are proliferating much faster than I/O. This has been the general trend for the last twenty years, and the underlying physics suggests that it will continue.

Deduplication in ZFS is Per-Dataset

Like all zfs properties, the 'dedup' property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis. Most storage environments contain a mix of data that is mostly unique and data that is mostly replicated. ZFS deduplication is per-dataset, which means you can selectively enable dedup only where it is likely to help.

Deduplication in ZFS is based on a SHA256 Hash

Chunks of data -- files, blocks, or byte ranges -- are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.

Deduplication in ZFS can be Verified

[If you are paranoid about potential "hash collisions"] ZFS provies a 'verify' option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not.

Deduplication in ZFS is Scalable

ZFS places no restrictions on your ability to dedup. You can dedup a petabyte if you're so inclined. The performace of ZFS dedup will follow the obvious trajectory: it will be fastest when the DDTs (dedup tables) fit in memory, a little slower when they spill over into the L2ARC, and much slower when they have to be read from disk -- but the point I want to emphasize here is that there are no limits in ZFS dedup. ZFS dedup scales to any capacity on any platform, even a laptop; it just goes faster as you give it more hardware.

Jeff Bonwick's Blog, November 2, 2009



What does this mean for ZFS users? That depends on the application, but highly duplicated environments like virtualization stand to gain significant storage-related value from this small addition to ZFS. Considering the various ways virtualization administrators deal with virtual machine cloning, even the basic VMware template approach (not using linked-clones) will now result in significant storage savings. This restores parity between storage and compute in the virtualization stack.


What does it mean for ZFS-based storage vendors? More main memory and processor threads will be necessary to limit the impact on performance. With 6-core and 8-thread CPU's available in the mainstream, this problem is very easily resolved. Just like the L2ARC tables consume main memory, the DDT's will require an increase in main memory for larger datasets. Testing and configuration convergence will likely take 2-3 months once dedupe is mainstream.


When can we expect to see dedupe added to ZFS (i.e. OpenSolaris)? According to Jeff, "in roughly a month."


Updated: 11/04/2009 - Link to Nexenta corrected. Was incorrectly linked to "nexent.com" - typo - now correctly linked to "http://www.nexenta.com"


No comments:

Post a Comment