Wednesday, December 30, 2009

Looking at memory prices one last time before the year is out and prices of our "benchmark" Kingston DDR3 server DIMMs are on the decline. While the quad rank 8G DDR3/1066 DIMMs are below the $565 target price (at $514) we predicted back in August, the dual rank equivalent (on our benchmark list) are still hovering around $670 each. Likewise, while retail price on the 8G DDR2/667 parts continue to rise, inventory and promotional pricing has managed to keep them flat at $433 each, giving large foot print DDR2 systems a $2,000 price advantage (based on 64GB systems).

Benchmark Server (Spot) Memory Pricing - Dual Rank DDR2 Only

DDR2 Reg. ECC Series (1.8V)

Price Jun '09

Price Sep '09

Price
Dec '09

KVR800D2D4P6/4G
4GB 800MHz DDR2 ECC Reg with Parity CL6 DIMM Dual Rank, x4
(5.400W operating)

$100.00

$117.00
up 17%

$140.70
up 23%

(Promo price, retail $162)

KVR667D2D4P5/4G
4GB 667MHz DDR2 ECC Reg with Parity CL5 DIMM Dual Rank, x4 (5.940W operating)

$80.00

$103.00
up 29%

$97.99
down 5%

(retail $160)

KVR667D2D4P5/8G
8GB 667MHz DDR2 ECC Reg with Parity CL5 DIMM Dual Rank, x4 (7.236W operating)

$396.00

$433.00

$433.00

(Promo price, retail $515)

Benchmark Server (Spot) Memory Pricing - Dual Rank DDR3 Only

DDR3 Reg. ECC Series (1.5V)

Price Jun '09

Price Sep '09

Price
Dec '09

KVR1333D3D4R9S/4G
4GB 1333MHz DDR3 ECC Reg w/Parity CL9 DIMM Dual Rank, x4 w/Therm Sen (3.960W operating)

$138.00

$151.00
up 10%

$135.99
down 10%

KVR1066D3D4R7S/4G
4GB 1066MHz DDR3 ECC Reg w/Parity CL7 DIMM Dual Rank, x4 w/Therm Sen (5.09W 5.085W operating)

$132.00

$151.00
up 15%

$137.59
down 9%(retail $162)

KVR1066D3D4R7S/8G
8GB 1066MHz DDR3 ECC Reg w/Parity CL7 DIMM Dual Rank, x4 w/Therm Sen (6.36W 4.110W operating)

$1035.00

$917.00
down 11.5%

$667.00
down 28%

(avail. 1/10)

As the year ends, OEMs are expected to "pull up inventory," according to DRAMeXchange, in advance of a predicted market short fall somewhere in Q2/2010. Demand for greater memory capacities are being driven by Windows 7 and 64-bit processors with 4GB as the well established minimum system foot print ending 2009. With Server 2008 systems demanding 6GB+ and increased shift towards large memory foot print virtualization servers and blades, the market price for DDR3 - just turning the corner in Q1/2010 versus DDR2 - will likely flatten based on growing demand.

SOLORI's Take: With Samsung and Hynix doubling CAPEX spending in 2010, we'd be surprised to see anything more than a 30% drop in retail 4GB and 8GB server memory by Q3/2010 given the anticipated demand. That puts 8G DDR3/10666 at $470/stick versus $330 for 2x 4GB and on track with August 2009 estimates. The increase in compute, I/O and memory densities in 2010 will be market changing and memory demand will play a small (but significant) role in that development.

In the battle to "feed" the virtualization servers of 2H/2010, the 4-channel "behemoth" Magny-Cours system could have a serious memory/price advantage with 8 (2-DPC) or 12 (3-DPC) configurations of 64GB (2.6GB/thread) and 96GB (3.9GB/thread) DDR3/1066 using only 4GB sticks (assumes 2P configuration). Similar GB/thread loads on Nehalem-EP6 "Gulftown" (6-core/12-thread) could be had with 72GB DDR3/800 (18x 4GB, 3-DPC) or 96GB DDR3/1066 (12x 8GB, 2-DPC), providing the solution architect with a choice between either a performance (memory bandwidth) or price (about $2,900 more) crunch. This means Magny-Cours could show a $2-3K price advantage (per system) versus Nehalem-EP6 in $/VM optimized VDI implementations.

Where the rubber starts to meet the road, from a virtualization context, is with (unannounced) Nehalem-EP8 (8-core/16-thread) which would need 96GB (12x 8GB, 2-DPC) to maintain 2.6GB/thread capacity with Magny-Cours. This creates a memory-based price differential - in Magny-Cours' favor - of about $3K per system/blade in the 2P space. At the high-end (3.9GB/thread), the EP8 system would need a full 144GB (running DDR3/800 timing) to maintain GB/thread parity with 2P Magny-Cours - this creates a $5,700 system price differential and possibly a good reason why we'll not actually see an 8-core/16-thread variant of Nehalem-EP in 2010.

Assuming that EP8 has 30% greater thread capacity than Magny-Cours (32-threads versus 24-threads, 2P system), a $5,700 difference in system price would require a 2P Magny-Cours system to cost about $19,000 just to make it an even value proposition. We'd be shocked to see a MC processor priced above $2,600/socket, making the target system price in the $8-9K range (24-core, 2P, 96GB DDR3/1066). That said, with VDI growth on the move, a 4GB/thread baseline is not unrealistic (4 VM/thread, 1GB per virtual desktop) given current best practices. If our numbers are conservative, that's a $100 equipment cost per virtual desktop - about 20% less than today's 2P equivalents in the VDI space. In retrospect, this realization makes VMware's decision to license VDI per-concurrent-user and NOT per socket a very forward-thinking one!

Of course, we're talking about rack servers and double-size and non-standard blades here: after all, where can we put 24 DIMM slots (2P, 3-DPC, 4-channel memory) on a SFF blade? Vendors will have a hard enough time with 8-DIMM per processor (2P, 2-DPC, 4-channel memory) configurations today. Plus, all that dense compute and I/O will need to get out of the box somehow (10GE, IB, etc.) It's easy to see that HPC and virtualization platforms demands are converging, and we think that's good for both markets.

SOLORI's 2nd Take: Why does 8GB of DRAM require less than 4GB at the same speed and voltage??? The 4GB stick is based on 36x 256M x 4-bit DDR3-1066 FBGA’s (60nm) and the 8GB stick is based on 36x 512M x 4-bit DDR3-1066 FBGA’s (likely 50nm). According to SAMSUNG, the smaller feature size offers nearly 40% improvement in power consumption (per FBGA). Since the sticks use the same number of FBGA components (1Gb vs 2Gb), the 20% power savings seems reasonable.

The prospect of lower power at higher memory densities will drive additional market share to modules based on 2Gb DRAM modules. The gulf between DDR2 will continue to expand as tooling shifts to majority-DDR3 production and the technology. While minority leader Hynix announced a 50nm 2Gb DDR2 part earlier this year (2009), the chip giant Samsung continues to use 60-nm for its 2Gb DDR2. Recently, Hynix announced a successful validation of its 40-nm class 2Gb DDR3 module operating at 1333MHz and saving up to 40% power from the 50nm design. Similarly, Samsung's leading the DRAM arms race with 30nm, 4Gb DDR3 production which will show-up in 1.35V, 16GB UDIMM and RDIMM in 2010 offering additional power saving benefits over 40-50nm designs. Meanwhile, Samsung has all but abandoned advances on DDR2 feature sizes.

The writing is on the wall for DDR2 systems: unit costs are rising, demand is shrinking, research is stagnant and a new wave of DDR3-based hardware is just over the horizon (1H/2010). While these things will show the door to DDR2-based systems (which enjoyed a brief resurgence in 2009 due to DDR3 supply problems and marginal power differences) as demand and DDR3 advantages heat-up in later 2010, it's kudos to AMD for calling the adoption curve, spot on!

Saturday, December 5, 2009

vSphere, Hardware Version 7 and Hot Plug

VMware's vSphere added hot plug features in hardware version 7 (first introduced in VMware Workstation 6.5) that were not available in the earlier version 4 virtual hardware. Virtual hardware version 7 adds the following new features to VMware virtual machines:

LSI SAS virtual device - provides support for Windows Server 2008 fail-over cluster configurations

Paravirtual SCSI devices - recently updated to allow booting, can allow higher-performance (greater throughput and lower CPU utilization) than the standard virtual SCSI adapter - especially in SAN environments where I/O-intensive applications are used. Currently supported in Windows Server 2003/2008 and Red Hat Linux 5 - although any version of Linux could be modified to support PVSCSI.

IDE virtual device - useful for older OSes that don't support SCSI drivers

VMXNET 3 - next generation Vmxnet device with enhanced performance and enhanced networking features.

Hot plug virtual devices, memory and CPU - supports hot add/remove of virtual devices, memory and CPU for supported OSes.

While the "upgrade" process from version 4 to version 7 is well-known, some of the side effects are not well publicised. The most obvious change after the migration from version 4 to version 7 is the affect hot plug has on the PCI bus adapters - some are now hot plug by default, including the network adapters!

[caption id="attachment_1342" align="aligncenter" width="417" caption="Safe to remove network adapters. Really?"]

[/caption]

Note that the above example demonstrates also that the updated hardware re-enumerates the network adapters (see #3 and #4) because they have moved to a new PCI bus - one that supports hot plug. Removing the "missing" devices requires a trip to device manager (set devmgr_show_nonpresent_devices=1 in your shell environment first.) This hot plug PCI bus also allows for an administrator to mistakenly remove the device from service - potentially disconnecting tier 1 services from operations (totally by accident, of course.

[caption id="attachment_1340" align="aligncenter" width="450" caption="Devices that can be added while the VM runs with hardware version 4"]

Devices that can be added while the VM runs with hardware version 4

[/caption]

In virtual hardware version 4, only SCSI devices and hard disks were allowed to be added to a running virtual machine. Now with hardware version 7,

[caption id="attachment_1341" align="aligncenter" width="450" caption="Devices that can be added while the VM runs with hardware version 7"]

Devices that can be added while the VM runs with hardware version 7

[/caption]

additional devices (USB and Ethernet) are available for hot add. You could change memory and CPU on the fly too, if the OS supports that feature and they are enabled in the virtual machine properties prior to running the VM:

[caption id="attachment_1339" align="aligncenter" width="450" caption="CPU and Memory Hot Plug Properties"]

[/caption]

However, the hot plug NIC issue isn't discussed in the documentation, but Carlo Costanzo at VMwareInfo.com passes on Chris Hahn's great tip to disable hot plug behaviour in his blog post complete with visual aids. The key is to add a new "Advanced Configuration Parameter" to the virtual machine configuration: this new parameter is called "devices.hotplug" and its value should be set to "false." However, adding this parameter requires the virtual machine to be turned-off, so it is currently an off-line fix.

Monday, November 30, 2009

Quick Take: VirtualBox adds Live Migra... uh, Teleportation

Sun announced the 3.1.0 release of its desktop hypervisor - VirtualBox - with their own version of live virtual machine host migration called "Teleporting." Teleporting, according to the user's manual, is defined as:

"moving a virtual machine over a network from one VirtualBox host to another, while the virtual machine is running. This works regardless of the host operating system that is running on the hosts: you can teleport virtual machines between Solaris and Mac hosts, for example."

Teleportation operates like an in-place replacement of a VM's facilities, requiring that the "target" host has a virtual machine in VirtualBox with exactly the same hardware settings as the "source" VM. The source and target VM's must also share the same storage, etc. and must use either the same VirtualBox accessible iSCSI targets or some other network storage (NFS or SMB/CIFS) - and no snapshots.

"The hosts must have fairly similar CPUs. While VirtualBox can simulate some CPU features to a degree, this does not always work. Teleporting between Intel and AMD CPUs will probably fail with an error message."

The recipe for teleportation begins on the target and is given in an example, leveraging VirtualBox's VBoxManage command syntax:

VBoxManage modifyvm  --teleporter on --teleporterport

On the source, the running virtual machine is modified according to the following:

VBoxManage controlvm  teleport --host  --port

For testing, same-host teleportation is allowed (source and target equal loopback). Obviously a ready and clean-up script would be involved to copy the settings to a target location, provide the teleport maintenance and clean-up the former VM configuration that is obsoleted in the teleportation. In the case of an error, the running VM stays running on the source host, and the target VM fails to initialize.

SOLORI's Take: This represents the writing on the wall for VMware and vMotion. Perhaps the shift from VMotion to vMotion telegraphs the reduced value VMware already sees in the "now standard" feature. Adding vMotion to vSphere Essentials and Essentials Plus would garner a lot of adoption from the SMB market that is moving quickly to Hyper-V over Citrix and VMware. With VirtualBox's obvious play in desktop virtualization - where minimalist live migration features would be less of a burden - VMware's market could quickly become divided in 2010 with some crafty third-party integration along with open VDI. It's a ways off, but the potential is there...

VMware View 4, Current Certified HCL

Given the recent release of VMware View 4.0, we though it would be handy to showcase the current state of the View "certified" HCL for "hardware" thin clients. As of November 30, 2009, the following hardware thin clients are "officially" on VMware's HCL:

OEM	Model	OS Variant	Certified For	Compatible With	Supports PcoIP	Unit Cost (Est. $)
DELL	OptiPlex FX160	Windows XPe SP2	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	Y	$512
DevonIT	TC5	Windows Embedded Standard 2009	View 4.0, View 3.1	View 3.0, VDM 2.1, VDM 2.0	Y	$299
HP	GT7720	Windows Embedded Standard	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	Y	$799
HP	t5630	Windows XPe SP3	View 4.0, View 3.1	View 3.0, VDM 2.1, VDM 2.0	Y	$632
HP	t5630W	Windows Embedded Standard	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	Y	$440
HP	t5720	Windows XPe SP3	View 4.0, View 3.1	View 3.0, VDM 2.1, VDM 2.0	Y	$410 (refurbished)
HP	t5730	Windows XPe SP3	View 4.0, View 3.1	View 3.0, VDM 2.1, VDM 2.0	Y	$349
HP	t5730W	Windows Embedded Standard	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	Y	$550
HP	t5740	Windows Embedded Standard	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	Y	$429
HP	vc4820t	Windows Embedded Standard	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	Y	N/A
Wyse	C90LEW	Windows Embedded Standard 2009	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	Y	$498
Wyse	R90LEW	Windows Embedded Standard 2009	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	Y	$640
Wyse	R90LW	Windows Embedded Standard 2009	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	Y	$593
Wyse	S10	WTOS 6.5	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	N	$252
Wyse	V10L	WTOS 6.5	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	N	$315
Wyse	V10L Dual DVI	WTOS 6.5	View 4.0	View 3.1, View 3.0, VDM 2.1, VDM 2.0	N	$447

Devices not on this list may "work" with VMware View 4.0 but may not support all of View 4's features. VMware addresses certified and compatible as follows:

Certified and Compatible Thin Clients:
Certified - A thin client device listed against a particular VMware View release in the Certified For column has been tested by the thin client manufacturer against that specific VMware View release and includes a minimum set of features supported in that VMware View version.

Compatible - A thin client device certified against a specific VMware View release is compatible with previous and subsequent VMware View releases according to the compatibility guarantees published as part of that specific VMware View release (typically two major releases in both directions). However, a compatible thin client may not include all of the features of the newer VMware View release. Please refer to your VMware View Client documentation to determine which features are included.

Unlisted thin clients may embed VMware's "software client" along with a more general purpose operating system to deliver View 4 compatibility. Support for this class of device may be restricted to the device vendor only. Likewise, thin clients that are compatible with earlier versions of View may support only a subset of View 4's features. When in doubt, contact the thin client manufacturer before deploying with View 4.

Updated: 1-December-2009 - added price reference for listed thin clients.

vSphere 4, Update 1 and ESXi

On November 19, 2009 VMware released Update 1 for vSphere 4.0 which, among other bug fixes and errata, adds the following new features:

ESX/ESXi
- VMware View 4.0 support (previously unsupported)
- Windows 7 and Windows 2008 R2 support (previously "experimental") - guest customizations now supported
- Enhanced Clustering support for Microsoft Windows - adds support for VMware HA and DRS by allowing HA/DRS to be disabled per MSCS VM instead of per ESX host
- Enhanced VMware Paravirtualized SCSI support (pvSCSI boot disks now supported in Windows 2003/2008)
- Improved vNetwork Distributed Switch performance
- Increased vCPU per Core limit (raised from 20 to 25)
- Intel Xeon 3400 (uni-processor variant of the Nehalem)

vCenter Server
- Support for IBM DB2 (Enterprise, Workgroup and Express 9, Express C)
- Windows 7 and Windows 2008 R2 support (previously "experimental") - guest customizations now supported

vCenter Update Manager
- Does not support IBM DB2
- Still no scan or remediate for Windows Server 2003 SP2/R2 64-bit, Windows Server 2008 or Windows 7

vSphere Client
- Windows 7 officially supported (no more workaround hacks)

vSphere Command-Line Interface
- allows the use of comma-separated bulletins with --bundle option in "vihostupdate"

Authorized VMware users can download the necessary updates for vSphere Update 1 directly from VMware. For ESX and ESXi, updates can be managed and installed from the vCenter Update Manager within the vSphere Client. In addition to the normal backup procedure and those steps recommend by VMware, the following observations may be helpful to you:

DRS/HA cluster members CAN be updated auto-magically, however we observed very low end-to-end success rates in our testing lab. We recommend the following:
- Manually enter maintenance mode for the ESXi server
- Manually stage/remediate the patches to avoid conflicts
- Manually re-boot ESXi servers if they do not reboot on their own
- Re-scan patches when re-boot is complete, to check/confirm upgrade success
- Manually recover from maintenance mode and confirm HA host configuration

For the vSphere Client on Windows 7, completely remove the "hacked" version and clean-install the latest version (download from the updated ESX/ESXi server(s))

SOLORI's Notes: When upgrading ESXi "auto-magically" we experienced the following failures and unwanted behavior:

Update manager failure to stage pending updates and upgrades correctly, resulting in a "time-out" failure. However, updates are/were applied properly after manual reboot.

DRS/DPM conflicts with upgrade process:
- inadequate time given to servers for servers to recover from sleep mode
- Hosts sent to sleep while updates being evaluated, causing DRS to hold maintenance mode while sleeping hosts awakened, resulting in failed/time-out update process

Off-line VMs and templates not automatically migrated during update (maintenance mode) process causing extended unavailability of these assets during update

Additional Notes: Directed to the question of which updates/patches may or may not be "rolled-up" into this update, the release notes are very clear. However, for the sake of convenience, we repeat them here:

Patch Release ESX400-Update01 contains the following individual bulletins:

ESX400-200911201-UG: Updates Core

ESX400-200911205-UG: Updates lpfc820 driver to version 8.2.0.30.52vmw-2vmw

ESX400-200911206-UG: Updates hpsa driver to version 3.6.14.27vmw-2vmw

ESX400-200911207-UG: Updates mpt2sas driver to version 02.00.00.00.1vmw-2vmw

ESX400-200911208-UG: Updates Service Console Kernel

ESX400-200911209-UG: Updates forcedeth driver to version 0.61-2vmw

ESX400-200911210-UG: Updates ixgbe driver to version 1.3.36-2vmw

ESX400-200911211-UG: Updates megaraid2 driver to version 2.00.4-2vmw

ESX400-200911212-UG: Updates megaraid-sas driver to version

ESX400-200911213-UG: Updates mptsas driver to version

ESX400-200911214-UG: Updates ahci driver to version 2.0-2vmw

ESX400-200911215-UG: Updates ata-piix driver to version 2.00ac6-2vmw

ESX400-200911216-UG: Updates mptspi driver to version

ESX400-200911217-UG: Updates qla2xxx driver to version 821.k1.42vmw-2vmw

ESX400-200911218-UG: Updates s2io driver to version 2.1.4.13427-2vmw

ESX400-200911219-UG: Updates nx-nic driver to version 4.0.301-2vmw

ESX400-200911220-UG: Updates cciss driver to version 3.6.14.8vmw-2vmw

ESX400-200911221-UG: Updates Service Console glibc

ESX400-200911222-UG: Updates enic driver to version 1.1.0.68-1vmw

ESX400-200911223-UG: Updates Web Access

ESX400-200911224-UG: Updates fnic driver to version 1.1.0.49.2vmw-2vmw

ESX400-200911225-UG: Updates bnx2x driver to version 1.45.20-2vmw

ESX400-200911226-UG: Updates the VMware ESX 4.0 e1000

ESX400-200911227-UG: Updates the ESX 4.0 Intel igb driver

ESX400-200911228-UG: Updates the ESX 4.0 Serverworks PATA for ATA driver

ESX400-200911229-UG: Updates the ESX 4.0 ATA

ESX400-200911230-UG: Updates the ESX 4.0 SCSI/iSCSI driver

ESX400-200911231-UG: Updates the ESX 4.0 qlogic SCSI driver

ESX400-200911232-UG: Updates the ESX 4.0 curl component

ESX400-200911233-UG: Updates the ESX 4.0 gnutls component

ESX400-200911234-UG: Updates the ESX 4.0 libxml2 component

ESX400-200911235-UG: Updates the ESX 4.0 python component

ESX400-200911236-UG: Updates ESX 4.0 SNMP component

ESX400-200911237-UG: Updates the ESX 4.0 bind component

ESX400-200911238-UG: Updates the ESX 4.0 ntp component

ESX400-200911239-UG: Updates the ESX 4.0 audit component:

ESXi 4.0 Update 1 also contains all fixes in the following previously released bundles:

Patch Release ESX400-200906001
Patch Release ESX400-200907001
Patch Release ESX400-200909001

Thursday, November 19, 2009

NEC Offers "Dunnington" Liposuction, Tops 64-Core VMmark

NEC’s venerable Express5800/A1160 is back at the top VMmark chart, this time establishing the brand-new 64-core category with a score of 48.23@32 tiles - surpassing its 48-core 3rd place posting by over 30%. NEC’s new 16-socket, 64-core, 256GB “Dunnington” X7460 Xeon-based score represents a big jump in performance over its predecessor with a per tile ratio of 1.507 - up 6% from the 48-core ratio of 1.419.

To put this into perspective, the highest VMmark achieved, to date, is the score of 53.73@35 tiles (tile ratio 1.535) from the 48-core HP DL785 G6 in August, 2009. If you are familiar with the "Dunnington" X7460, you know that it's a 6-core, 130W giant with 16MB L2 cache and a 1000's price just south of $3,000 per socket. So that raises the question: how does 6-cores X 16-sockets = 64? Well, it's not pro-rationing from the Obama administration's "IT fairness" czar. NEC chose to disable the 4th and 6th core of each socket to reduce the working cores from 96 to 64.

At $500/core, NEC's gambit may represent an expensive form of "core liposuction" but it was a necessary one to meet VMware's "logical processor per host" limitation of 64. That's right, currently VMware's vSphere places a limit on logical processors based on the following formula:

CPU_Sockets X Cores_Per_Socket X Threads_Per_Core =< 64

According to VMware, the other 32 cores would have been "ignored" by vSphere had they been enabled. Since "ignored" is a nebulous term (aka "undefined"), NEC did the "scientific" thing by disabling 32 cores and calling the system a 64-core server. The win here: a net 6% improvement in performance per tile over the 6-core configuration - ostensibly from the reduced core loading on the 16MB of L3 cache per socket and reduction in memory bus contention.

Moving forward to 2010, what does this mean for vSphere hardware configurations in the wake of 8-core, 16-thread Intel Nehalem-EX and 12-core, 12-thread AMD Magny-Cours processors? With a 4-socket Magny-Cours system limitation, we won't be seeing any VMmarks from the boys in green beyond 48-cores. Likewise, the boys in blue will be trapped by a VMware limitation (albeit, a somewhat arbitrary and artificial one) into a 4-socket, 64-thread (HT) configuration or an 8-socket, 64-core (HT-disabled) configuration for their Nehalem-EX platform - even if using the six-core variant of EX. Looks like VMware will need to lift the 64-LCPU embargo by Q2/2010 just to keep up.

Wednesday, November 11, 2009

Fujitsu's RX300 S5 rack server takes the top spot in VMware's VMmark for 8-core systems today with a score of 25.16@17 tiles. Loaded with two of Intel's top-bin 3.33GHz, 130W Nehalem-EP processors (W5590, turbo to 3.6GHz per core) and 96GB of DDR3-1333 R-ECC memory, the RX300 bested the former champ - the HP ProLiant BL490c G6 blade - by only about 2.5%.

With 17 tiles and 102 virtual machines on a single 2U box, the RX300 S5 demonstrates precisely how well vSphere scales on today's x86 commodity platforms. It also appears to demonstrate both the value and the limits of Intel's "turbo mode" in its top-bin Nehalem-EP processors - especially in the virtualization use case - we'll get to that later. In any case, the resulting equation is:

More * (Threads + Memory + I/O) = Dense Virtualization

We could have added "higher execution rates" to that equation, however, virtualization is a scale-out applications where threads, memory pool and I/O capabilities dominate the capacity equation - not clock speed. Adding 50% more clock provides less virtualization gains than adding 50% more cores, and reducing memory and context latency likewise provides better gains that simply upping the clock speed. That's why a dual quad-core Nehalem 2.6GHz processor will crush a quad dual-core 3.5GHz (ill-fated) Tulsa.

Speaking of Tulsa, unlike Tulsa's rather anaemic first-generation hyper-threading, Intel's improved SMT in Nehalem "virtually" adds more core "power" to the Xeon by contributing up to 100% more thread capacity. This is demonstrated by Nehalem-EP's 2 tiles per core contributions to VMmark where AMD's Istanbul quad-core provides only 1 tile per core. But exactly what is a VMmark tile and how does core versus thread play into the result?

[caption id="attachment_1306" align="aligncenter" width="450" caption="The Illustrated VMmark "Tile" Load "]

[/caption]

As you can see, a "VMmark Tile" - or just "tile" for short - is composed of 6 virtual machines, half running Windows, half running SUSE Linux. Likewise, half of the tiles are running in 64-bit mode while the other half runs in 32-bit mode. As a whole, the tile is composed of 10 virtual CPUs, 5GB of RAM and 62GB of storage. Looking at how the parts contribute to the whole, the tile is relatively balanced:

Operating System / Mode	32-bit	64-bit	Memory	vCPU	Disk
Windows Server 2003 R2	67%	33%	45%	50%	58%
SUSE Linux Enterprise Server 10 SP2	33%	67%	55%	50%	42%
32-bit	50%	N/A	30%	40%	58%
64-bit	N/A	50%	70%	60%	42%

If we stop here and accept that today's best x86 processors from AMD and Intel are capable of providing 1 tile for each thread, we can look at the thread count and calculate the number of tiles and resulting memory requirement. While that sounds like a good "rule of thumb" approach, it ignores specific use case scenarios where synthetic threads (like HT and SMT) do not scale linearly like core threads do where SMT accounts for only about 12% gains over single-threaded core, clock-for-clock. For this reason, processors from AMD and Intel in 2010 will feature more cores - 12 for AMD and 8 for Intel in their Magny-Cours and Nehalem-EX (aka "Beckton"), respectively.

Learning from the Master

If we want to gather some information about a specific field, we consult an expert, right? Judging from the results, Fujitsu's latest dual-processor entry has definitely earned the title 'Master of VMmark" in 2P systems - at least for now. So instead of the usual VMmark $/VM analysis (which are well established for recent VMmark entries), let's look at the solution profile and try to glean some nuggets to take back to our data centers.

It's Not About Raw Speed

First, we've noted that the processor used is not Intel's standard "rack server" fare, but the more workstation oriented W-series Nehalem at 130W TDP. With "turbo mode" active, this CPU is capable of driving the 3.33GHz core - on a per-core basis - up to 3.6GHz. Since we're seeing only a 2.5% improvement in overall score versus the ProLiant blade at 2.93GHz, we can extrapolate that the 2.93GHz X5570 Xeon is spending a lot of time at 3.33GHz - its "turbo" speed - while the power-hungry W5590 spends little time at 3.6GHz. How can we say this? Looking at the tile ratio as a function of the clock speed.

We know that the X5570 can run up to 3.33GHz, per core, according to thermal conditions on the chip. With proper cooling, this could mean up to 100% of the time (sorry, Google). Assuming for a moment that this is the case in the HP test environment (and there is sufficient cause to think so) then the ratio of the tile score to tile count and CPU frequency is 0.433 (24.54/17/3.33). If we examine the same ratio for the W5590, assuming the clock speed of 3.33GHz, we get 0.444 - a difference of 2.5%, or the contribution of "turbo" in the W5590. Likewise, if you back-figure the "apparent speed" of the X5570 using the ratio of the clock-locked W5590, you arrive at 3.25GHz for the W5570 (an 11% gain over base clock). In either case, it is clear that "turbo" is a better value at the low-end of the Nehalem spectrum as there isn't enough thermal headroom for it to work well for the W-series.

VMmark Equals Meager Network Use

Second, we're not seeing "fancy" networking tricks out of VMmark submissions. In the past, we've commented on the use of "consumer grade" switches in VMmark tests. For this reason, we can consider VMmark's I/O dependency as related almost exclusively to storage. With respect to networking, the Fujitsu team simply interfaced three 1Gbps network adapter ports to the internal switch of the blade enclosure used to run the client-side load suite and ran with the test. Here's what that looks like:

[caption id="attachment_1310" align="aligncenter" width="410" caption="Networking Simplified: The "leaders" simple virtual networking topology."]

[/caption]

Note that the network interfaces used for the VMmark trial are not from the on-board i82575EB network controller but from the PCI-Express quad-port adapter using its older cousin - the i82571EB. What is key here is that VMmark is tied to network performance issues, and it is more likely that additional network ports might increase the likelihood of IRQ sharing and reduced performance more so than the "optimization" of network flows.

Keeping Storage "Simple"

Third, Fujitsu's approach to storage is elegantly simple: several "inexpensive" arrays with intelligent LUN allocation. For this, Fujistu employed eight of its ETERNUS DX80 Disk Storage Systems with 7 additional storage shelves for a total of 172 working disks and 23 LUNs. For simplicity, Fujistu used a pair of 8Gbps FC ports to feed ESX and at least one port per DX80 - all connected through a Brocade 5100 fabric switch. The result looked something like this:

[caption id="attachment_1311" align="aligncenter" width="450" caption="Fujitsu's VMmark Storage Topology: 8 Controllers, 7 Shelves, 172 Disks and 23 LUNs."]

[/caption]

And yes, the ESX server is configured to boot from SAN, using no locally attached storage. Note that the virtual machine configuration files, VM swap and ESX boot/swap are contained in a separate DX80 system. This "non-default" approach allows the working VMDKs of the virtual machines to be isolated - from a storage perspective - from the swap file overhead, about 5GB per tile. Again, this is a benchmark scenario, not an enterprise deployment, so trade-offs are in favour of performance, not CAPEX or OPEX.

Even if the DX80 solution falls into the $1K/TB range, to say that this approach to storage is "economic" requires a deeper look. At 33 rack units for the solution - including the FC switch but not including the blade chassis - this configuration has a hefty datacenter footprint. In contrast to the old-school server/blade approach, 1 rack at 3 servers per U is a huge savings over the 2 racks of blades or 3 racks of 1U rack servers. Had each of those servers of blades had a mirror pair, we'd be talking about 200+ disks spinning in those racks versus the 172 disks in the ETERNUS arrays, so that still represents a savings of 15.7% in storage-related power/space.

When will storage catch up?

Compared to a 98% reduction in network ports, a 30-80% reduction server/storage CAPEX (based on $1K/TB SAN), a 50-75% reduction in overall datacenter footprint, why is a 15% reduction in datacenter storage footprint acceptable? After all, storage - in the Fujitsu VMmark case - now represents 94% of the datacenter footprint. Even if the load were less aggressively spread across five ESX servers (a conservative 20:1 loading), the amount of space taken by storage only falls to 75%.

How can storage catch up to virtualization densities. First, with 2.5" SAS drives, a bank of 172 disks can be made to occupy only 16U with very strong performance. This drops storage to only 60% of the datacenter footprint - 10U for hypervisor, 16U for storage, 26U total for this example. Moving from 3.5" drives to 2.5" drives takes care of the physical scaling issue with acceptable returns, but results in only minimal gains in terms of power savings.

Saving power in storage platforms is not going to be achieved by simply shrinking disk drives - shrinking the NUMBER of disks required per "effective" LUN is what's necessary to overcome the power demands of modern, high-performance storage. This is where non-traditional technology like FLASH/SSD is being applied to improve performance while utilizing fewer disks and proportionately less power. For example, instead of dedicating disks on a per LUN basis, carving LUNs out of disk pools accelerated by FLASH (a hybrid storage pool) can result in a 30-40% reduction in disk count - when applied properly - and that means 30-40% reduction in datacenter space and power utilization.

Lessons Learned

Here are our "take aways" from the Fujitsu VMmark case:

1) Top-bin performance is at the losing end of diminishing returns. Unless your budget can accommodate this fact, purchasing decisions about virtualization compute platforms need to be aligned with $/VM within an acceptable performance envelope. When shopping CPU, make sure the top-bin's "little brother" has the same architecture and feature set and go with the unit priced for the mainstream. (Don't forget to factor memory density into the equation...) Regardless, try to stick within a $190-280/VM equipment budget for your hypervisor hardware and shoot for a 20-to-1 consolidation ratio (that's at least $3,800-5,600 per server/blade).

2) While networking is not important to VMmark, this is likely not the case for most enterprise applications. Therefore, VMmark is not a good comparison case for your network-heavy applications. Also, adding more network ports increases capacity and redundancy but does so at the risk of IRQ-sharing (ESX, not ESXi) problems, not to mention the additional cost/number of network switching ports. This is where we think 10GE will significantly change the equation in 2010. Remember to add up the total number of in use ports - including out-of-band management - when factoring in switch density. For net new instalments, look for a switch that provides 10GE/SR or 10GE/CX4 options and go with !0GE/SR if power savings are driving your solution.

3) Storage should be simple, easy to manage, cheap (relatively speaking), dense and low-power. To meet these goals, look for storage technologies that utilize FLASH memory, tiered spindle types, smart block caching and other approaches to limit spindle count without sacrificing performance. Remember to factor in at least the cost of DAS when approximating your storage budget - about $150/VM in simple consolidation cases and $750/VM for more mission critical applications (that's a range of $9,000-45,000 for a 3-server virtualization stack). The economies in managed storage come chiefly from the administration of the storage, but try to identify storage solutions that reduce datacenter footprint including both rack space and power consumption. Here's where offerings from Sun and NexentaStor are showing real gains.

We'd like to see VMware update VMmark to include system power specifications so we can better gage - from the sidelines - what solution stack(s) perform according to our needs. VMmark served its purpose by giving the community a standard from which different platforms could be compared in terms of the resultant performance. With the world's eyes on power consumption and the ecological impact of datacenter choices, adding a "power utilization component" to the "server-side" of the VMmark test would not be that significant of a "tweak." Here's how we think it can be done:

Require power consumption of the server/VMmark related components be recorded, including:
1. the ESX platform (rack server, blade & blade chassis, etc.)
2. the storage platform providing ESX and test LUN(s) (all heads, shelves, switches, etc.)
3. the switching fabric (i.e. Ethernet, 10GE, FC, etc.)

Power delivered to the test harness platforms, client load machines, etc. can be ignored;

Power measurements should be recorded at the following times:
1. All equipment off (validation check);
2. Start-up;
3. Single tile load;
4. 100% tile capacity;
5. 75% tile capacity;
6. 50% tile capacity;

Power measurements should be recorded using a time-power data-logger with readings recorded as 5-minute averages;

Notations should be made concerning "cache warm-up" intervals, if applicable, where "cache optimized" storage is used.

Why is this important? In the wake of the VCE announcement, solution stacks like VCE need to be measured against each other in an easy to "consume" way. Is VCE the best platform versus a component solution provided by your local VMware integrator? Given that the differentiated VCE components are chiefly UCS, Cisco switching and EMC storage, it will be helpful to have a testing platform that can better differentiate "packaged solutions" instead of uncorrelated vendor "propaganda."

Let us know what your thoughts are on the subject, either on Twitter or on our blog...
[polldaddy poll=2241691]

Tuesday, November 3, 2009

Sun Adds De-Duplication to ZFS

Yesterday Jeff Bonwick (Sun) announced that deduplication is now officially part of ZFS - Sun's Zettabyte File System that is at the heart of Sun's Unified Storage platform and NexentaStor. In his post, Jeff touched on the major issues surrounding deduplication in ZFS:

Deduplication in ZFS is Block-level

ZFS provides block-level deduplication because this is the finest granularity that makes sense for a general-purpose storage system. Block-level dedup also maps naturally to ZFS's 256-bit block checksums, which provide unique block signatures for all blocks in a storage pool as long as the checksum function is cryptographically strong (e.g. SHA256).

Deduplication in ZFS is Synchronous

ZFS assumes a highly multithreaded operating system (Solaris) and a hardware environment in which CPU cycles (GHz times cores times sockets) are proliferating much faster than I/O. This has been the general trend for the last twenty years, and the underlying physics suggests that it will continue.

Deduplication in ZFS is Per-Dataset

Like all zfs properties, the 'dedup' property follows the usual rules for ZFS dataset property inheritance. Thus, even though deduplication has pool-wide scope, you can opt in or opt out on a per-dataset basis. Most storage environments contain a mix of data that is mostly unique and data that is mostly replicated. ZFS deduplication is per-dataset, which means you can selectively enable dedup only where it is likely to help.

Deduplication in ZFS is based on a SHA256 Hash

Chunks of data -- files, blocks, or byte ranges -- are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.

Deduplication in ZFS can be Verified

[If you are paranoid about potential "hash collisions"] ZFS provies a 'verify' option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not.

Deduplication in ZFS is Scalable

ZFS places no restrictions on your ability to dedup. You can dedup a petabyte if you're so inclined. The performace of ZFS dedup will follow the obvious trajectory: it will be fastest when the DDTs (dedup tables) fit in memory, a little slower when they spill over into the L2ARC, and much slower when they have to be read from disk -- but the point I want to emphasize here is that there are no limits in ZFS dedup. ZFS dedup scales to any capacity on any platform, even a laptop; it just goes faster as you give it more hardware.
Jeff Bonwick's Blog, November 2, 2009

What does this mean for ZFS users? That depends on the application, but highly duplicated environments like virtualization stand to gain significant storage-related value from this small addition to ZFS. Considering the various ways virtualization administrators deal with virtual machine cloning, even the basic VMware template approach (not using linked-clones) will now result in significant storage savings. This restores parity between storage and compute in the virtualization stack.

What does it mean for ZFS-based storage vendors? More main memory and processor threads will be necessary to limit the impact on performance. With 6-core and 8-thread CPU's available in the mainstream, this problem is very easily resolved. Just like the L2ARC tables consume main memory, the DDT's will require an increase in main memory for larger datasets. Testing and configuration convergence will likely take 2-3 months once dedupe is mainstream.

When can we expect to see dedupe added to ZFS (i.e. OpenSolaris)? According to Jeff, "in roughly a month."

Updated: 11/04/2009 - Link to Nexenta corrected. Was incorrectly linked to "nexent.com" - typo - now correctly linked to "http://www.nexenta.com"

Friday, October 9, 2009

Quick Take: Red Hat and Microsoft Virtual Inter-Op

This week Red Hat and Microsoft announced support of certain of their OSes as guests in their respective hypervisor implementations: Kernel Virtual Machine (KVM) and Hyper-V, respectively. This comes on the heels of Red Hat's Enterprise Server 5.4 announcement last month.

KVM is Red Hat's new hypervisor that leverages the Linux kernel to accelerate support for hardware and capabilities. It was Red Hat and AMD that first demonstrated live migration between AMD and Intel-based hypervisors using KVM late last year - then somewhat of a "Holy Grail" of hypervisor feats. With nearly a year of improvements and integration into their Red Hat Enterprise Server and Fedora "free and open source" offerings, Red Hat is almost ready to strike-out in a commercially viable way.

Microsoft now officially supports the following Red Hat guest operating systems in Hyper-V:

Red Hat Enterprise Linux 5.2, 5.3 and 5.4

Red Hat likewise officially supports the following Microsoft quest operating systems in KVM:

Windows Server 2003, 2008 and 2008 R2

The goal of the announcement and associated agreements between Red Hat and Microsoft was to enable a fully supported virtualization infrastructure for enterprises with Red Hat and Microsoft assets. As such, Microsoft and Red Hat are committed to supporting their respective products whether the hypervisor environment is all Red Hat, all Hyper-V or totally heterogeneous - mixing Red Hat KVM and Microsoft Hyper-V as necessary.

"With this announcement, Red Hat and Microsoft are ensuring their customers can resolve any issues related to Microsoft Windows on Red Hat Enterprise Virtualization, and Red Hat Enterprise Linux operating on Microsoft Hyper-V, regardless of whether the problem is related to the operating system or the virtualization implementation."
- Red Hat press release, October 7, 2009

Many in the industry cite Red Hat's adoption of KVM as a step backwards [from Xen] requiring the re-development of significant amount of support code. However, Red Hat's use of libvirt as a common management API has allowed the change to happen much more rapidly that critics assumptions had allowed. At Red Hat Summit 2009, key Red Hat officials were keen to point out just how tasty their "dog food" is:

Tim Burke, Red Hat's vice president of engineering, said that Red Hat already runs much of its own infrastructure, including mail servers and file servers, on KVM, and is working hard to promote KVM with key original equipment manufacturer partners and vendors.

And Red Hat CTO Brian Stevens pointed out in his Summit keynote that with KVM inside the Linux kernel, Red Hat customers will no longer have to choose which applications to virtualize; virtualization will be everywhere and the tools to manage applications will be the same as those used to manage virtualized guests.
- Xen vs. KVM, by Pam Derringer, SearchDataCenter.com

For system integrators and virtual infrastructure practices, Red Hat's play is creating opportunities for differentiation. With a focus on light-weight, high-performance, I/O-driven virtualization applications and no need to support years-old established processes that are dragging on Xen and VMware, KVM stands to leap-frog the competition in the short term.

SOLORI's Take: This news is good for all Red Hat and Microsoft customers alike. Indeed, it shows that Microsoft realizes that its licenses are being sold into the enterprise whether or not they run on physical hardware. With 20+:1 consolidation ratios now common, that represents a 5:1 license to hardware sale for Microsoft, regardless of the hypervisor. With KVM's demonstrated CPU agnostic migration capabilities, this opens the door to an even more diverse virtualization infrastructure than ever before.

On the Red Hat side, it demonstrates how rapidly Red Hat has matured its offering following the shift to KVM earlier this year. While KVM is new to Red Hat, it is not new to Linux or aggressive early adopters since being added to the Linux kernel as of 2.6.20 back in September of 2007. With support already in active projects like ConVirt (VM life cycle management), OpenNebula (cloud administration tools), Ganeti, and Enomaly's Elastic Computing Platform, the game of catch-up for Red Hat and KVM is very likely to be a short one.

Wednesday, October 7, 2009

Quick Take: Nehalem/Istanbul Comparison at AnandTech

Johan De Gelas and crew present an interesting comparison of Dunnington, Shanghai, Istanbul and Nehalem in a new post at AnandTech this week. In the test line-up are the "top bin" parts from Intel and AMD in 4-core and 6-core incarnations:

Intel Nehalem-EP Xeon, X5570 2.93GHz, 4-core, 8-thread

Intel "Dunnington" Xeon, X7460, 2.66GHz, 6-core, 6-thread

AMD "Shanghai" Opteron 2389/8389, 2.9GHz, 4-core, 4-thread

AMD "Istanbul" Opteron 2435/8435, 2.6GHz, 6-core, 6-thread

Most importantly for virtualization systems architects is how the vCPU scheduling affects "measured" performance. The telling piece comes from the difference in comparison results where vCPU scheduling is equalized:

[caption id="attachment_1276" align="aligncenter" width="450" caption="AnandTech's Quad Sockets v. Dual Sockets Comparison. Oct 6, 2009."]

AnandTech's Quad Sockets v. Dual Sockets Comparison. Oct 6, 2009.

[/caption]

When comparing the results, De Gelas hits on the I/O factor which chiefly separates VMmark from vAPUS:

The result is that VMmark with its huge number of VMs per server (up to 102 VMs!) places a lot of stress on the I/O systems. The reason for the Intel Xeon X5570's crushing VMmark results cannot be explained by the processor architecture alone. One possible explanation may be that the VMDq (multiple queues and offloading of the virtual switch to the hardware) implementation of the Intel NICs is better than the Broadcom NICs that are typically found in the AMD based servers.
Johan De Gelas, AnandTech, Oct 2009

This is yet another issue that VMware architects struggle with in complex deployments. The latency in "Dunnington" is a huge contributor to its downfall and why the Penryn architecture was a dead-end. Combined with 8 additional threads in the 2P form factor, Nehalem delivers twice the number of hardware execution contexts than Shanghai, resulting in significant efficiencies for Nehalem where small working data sets are involved.

When larger sets are used - as in vAPUS - the Istanbul's additional cores allows it to close the gap to within the clock speed difference of Nehalem (about 12%). In contrast to VMmark which implies a 3:2 advantage to Nehalem, the vAPUS results suggest a closer performance gap in more aggressive virtualization use cases.

SOLORI's Take: We differ with De Gelas on the reduction in vAPUS' data set to accommodate the "cheaper" memory build of the Nehalem system. While this offers some advantages in testing, it also diminishes one of Opteron's greatest strengths: access to cheap and abundant memory. Here we have the testing conundrum: fit the test around the competitors or the competitors around the test. The former approach presents a bias on the "pure performance" aspect of the competitors, while the latter is more typical of use-case testing.

We do not construe this issue as intentional bias on AnandTech's part, however it is another vector to consider in the evaluation of the results. De Gelas delivers a report worth reading in its entirety, and we view this as a primer to the issues that will define the first half of 2010.

Thursday, October 1, 2009

Quarter in Review: Top 5's of Q3/2009

SOLORI's top blog posts of Q3/2009

In-the-Lab: Full ESX Test Lab in a Box - 18%
1. Part 1, Setup and Getting Started with ESXi
2. Part 2, Selecting a Virtual Storage Appliance (VSA)
3. Part 3, Building and Provisioning the VSA
4. Part 4, Creating the Cluster-in-a-Box
5. Part 5, Deploying vCenter, Update Manager, et al

Installing FreeNAS to USB Flash: Easy as 1, 2, 3 - 17%

Preview: Installing vSphere ESXi to Flash - 11%

Installing ESXi on the Tyan Transport GT28 - 4%

In-the-Lab: vSphere DPM, Quirky but Functional - 3%

SOLORI's top search engine keywords for Q3/2009

USB flash install - 5.6%

FreeNAS - 4.7%

ESXi - 1.7%

Virtual SAN - 0.6%

AMD - 0.4%

Summary and Comments

With about 17K visits this quarter, FreeNAS and ESXi related posts are clearly the most popular. We've seen a great deal of traffic generated by the ESX-on-ESX series, but the popular FreeNAS project comes a close second. Judging by the search engine results, nearly 6% of our traffic find the SolutionOriented Blog trying to locate tips on installing FreeNAS or ESXi to USB flash. We'll take that as a hint for next quarter to deliver more information on alternative and open storage solutions that fit virtualization use cases: stayed tuned.

Monday, September 28, 2009

In Part 4 of this series we created two vSphere virtual machines - one running ESX and one running ESXi - from a set of master images we can use for rapid deployment in case we want to expand the number of ESX servers in our lab. We showed you how to use NexentaStor to create snapshots of NFS and iSCSI volumes and create ZFS clone images from them. We then showed you how to stage the startup of the VSA and ESX hosts to "auto-start" the lab on boot-up.

In this segment, Part 5, we will create a VMware Virtual Center (vCenter) virtual machine and place the ESX and ESXi machines under management. Using this vCenter instance, we will complete the configuration of ESX and ESXi using some of the new features available in vCenter.

Part 5, Managing our ESX Cluster-in-a-Box

With our VSA and ESX servers purring along in the virtual lab, the only thing stopping us from moving forward with vMotion is the absence of a working vCenter to control the process. Once we have vCenter installed, we have 60-days to evaluate and test vSphere before the trial license expires.

Prepping for vCenter Server for vSphere

We are going to install Microsoft Windows Server 2003 STD for the vCenter Server operating system. We chose Server 2003 STD since we have limited CPU and memory resources to commit to the management of the lab and because our vCenter has no need of 64-bit resources in this use case.

Since one of our goals is to have a fully functional vMotion lab with reasonable performance, we want to create a vCenter virtual machine with at least the minimum requirements satisfied. In our 24GB lab server, we have committed 20GB to ESX, ESXi and the VSA (8GB, 8GB and 4GB, respectively). Our base ESXi instance consumes 2GB, leaving only 2GB for vCenter - or does it?

Memory Use in ESXi

VMware ESX (and ESXi) does a good job of conserving resources by limiting commitments for memory and CPU. This is not unlike any virtual memory capable system that puts a premium on "real" memory by moving less frequently used pages to disk. With a lot of idle virtual machines, this ability alone can create significant over-subscription possibilities for VMware; this is why it could be possible to run 32GB worth of VM's to run on a 16-24GB host.

Do we really want this memory paging to take place? The answer - for the consolidation use cases - is usually "yes." This is because consolidation is born out of the need to aggregate underutilized systems in a more resource efficient way. Put another way, administrators tend to provision systems based on worst case versus average use, leaving 70-80% of those resources idle in off-peak times. Under ESX's control those underutilized resources can be re-tasked to another VM without impacting the performance of either one.

On the other hand, our ESX and VSA virtual machines are not the typical use case. We intend to fully utilized their resources and let them determine how to share them in turn. Imagine a good number of virtual machines running on our virtualized ESX hosts: will they perform well with the added hardship of memory paging? Also, when begin to use vMotion those CPU and memory resources will appear on BOTH virtualized ESX servers at the same time.

It is pretty clear that if all of our lab storage is committed to the VSA, we do not want to page its memory. Remember that any additional memory not in use by the SAN OS in our VSA is employed as ARC cache for ZFS to increase read performance. Paging memory that is assumed to be "high performance" by NexentaStor would result in poor storage throughput. The key to "recursive computing" is knowing how to anticipate resource bottlenecks and deploy around them.

This brings the question: how much memory is left after reserving 4GB for the VSA? To figure that out, let's look at what NexentaStor uses at idle with 4GB provisioned:

[caption id="attachment_1169" align="aligncenter" width="374" caption="NexentaStor's RAM footprint with 4GB provisioned, at idle."]

NexentaStor's RAM footprint with 4GB provisioned, at idle.

[/caption]

As you can see, we have specified a 4GB reservation which appears as "4233 MB" of Host Memory consumed (4096MB+137MB). Looking at the "Active" memory we see that - at idle - the NexentaStor is using about 2GB of host RAM for OS and to support the couple of file systems mounted on the host ESXi server (recursively).

Additionally, we need to remember that each VM has a memory overhead to consider that increases with the vCPU count. For the four vCPU ESX/ESXi servers, the overhead is about 220MB each; the NexentaStor VSA consumes an additional 140MB with its two vCPU's. Totaling-up the memory plus overhead identifies a commitment of at least 21,828MB of memory to run the VSA and both ESX guests - that leaves a little under 1.5GB for vCenter if we used a 100% reservation model.

Memory Over Commitment

The same concerns about memory hold true for our ESX and ESXi hosts - albeit in a less obvious way. We obviously want to "reserve" memory for required by the VMM - about 2.8GB and 2GB for ESX and ESXi respectively. Additionally, we want to avoid over subscription of memory on the host ESXi instance - if at all possible - since it will already be working running our virtual ESX and ESXi machines.

Quick Take: HP Blade Tops 8-core VMmark w/OC'd Memory

HP's ProLiant BL490c G6 server blade now tops the VMware VMmark table for 8-core systems - just squeaking past rack servers from Lenovo and Dell with a score of 24.54@17 tiles: a new 8-core record. The half-height blade was equipped with two, quad-core Intel Xeon X5570 (Nehalem-EP, 130W TDP) and 96GB ECC Registered DDR3-1333 (12x 8GB, 2-DIMM/channel) memory.

In our follow-up, we found that HP's on-line configuration tool does not allow for DDR3-1333 memory so we went to the street for a comparison. For starters, we examined the on-line price from HP with DDR3-1066 memory and the added QLogic QMH2462 Fiber Channel adapter ($750) and additional NC360m dual-port Gigabit Ethernet controller ($320) which came to a grand total of $28,280 for the blade (about $277/VM, not including Blade chassis or SAN storage).

Stripping memory from the build-out results in a $7,970 floor to the hardware, sans memory. Going to the street to find 8GB sticks with DDR3-1333 ratings and HP support yielded the Kingston KTH-PL313K3/24G kit (3x 8GB DIMMs) of which we would need three to complete the build-out. At $4,773 per kit, the completed system comes to $22,289 (about $218/VM, not including chassis or storage) which may do more to demonstrate Kingston's value in the market place rather than HP's penchant for "over-priced" memory.

Now, the interesting disclosure from HP's testing team is this:

[caption id="attachment_1203" align="aligncenter" width="450" caption="Notes from HP's VMmark submission."]

[/caption]

While this appears to boost memory performance significantly for HP's latest run (compared to the 24.24@17 tiles score back in May, 2009) it does so at the risk of running the Nehalem-EP memory controller out of specification - essentially, driving the controller beyond the rated load. It is hard for us to imagine that this specific configuration would be vendor supported if used in a problematic customer installation.

SOLORI's Take:Those of you following closely may be asking yourselves: "Why did HP choose to over-clock the memory controller in this run by pushing a 1066MHz, 2DPC limit to 1333MHz?" It would appear the answer is self-evident: the extra 6% was needed to put them over the Lenovo machine. This issue raises a new question about the VMmark validation process: "Should out of specification configurations be allowed in the general benchmark corpus?" It is our opinion that VMmark should represent off-the-shelf, fully-supported configurations only - not esoteric configuration tweaks and questionable over-clocking practices.

Should there be as "unlimited" category in the VMmark arena? Who knows? How many enterprises knowingly commit their mission critical data and processes to systems running over-clocked processors and over-driven memory controllers? No hands? That's what we thought... Congratulations anyway to HP for clawing their way to the top of the VMmark 8-core heap...

Pages