This past week saw the second whitepaper published by Jeff Buell at VMware on Hadoop performance within a vSphere environment. The initial whitepaper was published in 2013 and was one of the initial pieces of information I latched onto when I started my journey down the virtualized Hadoop road. Once again, Jeff Buell does a great job going through the same set of tests performed in 2013 to show the performance gains that can be realized when virtualizing Hadoop on bare metal servers using VMware vSphere 6.

I first met Jeff last year at Hadoop Summit in San Jose, CA where I had an opportunity to talk to him about his work. He is a brilliant individual and has amazing insight into Hadoop and the vSphere environments. He was also instrumental in the work we did at Adobe last summer with EMC on our large-scale POC utilizing EMC Isilon storage for the HDFS layer. The latest whitepaper continues his work and improves upon everything he has evangelized in the past on virtualized Hadoop.

Rather than regurgitate the information published in the whitepapers, I wanted to just spend a moment mentioning a couple of the items that stood out most to me:

  1. Single-queue versus multi-queue (default) settings when using the VMXNET3 virtual NIC driver.
  2. Using pRDM devices rather than SAN or Isilon storage.
  3. The ESXi scheduler being able to isolate a VM to a single NUMA node for both CPU and memory, increasing the performance of the VM when running Hadoop tasks.
  4. Best Practices for virtualizing Hadoop to ensure the cluster is able to realize all of the performance gains possible.

I still find it curious the whitepaper did not utilize the VMware Big Data Extensions for cluster deployment and configuration; I believe both parties could stand to gain if they were more closely aligned. I have incorporated many of the best practices mentioned by Jeff within my own BDE environments, but would like to see more of them committed into the BDE/Serengeti code. Beyond that, I think the next step is going to be performing the same set of tests when utilizing VMware vCloud Air public cloud offering, to help consumers understand the advantages and disadvantages to running a virtualized Hadoop workload in that environment.

I am still hearing engineers and other system administrators making blanket statements around Hadoop being a workload that cannot be virtualized. I most appreciate the work done by Jeff and VMware because it allows me to have one more bit of published information to help change people’s minds. I highly recommend and encourage you to spend a few minutes reading both whitepapers and the other works referenced in them to better understand the work Jeff and others have done in the space.