After a long week off, I am back and should be posting 2-3x per week leading up to VMworld 2014 in August.
I keep getting this question from various software engineers, system engineers and managers so I thought it would be a good topic to address here.
Disclaimer: Mileage will vary depending on your compute hardware, disk systems (DAS or NAS) and individual Hadoop workloads.
Now that the disclaimer is out of the way, let me spend some time answering the general form of the question. First, there are several whitepapers that show what virtualizing Hadoop looks like with various workloads, Hadoop distributions and hardware.
Generally speaking, running Hadoop within a virtual machine incurs a less than 5% performance penalty. However, that is based on no modification to the configuration of the Hadoop cluster and that means you likely don’t fully understand the workload you are hosting. If you saw my earlier post on YARN containers, you will hopefully come to the same conclusion that I have and that is customization is key for the infrastructure. Hadoop is very much not a one-size-fit-all system.
Reading the whitepaper from Dell, Intel and VMware it shows that the overall result of the tests they ran, several of the virtualized clusters outperformed the performance from a physical cluster utilizing the same hardware. There is the one test, DFSIOE-READ, which ran significantly worse when virtualized.
In the tests I have been running against different sized datasets with the very same Hive job, the performance within the virtual cluster have been within the +/- 5% threshold we aim to accomplish when working with an Engineering team to show them the benefits of virtualizing. The advantage we then have with virtualizing lies with rightsizing both the size and number of VMs within the cluster to then outperform a physical cluster while keeping the total cost of ownership lower when compared to trying to create/deploy/scale a physical cluster.
Bottom line: Is there a penalty to virtualizing Hadoop?
Simple answer: Yes, a 5-10% penalty in general.
Better answer: Yes. However, if you understand your workload and customize the cluster (using the tools available within BDE) the penalty quickly becomes nonexistent and a virtualized Hadoop cluster should outperform a physical cluster running on the very same physical hardware.