Setting up Big Data Extensions Orchestrator workflows for Hadoop

The default VMware Orchestrator plugin for Big Data Extensions is setup for deploying only Apache Hadoop clusters. That may be enough for your organization, but if you have already setup additional Hadoop distributions you ought to have them available to your vCloud Automation Center catalog. In order to do so, there are a couple of options available to you.

  1. Edit existing workflows to take a variable where you specify the Hadoop distribution.
  2. Duplicate the workflows and edit them to work only with a specific Hadoop distribution.

I chose to go with option #2 within my Hadoop Platform-as-a-Service offerings. Continue reading “Setting up Big Data Extensions Orchestrator workflows for Hadoop”

Deploying a HDFS cluster for consumption

There have been a number of discussions recently around what a next generation architecture should look like for a large-scale infrastructure. As I have discovered over the past few months, there is a stark difference from what current public cloud and private cloud offerings generally have and what Google is doing publicly and publishing in their technical documents. The piece that gets me most excited is the discussion around this, what others see as a solution, including their perspective, and then realizing how close the things I am passionate about align with what is coming next.

So along those lines, and after reading a few different white papers, I came to the conclusion that the current form of BDE could be used as a foundation for offering up a piece of the next-generation architecture in a PaaS focused infrastructure. The solution itself is rather simplistic and BDE yields itself very easily to accomplishing the end-goal of offering up a pure HDFS layer for enterprise data.

HortonWorks has an excellent image depicting the type of solution I, and others, are envisioning. Once you have the base layer presenting the data through HDFS, there are a variety of services/applications that can then be utilized in upper-layers of the platform stack to consume said data.


The example shows YARN being used for the cluster resource manager, but even then you are not specifically tied down. You could use another resource manager, like Apache Mesos, to handle that role. That is part of flexibility of building this type of infrastructure from the ground up — based on the needs of the platform, you can use the tools that best fit the requirements.

First thing you have to realize is the pieces are already there within BDE. The Big Data Extensions already allow a person to deploy a compute-only cluster and point it to an HDFS cluster. What we are talking about here is just doing the exact opposite and it all starts with a new cluster definition file. A good working example can be found in the whitepaper “Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure.”

By virtualizing the HDFS nodes within your rack-mount servers with DAS, you can customize how each node appears within the cluster. The cluster is then capable of taking advantage of using the Hadoop Virtulized Extensions (whitepaper) to ensure data locality within the physical nodes. At which point you have a pure HDFS layer that can be presented to any number of cluster resource managers, be they physical, virtual or a combination of the two. The flexibility gained from having this sort of layer quickly adds benefit to the PaaS offerings.

Building out HDFS using this set of technologies is a great first step towards building  a next-generation cluster level PaaS offering. I will keep you updated as hardware and performance testing determines the architecture  design.

What is the virtualization penalty with Hadoop?

After a long week off, I am back and should be posting 2-3x per week leading up to VMworld 2014 in August.

I keep getting this question from various software engineers, system engineers and managers so I thought it would be a good topic to address here.

Disclaimer: Mileage will vary depending on your compute hardware, disk systems (DAS or NAS) and individual Hadoop workloads.

Now that the disclaimer is out of the way, let me spend some time answering the general form of the question. First, there are several whitepapers that show what virtualizing Hadoop looks like with various workloads, Hadoop distributions and hardware.

Generally speaking, running Hadoop within a virtual machine incurs a less than 5% performance penalty. However, that is based on no modification to the configuration of the Hadoop cluster and that means you likely don’t fully understand the workload you are hosting. If you saw my earlier post on YARN containers, you will hopefully come to the same conclusion that I have and that is customization is key for the infrastructure. Hadoop is very much not a one-size-fit-all system.

Reading the whitepaper from Dell, Intel and VMware it shows that the overall result of the tests they ran, several of the virtualized clusters outperformed the performance from a physical cluster utilizing the same hardware. There is the one test, DFSIOE-READ, which ran significantly worse when virtualized.

In the tests I have been running against different sized datasets with the very same Hive job, the performance within the virtual cluster have been within the +/- 5% threshold we aim to accomplish when working with an Engineering team to show them the benefits of virtualizing. The advantage we then have with virtualizing lies with rightsizing both the size and number of VMs within the cluster to then outperform a physical cluster while keeping the total cost of ownership lower when compared to trying to create/deploy/scale a physical cluster.

Bottom line: Is there a penalty to virtualizing Hadoop?

Simple answer: Yes, a 5-10% penalty in general.

Better answer: Yes. However, if you understand your workload and customize the cluster (using the tools available within BDE) the penalty quickly becomes nonexistent and a virtualized Hadoop cluster should outperform a physical cluster running on the very same physical hardware.

Rightsizing YARN containers for virtual machines

Working on a specific use-case at work has required that I modify the Chef recipe templates for mapred-site.xml and yarn-site.xml to configure the memory allocations correctly. The container sizes themselves will depend on the size of VMs you are creating, and BDE has some generic settings by default, but again with each workload being different it is necessary to tune these parameters just as you would with a physical Hadoop cluster.

The virtual machines within this compute-only (Isilon-backed HDFS + NameNode) cluster utilized the ‘Medium’ sized node within BDE. That means:

  • 2 vCPU
  • 7.5GB RAM
  • 100GB drives

The specific YARN and MapReduce settings I have used to take advantage of the total memory allocated to the cluster was:

155 <% else %>
156 <property>
157   <name></name>
158   <value>-Xmx1024m</value>
159 </property>
161 <!-- <property> -->
162 <!--  <name>mapred.child.ulimit</name> -->
163 <!--  <value><%= node[:hadoop][:java_child_ulimit] %></value> -->
164 <!-- </property> -->
166 <property>
167   <description>MapReduce map memory, in MB</description>
168   <name></name>
169   <value>1024</value>
170 </property>
172 <property>
173   <description>MapReduce map java options</description>
174   <name></name>
175   <value>-Xmx819m</value>
176 </property>
178 <property>
179   <description>MapReduce reduce memory, in MB</description>
180   <name>mapreduce.reduce.memory.mb</name>
181   <value>2048</value>
182 </property>
184 <property>
185   <description>MapReduce reduce java options</description>
186   <name></name>
187   <value>-Xmx1638m</value>
188 </property>
190 <property>
191   <description>MapReduce task IO sort, in MB</description>
192   <name></name>
193   <value>409</value>
194 </property>
196 <% end %>

 72 <property>
 73   <description>Amount of physical memory, in MB, that can be allocated
 74     for containers.</description>
 75   <name>yarn.nodemanager.resource.memory-mb</name>
 76   <!-- <value><%= node[:yarn][:nm_resource_mem] %></value> -->
 77   <value>6122</value>
 78 </property>
 80 <property>
 81   <description>The amount of memory the MR AppMaster needs.</description>
 82   <name></name>
 83   <!-- <value><%= node[:yarn][:am_resource_mem] %></value> -->
 84   <value>2048</value>
 85 </property>
 87 <property>
 88   <description>Scheduler minimum memory, in MB, that can be allocated.</description>
 89   <name>yarn.scheduler.minimum-allocation-mb</name>
 90   <value>1024</value>
 91 </property>
 93 <property>
 94   <description>Scheduler maximum memory, in MB, that can be allocated.</description>
 95   <name>yarn.scheduler.maximum-allocation-mb</name>
 96   <value>6122</value>
 97 </property>
 99 <property>
100   <description>Application master options</description>
101   <name></name>
102   <value>-Xmx1638m</value>
103 </property>
126 <property>
127   <description>Disable the vmem check that is turned on by default in Yarn.</description>
128   <name>yarn.nodemanager.vmem-check.enabled</name>
129   <value>false</value>
130 </property>
Again, mileage will vary depending on your Hadoop workload, but these configuration settings should allow you to utilize the majority of the memory resources within a cluster deployed with the ‘Medium’ sized nodes within BDE.
I used the following articles as guidelines when tuning my cluster, along with trial and error.

Apache Flume node in VMware vSphere BDE – Part 1

Apache Flume is an open-source project to assist Hadoop users to ingest data into HDFS. It provides a reliable mechanism for sending raw data into HDFS to be used by the Hadoop cluster. Many new users of Hadoop often ask, “How can I get my data into HDFS for me to begin taking advantage of it?” As a result, a Flume node within the Hadoop cluster is a good first step.

There is a white paper, written by VMware, that describes how to include an Apache Flume node within a BDE-deployed Hadoop cluster. The steps and use-case described in the white paper are quite adequate for deploying a node that can be made available to the cluster. However, as I began thinking about how to offer this as part of a Hadoop-as-a-Service offering, I realized that the ability to deploy a Flume node through BDE needed to happen at the time of deployment — not afterwards. I certainly did not want to have to go through many of the manual steps to configure Flume when all of that information is available to BDE at the time of the cluster deployment.

Continue reading “Apache Flume node in VMware vSphere BDE – Part 1”