HDFS-Only Cluster through Big Data Extensions

After working with EMC over the summer and evaluating the capabilities of utilizing Isilon storage as an HDFS layer, including the NameNode, it got me thinking about how the VMware Big Data Extensions could be utilized to create the exact same functionality. If you’ve read any of the other posts around extending the capabilities of BDE beyond just what it ships with, you’ll know the framework allows an administrator to do nearly anything they can imagine.

As with creating a Zookeeper-only cluster, all of the functionality for a HDFS-only cluster is already built into BDE — it is just a matter of unlocking it. It took less than 10 minutes to set up all the pieces.

I needed to add Cloudera 5.2.1 support into my new BDE 2.1 lab environment, so the first command sets that functionality up. After that, the rest of commands are all that are needed:

# config-distro.rb --name cdh5 --vendor CDH --version 5.2.1 --repos http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
# cd /opt/serengeti/www/specs/Ironfan/hadoop2/
# mkdir -p donly
# cp conly/spec.json donly/spec.json
# vim donly/spec.json

I configured the donly/spec.json file to include the following:

  1 {
  2   "nodeGroups":[
  3     {
  4       "name": "DataMaster",
  5       "description": "It is the VM running the Hadoop NameNode service. It manages HDFS data and assigns tasks to workers. The number of VM can only be one. User can specify     size of VM.",
  6       "roles": [
  7         "hadoop_namenode"
  8       ],
  9       "groupType": "master",
 10       "instanceNum": "[1,1,1]",
 11       "instanceType": "[MEDIUM,SMALL,LARGE,EXTRA_LARGE]",
 12       "cpuNum": "[2,1,64]",
 13       "memCapacityMB": "[7500,3748,max]",
 14       "storage": {
 15         "type": "[SHARED,LOCAL]",
 16         "sizeGB": "[50,10,max]"
 17       },
 18       "haFlag": "on"
 19     },
 20     {
 21       "name": "DataWorker",
 22       "description": "They are VMs running the Hadoop DataNode services. They store HDFS data. User can specify number and size of VMs in this group.",
 23       "roles": [
 24         "hadoop_datanode"
 25       ],
 26       "instanceType": "[SMALL,MEDIUM,LARGE,EXTRA_LARGE]",
 27       "groupType": "worker",
 28       "instanceNum": "[3,1,max]",
 29       "cpuNum": "[1,1,64]",
 30       "memCapacityMB": "[3748,3748,max]",
 31       "storage": {
 32         "type": "[LOCAL,SHARED]",
 33         "sizeGB": "[100,20,max]"
 34       },
 35       "haFlag": "off"
 36     }
 37   ]
 38 }

The final part is to add an entry for a HDFS-Only cluster in the /opt/serengeti/www/specs/map file:

128   {
129     "vendor" : "CDH",
130     "version" : "^\\w+(\\.\\w+)*",
131     "type" : "HDFS Only Cluster",
132     "appManager" : "Default",
133     "path" : "Ironfan/hadoop2/donly/spec.json"
134   },

Restart the Tomcat service on the management server and the option is now available. The configured cluster when it is done looked like this in the VMware vCenter Web Client:


You can view the status of the HDFS layer through the standard interface:


Being able to have this functionality within VMware Big Data Extensions allows an environment to provide a dedicated HDFS data warehouse layer to your applications and other application cells.

Virtualized Hadoop + Isilon HDFS Benchmark Testing

During the VMworld EMEA presentation (Tuesday October 14, 2014) , the question around performance was asked again with regards to using Isilon as the data warehouse layer and what positives and negatives are associated with leveraging Isilon as that HDFS layer. As with any benchmark or performance testing, results will vary based on the data set you have, the hardware you are leveraging and how you have the clusters configured. However, there are some things that I’ve learned over the last year and a half that are applicable on a broad scale that can show the advantages to leveraging Isilon as the HDFS layer, especially when you have very large data sets (10+ Petabytes).

There are two benchmarking tests I want to focus on for this post. The tests themselves demonstrate the necessity for understanding the workload (Hadoop job), the size of the data set, and the individual configuration settings (YARN, MapReduce, and Java) for the compute worker nodes.

Continue reading “Virtualized Hadoop + Isilon HDFS Benchmark Testing”

Deploying a HDFS cluster for consumption

There have been a number of discussions recently around what a next generation architecture should look like for a large-scale infrastructure. As I have discovered over the past few months, there is a stark difference from what current public cloud and private cloud offerings generally have and what Google is doing publicly and publishing in their technical documents. The piece that gets me most excited is the discussion around this, what others see as a solution, including their perspective, and then realizing how close the things I am passionate about align with what is coming next.

So along those lines, and after reading a few different white papers, I came to the conclusion that the current form of BDE could be used as a foundation for offering up a piece of the next-generation architecture in a PaaS focused infrastructure. The solution itself is rather simplistic and BDE yields itself very easily to accomplishing the end-goal of offering up a pure HDFS layer for enterprise data.

HortonWorks has an excellent image depicting the type of solution I, and others, are envisioning. Once you have the base layer presenting the data through HDFS, there are a variety of services/applications that can then be utilized in upper-layers of the platform stack to consume said data.


The example shows YARN being used for the cluster resource manager, but even then you are not specifically tied down. You could use another resource manager, like Apache Mesos, to handle that role. That is part of flexibility of building this type of infrastructure from the ground up — based on the needs of the platform, you can use the tools that best fit the requirements.

First thing you have to realize is the pieces are already there within BDE. The Big Data Extensions already allow a person to deploy a compute-only cluster and point it to an HDFS cluster. What we are talking about here is just doing the exact opposite and it all starts with a new cluster definition file. A good working example can be found in the whitepaper “Scaling the Deployment of Multiple Hadoop Workloads on a Virtualized Infrastructure.”

By virtualizing the HDFS nodes within your rack-mount servers with DAS, you can customize how each node appears within the cluster. The cluster is then capable of taking advantage of using the Hadoop Virtulized Extensions (whitepaper) to ensure data locality within the physical nodes. At which point you have a pure HDFS layer that can be presented to any number of cluster resource managers, be they physical, virtual or a combination of the two. The flexibility gained from having this sort of layer quickly adds benefit to the PaaS offerings.

Building out HDFS using this set of technologies is a great first step towards building  a next-generation cluster level PaaS offering. I will keep you updated as hardware and performance testing determines the architecture  design.