Cloudera Manager Deployments in VMware Big Data Extensions



I have been wanting to spend some time with the Application Master feature built into Big Data Extensions and use Cloudera Manager to manage the SDLC of a Hadoop cluster in my lab environment for a while now. I was able to find time over the weekend to work on installing Cloudera Manager onto a virtual machine and tie it into the Big Data Extensions deployed in my vSphere lab environment. The genius behind the Cloudera Manager is the ability it gives a consumer — administrator, engineer or data scientist — to have a single-pane of glass for managing a Hadoop cluster. The Cloudera website states

Cloudera Manager makes it easy to manage Hadoop deployments of any scale in production. Quickly deploy, configure, and monitor your cluster through an intuitive UI – complete with rolling upgrades, backup and disaster recovery, and customizable alerting.

I remember trying to get Cloudera Manager working with a previously deployed Hadoop cluster back when BDE was on v1.0 — it was not successful. To have the added functionality built into BDE now, further expands its capabilities to all types of Hadoop environments — whether they are one-off clusters or offering Hadoop-as-a-Service.

The post will go through the steps required to install Cloudera Manager for Hadoop deployments through VMware Big Data Extensions.

Installing Cloudera Manager

Note: CentOS 6 is my preferred Linux distribution and I run it in my lab environment for almost all of my Linux management VM roles. The instructions to follow are specific to running Cloudera Manager on a CentOS 6 VM. Your mileage may vary.

The installer file for Cloudera Manager needs to be downloaded from Cloudera before it can be installed on a virtual machine in the environment. The Cloudera website has a link for downloading the bin file, as seen in the following screenshot.

cloudera manager download

Once the installer has been downloaded onto the VM designated for running Cloudera Manager, a few steps are required to complete the installation.

Disable SELinux

SELinux will need to be disabled in order for Cloudera Manager to successfully install. Edit the /etc/sysconfig/selinux file and change line 7 to state disabled.

cloudera manager


Initially, I added an IPTABLES ruleset to allow incoming traffic on port 80, 443 and 7180 for Cloudera Manager. Although that allowed the UI to run correctly, early testing of Hadoop deployments failed and upon stopping the service altogether, the agents were able to be installed on the Hadoop nodes.

[root@cloudera ~]# service iptables stop
[root@cloudera ~]# chkconfig iptables off

The final step is to run the installer (cloudera-manager-installer.bin) from the command line.

[root@cloudera ~]# chmod u+x cloudera-manager-installer.bin
[root@cloudera ~]# ./cloudera-manager-installer.bin

Accept the EULAs and the process is off to the races. Upon completion, the following screen should appear in the terminal.
cloudera manager
You can see from the screenshot, the web UI and username/password information for the newly installed instance of Cloudera Manager.

Adding an Application Manager to BDE

Once the Cloudera Manager is installed, the next step is to tie it into the Big Data Extensions installation in the vSphere environment. To do so, log onto the vSphere Web UI and go to the Big Data Extensions tab. Under the Application Masters selection on the left-side menu, click the plus icon and fill out the form.

cloudera manager

Now the Big Data Extensions framework is capable of using the Cloudera Manager for installing and managing Hadoop clusters.

Deploy a Hadoop Cluster using Cloudera Manager

In the vSphere Web UI, deploy a Hadoop cluster using Big Data Extensions — the only difference now is selecting the CDH5 Manager as the Application Manager.

cloudera manager

The deployment process will initially proceed in the same way it would without using Cloudera Manager. The Big Data Extensions framework will clone the template VM, configure it based on the memory, disk and CPU specified and power on all of the VMs. Once the VMs have their initial configuration, BDE hands them off to Cloudera Manager for installing the local agent and then the proper Hadoop applications.

Once the deployment is complete, using the Cloudera Manager, the newly deployed Hadoop cluster is visible.

cloudera manager

There were a few other minor tweaks within Cloudera Manager I found necessary to have it working ‘just so’ in my vSphere environment. I will be posting what those tweaks were and going over other parts of Cloudera Manager that will assist in the SDLC management of Hadoop clusters in other posts this week.


Using VMware Instant Clone with Big Data Extensions


Over the past weekend, I wiped my home lab environment that was running vSphere 5.5 and installed a fresh set of vSphere 6.0U1 bits. The decision to begin using vSphere 6.0 in the home lab was largely due to now longer needing vSphere 5.5 for my VCAP-DCA studying and wanting to finally begin using the Instant Clone technology for deploying Hadoop clusters with Big Data Extensions. As I had mentioned in an earlier post, the release of version 2.2 for Big Data Extensions exposed the ability to use VMware Instant Clone technology. It did not however enable the setting by default, leaving it to the vSphere administrator to determine whether or not to enable it.

Turning on the feature is simple enough. Edit the /opt/serengeti/conf/ file on the BDE management server, changing line 104 to state instant for the cluster.clone.service variable. The exact syntax is highlighted in the screenshot below.


After saving the file and restarting the Tomcat service, the environment is ready to go!

For those unfamiliar with the VMware Instant Clone technology, a brief description from the VMware Blog states,

The Instant Clone capability allows admins to ‘fork’ a running virtual machine, meaning, it is not a full clone. The parent virtual machine is brought to a state where the admin will Instant Clone it, at which time the parent virtual machine is quiesced and placed in a state of a ‘parent VM’. This allows the admins to create as many “child VMs” as they please. These child VMs are created in mere seconds (or less) depending on the environment (I’ve seen child VMs created in .6 seconds). The reason these child VMs can be created so quickly is because they leverage the memory and disk of the parent VM. Once the child VM is created, any writes are placed in delta disks.

When the parent virtual machine is quiesced, a prequiesce script cleans up certain aspects of the parent VM while placing it in its parent state, allowing the child VMs to receive unique MAC addresses, UUID, and other information when they are instantiated. When spinning up the child VMs a post clone script can be used to set properties such as the network information of the VM, and/or kick off additional scripts or actions within the child VM.

The ability to deploy new Hadoop VMs in an extremely quick manner through Big Data Extensions is amazing! In addition, because of the extensibility of BDE, the VMware Instant Clone feature is used when any type of cluster deployment is initiated — Apache Spark, Apache Mesos, Apache Hadoop, etc.

VMware Instant Clone VMs

The new cloned VMs launched in less than 1 minute — for my Intel NUC home lab it was really impressive! I’ve read all the press stating Instant Clones would launch in single-digit seconds, but you never know how something is going to work in an environment you control. Seeing is believing!

The interesting bit I did not anticipate was the fact that a single parent VM was spun up on each ESXi host inside my home lab when the first Hadoop cluster was deployed. You can see the parent VMs in a new resource pool created during the deployment that is separate from the Hadoop cluster resource pool.


As a result of the additional VMs that did not exist when the original cluster was deployed without VMware Instant Clone, the total resource utilization across the cluster actually increased.

Cluster Utilization Before VMware Instant Clone



Cluster Utilization After VMware Instant Clone



Increased Hadoop Cluster Utilization with VMware Instant Clone

The next step was to increase the cluster size to see what sort of savings could be realized through the Instant Clone technology. I increased the number of worker nodes through BDE using the new interface.


Note: The new adjustment field and showing the new node count is outstanding!

After the new nodes were added, the total cluster utilization looked like there were some savings.



I am looking forward to using the new VMware Instant Clone technology further in the lab — including Photon nodes — to see what further savings I can get out of my home lab.

Workload-based cluster sizing for Hadoop

There is a quote in the book “Hadoop Operations by Eric Sammer  (O’Reilly)” where it states:

“The complexity of sizing a cluster comes from knowing — or more commonly, not knowing — the specifics of such a workload: its CPU, memory, storage, disk I/O, or frequency of execution requirements. Worse, it’s common to see a single cluster support many diverse types of jobs with conflicting resource requirements.”

In my experience that is a factual statement. It does not however, preclude one from determining that very information so that an intelligent decision can be made. In fact, VMware vCenter Operations Manager becomes an invaluable tool in the toolbox when developing the ability to maintain the entire SDLC of a Hadoop cluster.

Initial sizing of the Hadoop cluster in the Engineering|Pre-Production|Chaos environment of your business will include some amount of guessing. You can stick with the tried and true methodology of answering the following two questions — “How much data do I have for HDFS initially?” and “How much data do I need to ingest into HDFS daily|monthly?” It is at this point that you’ll need to start monitoring the workload(s) placed on the Hadoop cluster and begin making determinations for the cluster size once it moves into the QE, Staging and Production environments.

Continue reading “Workload-based cluster sizing for Hadoop”

Adding Hadoop Jobtracker History retention in BDE

As we’ve been working on very large datasets tied back to an Isilon array for the HDFS layer, we discovered that the history server functionality was missing from BDE (both 1.1 and 2.0). After talking to a few individuals and getting some direction, but no solution, I realized the ability to turn the feature was available — just not done.

In order to turn on the jobtracker history server functionality, so that you can see the job logs after they complete, add the following code to the file:

  • BDE 1.1 /opt/serengeti/cookbooks/cookbooks/hadoop_cluster/templates/default/mapred-site.xml.erb
  • BDE 2.0
  • /opt/serengeti/chef/cookbooks/hadoop_cluster/templates/default/mapred-site.xml.erb

27 <property>
28  <name>mapreduce.jobhistory.webapp.address</name>
29   <value><%= @resourcemanager_address %>:19888</value>
30 </property>
32 <property>
33   <name>mapreduce.jobhistory.address</name>
34   <value><%= @resourcemanager_address %>:10020</value>
35 </property>
As always, be sure to run ‘knife cookbook upload -a’ after editing the file and then it will be available for you to use during your cluster deployments.