Category: Apache Hadoop

 

vmware-sliderVersion 2.3.1 of VMware Big Data Extensions was released on March 29, 2016. The latest version includes the fix for the glibc vulnerability disclosed in February. The current branch saw many new features included back in December when v2.3 was released, including an updated CentOS 6.7 template and support for multiple VM templates within the BDE vApp. The full release notes for the 2.3 branch can be viewed on the VMware site.

I’ve been anxious to upgrade my lab environment to 2.3 for the past several months, however time has been extremely limited due to a heavy workload and family life. Fortunately, the Bay Area experienced a rather rainy weekend and with all the little league baseball games getting cancelled, I was able to sit down and deploy the latest version into my vSphere 6.0 lab.

One of the major improvements that has been made to VMware Big Data Extensions (BDE), is the administrative HTTPS interface running on port 5480. Once the vApp is powered on and you have changed the default random password, point your browser to the interface and login. From there, you will be greeted with a summary screen where you can see the status of the running services. When the BDE management server is initializing, you can monitor the status of the initialization and see any error messages (if they occur).

BDE_2.3.1_Mgmt_screen_1

Clicking the ‘Details…’ link to the right of Initialization Status will load the following pop-up that allows you to watch the progress of the management server.

BDE_2.3.1_Mgmt_screen_2

Once all of the initialization steps complete successfully, the Summary screen can be refreshed and it should show all of the services operational.

BDE_2.3.1_Mgmt_screen_3

At this point, log out and back into the vSphere Web Client to see the Big Data Extensions icon and begin managing the vApp.

The BDE management server is missing two key packages which will prevent a deployment from being successful — mailx and wsdl4j. The BDE documentation includes the following instructions for adding these packages to the management server:

The wsdl4j and mailx RPM packages are not embedded within Big Data Extensions due to licensing agreements. For this reason you must install them within the internal Yum repository of the Serengeti Management Server.

In order to install these packages properly, you will need to execute the following commands on the BDE management server.

# su - serengeti
$ umask 022 
$ cd /opt/serengeti/www/yum/repos/centos/6/base/RPMS/
$ wget http://mirror.centos.org/centos/6/os/x86_64/Packages/mailx-12.4-8.el6_6.x86_64.rpm
$ wget http://mirror.centos.org/centos/6/os/x86_64/Packages/wsdl4j-1.5.2-7.8.el6.noarch.rpm
$ createrepo ..

After verifying the proper VMFS datastores and Networks are configured within the BDE application, I always like to perform a test deployment of a basic Hadoop cluster. Doing so allows me to be sure everything is working as expected before I begin modifying the BDE management server. A test deployment is also a good way to see if anything in the BDE workflow has changed — it just so happens there is now a nifty new drop-down menu for selecting the VM template that should be used for the deployment.

BDE_2.3.1_Dropdown

A successful installation of a basic Hadoop cluster means the VMware Big Data Extensions application is ready for consumption and modification to support the Cloud Native Applications (Marathon, Mesos, Kubernetes, etc) I require in my lab environment.

Enjoy.

Read More

The VMware BDE template uses a snapshot to perform the cloning operation as it deploys a cluster. The ability to create a cloned VM from a snapshot is exposed in the vSphere API with the CloneVM_Task. As part of regular template maintenance, I run a yum update command to make sure the OS gets regular updates and security patches. It helps when installing packages like Docker to make sure I’m as close to the stable CentOS 7 branch as possible. However, if you were to simply power on the template and run an OS update those changes would not be realized in new cluster deployments.

If you look at your BDE template, the snapshot the Management server uses can be seen.

bde-template-2

By deleting the snapshot, any changes you have made to the template will be used during future cluster deployments. It is not necessary to do anything else. The next cluster deployment, if the template is missing, the BDE framework will create a new one and proceed to use it.

The ability to update the BDE template will assist you in the lifecycle management of your Hadoop, Apache Mesos and all other cluster deployments you are using the VMware Big Data Extensions framework for. Enjoy!

Read More

isc-logo

The ISC Cloud & Big Data conference is taking place next week in Frankfurt, Germany from September 28-30, 2015. I will be speaking on the benefits of virtualizing Hadoop and how to identify use-cases within an organization. The conference is focused on the latest developments in cloud and big data technologies. If you are going to be in Germany for the conference, please come listen to the session on Wednesday morning.

ISC Big Data

Hope to see you there!

Read More

Cloudera_logo_rgb

 

I have been wanting to spend some time with the Application Master feature built into Big Data Extensions and use Cloudera Manager to manage the SDLC of a Hadoop cluster in my lab environment for a while now. I was able to find time over the weekend to work on installing Cloudera Manager onto a virtual machine and tie it into the Big Data Extensions deployed in my vSphere lab environment. The genius behind the Cloudera Manager is the ability it gives a consumer — administrator, engineer or data scientist — to have a single-pane of glass for managing a Hadoop cluster. The Cloudera website states

Cloudera Manager makes it easy to manage Hadoop deployments of any scale in production. Quickly deploy, configure, and monitor your cluster through an intuitive UI – complete with rolling upgrades, backup and disaster recovery, and customizable alerting.

I remember trying to get Cloudera Manager working with a previously deployed Hadoop cluster back when BDE was on v1.0 — it was not successful. To have the added functionality built into BDE now, further expands its capabilities to all types of Hadoop environments — whether they are one-off clusters or offering Hadoop-as-a-Service.

The post will go through the steps required to install Cloudera Manager for Hadoop deployments through VMware Big Data Extensions.

Installing Cloudera Manager

Note: CentOS 6 is my preferred Linux distribution and I run it in my lab environment for almost all of my Linux management VM roles. The instructions to follow are specific to running Cloudera Manager on a CentOS 6 VM. Your mileage may vary.

The installer file for Cloudera Manager needs to be downloaded from Cloudera before it can be installed on a virtual machine in the environment. The Cloudera website has a link for downloading the bin file, as seen in the following screenshot.

cloudera manager download

Once the installer has been downloaded onto the VM designated for running Cloudera Manager, a few steps are required to complete the installation.

Disable SELinux

SELinux will need to be disabled in order for Cloudera Manager to successfully install. Edit the /etc/sysconfig/selinux file and change line 7 to state disabled.

cloudera manager

Disable IPTABLES

Initially, I added an IPTABLES ruleset to allow incoming traffic on port 80, 443 and 7180 for Cloudera Manager. Although that allowed the UI to run correctly, early testing of Hadoop deployments failed and upon stopping the service altogether, the agents were able to be installed on the Hadoop nodes.

[root@cloudera ~]# service iptables stop
[root@cloudera ~]# chkconfig iptables off

The final step is to run the installer (cloudera-manager-installer.bin) from the command line.

[root@cloudera ~]# chmod u+x cloudera-manager-installer.bin
[root@cloudera ~]# ./cloudera-manager-installer.bin

Accept the EULAs and the process is off to the races. Upon completion, the following screen should appear in the terminal.
cloudera manager
You can see from the screenshot, the web UI and username/password information for the newly installed instance of Cloudera Manager.

Adding an Application Manager to BDE

Once the Cloudera Manager is installed, the next step is to tie it into the Big Data Extensions installation in the vSphere environment. To do so, log onto the vSphere Web UI and go to the Big Data Extensions tab. Under the Application Masters selection on the left-side menu, click the plus icon and fill out the form.

cloudera manager

Now the Big Data Extensions framework is capable of using the Cloudera Manager for installing and managing Hadoop clusters.

Deploy a Hadoop Cluster using Cloudera Manager

In the vSphere Web UI, deploy a Hadoop cluster using Big Data Extensions — the only difference now is selecting the CDH5 Manager as the Application Manager.

cloudera manager

The deployment process will initially proceed in the same way it would without using Cloudera Manager. The Big Data Extensions framework will clone the template VM, configure it based on the memory, disk and CPU specified and power on all of the VMs. Once the VMs have their initial configuration, BDE hands them off to Cloudera Manager for installing the local agent and then the proper Hadoop applications.

Once the deployment is complete, using the Cloudera Manager, the newly deployed Hadoop cluster is visible.

cloudera manager

There were a few other minor tweaks within Cloudera Manager I found necessary to have it working ‘just so’ in my vSphere environment. I will be posting what those tweaks were and going over other parts of Cloudera Manager that will assist in the SDLC management of Hadoop clusters in other posts this week.

Enjoy!

Read More

vsphere6-login-custom

Over the past weekend, I wiped my home lab environment that was running vSphere 5.5 and installed a fresh set of vSphere 6.0U1 bits. The decision to begin using vSphere 6.0 in the home lab was largely due to now longer needing vSphere 5.5 for my VCAP-DCA studying and wanting to finally begin using the Instant Clone technology for deploying Hadoop clusters with Big Data Extensions. As I had mentioned in an earlier post, the release of version 2.2 for Big Data Extensions exposed the ability to use VMware Instant Clone technology. It did not however enable the setting by default, leaving it to the vSphere administrator to determine whether or not to enable it.

Turning on the feature is simple enough. Edit the /opt/serengeti/conf/serengeti.properties file on the BDE management server, changing line 104 to state instant for the cluster.clone.service variable. The exact syntax is highlighted in the screenshot below.

bde-instant-clone-ss

After saving the file and restarting the Tomcat service, the environment is ready to go!

For those unfamiliar with the VMware Instant Clone technology, a brief description from the VMware Blog states,

The Instant Clone capability allows admins to ‘fork’ a running virtual machine, meaning, it is not a full clone. The parent virtual machine is brought to a state where the admin will Instant Clone it, at which time the parent virtual machine is quiesced and placed in a state of a ‘parent VM’. This allows the admins to create as many “child VMs” as they please. These child VMs are created in mere seconds (or less) depending on the environment (I’ve seen child VMs created in .6 seconds). The reason these child VMs can be created so quickly is because they leverage the memory and disk of the parent VM. Once the child VM is created, any writes are placed in delta disks.

When the parent virtual machine is quiesced, a prequiesce script cleans up certain aspects of the parent VM while placing it in its parent state, allowing the child VMs to receive unique MAC addresses, UUID, and other information when they are instantiated. When spinning up the child VMs a post clone script can be used to set properties such as the network information of the VM, and/or kick off additional scripts or actions within the child VM.

The ability to deploy new Hadoop VMs in an extremely quick manner through Big Data Extensions is amazing! In addition, because of the extensibility of BDE, the VMware Instant Clone feature is used when any type of cluster deployment is initiated — Apache Spark, Apache Mesos, Apache Hadoop, etc.

VMware Instant Clone VMs

The new cloned VMs launched in less than 1 minute — for my Intel NUC home lab it was really impressive! I’ve read all the press stating Instant Clones would launch in single-digit seconds, but you never know how something is going to work in an environment you control. Seeing is believing!

The interesting bit I did not anticipate was the fact that a single parent VM was spun up on each ESXi host inside my home lab when the first Hadoop cluster was deployed. You can see the parent VMs in a new resource pool created during the deployment that is separate from the Hadoop cluster resource pool.

instant-clones-created

As a result of the additional VMs that did not exist when the original cluster was deployed without VMware Instant Clone, the total resource utilization across the cluster actually increased.

Cluster Utilization Before VMware Instant Clone

hadoop-cluster-before

hadoop-before-utilization

Cluster Utilization After VMware Instant Clone

hadoop-cluster-after

hadoop-cluster-utilization-1

Increased Hadoop Cluster Utilization with VMware Instant Clone

The next step was to increase the cluster size to see what sort of savings could be realized through the Instant Clone technology. I increased the number of worker nodes through BDE using the new interface.

hadoop-cluster-increase-ui

Note: The new adjustment field and showing the new node count is outstanding!

After the new nodes were added, the total cluster utilization looked like there were some savings.

hadoop-increased-cluster

hadoop-increased-cluster-utilization

I am looking forward to using the new VMware Instant Clone technology further in the lab — including Photon nodes — to see what further savings I can get out of my home lab.

Read More