Category: Hadoop Ecosystem

isc-logo

The ISC Cloud & Big Data conference is taking place next week in Frankfurt, Germany from September 28-30, 2015. I will be speaking on the benefits of virtualizing Hadoop and how to identify use-cases within an organization. The conference is focused on the latest developments in cloud and big data technologies. If you are going to be in Germany for the conference, please come listen to the session on Wednesday morning.

ISC Big Data

Hope to see you there!

Read More

Cloudera_logo_rgb

 

I have been wanting to spend some time with the Application Master feature built into Big Data Extensions and use Cloudera Manager to manage the SDLC of a Hadoop cluster in my lab environment for a while now. I was able to find time over the weekend to work on installing Cloudera Manager onto a virtual machine and tie it into the Big Data Extensions deployed in my vSphere lab environment. The genius behind the Cloudera Manager is the ability it gives a consumer — administrator, engineer or data scientist — to have a single-pane of glass for managing a Hadoop cluster. The Cloudera website states

Cloudera Manager makes it easy to manage Hadoop deployments of any scale in production. Quickly deploy, configure, and monitor your cluster through an intuitive UI – complete with rolling upgrades, backup and disaster recovery, and customizable alerting.

I remember trying to get Cloudera Manager working with a previously deployed Hadoop cluster back when BDE was on v1.0 — it was not successful. To have the added functionality built into BDE now, further expands its capabilities to all types of Hadoop environments — whether they are one-off clusters or offering Hadoop-as-a-Service.

The post will go through the steps required to install Cloudera Manager for Hadoop deployments through VMware Big Data Extensions.

Installing Cloudera Manager

Note: CentOS 6 is my preferred Linux distribution and I run it in my lab environment for almost all of my Linux management VM roles. The instructions to follow are specific to running Cloudera Manager on a CentOS 6 VM. Your mileage may vary.

The installer file for Cloudera Manager needs to be downloaded from Cloudera before it can be installed on a virtual machine in the environment. The Cloudera website has a link for downloading the bin file, as seen in the following screenshot.

cloudera manager download

Once the installer has been downloaded onto the VM designated for running Cloudera Manager, a few steps are required to complete the installation.

Disable SELinux

SELinux will need to be disabled in order for Cloudera Manager to successfully install. Edit the /etc/sysconfig/selinux file and change line 7 to state disabled.

cloudera manager

Disable IPTABLES

Initially, I added an IPTABLES ruleset to allow incoming traffic on port 80, 443 and 7180 for Cloudera Manager. Although that allowed the UI to run correctly, early testing of Hadoop deployments failed and upon stopping the service altogether, the agents were able to be installed on the Hadoop nodes.

[root@cloudera ~]# service iptables stop
[root@cloudera ~]# chkconfig iptables off

The final step is to run the installer (cloudera-manager-installer.bin) from the command line.

[root@cloudera ~]# chmod u+x cloudera-manager-installer.bin
[root@cloudera ~]# ./cloudera-manager-installer.bin

Accept the EULAs and the process is off to the races. Upon completion, the following screen should appear in the terminal.
cloudera manager
You can see from the screenshot, the web UI and username/password information for the newly installed instance of Cloudera Manager.

Adding an Application Manager to BDE

Once the Cloudera Manager is installed, the next step is to tie it into the Big Data Extensions installation in the vSphere environment. To do so, log onto the vSphere Web UI and go to the Big Data Extensions tab. Under the Application Masters selection on the left-side menu, click the plus icon and fill out the form.

cloudera manager

Now the Big Data Extensions framework is capable of using the Cloudera Manager for installing and managing Hadoop clusters.

Deploy a Hadoop Cluster using Cloudera Manager

In the vSphere Web UI, deploy a Hadoop cluster using Big Data Extensions — the only difference now is selecting the CDH5 Manager as the Application Manager.

cloudera manager

The deployment process will initially proceed in the same way it would without using Cloudera Manager. The Big Data Extensions framework will clone the template VM, configure it based on the memory, disk and CPU specified and power on all of the VMs. Once the VMs have their initial configuration, BDE hands them off to Cloudera Manager for installing the local agent and then the proper Hadoop applications.

Once the deployment is complete, using the Cloudera Manager, the newly deployed Hadoop cluster is visible.

cloudera manager

There were a few other minor tweaks within Cloudera Manager I found necessary to have it working ‘just so’ in my vSphere environment. I will be posting what those tweaks were and going over other parts of Cloudera Manager that will assist in the SDLC management of Hadoop clusters in other posts this week.

Enjoy!

Read More

vsphere6-login-custom

Over the past weekend, I wiped my home lab environment that was running vSphere 5.5 and installed a fresh set of vSphere 6.0U1 bits. The decision to begin using vSphere 6.0 in the home lab was largely due to now longer needing vSphere 5.5 for my VCAP-DCA studying and wanting to finally begin using the Instant Clone technology for deploying Hadoop clusters with Big Data Extensions. As I had mentioned in an earlier post, the release of version 2.2 for Big Data Extensions exposed the ability to use VMware Instant Clone technology. It did not however enable the setting by default, leaving it to the vSphere administrator to determine whether or not to enable it.

Turning on the feature is simple enough. Edit the /opt/serengeti/conf/serengeti.properties file on the BDE management server, changing line 104 to state instant for the cluster.clone.service variable. The exact syntax is highlighted in the screenshot below.

bde-instant-clone-ss

After saving the file and restarting the Tomcat service, the environment is ready to go!

For those unfamiliar with the VMware Instant Clone technology, a brief description from the VMware Blog states,

The Instant Clone capability allows admins to ‘fork’ a running virtual machine, meaning, it is not a full clone. The parent virtual machine is brought to a state where the admin will Instant Clone it, at which time the parent virtual machine is quiesced and placed in a state of a ‘parent VM’. This allows the admins to create as many “child VMs” as they please. These child VMs are created in mere seconds (or less) depending on the environment (I’ve seen child VMs created in .6 seconds). The reason these child VMs can be created so quickly is because they leverage the memory and disk of the parent VM. Once the child VM is created, any writes are placed in delta disks.

When the parent virtual machine is quiesced, a prequiesce script cleans up certain aspects of the parent VM while placing it in its parent state, allowing the child VMs to receive unique MAC addresses, UUID, and other information when they are instantiated. When spinning up the child VMs a post clone script can be used to set properties such as the network information of the VM, and/or kick off additional scripts or actions within the child VM.

The ability to deploy new Hadoop VMs in an extremely quick manner through Big Data Extensions is amazing! In addition, because of the extensibility of BDE, the VMware Instant Clone feature is used when any type of cluster deployment is initiated — Apache Spark, Apache Mesos, Apache Hadoop, etc.

VMware Instant Clone VMs

The new cloned VMs launched in less than 1 minute — for my Intel NUC home lab it was really impressive! I’ve read all the press stating Instant Clones would launch in single-digit seconds, but you never know how something is going to work in an environment you control. Seeing is believing!

The interesting bit I did not anticipate was the fact that a single parent VM was spun up on each ESXi host inside my home lab when the first Hadoop cluster was deployed. You can see the parent VMs in a new resource pool created during the deployment that is separate from the Hadoop cluster resource pool.

instant-clones-created

As a result of the additional VMs that did not exist when the original cluster was deployed without VMware Instant Clone, the total resource utilization across the cluster actually increased.

Cluster Utilization Before VMware Instant Clone

hadoop-cluster-before

hadoop-before-utilization

Cluster Utilization After VMware Instant Clone

hadoop-cluster-after

hadoop-cluster-utilization-1

Increased Hadoop Cluster Utilization with VMware Instant Clone

The next step was to increase the cluster size to see what sort of savings could be realized through the Instant Clone technology. I increased the number of worker nodes through BDE using the new interface.

hadoop-cluster-increase-ui

Note: The new adjustment field and showing the new node count is outstanding!

After the new nodes were added, the total cluster utilization looked like there were some savings.

hadoop-increased-cluster

hadoop-increased-cluster-utilization

I am looking forward to using the new VMware Instant Clone technology further in the lab — including Photon nodes — to see what further savings I can get out of my home lab.

Read More
Posted on

VMworld 2015

Monday was my first day at VMworld 2015 in San Francisco, CA and it was outstanding! I followed the Cloud Native Apps track throughout the day and it kicked off with a bang with the announcements around Photon Platform and vSphere Integrated Containers.

I attended the following sessions today:

  • STO5443 Case Study of Virtualized Hadoop on VMware Software Designed Storage Technologies
  • CNA6261 Internals: How We Integrated Kubernetes with the SDDC
  • CNA6649-S Build and run Cloud Native Apps in your Software-Defined Data Center

All great sessions each in their own right, but I am most excited around the things coming out of the Cloud Native Apps team and the Office of the CTO at VMware. Here are a few of the key takeaways from the sessions today I had.

Cells as the new layer of abstraction

One of the new pieces of technology coming out of VMware in the coming months is the ability to deploy clusters through a self-service cloud portal to the IaaS layer of your choosing (vCloud Air, vSphere, EC2 and GCE). The PaaS offering plans to include support for Kubernetes and Mesos clusters, including many other applications such as MongoDB, Cassandra, etc. The motivating idea behind the technology is to abstract the notion of VMs away from the developers and instead deliver the cell as the new abstraction layer. This should allow developers to standardize management of all the apps across the enterprise and take the workload between the different public and private cloud offerings.

I understand they wrote the framework from scratch and did not use the framework Big Data Extensions currently uses. I am excited to get a look at the framework to see how they are accomplishing this and what the key differentiators from BDE are.

vSphere Integrated Containers

This is really outstanding.  From the VMware website, it describes the project key points as follows.

With VMware vSphere at its foundation, the new offering will help IT operations team meet the following enterprise requirements for containers:

  • Security and Isolation – Assuring the integrity and authenticity of containers and their underlying infrastructure, Project Bonneville, a technology preview, isolates and starts up each container in a virtual machine with minimal overhead using the Instant Clone feature of VMware vSphere 6.
  • Storage and Data Persistence – While many container services are stateless today, customers have the desire to enable stateful services to support cloud-native databases. VMware vSphere Integrated Containers will enable provisioning of persistent data volumes for containers in VMware vSphere environments. This will enable IT operations and development teams to take advantage of the speed and portability of containerized applications in conjunction with highly resilient VMware vSphere storage, including VMware Virtual SAN™ and VMware vSphere Virtual Volumes™-enabled external storage.
  • Networking – VMware NSX™ supports production container deployments today. With VMware NSX, IT can apply fine-grained network micro-segmentation and policy-based security to cloud-native applications. Additionally, VMware NSX provides IT with greater visibility into the behavior of containers. Finally, with VMware NSX, containers can be integrated with the rest of the data center, and can be connected to quarantine, forensics and/or monitoring networks for additional monitoring and troubleshooting.
  • Service-Level Agreements (SLAs) – IT teams will be able to assure service-level agreements for container workloads with VMware vSphere Distributed Resource Scheduler as well as reduce planned and unplanned downtime with VMware vSphere High Availability and VMware vSphere vMotion®.
  • Management – Administrators will be able to use VMware vCenter Server™ to view and manage their containers without the need for new tools or additional training through Project Bonneville, which will enable the seamless integration of containers into VMware vSphere. Customers can further achieve consistent management and configuration compliance across private and public clouds using the VMware vRealize™ Suite.

Pretty outstanding stuff!

Photon Platform

Kit talked about the Photon Platform today and some of the features it will be bringing to the Cloud Native App environments are really going to be game changers. I had gotten a look at this technology almost a year ago when it was known by another codename and it looks even better now! The announcement around the Photon Controller being released as open source continues the example Project Photon and Project Lightwave went down a few months ago with their announcements.

The other announcement was the vSphere driver for Flocker is a welcome addition to the Cloud Native App storyline. Persistent data in containers is one of the bigger challenges the industry is still working to solve in a manner that will work for enterprise environments. Having the container itself own a VMDK that is persistent and then is available when the container migrates through the environment is huge. The code is available to on the VMware GitHub account and I am anxious to get my hands on it ASAP!

Conclusion

Overall an outstanding way to start the conference. I am excited for what tomorrow is going to bring and look forward to working with many of these technologies coming from VMware. To be completely honest, the story VMware is telling in the Cloud Native App space and the internal projects surrounding it is one of the primary reasons I joined VMware this summer. It is really great to work for a company that is passionate about technology and pushing the envelope with what is possible today!

I love my job!

Read More

VMware released the latest version of Big Data Extensions during Hadoop Summit on June 4, 2015. Included in the release notes, there are two features that have me really excited for this version.

Resize Hadoop Clusters on Demand. You can reduce the number of virtual machines in a running Hadoop cluster, allowing you to manage resources in your VMware ESXi and vCenter Server environments. The virtual machines are deleted, releasing all resources such as memory, CPU, and I/O.

Increase Cloning Performance and Resource Usage of Virtual Machines. You can rapidly clone and deploy virtual machines using Instant Clone, a feature of vSphere 6.0. Using Instant Clone, a parent virtual machine is forked, and then a child virtual machine (or instant clone) is created. The child virtual machine leverages the storage and memory of the parent, reducing resource usage.

These are really outstanding features, especially the ability to use the Instant Clone (aka VMFork) functionality introduced in vSphere 6.0. The Instant Clone technology has very interesting implications when considered with the work in the Cloud Native Application space. Deploying a very small Photon VM to immediately launch a Docker container workload in a matter of seconds will add a huge benefit to running virtualized Apache Mesos clusters (with Marathon) that have been deployed using the BDE framework.

I had hoped to see the functionality from the BDE Fling that included recipes for Mesos and Kubernetes to be incorporated into the official VMware release. On a positive note, the cookbooks for Mesos, Marathon and Kubernetes are present on the management server. It will just take a little effort to unlock those features.

06.27.2015 UPDATE: I have confirmed all of the cookbooks for Mesos, Docker and Kubernetes are present on the BDE v2.2 management server. I will have a post shortly describing how to unlock them for use.

Overall, a big release from the VMware team and it appears they are on the right track to increase the functionality of the BDE framework!

Read More