Apache Storm cluster deployment through vSphere Big Data Extensions

Last year at VMworld, Andy and I spoke about the data pipeline and all of the different pieces involved and how their interactions lead to congestion. For any organization, how you deal with the data congestion will affect how much data an application can process. Fortunately, much like what Apache Hadoop has done for batch processing, Apache Storm has entered the real-time processing arena to help applications more efficiently process big data.

In case you are unfamiliar with Apache Storm, a basic explanation of its purpose and design can be seen on the Apache Storm site.

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, is used by many companies, and is a lot of fun to use!

Hortonworks has provided some great insight into how Apache Storm can be utilized alongside Hadoop to allow organizations to become even more agile and efficient in their data pipeline processing.  Continue reading “Apache Storm cluster deployment through vSphere Big Data Extensions”

vSphere 6 Hadoop Performance Whitepaper

This past week saw the second whitepaper published by Jeff Buell at VMware on Hadoop performance within a vSphere environment. The initial whitepaper was published in 2013 and was one of the initial pieces of information I latched onto when I started my journey down the virtualized Hadoop road. Once again, Jeff Buell does a great job going through the same set of tests performed in 2013 to show the performance gains that can be realized when virtualizing Hadoop on bare metal servers using VMware vSphere 6.

I first met Jeff last year at Hadoop Summit in San Jose, CA where I had an opportunity to talk to him about his work. He is a brilliant individual and has amazing insight into Hadoop and the vSphere environments. He was also instrumental in the work we did at Adobe last summer with EMC on our large-scale POC utilizing EMC Isilon storage for the HDFS layer. The latest whitepaper continues his work and improves upon everything he has evangelized in the past on virtualized Hadoop.

Rather than regurgitate the information published in the whitepapers, I wanted to just spend a moment mentioning a couple of the items that stood out most to me:

  1. Single-queue versus multi-queue (default) settings when using the VMXNET3 virtual NIC driver.
  2. Using pRDM devices rather than SAN or Isilon storage.
  3. The ESXi scheduler being able to isolate a VM to a single NUMA node for both CPU and memory, increasing the performance of the VM when running Hadoop tasks.
  4. Best Practices for virtualizing Hadoop to ensure the cluster is able to realize all of the performance gains possible.

I still find it curious the whitepaper did not utilize the VMware Big Data Extensions for cluster deployment and configuration; I believe both parties could stand to gain if they were more closely aligned. I have incorporated many of the best practices mentioned by Jeff within my own BDE environments, but would like to see more of them committed into the BDE/Serengeti code. Beyond that, I think the next step is going to be performing the same set of tests when utilizing VMware vCloud Air public cloud offering, to help consumers understand the advantages and disadvantages to running a virtualized Hadoop workload in that environment.

I am still hearing engineers and other system administrators making blanket statements around Hadoop being a workload that cannot be virtualized. I most appreciate the work done by Jeff and VMware because it allows me to have one more bit of published information to help change people’s minds. I highly recommend and encourage you to spend a few minutes reading both whitepapers and the other works referenced in them to better understand the work Jeff and others have done in the space.

vExpert 2015 – now what?

First off, I want to say thank you to VMware for selecting me as a vExpert for 2015. I will be honest in stating that it was a stretch goal for me last year to feel qualified enough to fill out the application for vExpert. I am honored to be included with so many worthy individuals who contribute back to the community and get me excited for the challenges we face daily!

The great part for me is the challenge I have set for myself to do even more this year — both professionally and personally on the virtualization front and in the community. I like to challenge myself, and working on the things that excite me (Hadoop, Mesos, Docker, etc.) help me to always be improving. Right now I have my head down in studying for the VCAP exams, automating operational tasks through PowerCLI and evangelizing Big Data Extensions internally to teams who are trying to work through many of the challenges facing them with intelligent cluster deployments. It is pretty easy to keep myself busy for 16+ hours a day right now.

One of the things I am working through in my head right now is how to extend Big Data Extensions further — by making it possible to initiate cluster deployments through the OpenStack API. What does that look like and how challenging is it to accomplish?

This was the first year I applied for vExpert and the first time I was selected. I plan to continue to challenge myself to keep this going for several years and give back to the community that has given me so much!

PowerCLI script for vCenter configuration

As part of my efforts to automate further within the private cloud environment at work, setting up a home lab and studying for the VCAP-DCA exam, I have been working on a PowerCLI script to configure vCenter once it is deployed. (Note: we are also working on deploying vCenter itself using Puppet, but that’s another post).

It has been really good to work through the functionality exposed through PowerCLI (Version 5.8 Release 1) and those only exposed through the API. I would also like to note that @vBrianGraf from VMware Tech Marketing has been an awesome resource for several different pieces that are not currently exposed directly through PowerCLI.

The goals for the script were as follows:

  1. Setup Datacenter and Folder management.
  2. Create cluster(s) with High Availability and DRS configured.
    1. HA would set the HA Admission Control Policy to reserver 25% of CPU and Memory.
  3. Create vDS and default portgroup.
  4. Setup Autodeploy.
    1. For the Intel NUC, this included adding a custom VIB for the e1001 network driver.

The script can be downloaded here and seen after the break.

Continue reading “PowerCLI script for vCenter configuration”

Virtualizing Hadoop in Large-Scale Infrastructures Whitepaper

For a large part of 2014, I was involved in a Proof-of-Concept (POC) at work with EMC and a great Adobe storage engineer, Jason (@jason_farns), working on using Isilon as the HDFS layer for virtualized Hadoop clusters. After many hours, long weekends and serious amounts of trial-and-error, a whitepaper has been published on the work we did. This is the first paper I have seen published where I was involved, and it is really exciting to see it out there for everyone to read.

There is always more work to be done, but this was a great start.

http://community.emc.com/docs/DOC-41473