Workload-based cluster sizing for Hadoop

There is a quote in the book “Hadoop Operations by Eric Sammer  (O’Reilly)” where it states:

“The complexity of sizing a cluster comes from knowing — or more commonly, not knowing — the specifics of such a workload: its CPU, memory, storage, disk I/O, or frequency of execution requirements. Worse, it’s common to see a single cluster support many diverse types of jobs with conflicting resource requirements.”

In my experience that is a factual statement. It does not however, preclude one from determining that very information so that an intelligent decision can be made. In fact, VMware vCenter Operations Manager becomes an invaluable tool in the toolbox when developing the ability to maintain the entire SDLC of a Hadoop cluster.

Initial sizing of the Hadoop cluster in the Engineering|Pre-Production|Chaos environment of your business will include some amount of guessing. You can stick with the tried and true methodology of answering the following two questions — “How much data do I have for HDFS initially?” and “How much data do I need to ingest into HDFS daily|monthly?” It is at this point that you’ll need to start monitoring the workload(s) placed on the Hadoop cluster and begin making determinations for the cluster size once it moves into the QE, Staging and Production environments.

Continue reading “Workload-based cluster sizing for Hadoop”

VMworld 2014 US session schedule

With VMworld 2014 in the United States fast approaching, I have been working on building out my schedule based on my personal objectives and checking the popular blogger sites for their recommendations. In that spirit, I thought I would share the sessions I am most excited about this year in San Francisco.

Last year was my first year at VMworld and I focused on the Hands-on-Labs (HoLs) and generic sessions to better understand the VMware ecosystem. This year I am focused on three primary topics:

  • VMware NSX
  • Openstack|Docker|Containers with VMware
  • VMware VSAN

Here are the sessions I am focused on:

  • SEC1746 NSX Distributed Firewall Deep Dive
  • NET1966 Operational Best Practices for VMware NSX
  • NET1949 VMware NSX for Docker, Containers & Mesos
  • SDDC3350 VMware and Docker — Better Together
  • SDDC2370 Why Openstack runs best with the vCloud suite
  • STO1279 Virtual SAN Architecture Deep Dive
  • STO1424 Massively Scaling Virtual SAN implementations

In addition to that, I am also excited for my own sessions at VMworld this year around Hadoop , VMware BDE and building a Hadoop-as-a-Service!

  • VAPP1428 Hadoop-as-a-Service: Utilizing VMware Cloud Automation Center and Big Data Extensions at Adobe (Monday & Wednesday sessions)

Excited for the week to get kicked off and see all the exciting things coming to our virtualized world.

Linux VM boot error workaround

Not specifically related to Hadoop or Big Data Extensions, but I came across this bug tonight. There is a KB article on the VMware website (here), but the syntax it lists is incorrect.

The error I was seeing on the VM console was “vmsvc [warning] [guestinfo] RecordRoutingInfo: Unable to collect IPv4 routing table” immediately after it brought eth0 online. The workaround to fix the issue, beyond upgrading arping in the OS, is to add the following line in the virtual machine .vmx file:

rtc.diffFromUTC = “0”

The quotes are missing from the VMware knowledge base article and are indeed necessary to fix the issue and get the virtual machine past this point in the boot process.

Adding Hadoop Jobtracker History retention in BDE

As we’ve been working on very large datasets tied back to an Isilon array for the HDFS layer, we discovered that the history server functionality was missing from BDE (both 1.1 and 2.0). After talking to a few individuals and getting some direction, but no solution, I realized the ability to turn the feature was available — just not done.

In order to turn on the jobtracker history server functionality, so that you can see the job logs after they complete, add the following code to the file:

  • BDE 1.1 /opt/serengeti/cookbooks/cookbooks/hadoop_cluster/templates/default/mapred-site.xml.erb
  • BDE 2.0
  • /opt/serengeti/chef/cookbooks/hadoop_cluster/templates/default/mapred-site.xml.erb

27 <property>
28  <name>mapreduce.jobhistory.webapp.address</name>
29   <value><%= @resourcemanager_address %>:19888</value>
30 </property>
31
32 <property>
33   <name>mapreduce.jobhistory.address</name>
34   <value><%= @resourcemanager_address %>:10020</value>
35 </property>
As always, be sure to run ‘knife cookbook upload -a’ after editing the file and then it will be available for you to use during your cluster deployments.

VMworld 2014 Session information

The schedule has been announced for VMworld 2014 and I will be speaking with Andrew Nelson at two different times.

VAPP1428 – Hadoop as a Service: Utilizing VMware Cloud Automation Center and Big Data Extensions at Adobe

Looking forward to discussing Hadoop-as-a-Service in great detail. Hope to see you all there!

UPDATE: I’ve been informed that our session has also been picked up for VMworld EMEA in Barcelona, Spain this October!