Rightsizing YARN containers for virtual machines

Working on a specific use-case at work has required that I modify the Chef recipe templates for mapred-site.xml and yarn-site.xml to configure the memory allocations correctly. The container sizes themselves will depend on the size of VMs you are creating, and BDE has some generic settings by default, but again with each workload being different it is necessary to tune these parameters just as you would with a physical Hadoop cluster.

The virtual machines within this compute-only (Isilon-backed HDFS + NameNode) cluster utilized the ‘Medium’ sized node within BDE. That means:

  • 2 vCPU
  • 7.5GB RAM
  • 100GB drives

The specific YARN and MapReduce settings I have used to take advantage of the total memory allocated to the cluster was:

155 <% else %>
156 <property>
157   <name>mapred.child.java.opts</name>
158   <value>-Xmx1024m</value>
159 </property>
161 <!-- <property> -->
162 <!--  <name>mapred.child.ulimit</name> -->
163 <!--  <value><%= node[:hadoop][:java_child_ulimit] %></value> -->
164 <!-- </property> -->
166 <property>
167   <description>MapReduce map memory, in MB</description>
168   <name>mapreduce.map.memory.mb</name>
169   <value>1024</value>
170 </property>
172 <property>
173   <description>MapReduce map java options</description>
174   <name>mapreduce.map.java.opts</name>
175   <value>-Xmx819m</value>
176 </property>
178 <property>
179   <description>MapReduce reduce memory, in MB</description>
180   <name>mapreduce.reduce.memory.mb</name>
181   <value>2048</value>
182 </property>
184 <property>
185   <description>MapReduce reduce java options</description>
186   <name>mapreduce.reduce.java.opts</name>
187   <value>-Xmx1638m</value>
188 </property>
190 <property>
191   <description>MapReduce task IO sort, in MB</description>
192   <name>mapreduce.task.io.sort.mb</name>
193   <value>409</value>
194 </property>
196 <% end %>

 72 <property>
 73   <description>Amount of physical memory, in MB, that can be allocated
 74     for containers.</description>
 75   <name>yarn.nodemanager.resource.memory-mb</name>
 76   <!-- <value><%= node[:yarn][:nm_resource_mem] %></value> -->
 77   <value>6122</value>
 78 </property>
 80 <property>
 81   <description>The amount of memory the MR AppMaster needs.</description>
 82   <name>yarn.app.mapreduce.am.resource.mb</name>
 83   <!-- <value><%= node[:yarn][:am_resource_mem] %></value> -->
 84   <value>2048</value>
 85 </property>
 87 <property>
 88   <description>Scheduler minimum memory, in MB, that can be allocated.</description>
 89   <name>yarn.scheduler.minimum-allocation-mb</name>
 90   <value>1024</value>
 91 </property>
 93 <property>
 94   <description>Scheduler maximum memory, in MB, that can be allocated.</description>
 95   <name>yarn.scheduler.maximum-allocation-mb</name>
 96   <value>6122</value>
 97 </property>
 99 <property>
100   <description>Application master options</description>
101   <name>yarn.app.mapreduce.am.command-opts</name>
102   <value>-Xmx1638m</value>
103 </property>
126 <property>
127   <description>Disable the vmem check that is turned on by default in Yarn.</description>
128   <name>yarn.nodemanager.vmem-check.enabled</name>
129   <value>false</value>
130 </property>
Again, mileage will vary depending on your Hadoop workload, but these configuration settings should allow you to utilize the majority of the memory resources within a cluster deployed with the ‘Medium’ sized nodes within BDE.
I used the following articles as guidelines when tuning my cluster, along with trial and error.

Apache Flume node in VMware vSphere BDE – Part 1

Apache Flume is an open-source project to assist Hadoop users to ingest data into HDFS. It provides a reliable mechanism for sending raw data into HDFS to be used by the Hadoop cluster. Many new users of Hadoop often ask, “How can I get my data into HDFS for me to begin taking advantage of it?” As a result, a Flume node within the Hadoop cluster is a good first step.

There is a white paper, written by VMware, that describes how to include an Apache Flume node within a BDE-deployed Hadoop cluster. The steps and use-case described in the white paper are quite adequate for deploying a node that can be made available to the cluster. However, as I began thinking about how to offer this as part of a Hadoop-as-a-Service offering, I realized that the ability to deploy a Flume node through BDE needed to happen at the time of deployment — not afterwards. I certainly did not want to have to go through many of the manual steps to configure Flume when all of that information is available to BDE at the time of the cluster deployment.

Continue reading “Apache Flume node in VMware vSphere BDE – Part 1”

Performance Tuning for Hadoop Clusters

As I stated previously, the session I learned the most from at Hadoop Summit was about performance tuning the OS to ensure the cluster is getting the most from the infrastructure (slides can be found here). In order to do so, I had to modify the Chef recipes inside of the BDE management server to have the updates installed on all new clusters.

  • Disable swampiness, increase the proc and file limits in /opt/serengeti/cookbooks/cookbooks/hadoop_cluster/recipes/dedicated_server_tuning.rb
   3 ulimit_hard_nofile = 32768
   4 ulimit_soft_nofile = 32768
   5 ulimit_hard_nproc = 32768
   6 ulimit_soft_nproc = 32768
   7 vm_swappiness = 0
   8 redhat_transparent_hugepage = "never"
   9 vm_swappiness_line = "vm.swappiness = 0"
  11 def set_proc_sys_limit desc, proc_path, limit
  12   bash desc do
  13     not_if{ File.exists?(proc_path) && (File.read(proc_path).chomp.strip == limit.to_s) }
  14     code  "echo #{limit} > #{proc_path}"
  15   end
  16 end
  18 def set_swap_sys_limit desc, file_path, limit
  19   bash desc do
  20     not_if{ File.exists?(file_path) && (File.read(file_path).chomp.strip == limit.to_s) }
  21     code  "echo #{limit} > #{file_path}"
  22   end
  23 end
  25 set_proc_sys_limit "VM overcommit ratio", '/proc/sys/vm/overcommit_memory', overcommit_memory
  26 set_proc_sys_limit "VM overcommit memory", '/proc/sys/vm/overcommit_ratio',  overcommit_ratio
  27 set_proc_sys_limit "VM swappiness", '/proc/sys/vm/swappiness', vm_swappiness
  28 set_proc_sys_limit "Redhat transparent hugepage defag", '/sys/kernel/mm/redhat_transparent_hugepage/defrag', redhat_transparent_hugepage
  29 set_proc_sys_limit "Redhat transparent hugepage enable", '/sys/kernel/mm/redhat_transparent_hugepage/enabled', redhat_transparent_hugepage
  31 set_swap_sys_limit "SYSCTL swappiness setting", '/etc/sysctl.conf', vm_swappiness_line
  • Remove root reserved space from the filesystems in /opt/serengeti/cookbooks/cookbooks/hadoop_common/libraries/default.rb

335 function format_disk_internal()
336 {
337   kernel=`uname -r | cut -d'-' -f1`
338   first=`echo $kernel | cut -d '.' -f1`
339   second=`echo $kernel | cut -d '.' -f2`
340   third=`echo $kernel | cut -d '.' -f3`
341   num=$[ $first*10000 + $second*100 + $third ]
343   # we cannot use [[ "$kernel" < "2.6.28" ]] here becase linux kernel
344   # has versions like "2.6.5"
345   if [ $num -lt 20628 ];
346   then
347     mkfs -t ext3 -b 4096 -m 0 $1;
348   else
349     mkfs -t ext4 -b 4096 -m 0 $1;
350   fi;
351 }
Once the recipes were updated, in order for the changes to take effect, be sure to execute the command:
# knife cookbook upload -a
At which point, BDE will now be configured to include several of the commonly missed performance enhancements on a Hadoop cluster. There are several more configuration changes that can be made that I will cover in a future post.

Hadoop Summit 2014 Recap

Having recently returned from Hadoop Summit 2014 in San Jose, I wanted to take some time to jot down my thoughts on the sessions. I primarily focused on the sessions that revolved around operational management of Hadoop to see how other companies are tackling the same problems I am facing. It is comforting to know that I am not alone in my quest to deliver a reliable Hadoop platform across the development lifecycle for my internal customers to consume. However, one of the frustrating things to witness was the inherent lack of large-scale organizations operating within their own private cloud environments. Many of the demonstrations involved utilizing resources from AWS or made the assumption you would never run out of bare-metal hardware to deploy on. My experience is wholly different.

The challenging part of offering a true Hadoop-as-a-Service platform is the expectation that additional resources will always be available for an Engineering team or Operations team to consume at a moments notice. For that to be the case, in my experience, AWS becomes too expensive too quickly and bare-metal hardware is difficult to procure at a moments notice within a large, publicly traded organization. For that, a private cloud environment is perfect — but no one wants to openly talk about running Hadoop on a virtual platform. Which, when you start thinking about it is quite humorous because most demonstrations showed Hadoop running in AWS — what do they think an EC2 instance is exactly?

My talk with Andrew Nelson on running a production Hadoop-as-a-Service platform using VMware vCenter Big Data Extensions went well. The audience was well-educated and we received some rather good questions at the end. Virtualizing Hadoop for my organization has been a great way to solve many of the lifecycle management issues faced in today’s rapidly changing environment.

All that being said, here are the key takeaways/questions I gained from Hadoop Summit 2014:

  • Failure handling (Docker) — What happens when a container is lost and you are waiting for a new container to be created?
  • Docker can easily be virtualized within existing private cloud environments.
  • Data locality and container affinity can be accomplished with existing private cloud environments.
  • Writing an Application Master is hard and error prone.
  • Performance evaluation != Workload
  • Hbase regionserver splits are expensive & it is suggested that you pre-split the region — elasticity is really lacking here.
  • Best session by far was from Alex Moundalexis @Cloudera: http://tiny.cloudera.com/7steps
  • Performance tuning the Linux OS is key and oftentimes overlooked by DevOps. There are several low lift, high yield changes that can be made.
  • Ambari Apache Project is another manager to evaluate.
  • Many projects trying to solve the same problems that VMware vCloud Automation Center already offers at a larger-scale and more feature rich solution.
  • Read and re-read Hadoop Operations.