Blog Posts

Deploying a HDFS cluster for consumption

There have been a number of discussions recently around what a next generation architecture should look like for a large-scale infrastructure. As I have discovered over the past few months, there is a stark difference from what current public cloud and private cloud offerings generally have and what Google is doing publicly and publishing in their technical documents. The piece […]

What is the virtualization penalty with Hadoop?

After a long week off, I am back and should be posting 2-3x per week leading up to VMworld 2014 in August. I keep getting this question from various software engineers, system engineers and managers so I thought it would be a good topic to address here. Disclaimer: Mileage will vary depending on your compute hardware, disk systems (DAS or […]

Rightsizing YARN containers for virtual machines

Working on a specific use-case at work has required that I modify the Chef recipe templates for mapred-site.xml and yarn-site.xml to configure the memory allocations correctly. The container sizes themselves will depend on the size of VMs you are creating, and BDE has some generic settings by default, but again with each workload being different it is necessary to tune […]

Apache Flume node in VMware vSphere BDE – Part 1

Apache Flume is an open-source project to assist Hadoop users to ingest data into HDFS. It provides a reliable mechanism for sending raw data into HDFS to be used by the Hadoop cluster. Many new users of Hadoop often ask, “How can I get my data into HDFS for me to begin taking advantage of it?” As a result, a […]

Performance Tuning for Hadoop Clusters

As I stated previously, the session I learned the most from at Hadoop Summit was about performance tuning the OS to ensure the cluster is getting the most from the infrastructure (slides can be found here). In order to do so, I had to modify the Chef recipes inside of the BDE management server to have the updates installed on […]

Scroll Up