Apache Flume is an open-source project to assist Hadoop users to ingest data into HDFS. It provides a reliable mechanism for sending raw data into HDFS to be used by the Hadoop cluster. Many new users of Hadoop often ask, “How can I get my data into HDFS for me to begin taking advantage of it?” As a result, a Flume node within the Hadoop cluster is a good first step.

There is a white paper, written by VMware, that describes how to include an Apache Flume node within a BDE-deployed Hadoop cluster. The steps and use-case described in the white paper are quite adequate for deploying a node that can be made available to the cluster. However, as I began thinking about how to offer this as part of a Hadoop-as-a-Service offering, I realized that the ability to deploy a Flume node through BDE needed to happen at the time of deployment — not afterwards. I certainly did not want to have to go through many of the manual steps to configure Flume when all of that information is available to BDE at the time of the cluster deployment.

In order to accommodate the added piece of automation, it is necessary to add the functionality to the BDE management server itself. The steps themselves are not overly complicated, but unless you are already a contributor to the open-source Serengeti, you will likely find yourself painstakingly working through the bits like I did over the past week.

Here is what I needed to do:

  • Download the Apache Flume packages to the management server.
# cd /opt/serengeti/www/distros/apache/1.2.1/
# wget http://apache.mirrors.lucidnetworks.net/flume/1.5.0/apache-flume-1.5.0-bin.tar.gz
  • Update the distro map file to include a new role called ‘flume_client’.
45       {
46         "roles": [
47           "flume_client"
48         ],
49         "tarball": "apache/1.2.1/apache-flume-1.5.0-bin.tar.gz"
50       }
  • For the Hadoop distribution of your choosing, update the JSON specification file.
{
      "name": "Flume",
      "description": "They are VMs that contain Apache Flume. Flume is a distributed, reliable, and availabe service for efficiently collection, aggregating, and moving large amounts of log data.",
      "roles": [
        "flume_client",
        "hadoop_client",
        "hbase_client",
      ],
      "instanceType": "[SMALL,MEDIUM,LARGE,EXTRA_LARGE]",
      "groupType": "flume",
      "instanceNum": "[0,1,max]",
      "cpuNum": [1,1,64],
      "memCapacityMB": "[4096,4096,max]",
      "storage": {
        "type": "[SHARED,LOCAL]",
        "sizeGB": "[100,20,max]"
      },
      "haFlag": "on"
}
  • Restart the tomcat service on the management server.

At this point you’d think that things would be all setup and all that would be needed was to add a few Chef recipes for the new node. Sadly that is not the case. I will be covering the rest of the configuration in Part 2.