Apache Flume is an open-source project to assist Hadoop users to ingest data into HDFS. It provides a reliable mechanism for sending raw data into HDFS to be used by the Hadoop cluster. Many new users of Hadoop often ask, “How can I get my data into HDFS for me to begin taking advantage of it?” As a result, a Flume node within the Hadoop cluster is a good first step.
There is a white paper, written by VMware, that describes how to include an Apache Flume node within a BDE-deployed Hadoop cluster. The steps and use-case described in the white paper are quite adequate for deploying a node that can be made available to the cluster. However, as I began thinking about how to offer this as part of a Hadoop-as-a-Service offering, I realized that the ability to deploy a Flume node through BDE needed to happen at the time of deployment — not afterwards. I certainly did not want to have to go through many of the manual steps to configure Flume when all of that information is available to BDE at the time of the cluster deployment.