Cluster service discovery is an integral part of running any distributed system. The problem of cluster service discovery can be solved through a number of different methods. Until recently, a common approach to solving cluster service discovery incorporated manual intervention on the part of a developer or systems administrator. The larger the cluster, or distributed application, the greater the lift became for configuring all of the nodes with the proper IP and port information for the running service. At scale (100’s or 1000’s) of nodes, this methodology simply does not work.

So how can a developer or system engineer automate the process of service discovery in a manner that is not limited to a single application, but instead work for any distributed application? Here are a few options for solving service discovery:

  • Mesosphere offers a solution using HAProxy on top of the Mesos slaves and then periodically checking which services are running an generating a new HAProxy configuration. They offer instructions on how to implement this method here.
  • Zookeeper can be used for cluster service discovery. There are some that believe it isn’t a viable solution due to its eventually consistent state.
  • Eureka from Netflix is another option for service discovery.

Any of these can be viable solutions for solving the service discovery issue. The option that I have been using, from an operations perspective, is accomplished through a Chef library. It also happens that the Chef recipe is included in VMware Big Data Extensions and was one of the early reasons why I became so engrossed in BDE.

The library itself includes multiple methods that can be used for solving the service discovery problem in a unique way. Essentially, when a set of nodes connect to the Chef server and run their respective recipes based on their role, they register themselves with the Chef server. The registration allows them to identify which cluster, or application deployment, they are being deployed for. In addition to registering which cluster they are apart of, the nodes also register what role they are running. This then allows the library to build out configuration files based on those roles using IP address, short hostnames, FQDN and port information.

For example, when I was building out the Apache Cassandra option within my BDE environment the cluster service discovery library helped me to automate the configuration of the cassandra.yaml file. The configuration file needed to know which nodes were registered as the seeds within the cluster. I accomplished the configuration by writing a Chef library that utilized the cluster-service-discovery.rb library. The code included the following:

  1 module Cassandra
  2 
  3   def is_seed
  4     node.role?("cassandra_seed")
  5   end
  6 
  7   def cassandra_seeds_ip
  8     servers = all_providers_fqdn_for_role("cassandra_seed")
  9     Chef::Log.info("Cassandra seed nodes in cluster #{node[:cluster_name]} are: #{servers.inspect}")
 10     servers
 11   end
 12 
 13   def cassandra_nodes_ip
 14     servers = all_providers_fqdn_for_role("cassandra_node")
 15     Chef::Log.info("Cassandra worker nodes in cluster #{node[:cluster_name]} are: #{servers.inspect}")
 16     servers
 17   end
 18 
 19   def wait_for_cassandra_seeds(in_ruby_block = true)
 20     return if is_seed
 21 
 22     run_in_ruby_block __method__, in_ruby_block do
 23       set_action(HadoopCluster::ACTION_WAIT_FOR_SERVICE, node[:cassandra][:seed_service_name])
 24       seed_count = all_nodes_count({"role" => "cassandra_seed"})
 25       all_providers_for_service(node[:cassandra][:seed_service_name], true, seed_count)
 26       clear_action
 27     end
 28   end
 29 
 30 end
 31 
 32 class Chef::Recipe; include Cassandra; end

Using the Chef library, I was able to access three methods — all_providers_fqdn_for_roleall_nodes_count, and all_providers_for_service. The first allowed me to gather the fully qualified domain name (FQDN) for each node that was registered with the seed role on the Chef server. The second allowed me to receive an integer for the total number of seed nodes. The third allowed me to wait for the total number of nodes to register themselves with the Chef server before proceeding with the configuration. The last bit there is key, you don’t want to write a configuration file with only a portion of the nodes.

The inclusion of the Chef library on the BDE management server also allows for clusters deployed through it to be automatically updated when nodes are added or removed from the deployed cluster. This helps with dynamic scaling — you can monitor for a trigger and then scale the cluster up/down and not have to worry about manually updating the configuration files for the application service. Ultimately the goal is to have service discovery happen automatically.

The question then becomes how can you leverage a Chef library, running on a Chef server, to assist in service discovery for applications running within a container? The team at Chef have made available tools for managing containers, including Docker, using a Chef server. More information can be seen here.

There is no magic bullet for solving service discovery within a distributed application. Depending on what you are trying to do, existing mind and skill set within the organization or preferred technologies, the problem can be solved with various methods. The Chef library described above is one method I am using today to assist in the automated deployment of distributed applications.

Enjoy.