Docker for Ansible + VMware NSX Automation

I am writing this as I sit and watch the annual viewing of The Hobbit and The Lord of the Rings trilogy over the Christmas holiday. The next couple of weeks of time should provide the time necessary to hopefully complete the Infrastructure-as-Code project I undertook last month. As part of the Infrastructure-as-Code project, I spoke previous about how Ansible is being used to provide the automation layer for the deployment and configuration of the SDDC Kubernetes stack. As part of the bootstrapping effort, I have decided to create a Docker image with the necessary components to perform the initial virtual machine deployment and NSX configuration.

The Dockerfile for the Ubuntu-based Docker container is hosted both on Docker Hub and within the Github repository for the larger Infrastructure-as-Code project.

When the Docker container is launched, it includes the necessary components to interact with the VMware stack, including additional modules for VM folders, resource pools and VMware NSX.

To launch the container, I am running it with the following options to include the local copies of the Infrastructure-as-Code project.

$ docker run -it --name ansible -v /Users/cmutchler/github/vsphere-kubernetes/ansible/:/opt/ansible virtualelephant/ubuntu-ansible

The Docker container is a bit on the larger side, but it is designed to run locally on a laptop or desktop. The image includes the required Python and NSX bits so that the additional Github repositories that are cloned into the image will operate correctly. The OpenShift project includes additional modules for interacting with vSphere folders and resource pools, while the NSX modules from the VMware Github repository includes the necessary bits for leveraging Ansible with NSX.

Once running, the Docker container is then able to bootstrap the deployment of the Infrastructure-as-Code project using the Ansible playbooks I’ve published on Github. Enjoy!

VCDX Quick Hit – Monitoring and Alerting

This is the first post in what I plan to be a sporadic, yet on-going series highlighting certain aspects of a VCDX / Architect skillset. These VCDX Quick Hits will cover a range of topics and key in on certain aspects of the VCDX blueprint. It is my hope they will trigger some level of critical thinking on the readers part and help them improve their skillset.

The idea for this post came after listening to a post-mortem call for a recent incident that occurred at work. The incident itself was a lower priority Severity 2 incident, meaning it only impacted a small subset of customers in a small failure domain (a single vCenter Server). As architects, we know monitoring is a key component of any architecture design — whether it is intended for a VCDX submission or not.

In IT Architect: Foundation in the Art of Infrastructure Design (Amazon link), the authors state:

“A good monitoring solution will identify key metrics of both the physical and virtual infrastructure across all key resources compute, storage, and networking.”

The post-mortem call got me thinking about maturity within our monitoring solutions and improving our architecture designs by striving to understand the components better earlier in the design and pilot phases.

It is common practice to identify the key components and services of an architecture we designed, or are responsible for, to outline which are key to support the service offering. When I wrote the VMware Integrated OpenStack design documentation, which later became the basis for my VCDX defense, I identified specific OpenStack services which needed to be monitored. The following screen capture shows how I captured the services within the documentation.

As you can see from the above graphic, I identified each service definition with a unique ID, documented the component/service, documented where the service should be running, and a brief description of the component/service. The information was used to create the Sprint story for the monitoring team to create the alert definitions within the monitoring solution.

All good right?

The short answer is, not really. What I provided in my design was adequate for an early service offering, but left room for further maturity. Going back to the post-mortem call, this is where additional maturity in the architecture design would have helped reduce the MTTR of the incident.

During the incident, two processes running on a single appliance were being monitored to determine if they were running. Just like my VMware Integrated OpenStack design, these services had been identified and were being monitored per the architecture specification. However, what was not documented was the dependency between the two processes. In this case, process B was dependent on process A and although process A was running, it was not properly responding to the queries from process B. As a result, the monitoring system believed everything was running correctly — it was from an alert definition perspective — and the incident was not discovered immediately. Once process A was restarted, it began responding to the queries from process B and service was restored.

So what could have been done?

First, the architecture design could have written an alert definition for the key services (or processes) that went beyond just measuring whether the service is running.

Second, the architecture design could have better understood the inter-dependencies between these two processes and written an more detailed alert definition. In this case, there was a log entry written each time process A did not correctly respond to process B. Having an alert definition for this entry in the logs would have allowed the monitoring system to generate an alert.

Third, the architecture design could have used canary testing as a way to provide a mature monitoring solution. It may be necessary to clarify what I mean when I use the term canary testing.

“Well into the 20th century, coal miners brought canaries into coal mines as an early-warning signal for toxic gases, primary carbon monoxide. The birds, being more sensitive, would become sick before the miners, who would then have a chance to escape or put on protective respirators.” (Wikipedia link)

Canary testing would them imply a method of checking the service for issues prior to a customer discovering them. Canary testing should include common platform operations a customer would typically do — this can also be thought of as end-to-end testing.

For example, a VMware Integrated OpenStack service offering with NSX would need to ensure that both the NSX Manager is online, but also that the OpenStack Neutron service is able to communicate to it. A good test could be to make an OpenStack Neutron API call to deploy a NSX Edge Service Gateway, or create a new tenant network (NSX logical switch).

There are likely numerous ways a customer will interact with your service offering and defining these additional tests within the architecture design itself are something I challenge you consider.

Infrastructure-as-Code: Ansible for VMware NSX

As the project moves into the next phase, Ansible is beginning to be relied upon for the deployment of the individual components that will define the environment. This installment of the series is going to cover the use of Ansible with VMware NSX. VMware has provided a set of Ansible modules for integrating with NSX on GitHub. The modules easily allow the creation of NSX Logical Switches, NSX Distributed Logical Routers, NSX Edge Services Gateways (ESG) and many other components.

The GitHub repository can be found here.

Step 1: Installing Ansible NSX Modules

In order to support the Ansible NSX modules, it was necessary to install several supporting packages on the Ubuntu Ansible Control Server (ACS).

$ sudo apt-get install python-dev libxml2 libxml2-dev libxslt1-dev zlib1g-dev npm
$ sudo pip install nsxramlclient
$ sudo npm install -g
$ sudo npm install -g
$ sudo npm install -g raml-fleece

In addition to the Ansible NSX modules, the ACS server will also require the vSphere for NSX RAML repository. The RAML specification includes information on the NSX for vSphere API. The repo will need to be cloned to a local directory on the ACS as well before execution of an Ansible Playbook will work.

Now that all of the prerequisites are met, the Ansible playbook for creating the NSX components can be written.

Step 2: Ansible Playbook for NSX

The first thing to know is the GitHub repo for the NSX modules include many great examples within the test_*.yml files which were leveraged to create the playbook below. To understand what the Ansible Playbook has been written to create, let’s first review the logical network design for the Infrastructure-as-Code project.


The design calls for three layers of NSX virtual networking to exist — the NSX ECMP Edges, the Distributed Logical Router (DLR) and the Edge Services Gateway (ESG) for the tenant. The Ansible Playbook below assumes the ECMP Edges and DLR already exist. The playbook will focus on creating the HA Edge for the tenant and configuring the component services (SNAT/DNAT, DHCP, routing).

The GitHub repository for the NSX Ansible modules provides many great code examples. The playbook that I’ve written to create the k8s_internal logical switch and the NSX HA Edge (aka ESG) took much of the content provided and collapsed it into a single playbook. The NSX playbook I’ve written can be found in the Virtual Elephant GitHub repository for the Infrastructure-as-Code project.

As I’ve stated, this project is mostly about providing me a detailed game plan for learning several new (to me) technologies, including Ansible. The NSX playbook is the first time I’ve used an answer file to obfuscate several of the sensitive variables needed specifically for my environment. The nsxanswer.yml file includes the variable required for connecting to the NSX Manager, which is the component Ansible will be communicating with to create the logical switch and ESG.

Ansible Answer File: nsxanswer.yml (link)

  1 nsxmanager_spec:
  2         raml_file: '/HOMEDIR/nsxraml/nsxvapi.raml'
  3         host: 'usa1-2-nsxv'
  4         user: 'admin'
  5         password: 'PASSWORD'

The nsxvapi.raml file is the API specification file that we cloned in step 1 from the GitHub repository. The path should be modified for your local environment, as should the password: variable line for the NSX Manager.

Ansible Playbook: nsx.yml (link)

  1 ---
  2 - hosts: localhost
  3   connection: local
  4   gather_facts: False
  5   vars_files:
  6     - nsxanswer.yml
  7   vars_prompt:
  8   - name: "vcenter_pass"
  9     prompt: "Enter vCenter password"
 10     private: yes
 11   vars:
 12     vcenter: "usa1-2-vcenter"
 13     datacenter: "Lab-Datacenter"
 14     datastore: "vsanDatastore"
 15     cluster: "Cluster01"
 16     vcenter_user: "administrator@vsphere.local"
 17     switch_name: "{{ switch }}"
 18     uplink_pg: "{{ uplink }}"
 19     ext_ip: "{{ vip }}"
 20     tz: "tzone"
 22   tasks:
 23   - name: NSX Logical Switch creation
 24     nsx_logical_switch:
 25       nsxmanager_spec: "{{ nsxmanager_spec }}"
 26       state: present
 27       transportzone: "{{ tz }}"
 28       name: "{{ switch_name }}"
 29       controlplanemode: "UNICAST_MODE"
 30       description: "Kubernetes Infra-as-Code Tenant Logical Switch"
 31     register: create_logical_switch
 33   - name: Gather MOID for datastore for ESG creation
 34     vcenter_gather_moids:
 35       hostname: "{{ vcenter }}"
 36       username: "{{ vcenter_user }}"
 37       password: "{{ vcenter_pass }}"
 38       datacenter_name: "{{ datacenter }}"
 39       datastore_name: "{{ datastore }}"
 40       validate_certs: False
 41     register: gather_moids_ds
 42     tags: esg_create
 44   - name: Gather MOID for cluster for ESG creation
 45     vcenter_gather_moids:
 46       hostname: "{{ vcenter }}"
 47       username: "{{ vcenter_user }}"
 48       password: "{{ vcenter_pass }}"
 49       datacenter_name: "{{ datacenter }}"
 50       cluster_name: "{{ cluster }}"
 51       validate_certs: False
 52     register: gather_moids_cl
 53     tags: esg_create
 55   - name: Gather MOID for uplink
 56     vcenter_gather_moids:
 57       hostname: "{{ vcenter }}"
 58       username: "{{ vcenter_user}}"
 59       password: "{{ vcenter_pass}}"
 60       datacenter_name: "{{ datacenter }}"
 61       portgroup_name: "{{ uplink_pg }}"
 62       validate_certs: False
 63     register: gather_moids_upl_pg
 64     tags: esg_create
 66   - name: NSX Edge creation
 67     nsx_edge_router:
 68       nsxmanager_spec: "{{ nsxmanager_spec }}"
 69       state: present
 70       name: "{{ switch_name }}-edge"
 71       description: "Kubernetes Infra-as-Code Tenant Edge"
 72       resourcepool_moid: "{{ gather_moids_cl.object_id }}"
 73       datastore_moid: "{{ gather_moids_ds.object_id }}"
 74       datacenter_moid: "{{ gather_moids_cl.datacenter_moid }}"
 75       interfaces:
 76         vnic0: {ip: "{{ ext_ip }}", prefix_len: 26, portgroup_id: "{{ gather_moids_upl_pg.object_id }}", name: 'uplink0', iftype: 'uplink', fence_param: 'ethernet0.filter1.param1=1'}
 77         vnic1: {ip: '', prefix_len: 20, portgroup_id: "{{ switch_name }}", name: 'int0', iftype: 'internal', fence_param: 'ethernet0.filter1.param1=1'}
 78       default_gateway: "{{ gateway }}"
 79       remote_access: 'true'
 80       username: 'admin'
 81       password: "{{ nsx_admin_pass }}"
 82       firewall: 'false'
 83       ha_enabled: 'true'
 84     register: create_esg
 85     tags: esg_create

The playbook expects to be provided three extra variables from the CLI when it is executed — switch, uplink and vip. The switch variable defines the name of the logical switch, the uplink variable defines the uplink VXLAN portgroup the tenant ESG will connect to, and the vip variable is the external VIP to be assigned from the network block. At the time of this writing, these sorts of variables continue to be command-line based, but will likely be moved to a single Ansible answer file as the project matures. Having a single answer file for the entire set of playbooks should simplify the adoption of the Infrastructure-as-Code project into other vSphere environments.

Now that Ansible playbooks exist for creating the NSX components and the VMs for the Kubernetes cluster, the next step will be to begin configuring the software within CoreOS to run Kubernetes.

Stay tuned.

Infrastructure-as-Code: Getting started with Ansible

The series so far has covered the high level design of the project, how to bootstrap CoreOS and understanding how Ignition works to configure a CoreOS node. The next stage of the project will begin to leverage Ansible to fully automate and orchestrate the instantiation of the environment. Ansible will initially be used to deploy the blank VMs and gather the IP addresses and FQDNs of each node created.

Ansible is one of the new technologies that I am using the Infrastructure-as-Code project to learn. My familiarity with Chef was helpful, but I still wanted to get a good primer on Ansible before proceeding. Fortunately, Pluralsight is a great training tool and the Hands-on Ansible course by Aaron Paxon was just the thing to start with. Once I worked through the video series, I dived right into writing the Ansible playbook to deploy the virtual machines for CoreOS to install. I quickly learned there were a few extras I needed on my Ansible control server before it would all function properly.

Step 1: Configure Ansible Control Server

As I stated before, I have deployed an Ubuntu Server 17.10 node within the environment where tftpd-hpa is running for the CoreOS PXEBOOT system. The node is also being leveraged as the Ansible control server (ACS). The ACS node required a few additional packages to be present on the system in order for Ansible to be the latest version and include the VMware modules needed.

To get started, the Ubuntu repositories only include Ansible v2.3.1.0 — which is not from the latest 2.4 branch.

There are several VMware module updates in Ansible 2.4 that I wanted to leverage, so I needed to first update Ansible on the Ubuntu ACS.

$ sudo apt-add-repository ppa:ansible/ansible
$ sudo apt-get update
$ sudo apt-get upgrade

If you have not yet installed Ansible on the local system, run the following command:

$ sudo apt-get install ansible

If you need to upgrade Ansible from the Ubuntu package to the new PPA repository package, run the following command:

$ sudo apt-get upgrade ansible

Now the Ubuntu ACS is running Ansible v2.4.1.0.

In addition to just having Ansible and Python installed, there are additional Python pieces we need in order for all of the VMware Ansible modules to work correctly.

$ sudo apt-get install python-pip
$ sudo pip install --upgrade pyvmomi
$ sudo pip install pysphere
$ sudo pip list | grep pyvmomi

Note: Make sure pyvmomi is running a 6.5.x version to have all the latest code.

The final piece I needed to configure was to include an additional Ansible module to allow for new VM folders to be created. There is a 3rd party module, called vmware_folder, which includes the needed functionality. After cloning the Openshift-ansible-contrib repo, I copied the following file into the ACS directory /usr/lib/python2.7/dist-packages/ansible/modules/cloud/vmware.

The file can found on GitHub at the following link.

The Ubuntu ACS node now possesses all of the necessary pieces to get started with the environment deployment.

Step 2: Ansible Playbook for deployment

The starting point for the project is to write the Ansible playbook that will deploy the virtual machines and power them on — thus allowing the PXEBOOT system to download and install CoreOS onto each node. Ansible has several VMware modules that will be leveraged as the project progresses.

The Infrastructure-as-Code project source code is hosted on GitHub and is available for download and use. The project is currently under development and is being written in stages. By the end of the series, the entire instantiation of the environment will be fully automated. As the series progresses, the playbooks will get built out and become more complete.

The main.yml Ansible playbook currently includes two tasks — one for creating the VM folder and a second for deployment of the VMs. It uses a blank VM template that already exists on the vCenter Server.

When the playbook is run from the ACS, it will deploy a dynamic number of nodes, create a new VM folder and allow the user to specify a VM-name prefix.

When the deployment is complete, the VMs will be powered on and booting CoreOS. Depending on the download speeds in the environment, the over/under for the CoreOS nodes to be fully online is roughly 10 minutes right now.

The environment is now deployed and ready for Kubernetes! Next week, the series will focus on using Ansible for installing and configuring Kubernetes on the nodes post-deployment. As always, feel free to reach out to me over Twitter if you have questions or comments.

[Introduction] [Part 1 – Bootstrap CoreOS with Ignition] [Part 2 – Understanding CoreOS Ignition] [Part 3 – Getting started with Ansible]

Infrastructure-as-Code: Understanding CoreOS Ignition

The previous post introduced the Ignition file that is being used to configure the CoreOS nodes that will eventually be used for running Kubernetes. The Ignition file is a JSON formatted flat-file that needs to include certain information and is particularly sensitive when improperly written. In an effort to help users of Ignition, the CoreOS team have provided a Config Validator and Config Transpiler binary for taking a YAML coreos-cloudinit file and converting it into the JSON format.

This post will review how to use the Config Transpiler to generate a valid JSON file for use by Ignition. After demonstrating its use, I will cover the stateful-config.ign Ignition file being used to configure the CoreOS nodes within the environment.

Step 1: CoreOS Config Transpiler

The CoreOS Config Transpiler is delivered as a binary that can be downloaded to a local system and used to generate a working JSON file for Ignition. After downloading the binary to my Mac OS laptop, I began by writing one section at a time for the stateful-ignition.ign file and then running it through the Config Validator to be it had correct syntax. Generally, when working on a project of this magnitude, I will write small pieces of code and test them before moving onto the next part. This helps me when there are issues, as the Config Validator is not the most verbose tool when there is a misconfiguration. By building small blocks of code, it allows me to build the larger picture slowly and have confidence in the parts that are working.

One piece, which will be covered in greater detail later in the post, was to install Python on CoreOS. For that portion, I decided to have Ignition write a script file to the local filesystem when it boots. To accomplish this, I built the following YAML file:

    - path: /home/deploy/
      filesystem: root
      mode: 0644
        inline: |
          sudo mkdir -p /opt/bin
          cd /opt
          sudo wget
          sudo tar -zxf ActivePython-
          sudo mv ActivePython- apy
          sudo /opt/apy/ -I /opt/python
          sudo ln -sf /opt/python/bin/easy_install /opt/bin/easy_install
          sudo ln -sf /opt/python/bin/pip /opt/bin/pip
          sudo ln -sf /opt/python/bin/python /opt/bin/python
          sudo ln -sf /opt/python/bin/python /opt/bin/python2
          sudo ln -sf /opt/python/bin/virtualenv /opt/bin/virtualenv
          sudo rm -rf /opt/ActivePython-

Once the YAML file was written, I used the CoreOS Config Transpiler to generate the JSON output. The screenshot below shows how to run the binary to produce the JSON output, which is written to the terminal.

From there, you can copy the entire output into an Ignition JSON file, or copy-and-paste just the bits that are needed to be added to an existing Ignition JSON file.

You’ve likely noticed there are lots of special characters in the JSON output that are necessary to write the script that will install Python, as described by the YAML file. In addition to that, the output is also one big blob of text — it does not have whitespace formatting, so you’ll need to decide how you want to format your own Ignition file. I personally prefer to take the time to properly format it in a reader-friendly way, as can be seen in the stateful-config.ign file.

Step 2: Understanding the PXEBOOT CoreOS Ignition File

pxeboot-config.ign (S3 download link)

The Ignition file can include a great number of configuration items within in. The Ignition specification includes sections for networking, storage, filesystems, systemd drop-ins and users. The pxeboot-config.ign Ignition file is much smaller compared to the one used when the stateful installation of CoreOS is performed. There is one section I want to highlight independently since it is crucial for it to be in place before the installation can begin.


The storage section includes a portion where fdisk is used to create a partition table on the local disk within the CoreOS virtual machine. The code included in this file will work regardless of what size disk is attached to the virtual machine. Right now I am creating a 50Gb disk on my vSAN datastore, however if I change the VM specification later to be larger or smaller, this bit of code will continue to work without modification.

The final part of the storage section then formats the partition using ext4 as the filesystem format. Ignition supports other filesystem types, such as xfs, if you choose to use a different format.

Step 3: Understanding the Stateful CoreOS Ignition File

stateful-config.ign (S3 download link)

Now we will go through each section of code included in the stateful-config.ign file I am using when the stateful installation of CoreOS is performed on one of the deployed nodes. At a minimum, an Ignition file should include at least one user, with an associated SSH key to allow for remote logins to be successful.

There are many examples available from the CoreOS site itself and these were used as reference points when I was building this Ignition file.

Now I will go through each section and describe what actions will be performed when the file is run.

Lines 1-5 define the Ignition version that is to be used — like an OpenStack Heat template, the version will unlock certain features contained in the specification.

The storage section of the Ignition file is where local files can be created (shell scripts, flat files, etc) and where storage devices are formatted. Lines 7-17 define the first file that needs to be created on the local filesystem. The file itself — /etc/motd — is a simple flat file that I wanted to write so that I would know the stateful installation had been performed on the local node. The contents section requires special formatting and this is where the Config Transpiler is helpful. As shown above, a YAML file can be created and the Config Transpiler used to convert it into the correctly formatted JSON code. The YAML file snippet looked like:

  - path: /etc/motd
    filesystem: root
    mode: 0644
      inline: |
        Stateful CoreOS Installation.

Lines 18-28 create the /home/deploy/ shell script that will be used later to actually perform the installation. Remember, the storage section in the Ignition file is not executing any files, it is merely creating them.

Lines 29-41 are now defining another shell script, /home/deploy/, that will be used to assign the FQDN as the hostname of the CoreOS node. This is an important piece since each node will be receiving a DHCP address and as we get further into the automation/orchestration with Ansible, it will be necessary to know exactly which FQDNs exist within the environment.

Line 41 closes off the storage section of the Ignition file. The next section is for systemd units and drop-ins.

Line 42 tells Ignition we are now going to be providing definitions we expect systemd to use during the boot process. This is where Ignition shows some of its robustness — it allows us to create systemd units early enough in the boot process to affect how the system will run when it is brought online fully.

Lines 44-48 define the first systemd unit. Using the /home/deploy/ shell script that was defined in the storage section, the Ignition file creates the /etc/systemd/system/set-hostname.service file that will be run during the boot process. The formatting of the contents section here is less severe than the contents section inside a files unit (above). Here we can simply type the characters, including spaces and use the familiar ‘\n’ syntax for newlines.

As you can see the unit above creates the /etc/systemd/system/set-hostname.service file with the following contents:

Description=Use FQDN to set hostname.

ExecStartPre=/usr/bin/chmod 755 /home/deploy/
ExecStartPre=/usr/bin/chown deploy:deploy /home/deploy/


Lines 49-53 take the Python installation script Ignition created and creates a systemd unit for it as well. I confess that this may not be the most ideal method for installing Python, but it works.

The /etc/systemd/system/env-python.service file is created with the following contents:

Description=Install Python for Ansible.

ExecStartPre=/usr/bin/chmod 755 /home/deploy/
ExecStartPre=/usr/bin/chown deploy:deploy /home/deploy/


There is a systemd caveat I want to go over that were instrumental is being able to deliver a functional Ignition file. As I worked through setting the hostname — which should be a relatively easy task — I ran into all sorts of issues. After working through the script, adding debugging messages to the shell script, I was able to determine the systemd unit was being run before the network was fully online — resulting in the scripts inability to successfully query a DNS server to resolve the FQDN. After reading through more blog posts and GitHub pages, I came across the syntax for making sure my systemd services were not being executed until after the network was fully online.

The two key lines here are:

This instructs systemd to not execute this unit until after the network is confirmed to be online. There is another systemd target server — — but it does not guarantee the network is actually fully online. Instead the unit is released after the interface is configured, not necessarily after all of the networking components are fully operational. Using the unit ensured the two shell scripts I needed systemd to execute were able to leverage the functioning network.

Lines 54-59 define the last systemd unit in my Ignition file, which tells CoreOS to start the etcd2 service. The configuration of etcd2 will be performed by Ansible and covered in a later post.


The final portion of the Ignition file defines users  the CoreOS system should have when it is fully configured. In the file I have configured a single user, deploy, and assigned an SSH key that can be used to log into the CoreOS node. The code also defines the user to be part of the sudo and the docker groups, which are predefined in the operating system.

Feel free to reach out over Twitter if you have any questions or comments.

[Introduction] [Part 1 – Bootstrap CoreOS with Ignition] [Part 2 – Understanding CoreOS Ignition] [Part 3 – Getting started with Ansible]