Ansible NSX module for creating NAT rules

After working on the code last weekend and testing the functionality this past week, I am proud to announce the ability to create NAT rules on an NSX Edge is possible through Ansible! The ability to create SNAT and DNAT rules on the NSX Edge was a necessity for the Infrastructure-as-Code project, as each environment deployed uses it’s own micro-segmented network. I am doing that so that each environment can be stood up multiple times within the same vSphere environment and be solely dependent upon itself.

The current Ansible module allows for the creation of both SNAT and DNAT rules. I used the NSX API Guide to be able to determine which variables are acceptable to be passed to either types of NAT rules and included each one in the function. As such, there are no features missing from the module today.

The module can be downloaded from the GitHub virtualelephant/nsxansible repo.

To use the module, I have created an example Ansible playbook (also available on GitHub):

test_edge_nat.yml

  1 ---
  2 - hosts: localhost
  3   connection: local
  4   gather_facts: False
  5   vars_files:
  6     - nsxanswer.yml
  7 
  8   tasks:
  9   - name: Create SSH DNAT rule
 10     nsx_edge_nat:
 11       nsxmanager_spec: '{{ nsxmanager_spec }}'
 12       mode: 'create'
 13       name: '{{ edge_name }}'
 14       rule_type: 'dnat'
 15       vnic: '0'
 16       protocol: 'tcp'
 17       originalAddress: '10.0.0.1'
 18       originalPort: '22'
 19       translatedAddress: '192.168.0.2'
 20       translatedPort: '22'
 21 
 22   - name: Create default outbound SNAT rule
 23     nsx_edge_nat:
 24       nsxmanager_spec: '{{ nsxmanager_spec }}'
 25       mode: 'create'
 26       name: '{{ edge_name }}'
 27       rule_type: 'snat'
 28       vnic: '0'
 29       protocol: 'any'
 30       originalAddress: '192.168.0.0/20'
 31       originalPort: 'any'
 32       translatedAddress: '10.0.0.1'
 33       translatedPort: 'any'

Update Jan 30th:

The module was updated a few days ago to include the ability to delete a NAT rule from an Edge. The functionality allows the consumer to write a playbook with the following information to delete an individual rule.

  1 ---
  2 - hosts: localhost
  3   connection: local
  4   gather_facts: False
  5   vars_files:
  6     - nsxanswer.yml
  7     - envanswer.yml
  8 
  9   tasks:
 10   - name: Delete HTTP NAT rule
 11     nsx_edge_nat:
 12       nsxmanager_spec: '{{ nsxmanager_spec }}'
 13       mode: 'delete'
 14       name: '{{ edge_name }}'
 15       ruleId: '196622'

Let me know if you are using the NSX Ansible modules and what other functionality you would like to see added.

Enjoy!

Infrastructure-as-Code: Project Update

The Infrastructure-as-Code project is progressing along rather well. When I set out on the project in November of 2017, I wanted to use the project as a means to learn several new technologies — Ansible, Python, CoreOS and Kubernetes. The initial stages of the project focused on understanding how CoreOS works and how to automate the installation of a CoreOS VM within a vSphere environment. Once completed, I moved onto automating the initial deployment of the environment and supporting components. This is where the bulk of my time has been spent the past several weeks.

As previous posts have shown, using Ansible as an automation framework within a vSphere environment is a powerful tool. The challenge has been leveraging the existing, publicly available modules to perform all the required actions to completely automate the deployment. The Ansible NSX modules available on Github are a good starting point, but they have lacked all of the desired functionality.

The lack of functionality lead to me fork the project into my own repo and submit my very first pull request on Github shortly after adding the necessary DHCP functionality.

The process of adding the desired functionality has become a bit of a rabbit-hole. Even still, I am enjoying the process of working through all the tasks and seeing the pieces begin to come together.

Thus far, I have the following working through Ansible:

  • NSX logical switch and NSX edge deployment.
  • DHCP and firewall policy configuration on NSX edge.
  • Ubuntu SSH bastion host deployment and configuration.
  • Ubuntu DNS hosts deployment and configuration.
  • Ubuntu-based Kubernetes master nodes deployment (static number).
  • CoreOS-based Kubernetes minion nodes deployment (dynamic number).

In addition to the Ansible playbooks that I’ve written to automate the project, creating a Docker image specifically to act as the Ansible Control Server, with all of the required third-party modules has really helped to streamline the project and make it something I should be able to ‘release’ for others to use and duplicate my efforts.

The remaining work before the project is complete:

  • Add DNAT/SNAT configuration functionality to Ansible for NSX edge (testing in progress).
  • Update CoreOS nodes to use logical switch DVS port group.
  • Kubernetes configuration through Ansible.

I’ve really enjoyed all the challenges and new technologies (to me) the project has allowed me to learn. I am also looking forward to being able to contribute back to the Ansible community with additional capabilities for NSX modules.

 

Enhanced NSX Modules for Ansible

The published NSX modules from VMware lack certain functionality that I’ve needed as I worked on the Infrastructure-as-Code project over the holiday break. A few of the things I need to be able to do include:

  • Enable Edge firewall and add/delete rules
  • Enable DHCP and add IP pools
  • Search for DVS VXLAN port group
  • Associate vNIC with default gateway
  • Create Edge SNAT/DNAT rules

As I investigated methods for accomplishing these tasks, I found another VMware repository of Python scripts that had some of the functionality. The library is designed as a command-line tool, but I was able to take several of the code blocks and modify them for use within an Ansible playbook. In order to track the changes that I’ve made to the Ansible modules, I’ve forked the vmware/nsxansible repo into virtualelephant/nsxansible on Github. After a bit of work over the holiday, I’ve managed to add functionality for all but the SNAT/DNAT rule creation.

In addition to writing the Python modules, I have modified the Docker ubuntu-ansible container I spoke about previously to include my forked branch of the vmware/nsxansible modules.

Creating DHCP Pools

The module nsx_edge_dhcp.py allows an Ansible playbook to create a new IP pool and enable the DHCP service on a previously deployed NSX Edge. The playbook can currently support all of the basic IP pool options, as seen in the following image:

The playbook can contain the following code:

  1 ---
  2 - hosts: localhost
  3   connection: local
  4   gather_facts: False
  5   vars_files:
  6     - nsxanswer.yml
  7     - envanswer.yml
  8 
  9   tasks:
 10   - name: Create DHCP pool on NSX Edge
 11     nsx_edge_dhcp:
 12       nsxmanager_spec: "{{ nsxmanager_spec }}"
 13       name: '{{ edge_name }}'
 14       mode: 'create_pool'
 15       ip_range: '{{ ip_range }}'
 16       subnet: '{{ netmask }}'
 17       default_gateway: '{{ gateway }}'
 18       domain_name: '{{ domain }}'
 19       dns_server_1: '{{ dns1_ip }}'
 20       dns_server_2: '{{ dns2_ip }}'
 21       lease_time: '{{ lease_time }}'
 22       next_server: '{{ tftp_server }}'
 23       bootfile: '{{ bootfile }}'

A future enhancement to the module will allow for the DHCP Options variables to be updated as well. This is key for the project so that the scope points to the TFTP server where Core OS is installed from.

Update: The ability to add a TFTP next-server and specify the filename for downloading has been added and is contained in the Github repo virtualelephant/nsxansible.

Edge Firewall Rules

Fortunately, another author on Github already wrote this module — they even submitted a pull request to have it included in the vmware/nsxansible repo last year, but since it had yet to be included, I forked it into my own repo for use.

The nsx_edge_firewall.py module allows you to modify the default rule and create new rules on a NSX Edge device.

The Ansible playbook contains the following to create the default firewall policy:

  1 ---
  2 - hosts: localhost
  3   connection: local
  4   gather_facts: False
  5   vars_files:
  6     - nsxanswer.yml
  7     - envanswer.yml
  8 
  9   tasks:
 10   - name: Set default firewall rule policy
 11     nsx_edge_firewall:
 12       nsxmanager_spec: "{{ nsxmanager_spec }}"
 13       mode: 'set_default_action'
 14       edge_name: '{{ edge_name }}'
 15       default_action: 'accept'

Specify vNIC for Default Gateway

The original nsx_edge_router.py module included code to create the default gateway, however it did not allow you to modify the MTU or specify which vNIC should be associated with the default gateway. The forked nsx_edge_router.py version in the VirtualElephant Github repo includes the necessary code to specify both of those options.

150 def config_def_gw(client_session, esg_id, dfgw, vnic, mtu, dfgw_adminDistance):
151     if not mtu:
152         mtu = '1500'
153     rtg_cfg = client_session.read('routingConfigStatic', uri_parameters={'edgeId': esg_id})['body']
154     if dfgw:
155         try:
156             rtg_cfg['staticRouting']['defaultRoute'] = {'gatewayAddress': dfgw, 'vnic': vnic, 'mtu': mtu, 'adminDistance': dfgw_adminDistance}
157         except KeyError:
158             rtg_cfg['staticRouting']['defaultRoute'] = {'gatewayAddress': dfgw, 'vnic': vnic, 'adminDistance': dfgw_adminDistance, 'mtu': mtu}
159     else:
160         rtg_cfg['staticRouting']['defaultRoute'] = None
161 
162     cfg_result = client_session.update('routingConfigStatic', uri_parameters={'edgeId': esg_id},
163                                        request_body_dict=rtg_cfg)
164     if cfg_result['status'] == 204:
165         return True
166     else:
167         return False

The Ansible playbook is then able to include the following bits to create the default gateway with the preferred settings:

 42   - name: NSX Edge creation
 43     nsx_edge_router:
 44       nsxmanager_spec: "{{ nsxmanager_spec }}"
 45       state: present
 46       name: "{{ edge_name }}"
 47       description: "{{ description }}"
 48       resourcepool_moid: "{{ gather_moids_cl.object_id }}"
 49       datastore_moid: "{{ gather_moids_ds.object_id }}"
 50       datacenter_moid: "{{ gather_moids_cl.datacenter_moid }}"
 51       interfaces:
 52         vnic0: {ip: "{{ ext_ip }}", prefix_len: 26, logical_switch: "{{ uplink }}", name: 'uplink0', iftype: 'uplink', fence_param: 'ethernet0.filter1.param1=1'}
 53         vnic1: {ip: '192.168.0.1', prefix_len: 20, logical_switch: "{{ switch_name }}", name: 'int0', iftype: 'internal', fence_param: 'ethernet0.filter1.param1=1'}
 54       default_gateway: "{{ default_route }}"
 55       default_gateway_vnic: '0'
 56       mtu: '9000'
 57       remote_access: 'true'
 58       username: 'admin'
 59       password: "{{ nsx_pass }}"
 60       firewall: 'true'
 61       ha_enabled: 'true'
 62     register: create_esg
 63     tags: esg_create

When specifying the vNIC to use for the default gateway, the value is not the name the Ansible playbook gives the vNIC — uplink0 — but rather the vNIC number within the Edge — which will be 0 if you are using my playbook.

Once I have the SNAT/DNAT functionality added, I will write another blog post and progress on the Infrastructure-as-Code project will be nearly complete.

Enjoy!

 

Docker for Ansible + VMware NSX Automation

I am writing this as I sit and watch the annual viewing of The Hobbit and The Lord of the Rings trilogy over the Christmas holiday. The next couple of weeks of time should provide the time necessary to hopefully complete the Infrastructure-as-Code project I undertook last month. As part of the Infrastructure-as-Code project, I spoke previous about how Ansible is being used to provide the automation layer for the deployment and configuration of the SDDC Kubernetes stack. As part of the bootstrapping effort, I have decided to create a Docker image with the necessary components to perform the initial virtual machine deployment and NSX configuration.

The Dockerfile for the Ubuntu-based Docker container is hosted both on Docker Hub and within the Github repository for the larger Infrastructure-as-Code project.

When the Docker container is launched, it includes the necessary components to interact with the VMware stack, including additional modules for VM folders, resource pools and VMware NSX.

To launch the container, I am running it with the following options to include the local copies of the Infrastructure-as-Code project.

$ docker run -it --name ansible -v /Users/cmutchler/github/vsphere-kubernetes/ansible/:/opt/ansible virtualelephant/ubuntu-ansible

The Docker container is a bit on the larger side, but it is designed to run locally on a laptop or desktop. The image includes the required Python and NSX bits so that the additional Github repositories that are cloned into the image will operate correctly. The OpenShift project includes additional modules for interacting with vSphere folders and resource pools, while the NSX modules from the VMware Github repository includes the necessary bits for leveraging Ansible with NSX.

Once running, the Docker container is then able to bootstrap the deployment of the Infrastructure-as-Code project using the Ansible playbooks I’ve published on Github. Enjoy!

VCDX Quick Hit – Monitoring and Alerting

This is the first post in what I plan to be a sporadic, yet on-going series highlighting certain aspects of a VCDX / Architect skillset. These VCDX Quick Hits will cover a range of topics and key in on certain aspects of the VCDX blueprint. It is my hope they will trigger some level of critical thinking on the readers part and help them improve their skillset.

The idea for this post came after listening to a post-mortem call for a recent incident that occurred at work. The incident itself was a lower priority Severity 2 incident, meaning it only impacted a small subset of customers in a small failure domain (a single vCenter Server). As architects, we know monitoring is a key component of any architecture design — whether it is intended for a VCDX submission or not.

In IT Architect: Foundation in the Art of Infrastructure Design (Amazon link), the authors state:

“A good monitoring solution will identify key metrics of both the physical and virtual infrastructure across all key resources compute, storage, and networking.”

The post-mortem call got me thinking about maturity within our monitoring solutions and improving our architecture designs by striving to understand the components better earlier in the design and pilot phases.

It is common practice to identify the key components and services of an architecture we designed, or are responsible for, to outline which are key to support the service offering. When I wrote the VMware Integrated OpenStack design documentation, which later became the basis for my VCDX defense, I identified specific OpenStack services which needed to be monitored. The following screen capture shows how I captured the services within the documentation.

As you can see from the above graphic, I identified each service definition with a unique ID, documented the component/service, documented where the service should be running, and a brief description of the component/service. The information was used to create the Sprint story for the monitoring team to create the alert definitions within the monitoring solution.

All good right?

The short answer is, not really. What I provided in my design was adequate for an early service offering, but left room for further maturity. Going back to the post-mortem call, this is where additional maturity in the architecture design would have helped reduce the MTTR of the incident.

During the incident, two processes running on a single appliance were being monitored to determine if they were running. Just like my VMware Integrated OpenStack design, these services had been identified and were being monitored per the architecture specification. However, what was not documented was the dependency between the two processes. In this case, process B was dependent on process A and although process A was running, it was not properly responding to the queries from process B. As a result, the monitoring system believed everything was running correctly — it was from an alert definition perspective — and the incident was not discovered immediately. Once process A was restarted, it began responding to the queries from process B and service was restored.

So what could have been done?

First, the architecture design could have written an alert definition for the key services (or processes) that went beyond just measuring whether the service is running.

Second, the architecture design could have better understood the inter-dependencies between these two processes and written an more detailed alert definition. In this case, there was a log entry written each time process A did not correctly respond to process B. Having an alert definition for this entry in the logs would have allowed the monitoring system to generate an alert.

Third, the architecture design could have used canary testing as a way to provide a mature monitoring solution. It may be necessary to clarify what I mean when I use the term canary testing.

“Well into the 20th century, coal miners brought canaries into coal mines as an early-warning signal for toxic gases, primary carbon monoxide. The birds, being more sensitive, would become sick before the miners, who would then have a chance to escape or put on protective respirators.” (Wikipedia link)

Canary testing would them imply a method of checking the service for issues prior to a customer discovering them. Canary testing should include common platform operations a customer would typically do — this can also be thought of as end-to-end testing.

For example, a VMware Integrated OpenStack service offering with NSX would need to ensure that both the NSX Manager is online, but also that the OpenStack Neutron service is able to communicate to it. A good test could be to make an OpenStack Neutron API call to deploy a NSX Edge Service Gateway, or create a new tenant network (NSX logical switch).

There are likely numerous ways a customer will interact with your service offering and defining these additional tests within the architecture design itself are something I challenge you consider.