Post-Defense VCDX Thoughts

“Nothing in the world is worth having or worth doing unless it means effort, pain, difficulty… I have never in my life envied a human being who led an easy life. I have envied a great many people who led difficult lives and led them well.” -Theodore Roosevelt

That quote from Theodore Roosevelt sums up rather well the VCDX certification. The VCDX certification takes a great deal of effort, pain and difficulty to accomplish. My personal journey included multiple defense attempts — much to my dismay and benefit. Fortunately, it was all worth it!

I am VCDX #257!

The VCDX certification requires a significant amount of time to earn. If I had to estimate it, I would say I spent between 200+ hours working on my design documentation, defense presentation, mock defenses, Q&A sessions and just general research. The submitted design was also an actual work project, so some of that time investment was for my job — an added benefit not all candidates have.

The one lesson I would share with others thinking about or pursuing their own VCDX certification is the following — be careful who you ask advice of or take advice from. If they have not been a panelist in the past, their view into what to do (or not to do) is going to be mostly opinion. The VCDX program held a Q&A call the Friday before the defenses began in May.

On the call were Joe Silvagi, Simon Long and Karl Childs — all three are heavily involved in the program. The most frequent questions asked by the candidates started with the phrase, “My mentor says” or “The community says”. In nearly every instance the response from Joe was along of the lines of that isn’t right.

Attend one (or more) VCDX workshops prior to submitting so that you can ask questions and reach out to the people running the workshops to get trustworthy responses.

That’s all the advice I have to give.

There is an African Proverb, and the quote is outside one of the VMware conference rooms, that says:

“If you want to go fast, go alone. If you want to go far, go together.”

This is true of the VCDX certification. I got to this point not because I went alone, but because I went with others.

My wife – No one on this earth has supported me more. The countless hours over the past two decades of late nights as I strived to advance my career. This is as much her certification as it is my own.

Rich Steck (Adobe) – He mentored me during one of the most difficult years in my career. He challenged me to figure out where I wanted to go and to find paths to get there. Most importantly, he listened.

Frans van Rooyen (Adobe) – Already a brilliant cloud architect in his own right, he mentored me in my role as a Compute Platform Engineer for two years. He let me constantly challenge all of the decisions we were making (on-the-fly) as we built a rather large private cloud across the globe. He introduced me to VMware technologies and helped me gain the skills I would need to land my dream job at VMware in two short years.

Andrew Nelson (VMware) – While at Adobe, Frans introduced me to Andrew. Andy and I spoke at VMworld together in San Francisco and Barcelona in 2014. We briefly worked on a book together, during which time he told me if I wanted to get a job at VMware, I’d be surprised how quickly it would happen. I had an offer for my current role barely 1 month later.

OneCloud Architecture Team (VMware) – My dream job came with the opportunity to work with 3 double-VCDX certification holders. The first architecture review board call I attended they tore into another architect over his vRA design and it was at that moment I knew I was going to have to step up my game significantly to play with them. What a blessing it has been to work with them for the past two years — each of them has helped me grow my skills as an architect immensely. They taught me to critically challenge a design decision, not just for the sake of arguing, but because we are trying to understand the rationale for the decision.

Their support continued from afar as I went through the process of submitting and defending my design for my own certification. When I got the email saying I was now VCDX #257, they were right there celebrating my success with me.

Thank you to each of you for helping me realize my dreams and earn the VCDX certification!

 

Install VMware Integrated OpenStack on VCF

I am currently pursuing my VCDX certification and the design I have submitted is based on VMware Cloud Foundation and VMware Integrated OpenStack. As part of the required documentation, I included a deployment guide — unfortunately, it is not as simple as laying down the SDDC components and the VIO vApp for the deployment.

This blog post will cover a couple items that are needed to get the two pieces playing together.


Shared Edge & Workload Cluster

The VCF architecture currently has a limitation that a vCenter Server can only have a single vSphere cluster — it’s a 1:1 relationship. VMware Integrated OpenStack requires either 3 clusters in a single vCenter Server or a management cluster in one vCenter Server instance and two clusters in a second vCenter Server. Neither of these options are compatible with VMware Integrated OpenStack.

In order to make it work, we are going to use a two vCenter Server deployment of VMware Integrated OpenStack and modify the OMS server to combine the NSX Edge and Workload Clusters into one. We do this by editing a single configuration file and restarting the oms service running on the VIO vApp Management (OMS) VM.

$ cd /opt/vmware/vio/etc
$ sudo vim moms.properties

Add the following line to the end of the file:
oms.allow_shared_edge_cluster = true

$ sudo restart oms

VMware Integrated OpenStack can now be deployed on top of VMware Cloud Foundation.


VXLAN-backed External Network

This one is a bit trickier and is an obstacle whether or not you are using VMware Cloud Foundation as the infrastructure layer.

Logically, the end result for the OpenStack external network is to attach to a VXLAN port group created by NSX. The NSX logical switch network is attached to the internal interface on a NSX Distributed Logical Router.

The following is the logical diagram for the architecture.

external openstack

The issue is that during the deployment of an OpenStack instance using VMware Integrated OpenStack, you have to specify an external network. However, VMware Integrated OpenStack will not allow a vSphere Administrator to select a VXLAN port group during the deployment. I got around this by creating a non-VXLAN port group on the DVS used only for the deployment.

Once the OpenStack deployment is complete, I needed to attach the actual VXLAN-backed port group as the external network.

SSH to the OMS server
$ ssh -l viouser oms.domain.local

SSH to an OpenStack controller VM
$ ssh controller01
$ sudo cp /root/cloudadmin_v3.rc .
$ source cloudadmin_v3.rc
$ neutron

(neutron) net-list
(neutron) net-create --provider:network_type=portgroup --provider:physical_network=virtualwire-XX vio-external-network
(neutron) net-list

The network will now appear in the OpenStack network list. Go ahead and create your subnet for the external IP addresses, based on the network assignment in your environment.

If you have questions or issues with implementing these changes in your environment, please reach out.

Using the VMware Validated Design Reference Material

Caution, this post is highly opinionated.

I am deep into the process of completing my VCDX design documentation and application for (hopefully) a Q2 2017 defense. As it so happens, a short conversation was had on Twitter today regarding a post on the VMware Communities site for the VMware Validation Design for SDDC 3.x, including a new design decision checklist.

twitter-screen

The latest version of the VMware Validated Design (VVD) is a pretty awesome product for customers to reference when starting out on their private cloud journey. That being said, it is by no means a VCDX design or a set of materials that could simply be re-purposed for a VCDX design.

Why? Because there are no customer requirements.

For the same reason a hypothetical (or fake) design is often discouraged by people in the VCDX community, the VVD suffers from the same issue. In a vacuum you can make any decision you want, because there are no ramifications from your design decision. In the real-world this is simply not the case.

Taking a look at the Design Decisions Checklist, it goes through the over 200 design decisions the VVD made in the course of developing the reference architecture. The checklist does a good job of laying out the fields the design decision covers, like:

  • Design Decision
  • Design Justification
  • Design Implication

Good material. But if you’ve read my other post on design decisions, which you may or may not agree with, it highlights that a decision justification is made based on a requirement.

Let’s take a look at just one of the design decisions made by the VVD product and highlighted in the checklist.

vvd_decision_screencap

The decision is to limit a single compute pod to a single physical rack, as in no cross-rack clusters. Sounds like a reasonable decision, especially if the environment had a restriction on L2 boundaries or some other requirement. But what if I have a customer requirement that said a compute node must be able to join any compute pod (cluster) regardless of physical rack location within a data center?

Should I ignore that requirement because the VVD says to do otherwise?

Of course not.

My issue with the Twitter conversation is two-fold:

  1. The VVD design decisions are not in fact design decisions, but design recommendations. They can be used to help a company, group or architect to determine, based on their requirements, which of these “decisions” should be leveraged within their environment. They are not die-hard decisions that must be adhered to.
  2. From a VCDX perspective, blindly assuming you could copy/paste any of these design decisions and use them in a VCDX defense is naive. You must have a justification for every design decision made and it has to map back to a customer requirement, risk or constraint.

I also do not think that is what  was saying when he initially responded to the Tweet about the checklist. I do think though that some people may actually think they can just take the VVD, wrap it in a bow and call it good.

My suggestion is to take the VVD design documentation and consider it reference material, just like the many other great books and online resources available to the community. It won’t work for everyone, because every design has different requirements, constraints and risks. Take the bits that work for you and expand upon them. Most importantly, understand why you are using or making that design decision.

Let me know what you think on Twitter.

Again, this post is highly opinionated from my own limited perspective. Do not mistake it for the opinion of VMware or any VCDX certified individuals.

NSX DLR Designated Instance

nsx designated instance

While a great show, we are going to talk about something slightly different — the NSX Distributed Logical Router (DLR) Designated Instance. NSX has many great features and also many caveats when implementing some of those great features — like having a Designated Instance when using a DLR.

So what is a Designated Instance? Honestly, I did not know what it was until a conversation earlier today with a few co-workers who are a bit more knowledgable with NSX than me. Essentially a Designated Instance is an elected ESXi host that will answer all new requests initially — also known as a single-point of failure.

Let’s look at the logical network diagram I posted yesterday.

nsx-dlr-openstack

Pretty sweet right?

The issue is when the DLR is connected directly to a VLAN. While technically not a problem — it does exactly what you’d expect it does — it results in having to have one of the ESXi hosts in the transport zone act as the Designated Instance. The result is that if the Designated Instance ESXi host encounters a failure, any new traffic will fail until the election process is complete and a new Designated Instance is chosen.

So is it possible to not need a Designated Instance when using a DLR? Yes.

It involves introducing another logical NSX layer into the virtual network design. If you saw my tweet earlier, this is what I meant.

I like , but sometimes I think it adds a little too much complexity for operational simplicity.

Adding a set of ECMP edges above the DLR and connecting the two together will eliminate the requirement for NSX to use a the Designated Instance. Here is what an alternative to the previous design would look like.

external openstack

Essentially what I’ve done is create another VXLAN, with a corresponding NSX Logical Switch and connect the uplink from the DLR to it. Then the ECMP Edges use the same Logical Switch as their internal interface. It is on the uplink side of the ECMP Edge where the P2V layer takes place and the VLAN is connected.

Using this design allows the environment to use a dynamic routing protocol between both the DLR and ECMP Edges and ECMP Edges and the upstream physical network — although mileage may vary depending on your physical network. The ECMP Edges introduce additional scalability — although limited to 8 — based on the amount of North-South network traffic and the bandwidth required to meet the tenant needs. Features like vSphere Anti-Affinity rules can mitigate a failure of a single ESXi host, which you cannot do when there is a Designated Instance. The design can also take into consideration a N+x scenario for when to scale the ECMP Edges.

So many options open up when NSX is introduced into an architecture, along with a lot of extra complexity. Ultimately the decision should be based on the requirements and the stakeholders risk acceptance. Relying on a Designated Instance may be acceptable to a stakeholder, while adding more complexity to the design may not be.

Until next time, enjoy!

Understanding a Design Decision

blueprint-header

The last couple of months leading into the end of the year has seen me focusing once again on earning the VCDX certification. In the process of doing a fair amount of examination of my skills, especially my areas of weakness, I knew a new design was needed. Fortunately a new project at work had me focusing on building an entirely new VMware Integrated OpenStack service offering. Being able to work on the design from inception to POC to Pilot has provided me a great learning opportunity. One of my weaknesses has been to be sure I understand the ramifications of each design decision being made in the architecture. As I worked through the process of documenting all of the design decisions, I settled on a template within the document.

The following table is replicated for each design decision within the architecture.

dd_summary_template

One of the ways I worked to improve my understanding of how to document a proper design was the book, IT Architect: Foundation in the Art of Infrastructure Design. In the book I noticed the authors made sure to highlight the design justifications throughout every chapter. I wanted to incorporate that same justifications within my VCDX architecture document and be sure to document the other risks, impacts and also the requirements that were achieved by the decision.

In the design I am currently working on, an example of the above table in action can be found in the following image.

dd_summary_example01

Here a decision for the compute platform was made to use the Dell PowerEdge R630 server. Requirements like the SLA had to also be taken into consideration, which you see in the risks and risk mitigation. The table helps to highlight when some design decisions actually add in additional requirements for the architecture — usually found in the Impact or Decision Risks section of the table. In the case of the example, the table notes,

Dell hardware has been prone to failures, includes drives, SD cards and controller failures.

I documented the risk based on knowledge acquired over nearly a decade of using Dell hardware, especially most recently in my current role. Based on that knowledge, I documented it as a risk which would need to be addressed — which created an ancilliary requirement needing to be addressed. The subsequent Risk Mitigation fulfills the new requirement.

A 4-hour support contract is purchased for each compute node. In addition, an on-site hardware locker is maintained at the local data center, which contains common components to reduce the mean-time-to-resolution when a failure occurs.

The subsequent decision to purchase a 4-hour support contract from Dell for issues, combined with the on-site hardware locker, allow the design to account for the SLA requirements of the service offering while also solving a known risk — hardware failure. In my previous VCDX attempt, I did not do a good enough job working through this thought process and is a key reason why I was not successful.

The process of documenting the table has helped me make sure the proper amount time is spent thinking through every decision. I am also finding documenting all the decisions to be helpful as I review the design with others. All-in-all it has been a great process to work through and is helping me to be sure to know and comprehend every aspect of the design.

As noted previously, I am still pursuing my VCDX certification right now and so these opinions may not be shared by those who have already earned their VCDX certifications.