Caution, this post is highly opinionated.
I am deep into the process of completing my VCDX design documentation and application for (hopefully) a Q2 2017 defense. As it so happens, a short conversation was had on Twitter today regarding a post on the VMware Communities site for the VMware Validation Design for SDDC 3.x, including a new design decision checklist.
The latest version of the VMware Validated Design (VVD) is a pretty awesome product for customers to reference when starting out on their private cloud journey. That being said, it is by no means a VCDX design or a set of materials that could simply be re-purposed for a VCDX design.
Why? Because there are no customer requirements.
For the same reason a hypothetical (or fake) design is often discouraged by people in the VCDX community, the VVD suffers from the same issue. In a vacuum you can make any decision you want, because there are no ramifications from your design decision. In the real-world this is simply not the case.
Taking a look at the Design Decisions Checklist, it goes through the over 200 design decisions the VVD made in the course of developing the reference architecture. The checklist does a good job of laying out the fields the design decision covers, like:
- Design Decision
- Design Justification
- Design Implication
Good material. But if you’ve read my other post on design decisions, which you may or may not agree with, it highlights that a decision justification is made based on a requirement.
Let’s take a look at just one of the design decisions made by the VVD product and highlighted in the checklist.
The decision is to limit a single compute pod to a single physical rack, as in no cross-rack clusters. Sounds like a reasonable decision, especially if the environment had a restriction on L2 boundaries or some other requirement. But what if I have a customer requirement that said a compute node must be able to join any compute pod (cluster) regardless of physical rack location within a data center?
Should I ignore that requirement because the VVD says to do otherwise?
Of course not.
My issue with the Twitter conversation is two-fold:
- The VVD design decisions are not in fact design decisions, but design recommendations. They can be used to help a company, group or architect to determine, based on their requirements, which of these “decisions” should be leveraged within their environment. They are not die-hard decisions that must be adhered to.
- From a VCDX perspective, blindly assuming you could copy/paste any of these design decisions and use them in a VCDX defense is naive. You must have a justification for every design decision made and it has to map back to a customer requirement, risk or constraint.
I also do not think that is what @brianwelch was saying when he initially responded to the Tweet about the checklist. I do think though that some people may actually think they can just take the VVD, wrap it in a bow and call it good.
My suggestion is to take the VVD design documentation and consider it reference material, just like the many other great books and online resources available to the community. It won’t work for everyone, because every design has different requirements, constraints and risks. Take the bits that work for you and expand upon them. Most importantly, understand why you are using or making that design decision.
Let me know what you think on Twitter.
Again, this post is highly opinionated from my own limited perspective. Do not mistake it for the opinion of VMware or any VCDX certified individuals.
While a great show, we are going to talk about something slightly different — the NSX Distributed Logical Router (DLR) Designated Instance. NSX has many great features and also many caveats when implementing some of those great features — like having a Designated Instance when using a DLR.
So what is a Designated Instance? Honestly, I did not know what it was until a conversation earlier today with a few co-workers who are a bit more knowledgable with NSX than me. Essentially a Designated Instance is an elected ESXi host that will answer all new requests initially — also known as a single-point of failure.
Let’s look at the logical network diagram I posted yesterday.
Pretty sweet right?
The issue is when the DLR is connected directly to a VLAN. While technically not a problem — it does exactly what you’d expect it does — it results in having to have one of the ESXi hosts in the transport zone act as the Designated Instance. The result is that if the Designated Instance ESXi host encounters a failure, any new traffic will fail until the election process is complete and a new Designated Instance is chosen.
So is it possible to not need a Designated Instance when using a DLR? Yes.
It involves introducing another logical NSX layer into the virtual network design. If you saw my tweet earlier, this is what I meant.
Adding a set of ECMP edges above the DLR and connecting the two together will eliminate the requirement for NSX to use a the Designated Instance. Here is what an alternative to the previous design would look like.
Essentially what I’ve done is create another VXLAN, with a corresponding NSX Logical Switch and connect the uplink from the DLR to it. Then the ECMP Edges use the same Logical Switch as their internal interface. It is on the uplink side of the ECMP Edge where the P2V layer takes place and the VLAN is connected.
Using this design allows the environment to use a dynamic routing protocol between both the DLR and ECMP Edges and ECMP Edges and the upstream physical network — although mileage may vary depending on your physical network. The ECMP Edges introduce additional scalability — although limited to 8 — based on the amount of North-South network traffic and the bandwidth required to meet the tenant needs. Features like vSphere Anti-Affinity rules can mitigate a failure of a single ESXi host, which you cannot do when there is a Designated Instance. The design can also take into consideration a N+x scenario for when to scale the ECMP Edges.
So many options open up when NSX is introduced into an architecture, along with a lot of extra complexity. Ultimately the decision should be based on the requirements and the stakeholders risk acceptance. Relying on a Designated Instance may be acceptable to a stakeholder, while adding more complexity to the design may not be.
Until next time, enjoy!
The last couple of months leading into the end of the year has seen me focusing once again on earning the VCDX certification. In the process of doing a fair amount of examination of my skills, especially my areas of weakness, I knew a new design was needed. Fortunately a new project at work had me focusing on building an entirely new VMware Integrated OpenStack service offering. Being able to work on the design from inception to POC to Pilot has provided me a great learning opportunity. One of my weaknesses has been to be sure I understand the ramifications of each design decision being made in the architecture. As I worked through the process of documenting all of the design decisions, I settled on a template within the document.
The following table is replicated for each design decision within the architecture.
One of the ways I worked to improve my understanding of how to document a proper design was the book, IT Architect: Foundation in the Art of Infrastructure Design. In the book I noticed the authors made sure to highlight the design justifications throughout every chapter. I wanted to incorporate that same justifications within my VCDX architecture document and be sure to document the other risks, impacts and also the requirements that were achieved by the decision.
In the design I am currently working on, an example of the above table in action can be found in the following image.
Here a decision for the compute platform was made to use the Dell PowerEdge R630 server. Requirements like the SLA had to also be taken into consideration, which you see in the risks and risk mitigation. The table helps to highlight when some design decisions actually add in additional requirements for the architecture — usually found in the Impact or Decision Risks section of the table. In the case of the example, the table notes,
Dell hardware has been prone to failures, includes drives, SD cards and controller failures.
I documented the risk based on knowledge acquired over nearly a decade of using Dell hardware, especially most recently in my current role. Based on that knowledge, I documented it as a risk which would need to be addressed — which created an ancilliary requirement needing to be addressed. The subsequent Risk Mitigation fulfills the new requirement.
A 4-hour support contract is purchased for each compute node. In addition, an on-site hardware locker is maintained at the local data center, which contains common components to reduce the mean-time-to-resolution when a failure occurs.
The subsequent decision to purchase a 4-hour support contract from Dell for issues, combined with the on-site hardware locker, allow the design to account for the SLA requirements of the service offering while also solving a known risk — hardware failure. In my previous VCDX attempt, I did not do a good enough job working through this thought process and is a key reason why I was not successful.
The process of documenting the table has helped me make sure the proper amount time is spent thinking through every decision. I am also finding documenting all the decisions to be helpful as I review the design with others. All-in-all it has been a great process to work through and is helping me to be sure to know and comprehend every aspect of the design.
As noted previously, I am still pursuing my VCDX certification right now and so these opinions may not be shared by those who have already earned their VCDX certifications.
The previous post discussed the use of the vRealize Operations Management Pack for OpenStack and Endpoint Agent in order to provide detailed service-level monitoring within an environment. The management pack comes with nearly 200 pre-defined alerts for OpenStack that can be leveraged to understand what is occurring within the environment. As I’ve gone through the alerts, these are the key alerts that can be leveraged to understand when any of the OpenStack services are experiencing a partial or complete outage.
OpenStack Compute Alerts
|Nova||All nova-network services are unavailable||All nova-network services are unavailable|
|Nova||All nova-xvpnc-proxy services are unavailable||All nova-xvpnc-proxy services are unavailable|
|Nova||All nova-scheduler services are unavailable||All nova-scheduler services are unavailable|
|Nova||All nova-api services are unavailable||All nova-api services are unavailable|
|Nova||All nova-consoleauth services are unavailable||All nova-consoleauth services are unavailable|
|Nova||All nova-cert services are unavailable||All nova-cert services are unavailable|
|Nova||All nova-compute services are unavailable||All nova-compute services are unavailable|
|Nova||All nova-conductor services are unavailable||All nova-conductor services are unavailable|
|Nova||All nova-console services are unavailable||All nova-console services are unavailable|
|Nova||All nova-novncproxy services are unavailable||All nova-novncproxy services are unavailable|
|Nova||All nova-objectstore services are unavailable||All nova-objectstore services are unavailable|
|Nova||The nova-compute service is unavailable||Nova-compute status is unknown|
|Nova||The nova-objectstore service is unavailable||Nova-objectstore status is unknown|
|Nova||The nova-conductor service is unavailable||Nova-conductor status is unknown|
|Nova||The nova-api service is unavailable||Nova-api status is unknown|
|Nova||The nova-cert service is unavailable||Nova-cert status is unknown|
|Nova||The nova-console service is unavailable||Nova-console status is unknown|
|Nova||The nova-consoleauth service is unavailable||Nova-consoleauth status is unknown|
|Nova||The nova-network service is unavailable||Nova-network status is unknown|
|Nova||The nova-novnc-proxy service is unavailable||Nova-novncproxy status is unknown|
|Nova||The nova-scheduler||Nova-scheduler status is unknown|
|Nova||The nova-xvpvnc-proxy service is unavailable||Nova-xvpvnc-proxy status is unknown|
OpenStack Storage Alerts
|Glance||All glance-api services are unavailable||All glance-api services are unavailable|
|Glance||All glance-registry services are unavailable||All glance-registry services are unavailable|
|Glance||The glance-api service is unavailable||Glance-api status is unknown|
|Glance||The glance-registry service is unavailable||Glance-registry status is unknown|
|Cinder||All cinder-api services are unavailable||All cinder-api services are unavailable|
|Cinder||All cinder-scheduler services are unavailable||All cinder-scheduler services are unavailable|
|Cinder||All cinder-volume services are unavailable||All cinder-volume services are unavailable|
|Cinder||The cinder-volume service is unavailable||Cinder-volume status is unknown|
|Cinder||The cinder-api service is unavailable||Cinder-api status is unknown|
|Cinder||The cinder-scheduler service is unavailable||Cinder-scheduler status is unknown|
OpenStack Network Alerts
|Neutron||The neutron-lbaas-agent service is unavailable||Neutron-lbaas-agent status is unknown|
|Neutron||The neutron-server service is unavailable||Neutron-server status is unknown|
|Neutron||All neutron-dhcp-agent services are unavailable||All neutron-dhcp-agent services are unavailable|
|Neutron||All neutron-l3-agent services are unavailable||All neutron-l3-agent services are unavailable|
|Neutron||All neutron-lbaas-agent services are unavailable||All neutron-lbaas-agent services are unavailable|
|Neutron||All neutron-metadata-agent services are unavailable||All neutron-metadata-agent services are unavailable|
|Neutron||All neutron-server services are unavailable||All neutron-server services are unavailable|
|Neutron||The neutron-dhcp-agent service is unavailable||Neutron-dhcp-agent status is unknown|
|Neutron||The neutron-l3-agent service is unavailable||Neutron-l3-agent status is unknown|
|Neutron||The neutron-lbaas-agent service is unavailable||Neutron-lbaas-agent status is unknown|
|Neutron||The neutron-metadata-agent service is unavailable||Neutron-metadata-agent status is unknown|
|Neutron||The neutron-server service is unavailable||Neutron-server status is unknown|
OpenStack Auxiliary Alerts
|Heat||All heat-api services are unavailable||All heat-api services are unavailable|
|Heat||All heat-api-cfn services are unavailable||All heat-api-cfn services are unavailable|
|Heat||All heat-api-cloudwatch services are unavailable||All heat-api-cloudwatch services are unavailable|
|Heat||All heat-engine services are unavailable||All heat-engine services are unavailable|
|Heat||The heat-api service is unavailable||Heat-api status is unknown|
|Heat||The heat-api-cfn service is unavailable||Heat-api-cfn status is unknown|
|Heat||The heat-api-cloudwatch status is unavailable||Heat-api-cloudwatch status is unknown|
|Heat||The heat-engine service is unavailable||Heat-engine service is unknown|
|Keystone||All keystone-all services are unavailable||All keystone-all services are unavailable|
|Keystone||The keystone-all service is unavailable||Keystone-al service is unknown|
|MySQL||All MySQL services are unavailable||All MySQL services are unavailable|
|MySQL||The MySQL Database service is unavailable||MySQL status is unknown|
|Apache||All Apache services are unavailable||All Apache services are unavailable|
|Apache||The Apache service is unavailable||Apache status is unknown|
|Jarvis||All Jarvis services are unavailable||All Jarvis services are unavailable|
|Memcached||All Memcached services are unavailable||All Memcached services are unavailable|
|Memcached||The memcached service is unavailable||Memcached status is unknown|
|RabbitMQ||All RabbitMQ services are unavailable||All RabbitMQ services are unavailable|
|RabbitMQ||The Rabbit Messaging service is unavailable||Rabbit Message Queue status is unknown|
|OMS||All tc-oms services are unavailable||All tc-oms services are unavailable|
|OMS||All tc-osvmw services are unavailable||All tc-osvmw services are unavailable|
|vPostGres||All vPostGres services are unavailable||All vPostGres services are unavailable|
|vPostGres||The vpostgres service is unavailable||Vpostgres status is unknown|
|Ceilometer||The ceilometer-agent-central service is unavailable||Ceilometer-agent-central status is unknown|
|Ceilometer||The ceilometer-agent-compute service is unavailable||Ceilometer-agent-compute status is unknown|
|Ceilometer||The ceilometer-agent-notification service is unavailable||Ceilometer-agent-notification status is unknown|
|Ceilometer||The ceilometer-alarm-evaluator service is unavailable||Ceilometer-alarm-evaluator status is unknown|
|Ceilometer||The ceilometer-alarm-notifier service is unavailable||Ceilometer-alarm-notifier status is unknown|
|Ceilometer||The ceilometer-api service is unavailable||Ceilometer-api status is unknown|
|Ceilometer||The ceilometer-collector service is unavailable||Ceilometer-collector status is unknown|
Use of these alerts will help the environment be ready for a production deployment where an SLA can be attached. Enjoy!
Over the weekend I focused on two things — taking care of my six kids while my wife was out of town and documenting my VCDX design. During the course of working through the Monitoring portion of the design I found myself focusing on the technical reasons for some of the design decisions I was making to meet the SLA requirements of the design. That prompted the tweet you see the the left. When working on any design, you have to understand where the goal posts are in order to make intelligent decisions. With regards to an SLA, it means understanding what the SLA target is and on what frequency the SLA is being calculated. As you can see from the image, a SLA calculated against a daily metric will vary a considerable amount from a SLA calculated on a weekly or monthly basis.
So what can be done to meet the target SLA? If the monitoring solution is inside the environment, shouldn’t it have a higher target SLA than the thing it is monitoring? As I looked at the downtime numbers, I realized there were places where vSphere HA would not be adequate (by itself) to meet the SLA requirement of the design if it was being calculated on a daily or weekly basis. The ever elusive 99.99% SLA target eliminates vSphere HA altogether if it is being calculated on any less than a yearly basis.
As the architect of a project it is important to discuss the SLA requirements with the stakeholders and understand where the goal posts are. Otherwise you are designing in the vacuum of space with no GPS to guide you to the target.
SLAs within SLAs
The design I am currently working on had requirements for a central log repository and a SLA target of 99.9% for the tenant workload domain, calculated on a monthly basis. As I worked through the design decisions, I came to realize however the central logging capability that vRealize Log Insight is providing to the environment should be more resilient than the 99.9% uptime of the workload domain it is supporting. This type of SLA within a SLA is the sort of thing you may find yourself having to design against. So how could I increase the uptime to be able to support a higher target SLA for Log Insight?
The post on Friday discussed the clustering capabilities of Log Insight and that came about as I was working through this problem. If the clustering capability of Log Insight could be leveraged to increase the uptime of the solution, even on physical infrastructure only designed to provide a lower 99.9% SLA, then I could meet the higher target sub-SLA. By including a 3-node Log Insight cluster and creating anti-affinity rules on the vSphere cluster to ensure the Log Insight virtual appliances were never located on the same physical node, I was able to increase the SLA potential of the solution. The last piece of the puzzle was the incorporation of the internal load balancing mechanism of Log Insight and using the VIP as the target for all of the systems remote logging functionality. This allowed me to create a central logging repository with a higher target SLA than the underlying infrastructure SLA.
Designing for and justifying the decisions made to support a SLA is one of the more trying issues in any architecture, at least in my mind. Understanding how decisions made influence positively or negatively the SLA goals of the design is something every architect will need to do. This is one area where I was weak during my previous VCDX defense and as not able to accurately articulate. After spending significant time thinking through the key points of my current design, I have definitely learned more and have been able to understand what effects the choices I am making have.
The opinions expressed above are my own and as I have not yet acquired my VCDX certification, these opinions may not be shared by those who have.