Tag: VCDX
Posted on

Caution, this post is highly opinionated.

I am deep into the process of completing my VCDX design documentation and application for (hopefully) a Q2 2017 defense. As it so happens, a short conversation was had on Twitter today regarding a post on the VMware Communities site for the VMware Validation Design for SDDC 3.x, including a new design decision checklist.

twitter-screen

The latest version of the VMware Validated Design (VVD) is a pretty awesome product for customers to reference when starting out on their private cloud journey. That being said, it is by no means a VCDX design or a set of materials that could simply be re-purposed for a VCDX design.

Why? Because there are no customer requirements.

For the same reason a hypothetical (or fake) design is often discouraged by people in the VCDX community, the VVD suffers from the same issue. In a vacuum you can make any decision you want, because there are no ramifications from your design decision. In the real-world this is simply not the case.

Taking a look at the Design Decisions Checklist, it goes through the over 200 design decisions the VVD made in the course of developing the reference architecture. The checklist does a good job of laying out the fields the design decision covers, like:

  • Design Decision
  • Design Justification
  • Design Implication

Good material. But if you’ve read my other post on design decisions, which you may or may not agree with, it highlights that a decision justification is made based on a requirement.

Let’s take a look at just one of the design decisions made by the VVD product and highlighted in the checklist.

vvd_decision_screencap

The decision is to limit a single compute pod to a single physical rack, as in no cross-rack clusters. Sounds like a reasonable decision, especially if the environment had a restriction on L2 boundaries or some other requirement. But what if I have a customer requirement that said a compute node must be able to join any compute pod (cluster) regardless of physical rack location within a data center?

Should I ignore that requirement because the VVD says to do otherwise?

Of course not.

My issue with the Twitter conversation is two-fold:

  1. The VVD design decisions are not in fact design decisions, but design recommendations. They can be used to help a company, group or architect to determine, based on their requirements, which of these “decisions” should be leveraged within their environment. They are not die-hard decisions that must be adhered to.
  2. From a VCDX perspective, blindly assuming you could copy/paste any of these design decisions and use them in a VCDX defense is naive. You must have a justification for every design decision made and it has to map back to a customer requirement, risk or constraint.

I also do not think that is what  was saying when he initially responded to the Tweet about the checklist. I do think though that some people may actually think they can just take the VVD, wrap it in a bow and call it good.

My suggestion is to take the VVD design documentation and consider it reference material, just like the many other great books and online resources available to the community. It won’t work for everyone, because every design has different requirements, constraints and risks. Take the bits that work for you and expand upon them. Most importantly, understand why you are using or making that design decision.

Let me know what you think on Twitter.

Again, this post is highly opinionated from my own limited perspective. Do not mistake it for the opinion of VMware or any VCDX certified individuals.

Read More

nsx designated instance

While a great show, we are going to talk about something slightly different — the NSX Distributed Logical Router (DLR) Designated Instance. NSX has many great features and also many caveats when implementing some of those great features — like having a Designated Instance when using a DLR.

So what is a Designated Instance? Honestly, I did not know what it was until a conversation earlier today with a few co-workers who are a bit more knowledgable with NSX than me. Essentially a Designated Instance is an elected ESXi host that will answer all new requests initially — also known as a single-point of failure.

Let’s look at the logical network diagram I posted yesterday.

nsx-dlr-openstack

Pretty sweet right?

The issue is when the DLR is connected directly to a VLAN. While technically not a problem — it does exactly what you’d expect it does — it results in having to have one of the ESXi hosts in the transport zone act as the Designated Instance. The result is that if the Designated Instance ESXi host encounters a failure, any new traffic will fail until the election process is complete and a new Designated Instance is chosen.

So is it possible to not need a Designated Instance when using a DLR? Yes.

It involves introducing another logical NSX layer into the virtual network design. If you saw my tweet earlier, this is what I meant.

I like , but sometimes I think it adds a little too much complexity for operational simplicity.

Adding a set of ECMP edges above the DLR and connecting the two together will eliminate the requirement for NSX to use a the Designated Instance. Here is what an alternative to the previous design would look like.

external openstack

Essentially what I’ve done is create another VXLAN, with a corresponding NSX Logical Switch and connect the uplink from the DLR to it. Then the ECMP Edges use the same Logical Switch as their internal interface. It is on the uplink side of the ECMP Edge where the P2V layer takes place and the VLAN is connected.

Using this design allows the environment to use a dynamic routing protocol between both the DLR and ECMP Edges and ECMP Edges and the upstream physical network — although mileage may vary depending on your physical network. The ECMP Edges introduce additional scalability — although limited to 8 — based on the amount of North-South network traffic and the bandwidth required to meet the tenant needs. Features like vSphere Anti-Affinity rules can mitigate a failure of a single ESXi host, which you cannot do when there is a Designated Instance. The design can also take into consideration a N+x scenario for when to scale the ECMP Edges.

So many options open up when NSX is introduced into an architecture, along with a lot of extra complexity. Ultimately the decision should be based on the requirements and the stakeholders risk acceptance. Relying on a Designated Instance may be acceptable to a stakeholder, while adding more complexity to the design may not be.

Until next time, enjoy!

Read More

blueprint-header

The last couple of months leading into the end of the year has seen me focusing once again on earning the VCDX certification. In the process of doing a fair amount of examination of my skills, especially my areas of weakness, I knew a new design was needed. Fortunately a new project at work had me focusing on building an entirely new VMware Integrated OpenStack service offering. Being able to work on the design from inception to POC to Pilot has provided me a great learning opportunity. One of my weaknesses has been to be sure I understand the ramifications of each design decision being made in the architecture. As I worked through the process of documenting all of the design decisions, I settled on a template within the document.

The following table is replicated for each design decision within the architecture.

dd_summary_template

One of the ways I worked to improve my understanding of how to document a proper design was the book, IT Architect: Foundation in the Art of Infrastructure Design. In the book I noticed the authors made sure to highlight the design justifications throughout every chapter. I wanted to incorporate that same justifications within my VCDX architecture document and be sure to document the other risks, impacts and also the requirements that were achieved by the decision.

In the design I am currently working on, an example of the above table in action can be found in the following image.

dd_summary_example01

Here a decision for the compute platform was made to use the Dell PowerEdge R630 server. Requirements like the SLA had to also be taken into consideration, which you see in the risks and risk mitigation. The table helps to highlight when some design decisions actually add in additional requirements for the architecture — usually found in the Impact or Decision Risks section of the table. In the case of the example, the table notes,

Dell hardware has been prone to failures, includes drives, SD cards and controller failures.

I documented the risk based on knowledge acquired over nearly a decade of using Dell hardware, especially most recently in my current role. Based on that knowledge, I documented it as a risk which would need to be addressed — which created an ancilliary requirement needing to be addressed. The subsequent Risk Mitigation fulfills the new requirement.

A 4-hour support contract is purchased for each compute node. In addition, an on-site hardware locker is maintained at the local data center, which contains common components to reduce the mean-time-to-resolution when a failure occurs.

The subsequent decision to purchase a 4-hour support contract from Dell for issues, combined with the on-site hardware locker, allow the design to account for the SLA requirements of the service offering while also solving a known risk — hardware failure. In my previous VCDX attempt, I did not do a good enough job working through this thought process and is a key reason why I was not successful.

The process of documenting the table has helped me make sure the proper amount time is spent thinking through every decision. I am also finding documenting all the decisions to be helpful as I review the design with others. All-in-all it has been a great process to work through and is helping me to be sure to know and comprehend every aspect of the design.

As noted previously, I am still pursuing my VCDX certification right now and so these opinions may not be shared by those who have already earned their VCDX certifications.

Read More

monitoring-header

The previous post discussed the use of the vRealize Operations Management Pack for OpenStack and Endpoint Agent in order to provide detailed service-level monitoring within an environment. The management pack comes with nearly 200 pre-defined alerts for OpenStack that can be leveraged to understand what is occurring within the environment. As I’ve gone through the alerts, these are the key alerts that can be leveraged to understand when any of the OpenStack services are experiencing a partial or complete outage.

OpenStack Compute Alerts

ServiceAlert NameTriggers
NovaAll nova-network services are unavailableAll nova-network services are unavailable
NovaAll nova-xvpnc-proxy services are unavailableAll nova-xvpnc-proxy services are unavailable
NovaAll nova-scheduler services are unavailableAll nova-scheduler services are unavailable
NovaAll nova-api services are unavailableAll nova-api services are unavailable
NovaAll nova-consoleauth services are unavailableAll nova-consoleauth services are unavailable
NovaAll nova-cert services are unavailableAll nova-cert services are unavailable
NovaAll nova-compute services are unavailableAll nova-compute services are unavailable
NovaAll nova-conductor services are unavailableAll nova-conductor services are unavailable
NovaAll nova-console services are unavailableAll nova-console services are unavailable
NovaAll nova-novncproxy services are unavailableAll nova-novncproxy services are unavailable
NovaAll nova-objectstore services are unavailableAll nova-objectstore services are unavailable
NovaThe nova-compute service is unavailableNova-compute status is unknown
NovaThe nova-objectstore service is unavailableNova-objectstore status is unknown
NovaThe nova-conductor service is unavailableNova-conductor status is unknown
NovaThe nova-api service is unavailableNova-api status is unknown
NovaThe nova-cert service is unavailableNova-cert status is unknown
NovaThe nova-console service is unavailableNova-console status is unknown
NovaThe nova-consoleauth service is unavailableNova-consoleauth status is unknown
NovaThe nova-network service is unavailableNova-network status is unknown
NovaThe nova-novnc-proxy service is unavailableNova-novncproxy status is unknown
NovaThe nova-schedulerNova-scheduler status is unknown
NovaThe nova-xvpvnc-proxy service is unavailableNova-xvpvnc-proxy status is unknown

OpenStack Storage Alerts

ServiceAlert NameTriggers
GlanceAll glance-api services are unavailableAll glance-api services are unavailable
GlanceAll glance-registry services are unavailableAll glance-registry services are unavailable
GlanceThe glance-api service is unavailableGlance-api status is unknown
GlanceThe glance-registry service is unavailableGlance-registry status is unknown
CinderAll cinder-api services are unavailableAll cinder-api services are unavailable
CinderAll cinder-scheduler services are unavailableAll cinder-scheduler services are unavailable
CinderAll cinder-volume services are unavailableAll cinder-volume services are unavailable
CinderThe cinder-volume service is unavailableCinder-volume status is unknown
CinderThe cinder-api service is unavailableCinder-api status is unknown
CinderThe cinder-scheduler service is unavailableCinder-scheduler status is unknown

OpenStack Network Alerts

ServiceAlert NameTriggers
NeutronThe neutron-lbaas-agent service is unavailableNeutron-lbaas-agent status is unknown
NeutronThe neutron-server service is unavailableNeutron-server status is unknown
NeutronAll neutron-dhcp-agent services are unavailableAll neutron-dhcp-agent services are unavailable
NeutronAll neutron-l3-agent services are unavailableAll neutron-l3-agent services are unavailable
NeutronAll neutron-lbaas-agent services are unavailableAll neutron-lbaas-agent services are unavailable
NeutronAll neutron-metadata-agent services are unavailableAll neutron-metadata-agent services are unavailable
NeutronAll neutron-server services are unavailableAll neutron-server services are unavailable
NeutronThe neutron-dhcp-agent service is unavailableNeutron-dhcp-agent status is unknown
NeutronThe neutron-l3-agent service is unavailableNeutron-l3-agent status is unknown
NeutronThe neutron-lbaas-agent service is unavailableNeutron-lbaas-agent status is unknown
NeutronThe neutron-metadata-agent service is unavailableNeutron-metadata-agent status is unknown
NeutronThe neutron-server service is unavailableNeutron-server status is unknown

OpenStack Auxiliary Alerts

ServiceAlert NameTriggers
HeatAll heat-api services are unavailableAll heat-api services are unavailable
HeatAll heat-api-cfn services are unavailableAll heat-api-cfn services are unavailable
HeatAll heat-api-cloudwatch services are unavailableAll heat-api-cloudwatch services are unavailable
HeatAll heat-engine services are unavailableAll heat-engine services are unavailable
HeatThe heat-api service is unavailableHeat-api status is unknown
HeatThe heat-api-cfn service is unavailableHeat-api-cfn status is unknown
HeatThe heat-api-cloudwatch status is unavailableHeat-api-cloudwatch status is unknown
HeatThe heat-engine service is unavailableHeat-engine service is unknown
KeystoneAll keystone-all services are unavailableAll keystone-all services are unavailable
KeystoneThe keystone-all service is unavailableKeystone-al service is unknown
MySQLAll MySQL services are unavailableAll MySQL services are unavailable
MySQLThe MySQL Database service is unavailableMySQL status is unknown
ApacheAll Apache services are unavailableAll Apache services are unavailable
ApacheThe Apache service is unavailableApache status is unknown
JarvisAll Jarvis services are unavailableAll Jarvis services are unavailable
MemcachedAll Memcached services are unavailableAll Memcached services are unavailable
MemcachedThe memcached service is unavailableMemcached status is unknown
RabbitMQAll RabbitMQ services are unavailableAll RabbitMQ services are unavailable
RabbitMQThe Rabbit Messaging service is unavailableRabbit Message Queue status is unknown
OMSAll tc-oms services are unavailableAll tc-oms services are unavailable
OMSAll tc-osvmw services are unavailableAll tc-osvmw services are unavailable
vPostGresAll vPostGres services are unavailableAll vPostGres services are unavailable
vPostGresThe vpostgres service is unavailableVpostgres status is unknown
CeilometerThe ceilometer-agent-central service is unavailableCeilometer-agent-central status is unknown
CeilometerThe ceilometer-agent-compute service is unavailableCeilometer-agent-compute status is unknown
CeilometerThe ceilometer-agent-notification service is unavailableCeilometer-agent-notification status is unknown
CeilometerThe ceilometer-alarm-evaluator service is unavailableCeilometer-alarm-evaluator status is unknown
CeilometerThe ceilometer-alarm-notifier service is unavailableCeilometer-alarm-notifier status is unknown
CeilometerThe ceilometer-api service is unavailableCeilometer-api status is unknown
CeilometerThe ceilometer-collector service is unavailableCeilometer-collector status is unknown

Use of these alerts will help the environment be ready for a production deployment where an SLA can be attached. Enjoy!

Read More

twitter-post-slaOver the weekend I focused on two things — taking care of my six kids while my wife was out of town and documenting my VCDX design. During the course of working through the Monitoring portion of the design I found myself focusing on the technical reasons for some of the design decisions I was making to meet the SLA requirements of the design. That prompted the tweet you see the the left. When working on any design, you have to understand where the goal posts are in order to make intelligent decisions. With regards to an SLA, it means understanding what the SLA target is and on what frequency the SLA is being calculated. As you can see from the image, a SLA calculated against a daily metric will vary a considerable amount from a SLA calculated on a weekly or monthly basis.

So what can be done to meet the target SLA? If the monitoring solution is inside the environment, shouldn’t it have a higher target SLA than the thing it is monitoring? As I looked at the downtime numbers, I realized there were places where vSphere HA would not be adequate (by itself) to meet the SLA requirement of the design if it was being calculated on a daily or weekly basis. The ever elusive 99.99% SLA target eliminates vSphere HA altogether if it is being calculated on any less than a yearly basis.

As the architect of a project it is important to discuss the SLA requirements with the stakeholders and understand where the goal posts are. Otherwise you are designing in the vacuum of space with no GPS to guide you to the target.

SLAs within SLAs

The design I am currently working on had requirements for a central log repository and a SLA target of 99.9% for the tenant workload domain, calculated on a monthly basis. As I worked through the design decisions, I came to realize however the central logging capability that vRealize Log Insight is providing to the environment should be more resilient than the 99.9% uptime of the workload domain it is supporting. This type of SLA within a SLA is the sort of thing you may find yourself having to design against. So how could I increase the uptime to be able to support a higher target SLA for Log Insight?

The post on Friday discussed the clustering capabilities of Log Insight and that came about as I was working through this problem. If the clustering capability of Log Insight could be leveraged to increase the uptime of the solution, even on physical infrastructure only designed to provide a lower 99.9% SLA, then I could meet the higher target sub-SLA. By including a 3-node Log Insight cluster and creating anti-affinity rules on the vSphere cluster to ensure the Log Insight virtual appliances were never located on the same physical node, I was able to increase the SLA potential of the solution. The last piece of the puzzle was the incorporation of the internal load balancing mechanism of Log Insight and using the VIP as the target for all of the systems remote logging functionality. This allowed me to create a central logging repository with a higher target SLA than the underlying infrastructure SLA.

Designing for and justifying the decisions made to support a SLA is one of the more trying issues in any architecture, at least in my mind. Understanding how decisions made influence positively or negatively the SLA goals of the design is something every architect will need to do. This is one area where I was weak during my previous VCDX defense and as not able to accurately articulate. After spending significant time thinking through the key points of my current design, I have definitely learned more and have been able to understand what effects the choices I am making have.

The opinions expressed above are my own and as I have not yet acquired my VCDX certification, these opinions may not be shared by those who have.

 

Read More