VMware SDDC Failure Scenarios
A considerable part of the VCDX certification, as well as any thought out architecture, are the Business Continuity / Disaster Recovery requirements and design decisions. There are a myriad of monitoring tools and services out there that can be leveraged to varying degrees to monitor physical and virtual infrastructure. Over the course of my career, I have leveraged nearly a dozen different tools for monitoring and alerting. This article is not focused on recommending one tool over another — choose what is best for your organization and make it work for your needs.
This article will highlight several of the most common failure scenarios and draw attention to how the VMware SDDC will assist in providing visibility into these scenarios.
Please do not consider this list exhaustive. Depending on hardware, data center conditions or other environmental factors, your results may vary.
As you can see, there are a fair number of potential, common failure scenarios within any VMware SDDC environment. Many will be dependent on the type of hardware purchased, the number of SPOFs within the hardware and the quality of the component. This list isn’t even exhaustive — things like fan failures, backplanes in blade chassis and other potential failures exist too.
At a minimum, every VMware SDDC environment should leverage the built-in alarms within vCenter Server and have them configured to send out notifications (SNMP or SMTP) to a monitoring tool.
Regardless of what tool(s) your organizations are leveraging for monitoring the VMware SDDC, runbooks or documentation should exist for every failure scenario your organization feels they are at risk for. These runbooks can become the foundation for writing automation to start auto-remediating those alerts or alarms causing the most headaches for your operational teams.