VMware SDDC Failure Scenarios

Recently I have been having conversations around operationalizing SDDC environments, understanding what to monitor, and the most common failure scenarios. There are a myriad of monitoring tools and services out there that can be leveraged to varying degrees to monitor physical and virtual infrastructure. Over the course of my career, I’ve leveraged nearly a dozen different tools for monitoring and alerting and this article is not going to recommend one tool over another — choose what is best for your organization and make it work for your needs.

That being said, I wanted to highlight several of the most common failure scenarios and draw attention to how the vSphere stack (ESXi or vCenter Server) will assist in providing visibility into these scenarios.

Please do not consider this list exhaustive. Depending on the hardware, data center conditions or other environmental factors, your results may vary.

Failure Scenario – CPU Fault

Failure Scenario – Memory DIMM

Failure Scenario – Motherboard

Failure Scenario – Power Supply

Failure Scenario – Network Controller

Failure Scenario – HBA Controller

Failure Scenario – vSAN Cache Disk

Failure Scenario – vSAN Capacity Disk

Failure Scenario – SFP Faults

Failure Scenario – Fiber Cable Faults

Failure Scenario – Network Switch

Failure Scenario – Purple Screen of Death (PSOD)

Failure Scenario – Disconnect ESXi Host

Final Thoughts

As you can see, there are a fair number of potential, common failure scenarios within any VMware SDDC environment. Many will be dependent on the type of hardware purchased, the number of SPOFs within the hardware and the quality of the component. This list isn’t even exhaustive — things like fan failures, backplanes in rack mount servers, and other potential failures exist too.

At a minimum, every VMware SDDC environment should leverage the built-in alarms within vCenter Server and have them configured to send out notifications (email or SNMP) to a monitoring tool.

Regardless of what tool(s) your organizations are leveraging for monitoring the VMware SDDC, run books or documentation should exist for every failure scenario your organization feels they will likely encounter. These run books can become the foundation for writing automation to start auto-remediating those alerts or alarms causing the most headaches for your operational team.

Tags :