Troubleshooting Tiered Tragedy: A Peek Into Failure
This video is also available in the GOTO Play video app! Download it to enjoy offline access to our conference videos while on the move.
Failure is complicated. Sometimes an incident can reveal latent failures in your systems that have just been sitting dormant, waiting for the right combination of factors to activate them. In this talk Jeff Smith will walk through a real failure scenario and the process Centro uses to highlight issues that go beyond just the life cycle of an outage. We’ll walk through the importance of looking into signals before they become catastrophic and ensuring your team has the capacity to do so. We’ll examine how monitoring the same system from multiple vantage points can help avoid confusion and gain clarity during an incident. How the Product organization plays a vital role in protecting system uptime, and lastly how a collaborative culture can decrease your Mean Time to Recovery.
What will the audience learn from this talk?
The audience will learn practical troubleshooting steps when encountering an issue along with the pitfalls that exist without a diverse suite of tools at each layer. The audience will also learn tips for reading, prioritizing and resolving early warning signs to prevent outages before they occur. Lastly, we’ll learn how a company’s workflow can create hidden impact to the accountability and responsibility of system stability.
Does it feature code examples and/or live coding?
Prerequisite attendee experience level: