Resilient Architectures - Devoxx Poland 2017
Matt Stine, an expert in resilient architectures, discusses the critical need for building systems that can withstand and recover from failures. Matt opens the talk by sharing anonymized technology headlines to highlight the real-world impact of system failures on businesses, such as revenue loss and operational disruptions. They emphasize that the traditional approach of mistake prevention through extensive testing and approval processes often fails and can create heavy, slow processes that hinder quick recovery.
Instead, Matt advocates for embracing failure and focusing on reducing the meantime to recovery (MTTR) rather than trying to prevent failures entirely. They introduce the concept of resilient architectures, which enhance observability, leverage resiliency patterns, and embrace chaos. By improving observability, systems can better measure and monitor their health using metrics like service level indicators (SLIs) and service level objectives (SLOs). Matt also discusses practical tools and techniques for implementing these patterns, such as timeouts, retries, bulkheads, and circuit breakers, to build more robust systems.
Finally, Matt highlights the importance of practicing failure through game day exercises and chaos engineering. By regularly simulating failures in a controlled environment, organizations can improve their ability to respond to real incidents, ensuring that both the system and its maintainers are better prepared for unexpected disruptions. Matt concludes by reiterating the need to stop trying to prevent mistakes, embrace failure, and continually enhance system resilience through proactive measures and continuous learning.