Case Study: The Key - Site Reliability Engineering

Site Reliability Engineering (SRE) (2016/17) - Technical Architect

The Customer: Whole Portfolio

The Users: Development team

The Challenge

The development team was frequently running into recurring platform issues. Continuous Integration/Continuous Deployment (CI/CD) pipeline failures were becoming common, often leading to downtime for our customers and eroding their trust in our products.

The situation came to a head when we were at risk of losing a major contract due to failing to meet a contractual 99.999% uptime Service Level Agreement (SLA). It was clear that our reactive, firefighting approach was unsustainable and a fundamental change was needed to ensure the reliability and quality of our services.

The Approach

To address these systemic issues, I proposed and led the initiative to establish the organization's first Site Reliability Engineering (SRE) team. This wasn't just about creating a new team; it was about introducing a new mindset focused on proactive, data-driven reliability.

Our approach included several key initiatives:

Improving Observability: We implemented comprehensive monitoring and logging to get real-time insights into system health, moving from reactive problem-solving to proactive issue detection.
Enhancing CI/CD Processes: We re-architected our deployment pipelines, introducing feature flags and blue-green deployments. This allowed us to release new features more safely and roll back instantly if issues were detected, decoupling deployment from release.
Empowering the Team: The SRE team was given the autonomy and budget to address long-term technical debt. They were empowered to dedicate time to architectural improvements, automation, and building tools to enhance developer productivity and system resilience.

The Outcome

The introduction of the SRE team and its practices had a transformative impact. We successfully met and exceeded the 99.999% uptime target, securing the critical contract. More importantly, we fostered a culture of reliability and engineering excellence. Developer satisfaction increased as they spent less time on operational fires and more time on building value. The platform became significantly more stable, leading to a dramatic reduction in customer-facing incidents and a restoration of trust in our services.

Harry Hunter