- Satya Narayanan Sundararajan
In the ever-evolving landscape of technology, ensuring the reliability and stability of digital services is paramount. In this article, we'll delve into the fundamentals of Site Reliability Engineering, its key principles, how it has become a cornerstone for DNB in striving to deliver seamless and reliable user experiences.
Understanding Site Reliability Engineering (SRE):
Site Reliability Engineering is a discipline that combines aspects of software engineering with IT operations. Coined by Google, SRE emerged as a response to the challenges posed by rapidly growing and complex systems. Unlike traditional operations teams, SRE teams are composed of individuals with a software engineering background who apply their skills to solve operational problems. The primary goal of SRE is to create scalable and highly reliable software systems by applying software engineering principles to infrastructure and operations problems.
Reliability Hierarchy as proposed by Google Site Reliability Engineering.
Key Cornerstone of Site Reliability Engineering:
1. Service Level Objectives (SLOs):
At the heart of SRE is the concept of Service Level Objectives. SLOs are specific, quantifiable metrics that define the desired level of reliability for a service. SRE teams work to ensure that these objectives are met and strive to strike a balance between system reliability and feature development.
How to create SLO’s:
- Identify the critical service that is most important for the users
- Engage with product and Business teams to set SLO Target goals and measurement period
- Identity and map the technical components that delivers those services
- Determine the metrics to use as service-level indicators (SLIs) to track the user experience of those services
- Build dashboards to visualise SLI, SLO, and error budget
- Shift from alerting on metrics to alerting on SLO’s.
2. Error Budgets:
SRE introduces the notion of an error budget, which represents the acceptable amount of downtime or errors within a given timeframe. By defining and managing error budgets, teams can make informed decisions about whether to prioritize new feature development or focus on stability
3. Automation:
SRE emphasizes the importance of automation in managing and operating systems. By automating repetitive tasks and processes, SRE teams have increased efficiency, reduced human error, and ensured consistency across environments. Real time dashboards to indicate the status of the services are built thereby several manual checks has been reduced.
4. Incident Response and Blameless Root cause Analysis:
SRE places a strong emphasis on learning from incidents. When issues arise, SRE teams conduct thorough Root cause analysis to understand the root causes and implement preventive measures. This continuous feedback loop ensures that systems become more resilient over time.
5. Monitoring and Alerting:
Comprehensive monitoring and alerting are critical components of SRE. By implementing effective monitoring solutions, teams can proactively identify and address issues before they impact users. Well-defined alerting systems enable rapid response to incidents, minimizing downtime.
Observability is about understanding what's happening inside a system by analysing its external outputs. In the world of software, it helps developers and operations teams keep a real-time check on how applications are performing, behaving, and staying healthy. Unlike traditional monitoring, observability goes the extra mile by focusing on collecting and analysing a variety of data types, such as logs, metrics, traces, and other telemetry.
Observability seeks to unravel the mysteries of digital ecosystems, offering valuable insights for users to make well-informed decisions. The effectiveness of these insights hinges on the quality of the data they're based on. When diverse data sets come together harmoniously, decision-making becomes more informed, especially in crucial areas like application architecture, user experience (UX), and pivotal business choices, where context adds depth to the insights provided.
Recognizing that our teams are experiencing increasing pressure to monitor and manage situations in their multi-cloud and diverse technology platforms, it's clear that the complexity and scale of dynamic system designs only continues to grow. As a result, our technology teams are seeking greater insight into these increasingly complex and diverse computing systems.
Why It’s Important? and the benefits realised
In the realm of IT and cloud services, observability means being able to evaluate a system's current state based on the logs, measurements, and traces it produces. With the challenges posed by cloud-native setups and the difficulty of pinpointing the root causes of errors or anomalies, observability has become increasingly crucial in recent years.
The teams that have embraced observability has seen numerous benefits like
- Faster time to detect the issues and resolve them
- Have proactively identified and addressed performance bottlenecks
- Enhanced User Experience: Observability allows teams to monitor user interactions and identify areas for improvement, leading to a better overall user experience
- Efficient Capacity Planning: By analysing metrics and trends, teams can make informed decisions about resource allocation, scaling, and capacity planning, optimizing infrastructure usage
- Increased Reliability: Observability helps in early detection of issues, minimizing the impact on users and ensuring a more reliable and resilient system
We have built value chain dashboards that provides a comprehensive visualisation of end-to-end flow of the user requests. This has enabled insights to the team thereby anomalies are detected faster and business impact is mitigated (upstream/downstream)
Why Site Reliability Engineering Matters:
- Enhanced User Experience: SRE's focus on reliability translates into improved user experiences. Systems that are highly available and performant contribute to customer satisfaction and trust.
- Efficient Resource Utilization: SRE's emphasis on automation and efficient incident response enables organizations to make optimal use of resources. This results in cost savings and a more streamlined operational environment.
- Agile and Resilient Systems: SRE promotes agility by allowing teams to quickly adapt to changing requirements. Moreover, the discipline instils resilience in systems and reducing the impact of failures.
How can we have customers to choose us all the time? – Reliability is the Key!
For customers to choose us, Reliability is the #1 feature of a product. We have adopted the SRE Practises to build and operate Reliable services for our users. In DNB, we have started the SRE journey with one of the teams in 2022, with greater results and improved reliability, several teams have started adopting SRE Practices and established SRE teams and now we have more than 10 teams who have adopted SRE. Some of the benefits realised are:
a. Improvements in Time to Detect the incidents and Time to Resolve the incidents by 60%. This was achieved as a result of embedding Observability into the services and also enabling End to End value chain monitoring of the critical services
b. Measuring the right signals that indicates the user experience and Service health- Golden signals (Errors, Latency, Availability and Saturation)
c. TOIL Reduction: Automated the manual, repeatable, tactical that does not add value tasks. (eg: in one of the team, who staffed 24/7 for operations have moved to 16/5 after they have started the SRE Journey)
d. Automation of Health checks: Automating health checks is a critical aspect of Site Reliability Engineering (SRE) to ensure the continuous monitoring and stability of systems
e. Batch Optimisation: systematically enhancing system performance and reliability by efficiently managing and executing multiple tasks in batches, streamlining operational workflows for increased efficiency and resilience.
f. CMDB Accuracy: Improved the accuracy of CMDB (Configuration Management Database)
Reliability To the Next Level
We are also working towards taking Reliability to the next level in 2024. While we have established conventional best practices for SRE, it's vital to assess our SRE maturity development because it can help us:
- Figure out the direction that we need to take in the future
- Build the next course of action
- Structure our SRE roadmap better
- Understand the current maturity of teams and tech family requirements
- Identify improvement areas in the performance and capabilities
We organize workshops with the teams where we assess their current maturity and identify the areas that we would be focusing on the upcoming year. We evaluate 15 different questions related to SLOs & SLIs, MTTD, MTTR, Self-Healing, Chaos engineering etc and rate the performance based on different attributes of five maturity levels i.e., Absent, Reactive, Proactive, Strategic & Visionary.
Team culture usually waxes and wanes between levels, as it takes effort to maintain a strategic reliability culture. Therefore, we always strive to move forward from one level to another. More the organization embraces, engages, and emphasizes reliability as a key feature, maintenance cost decreases.
Roadmap Ahead for Building and Operating Reliable Services:
While we have started the SRE Journey, we have an interesting roadmap ahead to explore the following six areas and improve the reliability of the services for our users.
- Observability of the services
- Improve & maintain the user Experience
- Automation of Toil work with cap of not more than 20% of time on Toil
- Automated remediation capabilities
- Exploring the Generative AI capabilities
Conclusion:
Site Reliability Engineering represents a paradigm shift in the way organizations approach system reliability and operational excellence. By combining software engineering principles with operational expertise, SRE empowers teams to create and maintain robust, scalable, and highly reliable systems. As technology continues to advance, embracing Site Reliability Engineering is not just a best practice; it is a strategic imperative for organizations aiming to thrive in the digital era.