Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to operations to create scalable and highly reliable systems. The SRE process focuses on balancing the need for reliability with the pace of innovation. This document outlines the key practices, principles, and workflows involved in the SRE process.
Before beginning any practice or exercise, clone this repository to your local machine using the following command:
git clone https://github.com/cguilencr/sre-abc-training.git
cd sre-abc-training
This ensures you have access to all the required files and directory structures needed to complete the exercises.
- Software Development Lifecycle (SDLC)
- Core Principles
- SLIs, SLOs, SLAs, and Error Budgets
- Monitoring and Incident Management
- Operational Readiness Reviews (ORR)
- Change Management and Automation
- Cost Optimization
- Continuous Improvement
The Software Development Lifecycle (SDLC) is a structured approach to software development. It involves multiple phases that ensure software is designed, developed, tested, and deployed systematically. SRE plays a critical role in ensuring the reliability and scalability of the systems throughout the SDLC.
✏️ Practice #1 - Python app
Apply these changes 1. Practice to create a python REST API.This application will serve as a practical case to demonstrate the integration of SRE at each phase, from planning to maintenance, including the implementation of SLOs, monitoring of metrics, and deployment automation.
![]()
✏️ Practice #2 - App packaged as image
Apply these changes 2. Practice to achieve the application will be packaged into a Docker image to facilitate easier deployment in multiple locations.
![]()
✏️ Practice #3 - App image ni to a registry
Apply these changes 3. Practice the image will be stored in a remote registry(Dockerhub) to use it as the source for the application during deployments.
![]()
SRE integrates with SDLC during the deployment and maintenance phases, ensuring smooth releases and reliable operations post-deployment. This process involves proactive monitoring, incident management, and automation to minimize downtime and maintain high availability.
SRE is guided by several core principles that shape how operations are managed:
-
Emphasize Reliability: Ensure that systems maintain high availability and performance.
In this session, a Kubernetes cluster can be used to run our application with 3 replicas so that if one fails, the other 2 can take over the load, thus increasing the availability of the service.
✏️ Practice #4 - Running the app as a service
Apply these changes 4. Practice to achieve an infrastructure like this one:
-
Use SLIs and SLOs: Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure reliability.
-
Blameless Postmortems: Learn from failures without blaming individuals, focusing on how systems can be improved.
-
Automate Operations: Use automation to reduce manual interventions and improve system scalability.
✏️ Practice #4.1 - Running the app as a service
Apply these changes in Practice 4.1 to use an Infrastructure as Code (IaC) approach to automate infrastructure setup with Ansible, instead of running commands directly in the shell.
Service Level Objectives (SLOs) are key metrics that define the expected performance and availability of a service. They are derived from Service Level Indicators (SLIs), which measure system behavior.
-
Service Level Indicator (SLI): A metric that measures specific system performance attributes, such as latency, availability, or error rate.
-
Service Level Objective (SLO): A target for acceptable performance of an SLI, e.g., 99.9% availability. SLOs aim to balance reliability and development speed.
-
Service Level Agreement (SLA): A formal agreement with customers defining expected performance and penalties for unmet objectives.
-
Error Budget: The allowable margin for failure within an SLO, which provides flexibility for feature development without excessive risk.
📈 SLO´s list
Create a list of 4 SLO´s to eventually use in the observability strategy. This is an example. Consider adding a new SLO based on the percentile of HTTP request latencies. Percentiles are used to measure how often certain performance thresholds are met for a majority of users. For instance, a 95th percentile latency of 100ms means that 95% of requests are served within 100ms, offering a realistic view of user experience by focusing on the performance seen by the majority of requests, rather than average or extreme cases alone.
In white-box monitoring, metrics, traces, and logs are the primary data sources for monitoring Service Level Indicators (SLIs), ensuring that Service Level Objectives (SLOs) are met. For example:
- Metrics measure uptime and latency, forming the basis for availability and performance SLOs.
- Traces reveal the journey of each request, ensuring SLOs related to request latency are achieved.
- Logs provide additional context for errors and anomalies, supporting SLO adherence by enabling detailed investigations when issues arise.
Monitoring is crucial for detecting issues early and responding swiftly to incidents.
-
Set Up Monitoring Systems: Tools like Prometheus, Grafana, and OpenTelemetry provide insights into system performance using metrics, traces, and logs.
In this section, a Prometheus server is installed on the node that will eventually be used as a repository for traces and metrics.
✏️ Practice #5 - Including Prometheus in the cluster
Apply these changes 5. Practice to achieve an infrastructure like this one:
In this section, a Grafana server is installed, which will allow data visualization from different sources, in this case, Prometheus. Eventually, it will be used to deploy an observability and monitoring strategy.
✏️ Practice #6 - Including Grafana in the cluster
Apply these changes 6. Practice to achieve an infrastructure like this one:
-
Metrics: Quantitative data points such as CPU usage, memory consumption, and request rates.
✏️ Practice #7 - Sharing node metrics
Apply these changes 7. Practice to achieve an infrastructure like this one:
-
Traces: Record the journey of requests as they flow through different services, useful for diagnosing performance bottlenecks.
✏️ Practice #8 - Sharing app traces
Apply these changes 8. Practice to achieve an infrastructure like this one:
✏️ Practice #9 - Creating Metrics from Traces
Apply these changes 9. Practice to achieve an infrastructure like this one:
-
Logs: Detailed records of system events that provide context and historical information during incidents.
✏️ Practice #10 - Sharing app logs
Apply these changes 10. Practice to achieve an infrastructure like this one:
-
-
Golden Signals: Monitoring should focus on four critical Golden Signals to track the health of a service:
- Latency: The time it takes to service a request.
- Traffic: The amount of demand or load being placed on the system (e.g., requests per second).
- Errors: The rate of failed requests.
- Saturation: How full a service's resources are, such as CPU or memory.
✏️ Practice #11 - Dashboard creation
Apply these changes 11. Practice:
-
Define Alerts: Use alerts based on SLIs, metrics, traces, and logs to notify teams of potential problems before they affect customers.
✏️ Practice #12 - Alerts definition
Apply these changes 12. Practice to create alerts like these onces for each SLO define above:
-
Incident Response: When issues occur, follow a clear incident management process to resolve them quickly:
- Detect the issue using Golden Signals, monitoring, and alerting systems.
- Respond to the alert and acknowledge the incident.
- Mitigate the problem using workarounds or rollbacks to minimize customer impact.
- Document the incident for review and postmortem analysis.
✏️ Time to Detect, Time to Acknowledge and Time to Resolve
These metrics are important for measuring the performance of the SRE (Site Reliability Engineering) team. By monitoring these metrics, an SRE team can efficiently plan and improve its performance:
✏️ Practice #13 - Automate runbooks with ansible.
Apply these changes 13. Practice to achieve an infrastructure like this one:
In this setup, SREs are responsible for ensuring reliability using tools like Grafana, Jaeger, and AWX, while customers interact with the application.
📈 Syntetic Monitoring
Previously, Ansible was used to automate a task. Now, you must use it to create a synthetic monitor that simulates a browser accessing the endpoint. To achieve this, you need to create an Ansible script that runs on AWX and connects to the endpoint, ensuring the status code is 200.
Operational Readiness Reviews ensure that services are ready for production deployment. ORRs evaluate the robustness of infrastructure, the maturity of monitoring, and the ability to handle failures.
- Ensure service has monitoring and alerting in place.
- Review the capacity plan and ensure scaling capabilities.
- Confirm all dependencies are resilient to failures.
- Review disaster recovery strategies.
📈 ORR
Previously, in section 4: SLOs, a list of 4 SLOs was created. Now, based on the information in the Operational Readiness Review (ORR), please attach the results of a review conducted by you in the same document.
Automating repetitive tasks and following structured change management processes helps reduce risk during deployments.
-
Infrastructure as Code (IaC): Use tools like Terraform or Kubernetes for automated infrastructure management.
✏️ Practice #16 - Helm chart PENDING
Apply these changes 16. Practice:
-
CI/CD Pipelines: Implement continuous integration and continuous delivery pipelines to deploy changes in a controlled manner.
✏️ Practice #17 - Github actions PENDING
Apply these changes 17. Practice:
✏️ Practice #18 - Argo CD PENDING
Apply these changes 18. Practice:
-
Automate Rollbacks: Set up automated rollback strategies for failed deployments.
✏️ Practice #18 - Kubernetes rollabck PENDING
Apply these changes 19. Practice:
-
Perform Chaos Engineering: Test system resilience by simulating failures in a controlled way.
✏️ Practice #20 - PENDING
Apply these changes 20. Practice:
SRE focuses not only on reliability but also on ensuring efficient use of resources.
- Autoscaling: Adjust resources dynamically based on demand.
- Capacity Planning: Regularly review resource utilization and plan for growth.
- Optimize Cloud Usage: Ensure cloud services are provisioned based on actual needs.
SRE processes should continuously evolve based on feedback and lessons learned from incidents and performance reviews.
- Blameless Postmortems: Conduct post-incident reviews to identify root causes and improvements.
- Regularly Review SLOs: Ensure that SLOs remain aligned with business needs.
- Invest in Tooling: Continuously improve monitoring, alerting, and automation systems.
Pending Terraform Runbooks Ansible AngoDC Helm charts CI/CD
cd exercises3
podman login docker.io
podman build -t cguillenmendez/sre-abc-training-python-app:latest .
podman build -t cguillenmendez/sre-abc-training-python-app:0.0.0 .
podman push cguillenmendez/sre-abc-training-python-app:latest
podman push cguillenmendez/sre-abc-training-python-app:0.0.0
cd exercises8
podman login docker.io
podman build -t cguillenmendez/sre-abc-training-python-app:latest .
podman build -t cguillenmendez/sre-abc-training-python-app:0.0.1 .
podman push cguillenmendez/sre-abc-training-python-app:latest
podman push cguillenmendez/sre-abc-training-python-app:0.0.1
cd exercises10
podman login docker.io
podman build -t cguillenmendez/sre-abc-training-python-app:latest .
podman build -t cguillenmendez/sre-abc-training-python-app:0.0.23 .
podman push cguillenmendez/sre-abc-training-python-app:latest
podman push cguillenmendez/sre-abc-training-python-app:0.0.23