Skip to content

cguillencr/sre-abc-training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SRE (Site Reliability Engineering) Process


Introduction

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to operations to create scalable and highly reliable systems. The SRE process focuses on balancing the need for reliability with the pace of innovation. This document outlines the key practices, principles, and workflows involved in the SRE process.


Getting Started

Before beginning any practice or exercise, clone this repository to your local machine using the following command:

git clone https://github.com/cguilencr/sre-abc-training.git
cd sre-abc-training

This ensures you have access to all the required files and directory structures needed to complete the exercises.


Table of Contents

  1. Software Development Lifecycle (SDLC)
  2. Core Principles
  3. SLIs, SLOs, SLAs, and Error Budgets
  4. Monitoring and Incident Management
  5. Operational Readiness Reviews (ORR)
  6. Change Management and Automation
  7. Cost Optimization
  8. Continuous Improvement

Software Development Lifecycle (SDLC)

The Software Development Lifecycle (SDLC) is a structured approach to software development. It involves multiple phases that ensure software is designed, developed, tested, and deployed systematically. SRE plays a critical role in ensuring the reliability and scalability of the systems throughout the SDLC.

✏️ Practice #1 - Python app

Apply these changes 1. Practice to create a python REST API.This application will serve as a practical case to demonstrate the integration of SRE at each phase, from planning to maintenance, including the implementation of SLOs, monitoring of metrics, and deployment automation.

Infra
✏️ Practice #2 - App packaged as image

Apply these changes 2. Practice to achieve the application will be packaged into a Docker image to facilitate easier deployment in multiple locations.

Infra
✏️ Practice #3 - App image ni to a registry

Apply these changes 3. Practice the image will be stored in a remote registry(Dockerhub) to use it as the source for the application during deployments.

Infra

SRE integrates with SDLC during the deployment and maintenance phases, ensuring smooth releases and reliable operations post-deployment. This process involves proactive monitoring, incident management, and automation to minimize downtime and maintain high availability.

Core Principles

SRE is guided by several core principles that shape how operations are managed:

  • Emphasize Reliability: Ensure that systems maintain high availability and performance.

    In this session, a Kubernetes cluster can be used to run our application with 3 replicas so that if one fails, the other 2 can take over the load, thus increasing the availability of the service.

    ✏️ Practice #4 - Running the app as a service

    Apply these changes 4. Practice to achieve an infrastructure like this one:

    Infra
  • Use SLIs and SLOs: Define and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure reliability.

  • Blameless Postmortems: Learn from failures without blaming individuals, focusing on how systems can be improved.

  • Automate Operations: Use automation to reduce manual interventions and improve system scalability.

    ✏️ Practice #4.1 - Running the app as a service

    Apply these changes in Practice 4.1 to use an Infrastructure as Code (IaC) approach to automate infrastructure setup with Ansible, instead of running commands directly in the shell.

    Infra

SLIs, SLOs, SLAs, and Error Budgets

Service Level Objectives (SLOs) are key metrics that define the expected performance and availability of a service. They are derived from Service Level Indicators (SLIs), which measure system behavior.

Key Steps:

  1. Service Level Indicator (SLI): A metric that measures specific system performance attributes, such as latency, availability, or error rate.

  2. Service Level Objective (SLO): A target for acceptable performance of an SLI, e.g., 99.9% availability. SLOs aim to balance reliability and development speed.

  3. Service Level Agreement (SLA): A formal agreement with customers defining expected performance and penalties for unmet objectives.

  4. Error Budget: The allowable margin for failure within an SLO, which provides flexibility for feature development without excessive risk.

📈 SLO´s list

Create a list of 4 SLO´s to eventually use in the observability strategy. This is an example. Consider adding a new SLO based on the percentile of HTTP request latencies. Percentiles are used to measure how often certain performance thresholds are met for a majority of users. For instance, a 95th percentile latency of 100ms means that 95% of requests are served within 100ms, offering a realistic view of user experience by focusing on the performance seen by the majority of requests, rather than average or extreme cases alone.

White-Box Monitoring and SLOs

In white-box monitoring, metrics, traces, and logs are the primary data sources for monitoring Service Level Indicators (SLIs), ensuring that Service Level Objectives (SLOs) are met. For example:

  • Metrics measure uptime and latency, forming the basis for availability and performance SLOs.
  • Traces reveal the journey of each request, ensuring SLOs related to request latency are achieved.
  • Logs provide additional context for errors and anomalies, supporting SLO adherence by enabling detailed investigations when issues arise.

Monitoring and Incident Management

Monitoring is crucial for detecting issues early and responding swiftly to incidents.

Best Practices:

  • Set Up Monitoring Systems: Tools like Prometheus, Grafana, and OpenTelemetry provide insights into system performance using metrics, traces, and logs.

    In this section, a Prometheus server is installed on the node that will eventually be used as a repository for traces and metrics.

    ✏️ Practice #5 - Including Prometheus in the cluster

    Apply these changes 5. Practice to achieve an infrastructure like this one:

    Infra

    In this section, a Grafana server is installed, which will allow data visualization from different sources, in this case, Prometheus. Eventually, it will be used to deploy an observability and monitoring strategy.

    ✏️ Practice #6 - Including Grafana in the cluster

    Apply these changes 6. Practice to achieve an infrastructure like this one:

    Infra
  • Golden Signals: Monitoring should focus on four critical Golden Signals to track the health of a service:

    1. Latency: The time it takes to service a request.
    2. Traffic: The amount of demand or load being placed on the system (e.g., requests per second).
    3. Errors: The rate of failed requests.
    4. Saturation: How full a service's resources are, such as CPU or memory.
    ✏️ Practice #11 - Dashboard creation

    Apply these changes 11. Practice:

    Infra
  • Define Alerts: Use alerts based on SLIs, metrics, traces, and logs to notify teams of potential problems before they affect customers.

    ✏️ Practice #12 - Alerts definition

    Apply these changes 12. Practice to create alerts like these onces for each SLO define above:

    Infra
  • Incident Response: When issues occur, follow a clear incident management process to resolve them quickly:

    1. Detect the issue using Golden Signals, monitoring, and alerting systems.
    2. Respond to the alert and acknowledge the incident.
    3. Mitigate the problem using workarounds or rollbacks to minimize customer impact.
    4. Document the incident for review and postmortem analysis.
    ✏️ Time to Detect, Time to Acknowledge and Time to Resolve

    These metrics are important for measuring the performance of the SRE (Site Reliability Engineering) team. By monitoring these metrics, an SRE team can efficiently plan and improve its performance:

    ✏️ Practice #13 - Automate runbooks with ansible.

    Apply these changes 13. Practice to achieve an infrastructure like this one:

    Infra

    In this setup, SREs are responsible for ensuring reliability using tools like Grafana, Jaeger, and AWX, while customers interact with the application.

    📈 Syntetic Monitoring

    Previously, Ansible was used to automate a task. Now, you must use it to create a synthetic monitor that simulates a browser accessing the endpoint. To achieve this, you need to create an Ansible script that runs on AWX and connects to the endpoint, ensuring the status code is 200.

Operational Readiness Reviews (ORR)

Operational Readiness Reviews ensure that services are ready for production deployment. ORRs evaluate the robustness of infrastructure, the maturity of monitoring, and the ability to handle failures.

Checklist for Operational Readiness Review (ORR):

  • Ensure service has monitoring and alerting in place.
  • Review the capacity plan and ensure scaling capabilities.
  • Confirm all dependencies are resilient to failures.
  • Review disaster recovery strategies.
📈 ORR

Previously, in section 4: SLOs, a list of 4 SLOs was created. Now, based on the information in the Operational Readiness Review (ORR), please attach the results of a review conducted by you in the same document.

Change Management and Automation

Automating repetitive tasks and following structured change management processes helps reduce risk during deployments.

Key Automation Practices:

Cost Optimization

SRE focuses not only on reliability but also on ensuring efficient use of resources.

Strategies:

  • Autoscaling: Adjust resources dynamically based on demand.
  • Capacity Planning: Regularly review resource utilization and plan for growth.
  • Optimize Cloud Usage: Ensure cloud services are provisioned based on actual needs.

Continuous Improvement

SRE processes should continuously evolve based on feedback and lessons learned from incidents and performance reviews.

Key Activities:

  • Blameless Postmortems: Conduct post-incident reviews to identify root causes and improvements.
  • Regularly Review SLOs: Ensure that SLOs remain aligned with business needs.
  • Invest in Tooling: Continuously improve monitoring, alerting, and automation systems.

Pending Terraform Runbooks Ansible AngoDC Helm charts CI/CD

cd exercises3
podman login docker.io
podman build -t cguillenmendez/sre-abc-training-python-app:latest .
podman build -t cguillenmendez/sre-abc-training-python-app:0.0.0 .
podman push cguillenmendez/sre-abc-training-python-app:latest
podman push cguillenmendez/sre-abc-training-python-app:0.0.0
cd exercises8
podman login docker.io
podman build -t cguillenmendez/sre-abc-training-python-app:latest .
podman build -t cguillenmendez/sre-abc-training-python-app:0.0.1 .
podman push cguillenmendez/sre-abc-training-python-app:latest
podman push cguillenmendez/sre-abc-training-python-app:0.0.1
cd exercises10
podman login docker.io
podman build -t cguillenmendez/sre-abc-training-python-app:latest .
podman build -t cguillenmendez/sre-abc-training-python-app:0.0.23 .
podman push cguillenmendez/sre-abc-training-python-app:latest
podman push cguillenmendez/sre-abc-training-python-app:0.0.23

About

Devops SRE Training

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published