Modern cloud-native and distributed systems operate at a scale and complexity where traditional, reactive SRE practices — based on static thresholds, dashboards, and manual incident response — are no longer sufficient. This programme strengthens the technical depth, architectural judgement, and implementation capability of Senior DevOps Engineers and Architects through a progressive, system-centric model.
All learning is anchored on real microservices-based reference applications implemented in Java (Spring Boot) and .NET (ASP.NET Core). Learners instrument, observe, stress, break, recover, and optimise these services across the programme lifecycle. AI augments — not replaces — SRE decision-making: detecting issues earlier, improving diagnosis, preventing failures, and enabling safe, automated remediation under defined guardrails.
WHO SHOULD ATTEND
- Senior DevOps Engineers and Architects responsible for operating, scaling, and stabilising cloud-native systems at production scale
- SRE Practitioners who manage monitoring, automation, and CI/CD pipelines and are expected to
architect reliability and intelligent automation - Platform Engineers with direct involvement in production incidents, RCA, and post-incident
improvement activities
Experience: 3–8+ years required.
PRE-REQUISITES
Must-Have
- Experience with Java (Spring Boot) / .NET (ASP.NET Core) services in production
- Strong knowledge of AWS or Azure — compute, networking, storage, IAM, monitoring
- Practical Kubernetes experience including troubleshooting pod, service, and node-level failures
- Experience building and maintaining CI/CD pipelines; direct production incident experience.
Good to Have
- Database administration and optimisation
- Hands-on proficiency with Prometheus, Jaeger, Grafana, ELK, or Terraform
Microservice architecture knowledge
KEY OUTCOMES
- Define business-aligned SLIs and SLOs at application and transaction levels and implement SLI
instrumentation within Java / .NET services - Assess solution, database, and infrastructure architectures from performance, scalability, and
reliability perspectives - Conduct Fault Vulnerability Analysis (FVA) using historical data, incident patterns, and AI-assisted
insights - Design and implement high-fidelity observability architectures enabling AI-driven anomaly detection, signal correlation, and contextual analysis
- Design and validate chaos engineering strategies by executing controlled failure scenarios
- Participate in production incident RCA leveraging AI-assisted correlation, blast-radius analysis, and intelligent incident summarisation
- Define release management and rollback strategies aligned to SLOs, error budgets, and AI-supported risk-based change analysis
- Design and implement automation for toil reduction and self-healing with AI-assisted closed-loop
remediation under defined guardrails