Site Reliability Engineering (SRE) Lead – Developer Experience & CI Reliability

AMD

San Jose, CA Full-time 3/17/2026

Undergraduate with 5+ Years of Experience

Approval 98.6%•Total filings 728•New hires 184•

✓ Established Sponsor

•FY 2025

Job Description

As a Site Reliability Engineering (SRE) Lead at AMD, you will enhance the reliability and operational maturity of Kubernetes-based CI and Developer Experience platforms. This role involves establishing reliability standards, improving incident management, and driving systemic improvements to ensure sustainable platform growth.

Requirements

7+ years of experience in Site Reliability Engineering, DevOps or Platform Engineering roles supporting production systems.
Strong hands-on experience operating and debugging production Kubernetes environments at scale.
Deep experience designing, maintaining, and troubleshooting CI/CD systems (e.g. GitHub Actions, Jenkins).
Demonstrated ability to define and operate SLOs/SLIs, using reliability signals to drive sustainable operational improvements.
Strong cross-functional communication skills, with the ability to lead reliability improvements through hands-on technical expertise.

Responsibilities

Establish measurable reliability standards (SLOs/SLIs) for CI and Developer Experience services and drive adoption across teams.
Evolve existing on-call practices into a scalable, well-defined reliability model that scales platform growth.
Elevate the operational maturity of Kubernetes-based CI infrastructure by identifying systemic failure patterns and implementing structural fixes.
Partner with DevOps and platform teams to improve ownership clarity and first-line Kubernetes debugging capabilities.
Promote reliability engineering best practices and mentor engineers to build a culture of operational excellence.

Benefits

AMD provides a competitive 'Total Rewards' package that focuses on financial growth, health, and work-life balance.

Is this job posting expired or no longer available?