Site Reliability Engineering (SRE) Lead – Developer Experience & CI Reliability
AMD San Jose, CA Full-time 3/17/2026
Undergraduate with 5+ Years of Experience
Approval 98.6%•Total filings 728•New hires 184•
✓ Established Sponsor
•FY 2025Job Description
As a Site Reliability Engineering (SRE) Lead at AMD, you will enhance the reliability and operational maturity of Kubernetes-based CI and Developer Experience platforms. This role involves establishing reliability standards, improving incident management, and driving systemic improvements to ensure sustainable platform growth.
Requirements
- 7+ years of experience in Site Reliability Engineering, DevOps or Platform Engineering roles supporting production systems.
- Strong hands-on experience operating and debugging production Kubernetes environments at scale.
- Deep experience designing, maintaining, and troubleshooting CI/CD systems (e.g. GitHub Actions, Jenkins).
- Demonstrated ability to define and operate SLOs/SLIs, using reliability signals to drive sustainable operational improvements.
- Strong cross-functional communication skills, with the ability to lead reliability improvements through hands-on technical expertise.
Responsibilities
- Establish measurable reliability standards (SLOs/SLIs) for CI and Developer Experience services and drive adoption across teams.
- Evolve existing on-call practices into a scalable, well-defined reliability model that scales platform growth.
- Elevate the operational maturity of Kubernetes-based CI infrastructure by identifying systemic failure patterns and implementing structural fixes.
- Partner with DevOps and platform teams to improve ownership clarity and first-line Kubernetes debugging capabilities.
- Promote reliability engineering best practices and mentor engineers to build a culture of operational excellence.
Benefits
- AMD provides a competitive 'Total Rewards' package that focuses on financial growth, health, and work-life balance.
Is this job posting expired or no longer available?
