JobsSenior Manager, Site Reliability Engineering
Job description
NVIDIA is seeking a Senior Manager of Site Reliability Engineering to lead IT operations at scale, focusing on building AI-powered systems that enhance reliability and employee experience. This role emphasizes transforming traditional service management into an intelligent, automated operating model. The ideal candidate will drive the adoption of observability and automation while partnering with engineering and business teams to align operations with service reliability goals. This position offers an opportunity to make a significant impact in a diverse and innovative environment.
Requirements
- BS, MS, or PhD in Computer Science, Electrical/Computer Engineering, Physics, Mathematics, other Engineering or related fields or equivalent experience.
- 5+ years of experience leading and managing global IT operations or service management teams.
- 12+ overall years of experience in Site Reliability Engineering, IT Service Management, with a focus on Incident Management, Problem Management, and Configuration Management.
- Proven proficiency in Incident, Problem, and CM with a consistent record of delivering measurable gains in reliability and efficiency.
- Demonstrated experience applying AI, automation, or advanced analytics to improve operational outcomes.
- Solid understanding of observability, monitoring ecosystems, and modern reliability practices.
- Strong leadership capability with experience building and scaling engineering-focused teams.
- Ability to deliver executive-level communication and insights.
Responsibilities
- Manage the full lifecycle of Incident, Problem, and Change Management as a 24×7 operational function.
- Transform incident response by implementing AI detection, correlation, and guided remediation.
- Build and scale intelligent incident workflows that integrate monitoring, telemetry, and service context.
- Evolve Problem Management into a data-driven field using AI and analytics to identify patterns.
- Modernize Change Management by introducing risk-aware, data-driven decision-making.
- Drive the adoption of observability to ensure service-level visibility and actionable insights.
- Lead the development of automation and orchestration platforms to reduce manual effort.
- Partner closely with engineering, infrastructure, and business teams to align operations with service reliability goals.
Benefits
- Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.
Is this posting expired or inaccurate?
