JobsPrincipal Software Engineer, At-Scale Reliability and Fleet Intelligence — CSP Engagements

Principal Software Engineer, At-Scale Reliability and Fleet Intelligence — CSP Engagements

NVIDIA

Principal Software Engineer, At-Scale Reliability and Fleet Intelligence — CSP Engagements

NVIDIA

Location

Santa Clara, CA

Type

Full-time

Posted

6/27/2026

Compensation

$272,000 - $431,250 per year

Undergraduate with 5+ Years of Experience

Approval 99.2%·Filings 1,781·New hires 873·

👑 Elite Sponsor

·FY 2025

Job description

The Principal Software Engineer will join the CSP Engagements team, focusing on fleet-scale reliability for NVIDIA platforms. This role involves working directly with engineering teams of key CSP and hyperscale customers to achieve target Mean Time Between Interruptions (MTBI) in production. The engineer will enhance NVIDIA's internal software and quality teams with a dedicated CSP-facing focus, driving work streams to improve reliability through data analysis and collaboration. The position requires a deep understanding of reliability software and firmware architecture, as well as the ability to translate lab improvements to real customer environments.

Requirements

15+ years of experience in systems software at datacenter scale or reliability engineering with a focus on at-scale challenges.
BS or MS in Computer Science, Electrical Engineering, Statistics, or a related field, or equivalent experience.
Deep expertise in multi-NUMA, rack-scale system software and firmware.
Experience with fleet-level telemetry and observability systems, including time-series databases and anomaly detection.
Understanding of hardware failure modes in large-scale GPU/accelerator deployments.
Experience defining or operating burn-in, stress testing, or certification frameworks for complex hardware systems.
Strong communication skills to present statistical reliability findings to technical and executive audiences.

Responsibilities

Drive reliability work streams with CSP engineering teams to ensure shared understanding of MTBI measurement methodology.
Gather and synthesize CSP fleet reliability data to identify failure patterns across multiple customers.
Define a consistent MTBI measurement methodology that works across different CSP monitoring environments.
Conduct fleet-scale failure pattern analysis using statistical methods to classify failures.
Drive fleet health monitoring integration architecture to align with CSP operational workflows.
Define burn-in reliability test environment and cluster certification criteria in collaboration with quality teams.
Collaborate with CSPs to ensure reliability-related integration work is complete ahead of at-scale launch.
Develop predictive failure models using fleet telemetry and validate their effectiveness in customer environments.

Benefits

Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.

Is this posting expired or inaccurate?