JobsSenior System Architect, Infrastructure Reliability
NVIDIA logo

Senior System Architect, Infrastructure Reliability

NVIDIA

Location

Santa Clara, CA, Westford, MA, Austin, TX, Durham, NC, Redmond, WA

Type

Full-time

Posted

6/16/2026

Compensation

$184,000 - $356,500 per year

Undergraduate with 5+ Years of Experience
Approval 99.2%·Filings 1,781·New hires 873·
👑 Elite Sponsor
·FY 2025

Job description

NVIDIA is looking for a Senior System Architect to tackle the challenge of Failure Attribution at Scale in accelerated computing. The role involves developing an automated framework that analyzes telemetry from CPU and GPU clusters to identify the root causes of job failures. The engineer will work on building scalable frameworks and automated diagnostics while collaborating with hardware and infrastructure teams. This position requires expertise in distributed systems and a strong background in systems programming.

Requirements

  • BS, MS, or PhD in Computer Science or Electrical Engineering or equivalent experience with 6+ years in systems programming.
  • Experience building automated Root Cause Analysis (RCA) pipelines for HPC or cloud-scale environments.
  • Expert knowledge of x86/ARM node-level metrics including IPC, cache contention, NUMA imbalance, and hardware interrupts.
  • Strong C++ and Python skills with the ability to build high-performance daemons.
  • Familiarity with cluster resource managers such as Slurm, LSF, or Kubernetes.

Responsibilities

  • Architect Failure Attribution Frameworks to capture high-fidelity state across CPU, GPU, and Fabric at the moment of failure.
  • Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions.
  • Implement low-overhead tracing mechanisms for job execution across multi-node Slurm or Kubernetes clusters.
  • Develop heuristics and models based on machine learning to classify failures.
  • Work closely with hardware and infrastructure teams to define signals of impending failure.

Benefits

  • Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.

Is this posting expired or inaccurate?