JobsSenior Software Engineer - NVLink Rack Scale Stability and Reliability
NVIDIA logo

Senior Software Engineer - NVLink Rack Scale Stability and Reliability

NVIDIA

Location

USA (Multiple Locations)

Type

Full-time

Posted

5/23/2026

Compensation

$152,000 - $287,500 per year

Undergraduate with 5+ Years of Experience
Approval 99.2%·Filings 1,781·New hires 873·
👑 Elite Sponsor
·FY 2025

Job description

NVIDIA is seeking highly motivated Senior Software Engineers to join its Fabric Networking team, focusing on NVLink Rack-Scale Systems Stability and Reliability. The role involves collaborating with architects and developers to transform innovative platforms into stable, production-ready systems. Candidates will tackle complex system-level challenges related to resiliency, diagnostics, and large-scale AI infrastructure. This position is integral to enhancing the software foundation that supports next-generation datacenter deployments.

Requirements

  • BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field, or equivalent experience.
  • 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.
  • Strong programming skills in C/C++ and Python; Bash/Shell scripting experience is a plus.
  • Strong system-level debugging skills across software, firmware, hardware, and networking layers.
  • Solid networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis.
  • Experience with large-scale AI systems, including platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging.
  • Strong communication and collaboration skills across engineering, customer, and operations teams.

Responsibilities

  • Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems.
  • Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support.
  • Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution.
  • Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments.
  • Collaborate with architecture, hardware, firmware, software, and customer engagement teams to improve system quality and reliability.
  • Build and maintain SRE-style validation infrastructure, including provisioning, monitoring, and operational readiness.
  • Create automation, dashboards, runbooks, and debug workflows that improve root-cause analysis and operational efficiency.

Benefits

  • Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.

Is this posting expired or inaccurate?