JobsSenior Software Engineer - NVLink Rack Scale Stability and Reliability
Senior Software Engineer - NVLink Rack Scale Stability and Reliability
NVIDIASenior Software Engineer - NVLink Rack Scale Stability and Reliability
NVIDIALocation
USA (Multiple Locations)
Type
Full-time
Posted
5/23/2026
Compensation
$152,000 - $287,500 per year
Undergraduate with 5+ Years of Experience
Approval 99.2%·Filings 1,781·New hires 873·
👑 Elite Sponsor
·FY 2025Job description
NVIDIA is seeking highly motivated Senior Software Engineers to join its Fabric Networking team, focusing on NVLink Rack-Scale Systems Stability and Reliability. The role involves collaborating with architects and developers to transform innovative platforms into stable, production-ready systems. Candidates will tackle complex system-level challenges related to resiliency, diagnostics, and large-scale AI infrastructure. This position is integral to enhancing the software foundation that supports next-generation datacenter deployments.
Requirements
- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field, or equivalent experience.
- 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems.
- Strong programming skills in C/C++ and Python; Bash/Shell scripting experience is a plus.
- Strong system-level debugging skills across software, firmware, hardware, and networking layers.
- Solid networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis.
- Experience with large-scale AI systems, including platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging.
- Strong communication and collaboration skills across engineering, customer, and operations teams.
Responsibilities
- Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems.
- Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support.
- Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution.
- Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments.
- Collaborate with architecture, hardware, firmware, software, and customer engagement teams to improve system quality and reliability.
- Build and maintain SRE-style validation infrastructure, including provisioning, monitoring, and operational readiness.
- Create automation, dashboards, runbooks, and debug workflows that improve root-cause analysis and operational efficiency.
Benefits
- Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.
Is this posting expired or inaccurate?
