JobsSenior Software Engineer, DGX Cloud AI Infrastructure

Senior Software Engineer, DGX Cloud AI Infrastructure

NVIDIA

Senior Software Engineer, DGX Cloud AI Infrastructure

NVIDIA

Location

remote, Santa Clara, CA, Austin, TX, Redmond, WA

Type

Full-time

Posted

6/5/2026

Compensation

$184,000 - $356,500 per year

Undergraduate with 5+ Years of Experience

Approval 99.2%·Filings 1,781·New hires 873·

👑 Elite Sponsor

·FY 2025

Job description

NVIDIA is seeking a Senior Software Engineer to lead the optimization and benchmarking of distributed training and inference workloads on GPU platforms. This role involves setting technical direction across communication libraries and model frameworks to ensure efficient operation of large language model workloads. The engineer will conduct deep performance investigations and build resilience capabilities for large clusters. This position requires a hands-on approach and collaboration with various teams to enhance performance and reliability.

Requirements

Bachelor's or Master's in Computer Science or a related technical field or equivalent experience.
8+ years of experience developing software infrastructure for large-scale AI or HPC systems.
Expertise debugging and triaging AI applications across the full stack.
Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads.
Proven track record of architecting, debugging, and scaling large-scale distributed systems.
Expert-level Python and C/C++ programming skills.
Experience operating workloads in scheduled, containerized cluster environments.
Excellent analytical, debugging, and communication skills.

Responsibilities

Lead bring-up, validation, and debugging of large-scale AI clusters and workloads.
Tune and benchmark AI pre-training, post-training, and inference workloads using NVIDIA AI software stacks.
Profile and optimize workload performance across compute, memory, networking, and communication layers.
Analyze scaling efficiency for distributed LLM workloads and provide tuning guidance.
Conduct root-cause analysis of complex failures in large distributed environments.
Define and build resilience and failure-attribution capabilities across the cluster.
Build benchmark suites, automation, acceptance criteria, and qualification workflows.
Tune runtime settings and deployment configurations in collaboration with other teams.
Deliver actionable recommendations based on profiling and benchmark results.
Mentor engineers and drive technical standards within the organization.

Benefits

Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.

Is this posting expired or inaccurate?