JobsPrincipal AI and ML Infra Software Engineer, GPU Clusters
NVIDIA logo

Principal AI and ML Infra Software Engineer, GPU Clusters

NVIDIA

Location

USA (Multiple Locations)

Type

Full-time

Posted

5/10/2026

Compensation

$272,000 - $431,250 per year

Undergraduate with 5+ Years of Experience
Approval 99.2%·Filings 1,781·New hires 873·
👑 Elite Sponsor
·FY 2025

Job description

The Principal AI and ML Infra Software Engineer at NVIDIA will play a crucial role in enhancing the efficiency of AI and ML research by addressing infrastructure deficiencies within GPU Clusters. This position involves close collaboration with researchers to identify their needs and implement actionable improvements. The engineer will also monitor and optimize infrastructure performance while advocating for the integration of new AI/ML technologies. This role is part of the Hardware Infrastructure team, focused on creating effective and scalable solutions for AI/ML technology.

Requirements

  • BS or similar background in Computer Science or related area or equivalent experience.
  • 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems.
  • Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure.
  • In-depth knowledge of accelerated computing, storage, scheduling & orchestration, high-speed networking, and container technologies.
  • Capability in supervising and improving substantial distributed training operations using frameworks like PyTorch, NeMo, or JAX.
  • Proficiency in programming and scripting languages such as Python, Go, and Bash.
  • Familiarity with cloud computing platforms like AWS, GCP, or Azure.
  • Dedication to ongoing learning and staying updated on new technologies in the AI/ML infrastructure sector.
  • Excellent communication and collaboration skills.

Responsibilities

  • Engage closely with AI and ML research teams to discern their infrastructure requirements and barriers.
  • Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve them.
  • Monitor and optimize the performance of infrastructure ensuring high availability and efficient resource utilization.
  • Help define and improve measures of AI researcher efficiency.
  • Work closely with researchers, data engineers, and DevOps professionals to develop a cohesive AI/ML infrastructure ecosystem.
  • Stay updated with recent developments in AI/ML technologies and advocate for their integration within the organization.

Benefits

  • Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.

Is this posting expired or inaccurate?