JobsSenior AI and ML HPC Cluster Engineer
NVIDIA logo

Senior AI and ML HPC Cluster Engineer

NVIDIA

Location

USA (Multiple Locations)

Type

Full-time

Posted

5/10/2026

Compensation

$152,000 - $287,500 per year

Undergraduate with 5+ Years of Experience
Approval 99.2%·Filings 1,781·New hires 873·
👑 Elite Sponsor
·FY 2025

Job description

As a member of the GPU AI/HPC Infrastructure team at NVIDIA, you will lead the design and implementation of advanced GPU compute clusters for deep learning and high performance computing. This role focuses on addressing strategic challenges related to compute, networking, and storage for large-scale workloads. You will also be responsible for enhancing the ecosystem around GPU-accelerated computing and fostering relationships with customers and cross-functional teams. Your expertise will help optimize resource utilization and support researchers in their computational needs.

Requirements

  • Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum 5+ years of experience designing and operating large scale compute infrastructure
  • Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF
  • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
  • Solid understanding of cluster configuration management tools such as Ansible, Puppet, Salt
  • In-depth understanding of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud
  • Proficiency in Python programming and bash scripting
  • Applied experience with AI/HPC workflows that use MPI
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads
  • Passion for continual learning and staying ahead of emerging technologies in HPC and AI/ML infrastructure fields

Responsibilities

  • Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage
  • Develop and improve the ecosystem around GPU-accelerated computing including developing scalable automation solutions
  • Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
  • Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet evolving user needs
  • Support researchers to run their workloads including performance analysis and optimizations
  • Conduct root cause analysis and suggest corrective actions
  • Proactively find and fix issues before they occur

Benefits

  • Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.

Is this posting expired or inaccurate?