JobsSenior AI and ML HPC Cluster Engineer

Senior AI and ML HPC Cluster Engineer

NVIDIA

Senior AI and ML HPC Cluster Engineer

NVIDIA

Location

USA (Multiple Locations)

Type

Full-time

Posted

5/10/2026

Compensation

$152,000 - $287,500 per year

Undergraduate with 5+ Years of Experience

Approval 99.2%·Filings 1,781·New hires 873·

👑 Elite Sponsor

·FY 2025

Job description

As a member of the GPU AI/HPC Infrastructure team at NVIDIA, you will lead the design and implementation of advanced GPU compute clusters for deep learning and high performance computing. This role focuses on addressing strategic challenges related to compute, networking, and storage for large-scale workloads. You will also be responsible for enhancing the ecosystem around GPU-accelerated computing and fostering relationships with customers and cross-functional teams. Your expertise will help optimize resource utilization and support researchers in their computational needs.

Requirements

Bachelor's degree in Computer Science, Electrical Engineering or related field or equivalent experience
Minimum 5+ years of experience designing and operating large scale compute infrastructure
Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF
Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
Solid understanding of cluster configuration management tools such as Ansible, Puppet, Salt
In-depth understanding of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud
Proficiency in Python programming and bash scripting
Applied experience with AI/HPC workflows that use MPI
Experience analyzing and tuning performance for a variety of AI/HPC workloads
Passion for continual learning and staying ahead of emerging technologies in HPC and AI/ML infrastructure fields

Responsibilities

Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage
Develop and improve the ecosystem around GPU-accelerated computing including developing scalable automation solutions
Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet evolving user needs
Support researchers to run their workloads including performance analysis and optimizations
Conduct root cause analysis and suggest corrective actions
Proactively find and fix issues before they occur

Benefits

Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.

Is this posting expired or inaccurate?