JobsHPC Systems Engineer
KLA logo

HPC Systems Engineer

KLA

Location

Ann Arbor, MI

Type

Full-time

Posted

5/13/2026

Compensation

$105,900 - $180,000 per year

Undergraduate with 5+ Years of Experience
Master's with 2+ Years of Experience
PhD Entry-Level
Approval 97.8%·Filings 803·New hires 321·
💎 Strong Sponsor
·FY 2025

Job description

The HPC Systems Engineer role at KLA focuses on supporting and evolving a high-performance Linux cluster that powers R&D innovation. This position is crucial for maintaining the reliability, performance, and scalability of a mission-critical HPC environment. The engineer will collaborate closely with infrastructure, DevOps, and application teams to ensure optimal performance for demanding computational tasks. This role is part of the Global Products Group, which is dedicated to advancing semiconductor manufacturing technologies.

Requirements

  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 3+ years of hands-on Linux systems administration experience
  • Direct experience working with HPC or large-scale compute environments
  • Practical experience with at least one HPC scheduler such as SLURM, LSF, or PBS
  • Strong Linux troubleshooting skills in processes, memory, I/O, networking, and performance analysis

Responsibilities

  • Operate and maintain a large-scale Linux based HPC cluster used for internal R&D workloads
  • Manage compute nodes, login nodes, and supporting infrastructure in a multi-tenant environment
  • Monitor cluster health, performance, and capacity; respond to incidents and degradations
  • Configure, tune, and support HPC job schedulers
  • Assist users with job submission issues, resource requests, and queue optimization
  • Help optimize scheduler policies to balance throughput, fairness, and utilization
  • Install, configure, and maintain Linux operating systems across compute and service nodes
  • Manage OS updates, kernel changes, drivers, and system hardening
  • Troubleshoot complex Linux performance, networking, storage, and process level issues
  • Support high throughput and parallel workloads across CPU and GPU resources
  • Diagnose performance bottlenecks across compute, storage, network, and scheduler layers
  • Assist with scaling activities such as node expansions and hardware refreshes
  • Use automation and configuration management tools to ensure consistency across the cluster
  • Contribute to scripting and tooling for node provisioning, validation, and lifecycle management
  • Participate in on-call or escalation rotations as required to support a production R&D platform
  • Partner with internal engineering teams to understand workload requirements and usage patterns
  • Provide guidance and best practices for running workloads efficiently on shared HPC systems
  • Contribute to internal documentation and operational runbooks

Benefits

  • Employees at KLA are often offered competitive pay with bonuses, a 401(k) match, an employee stock purchase program, and financial perks like student-debt assistance, planning support, and group insurance discounts. Health and lifestyle benefits typically include medical/dental/vision, life and other voluntary coverages, paid time off and holidays, family leave, backup care, wellness rewards, gym discounts, and community-volunteering opportunities. Employees also get strong growth support through tuition reimbursement, KLA’s corporate learning center, education awards, and engineering certification programs.

Is this posting expired or inaccurate?