JobsHPC Systems Engineer
Job description
The HPC Systems Engineer role at KLA focuses on supporting and evolving a high-performance Linux cluster that powers R&D innovation. This position is crucial for maintaining the reliability, performance, and scalability of a mission-critical HPC environment. The engineer will collaborate closely with infrastructure, DevOps, and application teams to ensure optimal performance for demanding computational tasks. This role is part of the Global Products Group, which is dedicated to advancing semiconductor manufacturing technologies.
Requirements
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
- 3+ years of hands-on Linux systems administration experience
- Direct experience working with HPC or large-scale compute environments
- Practical experience with at least one HPC scheduler such as SLURM, LSF, or PBS
- Strong Linux troubleshooting skills in processes, memory, I/O, networking, and performance analysis
Responsibilities
- Operate and maintain a large-scale Linux based HPC cluster used for internal R&D workloads
- Manage compute nodes, login nodes, and supporting infrastructure in a multi-tenant environment
- Monitor cluster health, performance, and capacity; respond to incidents and degradations
- Configure, tune, and support HPC job schedulers
- Assist users with job submission issues, resource requests, and queue optimization
- Help optimize scheduler policies to balance throughput, fairness, and utilization
- Install, configure, and maintain Linux operating systems across compute and service nodes
- Manage OS updates, kernel changes, drivers, and system hardening
- Troubleshoot complex Linux performance, networking, storage, and process level issues
- Support high throughput and parallel workloads across CPU and GPU resources
- Diagnose performance bottlenecks across compute, storage, network, and scheduler layers
- Assist with scaling activities such as node expansions and hardware refreshes
- Use automation and configuration management tools to ensure consistency across the cluster
- Contribute to scripting and tooling for node provisioning, validation, and lifecycle management
- Participate in on-call or escalation rotations as required to support a production R&D platform
- Partner with internal engineering teams to understand workload requirements and usage patterns
- Provide guidance and best practices for running workloads efficiently on shared HPC systems
- Contribute to internal documentation and operational runbooks
Benefits
- Employees at KLA are often offered competitive pay with bonuses, a 401(k) match, an employee stock purchase program, and financial perks like student-debt assistance, planning support, and group insurance discounts. Health and lifestyle benefits typically include medical/dental/vision, life and other voluntary coverages, paid time off and holidays, family leave, backup care, wellness rewards, gym discounts, and community-volunteering opportunities. Employees also get strong growth support through tuition reimbursement, KLA’s corporate learning center, education awards, and engineering certification programs.
Is this posting expired or inaccurate?
