JobsStaff HPC Engineer
Job description
The Staff HPC Engineer at KLA is responsible for designing, building, optimizing, and supporting large-scale compute environments for scientific computing and AI/ML workloads. This role requires a blend of systems engineering, performance tuning, and hands-on troubleshooting. The engineer will work closely with researchers, developers, and IT teams to ensure reliable and high-performance compute infrastructure. The position is part of the Information Technology group, which focuses on enhancing technology to empower employees and drive business growth.
Requirements
- Extensive experience with Linux systems engineering in large-scale compute environments.
- Solid understanding of distributed systems and cloud infrastructure.
- Deep knowledge of HPC schedulers, MPI stacks, and parallel computing models.
- Strong understanding of high-speed interconnects and distributed storage systems.
- Proficiency in scripting languages and automation frameworks.
- Experience with GPUs and accelerator-based computing.
- Familiarity with containerization in HPC contexts.
- Strong troubleshooting skills across hardware, OS, and application layers.
- Understanding of networking fundamentals.
- Background in high-availability and distributed systems at scale.
- Doctorate degree with 8 years of related work experience, or a Master's degree with 12 years, or a Bachelor's degree with 15 years.
Responsibilities
- Design and implement HPC clusters, including compute, storage, networking, and job-scheduling components.
- Evaluate and integrate new technologies such as GPUs and accelerators.
- Develop automation for cluster provisioning, configuration, and lifecycle management.
- Architect solutions for large-scale parallel workloads and data-intensive applications.
- Profile and tune applications for performance across various metrics.
- Optimize parallel programming frameworks like MPI and OpenMP.
- Benchmark hardware and software stacks to guide procurement decisions.
- Maintain and monitor HPC clusters and job schedulers.
- Troubleshoot complex system issues across compute, storage, and network layers.
- Implement security best practices and ensure high availability.
- Build and maintain CI/CD pipelines for HPC-related software.
- Develop monitoring and observability solutions.
- Provide technical leadership and mentorship to junior engineers.
- Document architectures, procedures, and best practices.
- Participate in capacity planning and long-term HPC strategy.
Benefits
- Employees at KLA are often offered competitive pay with bonuses, a 401(k) match, an employee stock purchase program, and financial perks like student-debt assistance, planning support, and group insurance discounts. Health and lifestyle benefits typically include medical/dental/vision, life and other voluntary coverages, paid time off and holidays, family leave, backup care, wellness rewards, gym discounts, and community-volunteering opportunities. Employees also get strong growth support through tuition reimbursement, KLA’s corporate learning center, education awards, and engineering certification programs.
Is this posting expired or inaccurate?
