Principal ML Engineer - Large Scale Training Performance Optimization
AMD San Jose, CA Full-time 3/24/2026 $226.4k - $339.6k per year
PhD Entry-Level
Approval 98.6%•Total filings 728•New hires 184•
✓ Established Sponsor
•FY 2025Job Description
AMD is seeking a Principal Machine Learning Engineer to join their Models and Applications team, focusing on distributed training of large models on GPUs. The role involves improving training efficiency and contributing to open-source projects while collaborating with various teams to influence the direction of AMD's AI platform.
Requirements
- Experience with distributed training pipelines
- Knowledgeable in distributed training algorithms (Data Parallel, Tensor Parallel, Pipeline Parallel, Expert Parallel ZeRO)
- Familiar with training large models at scale
- Experience with ML/DL frameworks such as PyTorch, JAX, or TensorFlow
- Experience with distributed training frameworks like Megatron-LM, MaxText, TorchTitan
- Experience with LLMs or computer vision, especially large models, is a plus
- Experience with GPU kernel optimization is a plus
- Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale
- Experience with ML infra at kernel, framework, or system level
- Strong communication and problem-solving skills
Responsibilities
- Train large models to convergence on AMD GPUs at scale
- Improve the end-to-end training pipeline performance
- Optimize the distributed training pipeline and algorithm to scale out
- Contribute your changes to open source
- Stay up-to-date with the latest training algorithms
- Influence the direction of AMD AI platform
- Collaborate across teams with various groups and stakeholders
Benefits
- AMD provides a competitive 'Total Rewards' package that focuses on financial growth, health, and work-life balance.
Is this job posting expired or no longer available?
