Principal ML Engineer - Large Scale Training Performance Optimization

AMD

San Jose, CA Full-time 3/24/2026 $226.4k - $339.6k per year

PhD Entry-Level

Approval 98.6%•Total filings 728•New hires 184•

✓ Established Sponsor

•FY 2025

Job Description

AMD is seeking a Principal Machine Learning Engineer to join their Models and Applications team, focusing on distributed training of large models on GPUs. The role involves improving training efficiency and contributing to open-source projects while collaborating with various teams to influence the direction of AMD's AI platform.

Requirements

Experience with distributed training pipelines
Knowledgeable in distributed training algorithms (Data Parallel, Tensor Parallel, Pipeline Parallel, Expert Parallel ZeRO)
Familiar with training large models at scale
Experience with ML/DL frameworks such as PyTorch, JAX, or TensorFlow
Experience with distributed training frameworks like Megatron-LM, MaxText, TorchTitan
Experience with LLMs or computer vision, especially large models, is a plus
Experience with GPU kernel optimization is a plus
Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale
Experience with ML infra at kernel, framework, or system level
Strong communication and problem-solving skills

Responsibilities

Train large models to convergence on AMD GPUs at scale
Improve the end-to-end training pipeline performance
Optimize the distributed training pipeline and algorithm to scale out
Contribute your changes to open source
Stay up-to-date with the latest training algorithms
Influence the direction of AMD AI platform
Collaborate across teams with various groups and stakeholders

Benefits

AMD provides a competitive 'Total Rewards' package that focuses on financial growth, health, and work-life balance.

Is this job posting expired or no longer available?