H1BConnect Pro: Unlock advanced filters, H1B sponsorship insights, and unlimited job access.Subscribe now
AMD logo

Principal ML Engineer - Large Scale Training Performance Optimization

AMD
San Jose, CA Full-time 3/24/2026 $226.4k - $339.6k per year
PhD Entry-Level
Approval 98.6%Total filings 728New hires 184
Established Sponsor
FY 2025

Job Description

AMD is seeking a Principal Machine Learning Engineer to join their Models and Applications team, focusing on distributed training of large models on GPUs. The role involves improving training efficiency and contributing to open-source projects while collaborating with various teams to influence the direction of AMD's AI platform.

Requirements

  • Experience with distributed training pipelines
  • Knowledgeable in distributed training algorithms (Data Parallel, Tensor Parallel, Pipeline Parallel, Expert Parallel ZeRO)
  • Familiar with training large models at scale
  • Experience with ML/DL frameworks such as PyTorch, JAX, or TensorFlow
  • Experience with distributed training frameworks like Megatron-LM, MaxText, TorchTitan
  • Experience with LLMs or computer vision, especially large models, is a plus
  • Experience with GPU kernel optimization is a plus
  • Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale
  • Experience with ML infra at kernel, framework, or system level
  • Strong communication and problem-solving skills

Responsibilities

  • Train large models to convergence on AMD GPUs at scale
  • Improve the end-to-end training pipeline performance
  • Optimize the distributed training pipeline and algorithm to scale out
  • Contribute your changes to open source
  • Stay up-to-date with the latest training algorithms
  • Influence the direction of AMD AI platform
  • Collaborate across teams with various groups and stakeholders

Benefits

  • AMD provides a competitive 'Total Rewards' package that focuses on financial growth, health, and work-life balance.

Is this job posting expired or no longer available?