Job Description

Join the Network.AI Software team at Meta to develop and enhance the software stack around NCCL, enabling reliable and scalable distributed ML training on large-scale GPU infrastructures, with a focus on GenAI/LLM scaling.

Requirements

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Specialized experience in one or more of the following: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch)
Knowledge of GPU architectures and CUDA programming
Experience with DL frameworks like PyTorch, Caffe2 or TensorFlow
Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models
PhD in Computer Science, Computer Engineering, or relevant technical field
Experience with data parallel and model parallel training
Experience in HPC and parallel computing
Knowledge of ML, deep learning and LLM
Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband

Responsibilities

Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling