H1BConnect Pro is launching with premium alerts and access to more job postings.Get early access
Meta logo

Software Engineer, SystemML - Scaling / Performance

Meta
New York City, NY Full-time 12/2/2025 $70.67 an hour
Undergraduate Entry-LevelMaster's Entry-LevelPhD Entry-Level

Job Description

Join the Network.AI Software team at Meta to develop and enhance the software stack around NCCL, enabling reliable and scalable distributed ML training on large-scale GPU infrastructures, with a focus on GenAI/LLM scaling.

Requirements

  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Specialized experience in one or more of the following: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch)
  • Knowledge of GPU architectures and CUDA programming
  • Experience with DL frameworks like PyTorch, Caffe2 or TensorFlow
  • Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models
  • PhD in Computer Science, Computer Engineering, or relevant technical field
  • Experience with data parallel and model parallel training
  • Experience in HPC and parallel computing
  • Knowledge of ML, deep learning and LLM
  • Experience with NCCL and distributed GPU reliability/performance improvement on RoCE/Infiniband

Responsibilities

  • Enabling reliable and highly scalable distributed ML training on Meta's large-scale GPU training infra with a focus on GenAI/LLM scaling