JobsMember of Technical Staff, HPC Operations Engineering Manager

Member of Technical Staff, HPC Operations Engineering Manager

Microsoft

Member of Technical Staff, HPC Operations Engineering Manager

Microsoft

Location

Mountain View, CA

Type

Full-time

Posted

5/5/2026

Compensation

$163,000 - $331,200 per year

Undergraduate with 5+ Years of Experience

Approval 98.4%·Filings 6,363·New hires 3,142·

👑 Elite Sponsor

·FY 2025

Job description

Microsoft AI is looking for an experienced High Performance Computing Operations Engineering Manager to lead a team of Site Reliability Engineers within the MAI SuperIntelligence Team. This role focuses on ensuring the reliability and efficiency of large-scale distributed AI infrastructure. The manager will collaborate closely with ML researchers, data engineers, and product developers to design and operate platforms for generative AI models. The team is dedicated to creating AI that amplifies human potential while maintaining control and safety aligned with human values.

Requirements

Bachelor's Degree in Computer Science or related technical field.
8+ years of technical engineering experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering leadership roles.
8+ years of experience with Kubernetes, Docker, and container orchestration.
6+ years of programming/scripting experience in languages such as Python, Go, or Bash.
Master's Degree in Computer Science or related technical field is preferred.
12+ years of technical engineering experience is preferred.
10+ years of experience with public cloud platforms like Azure, AWS, or GCP and infrastructure-as-code is preferred.
6+ years of people management experience is preferred.
8+ years of experience in monitoring and observability tools such as Grafana or Datadog is preferred.
Solid knowledge of distributed systems, networking, and storage is preferred.

Responsibilities

Lead a team of experienced Site Reliability Engineers to ensure uptime, resiliency, and fault tolerance of AI model training and inference systems.
Design and maintain monitoring, alerting, and logging systems for real-time visibility into model serving pipelines and infrastructure.
Lead the development of automation for deployments, incident response, scaling, and failover in hybrid cloud and on-premises environments.
Manage on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements.
Ensure data privacy, compliance, and secure operations across model training and serving environments.
Collaborate with ML engineers and platform teams to enhance developer experience and accelerate research-to-production workflows.

Benefits

Employees at Microsoft are often offered comprehensive, “world-class” benefits—including health and mental-wellness programs, competitive pay with bonuses and stock awards, and retirement/savings options. Time-off and flexibility are common, with generous vacation and holidays, parental and caregiver leave, and flexible work schedules, alongside learning support, employee resource groups, product discounts, and matching-gifts/volunteering programs. Specific benefits can vary by region.

Is this posting expired or inaccurate?