JobsMTS - Site Reliability Engineer
Microsoft logo

MTS - Site Reliability Engineer

Microsoft

Location

USA (Multiple Locations)

Type

Full-time

Posted

5/5/2026

Compensation

$119,800 - $304,200 per year

Undergraduate with 5+ Years of Experience
Approval 98.4%·Filings 6,363·New hires 3,142·
👑 Elite Sponsor
·FY 2025

Job description

The Site Reliability Engineer (SRE) role at Microsoft focuses on maintaining the reliability and efficiency of large-scale distributed AI infrastructure. The position involves collaboration with ML researchers, data engineers, and product developers to design and operate platforms for generative AI models. The team is committed to building systems that embody true artificial intelligence while ensuring accessibility for all users. This role blends software and systems engineering to optimize performance and reliability.

Requirements

  • 4+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
  • Strong proficiency in Kubernetes, Docker, and container orchestration.
  • Knowledge of CI/CD pipelines for Inference and ML model deployment.
  • Hands-on experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code.
  • Expertise in monitoring and observability tools such as Grafana, Datadog, and OpenTelemetry.
  • Strong programming/scripting skills in Python, Go, or Bash.
  • Solid knowledge of distributed systems, networking, and storage.
  • Experience running large-scale GPU clusters for ML/AI workloads is preferred.
  • Familiarity with ML training/inference pipelines.
  • Experience with high-performance computing (HPC) and workload schedulers.
  • Background in capacity planning and cost optimization for GPU-heavy environments.

Responsibilities

  • Ensure uptime, resiliency, and fault tolerance of AI model training and inference systems.
  • Design and maintain monitoring, alerting, and logging systems for real-time visibility into model serving pipelines.
  • Analyze system performance and scalability, optimizing resource utilization across compute, GPU clusters, storage, and networking.
  • Build automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem environments.
  • Lead on-call rotations, troubleshoot production issues, and conduct blameless postmortems.
  • Ensure data privacy, compliance, and secure operations across model training and serving environments.
  • Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows.

Benefits

  • Employees at Microsoft are often offered comprehensive, “world-class” benefits—including health and mental-wellness programs, competitive pay with bonuses and stock awards, and retirement/savings options. Time-off and flexibility are common, with generous vacation and holidays, parental and caregiver leave, and flexible work schedules, alongside learning support, employee resource groups, product discounts, and matching-gifts/volunteering programs. Specific benefits can vary by region.

Is this posting expired or inaccurate?