JobsMember of Technical Staff, Hardware Health - MAI Superintelligence Team
Member of Technical Staff, Hardware Health - MAI Superintelligence Team
MicrosoftMember of Technical Staff, Hardware Health - MAI Superintelligence Team
MicrosoftLocation
USA (Multiple Locations)
Type
Full-time
Posted
5/5/2026
Compensation
$139,900 - $331,200 per year
Undergraduate with 5+ Years of Experience
Approval 98.4%·Filings 6,363·New hires 3,142·
👑 Elite Sponsor
·FY 2025Job description
The Member of Technical Staff, Hardware Health at Microsoft AI will focus on ensuring the reliability, performance, and availability of advanced AI training infrastructures. This role involves collaboration with various engineering teams to develop predictive health models and failure detection frameworks. The team is dedicated to advancing consumer AI products like Copilot and generative AI research. The position emphasizes innovation and a growth mindset within a culture of inclusion.
Requirements
- Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
- Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
- Experience working with large-scale HPC or GPU systems (NVIDIA H100/GB200 or equivalent).
- Deep understanding of GPU architecture, high-speed interconnects (NVLink, InfiniBand, RoCE), and large datacenter topologies.
- Proficiency in hardware telemetry, diagnostics, or failure analysis tools.
- Experience with exascale-class systems or cloud-scale AI clusters.
- Familiarity with reliability modeling, machine learning-based anomaly detection, or predictive maintenance.
- Contributions to large-scale infrastructure operations, supercomputing centers, or AI hardware design.
Responsibilities
- Design and develop next-generation hardware health monitoring and diagnostic frameworks for large GPU clusters.
- Build predictive analytics pipelines leveraging telemetry, power, and thermal data to anticipate hardware degradation and systemic issues.
- Collaborate with silicon, firmware, and datacenter engineers to identify root causes and remediate large-scale hardware anomalies.
- Define system health KPIs and integrate them into real-time observability platforms.
- Lead incident triage for high-impact GPU, network, and cooling issues across distributed clusters.
- Drive automation in health management to reduce manual intervention to the top 5% of anomalies.
- Partner with cross-functional teams to influence hardware design for reliability, thermal efficiency, and serviceability.
Benefits
- Employees at Microsoft are often offered comprehensive, “world-class” benefits—including health and mental-wellness programs, competitive pay with bonuses and stock awards, and retirement/savings options. Time-off and flexibility are common, with generous vacation and holidays, parental and caregiver leave, and flexible work schedules, alongside learning support, employee resource groups, product discounts, and matching-gifts/volunteering programs. Specific benefits can vary by region.
Is this posting expired or inaccurate?
