JobsSite Reliability Engineer II- CTJ - Secret
Job description
The Site Reliability Engineer (SRE) role within the IDEAS organization focuses on automation, incident response, and reliability improvements for services in regulated government cloud environments. The team collaborates with various partners across Microsoft to solve complex problems using modern data platforms and AI-assisted tooling. This position emphasizes improving service reliability and operational excellence while ensuring compliance with security and privacy requirements. Candidates will engage in live site operations and contribute to the evolution of systems at scale.
Requirements
- Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience.
- Bachelor's Degree in Computer Science, or related technical discipline with proven experience coding in languages including C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
- Experience with automation, live site operations, and incident response in large-scale cloud or distributed systems.
- Proficiency in at least one programming or scripting language such as C#, Java, Python, or PowerShell.
- Strong analytical and problem-solving skills, including experience using telemetry and operational data to inform decisions.
- Effective written and verbal communication skills, and experience collaborating across teams and disciplines.
- Ability to meet Microsoft, customer, and/or government security screening requirements.
Responsibilities
- Participate as a Designated Responsible Individual (DRI) in a 24x7 on-call rotation, monitoring service health and responding to incidents within defined SLAs.
- Design, build, and maintain automation for deployment, operations, and incident mitigation to improve reliability and reduce manual effort.
- Instrument services for observability; collect and analyze telemetry and health signals to guide reliability and performance improvements.
- Collaborate with engineering partners and stakeholders to align on goals and deliver user-focused solutions.
- Apply engineering best practices for development, scaling, and operational excellence to meet performance and customer requirements.
- Support compliance with security, privacy, and accessibility requirements throughout service onboarding and ongoing operations.
- Continuously learn and adopt industry practices and internal tools to improve reliability, performance, and observability.
Benefits
- Employees at Microsoft are often offered comprehensive, “world-class” benefits—including health and mental-wellness programs, competitive pay with bonuses and stock awards, and retirement/savings options. Time-off and flexibility are common, with generous vacation and holidays, parental and caregiver leave, and flexible work schedules, alongside learning support, employee resource groups, product discounts, and matching-gifts/volunteering programs. Specific benefits can vary by region.
Is this posting expired or inaccurate?
