JobsSr. Software Development Engineer, ML Infrastructure Team
Sr. Software Development Engineer, ML Infrastructure Team
AmazonSr. Software Development Engineer, ML Infrastructure Team
AmazonLocation
Cupertino, CA
Type
Full-time
Posted
6/2/2026
Compensation
$193,300 - $261,500 per year
Undergraduate with 5+ Years of Experience
Approval 98.6%·Filings 19,451·New hires 10,113·
👑 Elite Sponsor
·FY 2025Job description
The Senior Software Development Engineer will be part of the Machine Learning Infrastructure team at AWS, focusing on enhancing platforms that ensure optimal performance for ML and HPC technologies. This role involves owning and evolving the infrastructure that supports high-quality ML networking software. The team is dedicated to making AWS the leading platform for AI at scale, influencing strategic decisions at the highest levels of the company. Candidates should bring expertise in CI/CD automation, cluster management, and ML/HPC workloads.
Requirements
- 5+ years of leading design or architecture of new and existing systems experience
- 5+ years of full software development life cycle experience, including coding standards and operations
- Experience as a mentor, tech lead, or leading an engineering team
- 5+ years of non-internship professional software development experience
- 5+ years of programming experience with at least one software programming language
- Experience coding in Python, TypeScript, and CDK
- Bachelor's degree in computer science or equivalent
- Experience with HPC job schedulers like SLURM, Jenkins, or GPU compute infrastructure
- Experience with AWS infrastructure services such as EC2, CDK, and CloudFormation
- Familiarity with ML/HPC networking or collective communication libraries
- Experience building automation or tooling using large language models
Responsibilities
- Own the infrastructure that monitors and reports on functionality and performance of testing workloads.
- Build and operate CI/CD systems to automate the testing and delivery of ML networking libraries.
- Write Python code to orchestrate large clusters and run benchmarks across various instance types.
- Use AWS Managed Grafana and Athena to analyze performance data and build dashboards.
- Build intelligent automation using LLMs to analyze test failures and generate reports.
- Drive cross-team readiness for new instance type launches by delivering performance data.
- Manage GPU compute capacity planning and provisioning across the organization.
- Ensure all infrastructure is code, reviewed, and committed to automated pipelines.
Benefits
- Employees at Amazon are often offered comprehensive health benefits—including multiple medical plan options (no pre-existing condition exclusions, 100% covered in-network preventive care), dental and vision plans, a 24/7 medical advice line from day one, expert second-opinion services, and broad mental-health support with several free counseling sessions (including pediatric). Financial wellness typically includes a 401(k) with company match (up to 2%), Restricted Stock Units (equity), FSAs, an emergency savings program, product and partner discounts, and even college-savings and home-purchase programs. Overall, the package is designed to support employees and their families’ health, finances, and day-to-day life.
Is this posting expired or inaccurate?
