Post your job offer for free on H1BConnect with no upfront cost!

Logo

Hire with Us
Amazon.com, Inc. logo

Sr. Software Engineer, EC2 Instance Networking

Amazon.com, Inc.

8/3/2025

Sunnyvale, CA

Full-time

Salary: $151,300 - $261,500 per year


Job Description

Join our team building the scale-out networking backbone that powers the world's largest AI training clusters. We're developing high-performance RDMA and RoCE solutions that enable distributed training of trillion-parameter models across thousands of compute nodes on AWS infrastructure. As a senior engineer, you'll drive technical architecture decisions and lead the development of next-generation distributed AI training infrastructure.

Requirements

  • Experience as a mentor, tech lead or leading an engineering team
  • 5+ years of non-internship professional software development experience
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of programming experience in C/C++ with focus on high-performance distributed systems
  • 5+ years of leading design or architecture of large-scale networked systems
  • Deep expertise in RDMA technologies, RoCE implementations, and high-performance networking
  • Extensive experience with collective communication libraries (NCCL, RCCL, OneCCL, MPI)
  • Experience as a technical lead or leading engineering teams on complex infrastructure projects

Responsibilities

  • Lead the design and development of high-performance networking software solutions utilizing RDMA and RoCE technologies for large-scale AI clusters
  • Architect SmartNIC integration strategies with EC2 control plane systems and define API specifications
  • Drive optimization of collective communication patterns and multi-rack networking protocols for distributed AI training
  • Lead development of comprehensive performance monitoring, metrics collection, and benchmarking infrastructure
  • Design automated testing frameworks and stress testing methodologies for large-scale distributed systems
  • Lead complex system-level debugging efforts across hardware acceleration, kernel networking, and distributed applications
  • Define technical architecture and strategy for next-generation scale-out AI cluster networking
  • Provide technical leadership and mentoring to engineering teams
  • Drive cross-functional collaboration with hardware, cloud infrastructure, and AI platform teams
  • Lead technical design reviews and establish engineering best practices

Benefits

  • Medical, dental, and vision coverage (multiple plan options)
  • Health savings and flexible spending accounts
  • 24/7 Employee Assistance Program for mental health support
  • 401(k) with company match and various investment choices
  • Company-paid life/AD&D and disability insurance
  • Restricted stock units (RSUs) for employee ownership
  • Child care and elder care referrals, adoption assistance
  • Paid maternity and parental leave (including Ramp Back)
  • Paid time off (PTO) and company holidays
  • Employee discount on Amazon.com products
  • Career Choice program (95% tuition and textbook coverage)
Logo

© 2024 H1BConnect. All rights reserved.

Check out our sister site LatamDev for tech jobs in Latin America! 🌎