Post your job offer for free on H1BConnect with no upfront cost!

Logo

Hire with Us
NVIDIA Corporation logo

Principal Site Reliability Engineer - Enterprise AI Platform

NVIDIA Corporation

7/18/2025

US, CA, Santa Clara

Full-time

Salary: $248,000 - $391,000


Job Description

NVIDIA is looking to hire a deeply technical and creative Site Reliability Engineer to build, support, and maintain the next generation AI powered enterprise products.

Requirements

  • 15+ years of working experience in cloud, platform or SRE roles
  • Bachelors or Masters Degree in an Engineering or Computer Science or related field or equivalent experience
  • Proficient in one or more programming languages: Python, Go, Perl, or Ruby
  • Hands-on experience handling and scaling distributed systems in a public, private, or hybrid cloud, on-prem environment 24x7x365
  • Has delivered software with full understanding of deploying applications in Kubernetes clusters along with GPU and CPU pod scheduling
  • Has maintained and managed Micro-services relating to AI platforms (Inference, Training, Evaluation, Ingestion)
  • Hands-on experience in deploying, supporting, and supervising new and existing services, platforms, and application stacks
  • Experience with CI/CD systems such as Jenkins, GitHub Actions, etc
  • Background with Infrastructure as Code (IaC) methodologies and relevant tools
  • Extensive experience working with MS Windows Server and/or Linux operating systems
  • Solid communication skills, demonstrating the ability to comprehend and articulate technical issues to a non-technical audience

Responsibilities

  • Collaborate on translating business objectives into actionable plans
  • Address operational challenges, automate processes, and iterate for efficiency
  • Tackle systemic reliability issues with multi-functional teams
  • Monitor, optimize, and manage system performance and resources
  • Institute validated practices for reliability, remediations, and troubleshooting
  • Design, deploy, and automate production support, documenting essential knowledge
  • Navigate intricate tasks with a deep understanding of SRE principles
  • Lead cross-organizational projects from inception to completion
  • Mentor and train junior engineers for professional development
  • Serve as a subject matter expert in core team functions

Benefits

  • Multiple relocation packages
  • Two weeklong shutdowns (mid-summer and year-end) in the US (in addition to PTO)
  • 8-week parental leave
  • 9 Employee Resource Groups
  • Annual bonus offering
  • Flexible work arrangements
  • Up to 6% 401K matching
Logo

© 2024 H1BConnect. All rights reserved.

Check out our sister site LatamDev for tech jobs in Latin America! 🌎