Post your job offer for free on H1BConnect with no upfront cost!

Logo

Hire with Us
NVIDIA logo

Senior Site Reliability Engineer, DGX Cloud

NVIDIA

10/15/2025

California, CA

Full-time

Salary: $208k - $333.5k per year


Job Description

NVIDIA is seeking a Senior Site Reliability Engineer to work on maintaining high-performance DGX Cloud clusters for AI researchers and enterprise clients worldwide.

Requirements

  • BS in Computer Science or related technical field, or equivalent experience
  • 12+ years of experience operating production services at scale
  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
  • Proficiency in at least one high-level programming language (e.g., Python, Go)
  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
  • Demonstrated ability to troubleshoot complex DNS, network, Kubernetes, and systems issues in production environments
  • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
  • Experience building and operating comprehensive observability stacks using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, Datadog, etc.

Responsibilities

  • Support large-scale Kubernetes services before they launch
  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters
  • Define SLOs/SLIs, monitor error budgets, and streamline reporting
  • Maintain services by measuring and monitoring availability, latency, and overall system health
  • Operate and optimize GPU workloads across various cloud platforms
  • Scale systems sustainably through automation
  • Lead triage and root-cause analysis of high-severity incidents
  • Practice balanced incident response and blameless postmortems
  • Participate in on-call rotation to support production services

Benefits

  • Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.
Logo

© 2024 H1BConnect. All rights reserved.

Check out our sister site LatamDev for tech jobs in Latin America! 🌎