H1BConnect Pro is launching with premium alerts and access to more job postings.Get early access
NVIDIA logo

Senior Site Reliability Engineer, Observability

NVIDIA
Santa Clara, CA Full-time 12/10/2025 $184k - $356.5k per year
Undergraduate with 5+ Years of Experience

Job Description

NVIDIA is seeking Site Reliability Engineers to work on large-scale observability systems that support AI and data services. The role involves designing resilient telemetry pipelines, automating deployments, and establishing reliability standards while collaborating with various engineering teams.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
  • 10+ years operating large-scale production systems in roles such as SRE, Production Engineer, or Platform Engineer.
  • 5+ years designing, building, and running observability platforms at scale.
  • Deep hands-on experience with open-source observability stacks, including Prometheus/Thanos/Mimir, Loki or Elasticsearch/OpenSearch, and Tempo/Jaeger/OpenTelemetry.
  • Strong programming ability in Python and Go, with Java experience considered a plus.
  • Solid grounding in Linux internals, networking, storage systems, distributed systems, concurrency, and performance engineering.
  • Experience architecting multi-region, multi-tenant telemetry pipelines with high availability.
  • Proven skill in optimizing PromQL, LogQL, trace queries, ingestion paths, indexing strategies, and retention policies.
  • Strong understanding of SLOs, SLIs, error budgets, incident response, and operational processes.
  • Ability to analyze complex distributed systems and drive data-informed debugging.

Responsibilities

  • Architecting and operating large-scale observability systems that span global regions.
  • Designing resilient pipelines for metrics, logs, traces, profiling, and events.
  • Working closely with platform, infrastructure, and application teams to establish telemetry standards.
  • Automating deployments, scaling workflows, and maintenance tasks.
  • Defining and maintaining SLOs, SLIs, error budgets, dashboards, and alerting models.
  • Building self-service tooling and frameworks for observability.
  • Studying real system behavior to uncover bottlenecks and scaling limits.
  • Running day-to-day operations including upgrades and performance tuning.
  • Leading incident response and root-cause investigations.
  • Guiding engineers through design reviews and operational best practices.

Benefits

  • Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.