Job Description

NVIDIA is seeking Site Reliability Engineers to work on large-scale observability systems that support AI and data services. The role involves designing resilient telemetry pipelines, automating deployments, and establishing reliability standards while collaborating with various engineering teams.

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
10+ years operating large-scale production systems in roles such as SRE, Production Engineer, or Platform Engineer.
5+ years designing, building, and running observability platforms at scale.
Deep hands-on experience with open-source observability stacks, including Prometheus/Thanos/Mimir, Loki or Elasticsearch/OpenSearch, and Tempo/Jaeger/OpenTelemetry.
Strong programming ability in Python and Go, with Java experience considered a plus.
Solid grounding in Linux internals, networking, storage systems, distributed systems, concurrency, and performance engineering.
Experience architecting multi-region, multi-tenant telemetry pipelines with high availability.
Proven skill in optimizing PromQL, LogQL, trace queries, ingestion paths, indexing strategies, and retention policies.
Strong understanding of SLOs, SLIs, error budgets, incident response, and operational processes.
Ability to analyze complex distributed systems and drive data-informed debugging.

Responsibilities

Architecting and operating large-scale observability systems that span global regions.
Designing resilient pipelines for metrics, logs, traces, profiling, and events.
Working closely with platform, infrastructure, and application teams to establish telemetry standards.
Automating deployments, scaling workflows, and maintenance tasks.
Defining and maintaining SLOs, SLIs, error budgets, dashboards, and alerting models.
Building self-service tooling and frameworks for observability.
Studying real system behavior to uncover bottlenecks and scaling limits.
Running day-to-day operations including upgrades and performance tuning.
Leading incident response and root-cause investigations.
Guiding engineers through design reviews and operational best practices.

Benefits

Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.