JobsSenior Solutions Architect, AI Factory Observability and Visualization - NVIS

Senior Solutions Architect, AI Factory Observability and Visualization - NVIS

NVIDIA

Senior Solutions Architect, AI Factory Observability and Visualization - NVIS

NVIDIA

Location

remote, Austin, TX, Durham, NC, Santa Clara, CA

Type

Full-time

Posted

6/25/2026

Compensation

$184,000 - $356,500 per year

Undergraduate with 5+ Years of Experience

Approval 99.2%·Filings 1,781·New hires 873·

👑 Elite Sponsor

·FY 2025

Job description

NVIDIA's Infrastructure Specialists team is seeking a Senior Solutions Architect focused on AI Factory Observability & Visualization. This remote position aims to enhance the visibility of HPC systems and AI factories by transforming complex telemetry into actionable insights. The role requires a comprehensive understanding of HPC/AI systems and involves collaboration with various teams to optimize performance. The ideal candidate will be responsible for ensuring system health and readiness through observability and automation.

Requirements

Bachelor's degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or a related field.
6+ years of experience managing Linux-based systems in HPC, distributed systems, or large AI/ML settings.
Hands-on experience with the architecture of multi-GPU and/or multi-node clusters, including networking and interconnects.
Solid grasp of how HPC and AI factory systems fit together end to end, from network fabric through compute.
Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
Practical experience working with observability systems such as Prometheus, Grafana, or similar.
Experience transforming metrics, logs, and traces into clear, actionable insights for complex distributed environments.
Familiarity with GPU and fabric telemetry and using it to diagnose performance regressions.
Strong communication skills and the ability to work effectively with cross-functional teams.

Responsibilities

Run AI factory validation tools, microbenchmarks, and workloads to assess system health and performance.
Gain a comprehensive understanding of the system from start to finish, including network topology, interconnects, and compute.
Establish metrics, logs, and signals that confirm a system is functioning well and identify thresholds for issues.
Build and extend the telemetry surface across hardware, fabric, and workload.
Serve as the observability expert, investigating gaps in visibility to ensure it reflects true system behavior.
Develop automation for collecting, transforming, and presenting system and network data.
Recommend improvements to system visibility, data sources, and reporting.
Collaborate with hardware, software, networking, datacenter, and product groups to prepare HPC systems and AI factories for customer deployment.

Benefits

Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.

Is this posting expired or inaccurate?