JobsSenior Solutions Architect, AI Factory Observability and Visualization - NVIS
NVIDIA logo

Senior Solutions Architect, AI Factory Observability and Visualization - NVIS

NVIDIA

Location

remote, Austin, TX, Durham, NC, Santa Clara, CA

Type

Full-time

Posted

6/25/2026

Compensation

$184,000 - $356,500 per year

Undergraduate with 5+ Years of Experience
Approval 99.2%·Filings 1,781·New hires 873·
👑 Elite Sponsor
·FY 2025

Job description

NVIDIA's Infrastructure Specialists team is seeking a Senior Solutions Architect focused on AI Factory Observability & Visualization. This remote position aims to enhance the visibility of HPC systems and AI factories by transforming complex telemetry into actionable insights. The role requires a comprehensive understanding of HPC/AI systems and involves collaboration with various teams to optimize performance. The ideal candidate will be responsible for ensuring system health and readiness through observability and automation.

Requirements

  • Bachelor's degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or a related field.
  • 6+ years of experience managing Linux-based systems in HPC, distributed systems, or large AI/ML settings.
  • Hands-on experience with the architecture of multi-GPU and/or multi-node clusters, including networking and interconnects.
  • Solid grasp of how HPC and AI factory systems fit together end to end, from network fabric through compute.
  • Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
  • Practical experience working with observability systems such as Prometheus, Grafana, or similar.
  • Experience transforming metrics, logs, and traces into clear, actionable insights for complex distributed environments.
  • Familiarity with GPU and fabric telemetry and using it to diagnose performance regressions.
  • Strong communication skills and the ability to work effectively with cross-functional teams.

Responsibilities

  • Run AI factory validation tools, microbenchmarks, and workloads to assess system health and performance.
  • Gain a comprehensive understanding of the system from start to finish, including network topology, interconnects, and compute.
  • Establish metrics, logs, and signals that confirm a system is functioning well and identify thresholds for issues.
  • Build and extend the telemetry surface across hardware, fabric, and workload.
  • Serve as the observability expert, investigating gaps in visibility to ensure it reflects true system behavior.
  • Develop automation for collecting, transforming, and presenting system and network data.
  • Recommend improvements to system visibility, data sources, and reporting.
  • Collaborate with hardware, software, networking, datacenter, and product groups to prepare HPC systems and AI factories for customer deployment.

Benefits

  • Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.

Is this posting expired or inaccurate?