JobsSenior Solutions Architect, AI Factory Observability and Visualization - NVIS
Senior Solutions Architect, AI Factory Observability and Visualization - NVIS
NVIDIASenior Solutions Architect, AI Factory Observability and Visualization - NVIS
NVIDIALocation
remote, Austin, TX, Durham, NC, Santa Clara, CA
Type
Full-time
Posted
6/25/2026
Compensation
$184,000 - $356,500 per year
Undergraduate with 5+ Years of Experience
Approval 99.2%·Filings 1,781·New hires 873·
👑 Elite Sponsor
·FY 2025Job description
NVIDIA's Infrastructure Specialists team is seeking a Senior Solutions Architect focused on AI Factory Observability & Visualization. This remote position aims to enhance the visibility of HPC systems and AI factories by transforming complex telemetry into actionable insights. The role requires a comprehensive understanding of HPC/AI systems and involves collaboration with various teams to optimize performance. The ideal candidate will be responsible for ensuring system health and readiness through observability and automation.
Requirements
- Bachelor's degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or a related field.
- 6+ years of experience managing Linux-based systems in HPC, distributed systems, or large AI/ML settings.
- Hands-on experience with the architecture of multi-GPU and/or multi-node clusters, including networking and interconnects.
- Solid grasp of how HPC and AI factory systems fit together end to end, from network fabric through compute.
- Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
- Practical experience working with observability systems such as Prometheus, Grafana, or similar.
- Experience transforming metrics, logs, and traces into clear, actionable insights for complex distributed environments.
- Familiarity with GPU and fabric telemetry and using it to diagnose performance regressions.
- Strong communication skills and the ability to work effectively with cross-functional teams.
Responsibilities
- Run AI factory validation tools, microbenchmarks, and workloads to assess system health and performance.
- Gain a comprehensive understanding of the system from start to finish, including network topology, interconnects, and compute.
- Establish metrics, logs, and signals that confirm a system is functioning well and identify thresholds for issues.
- Build and extend the telemetry surface across hardware, fabric, and workload.
- Serve as the observability expert, investigating gaps in visibility to ensure it reflects true system behavior.
- Develop automation for collecting, transforming, and presenting system and network data.
- Recommend improvements to system visibility, data sources, and reporting.
- Collaborate with hardware, software, networking, datacenter, and product groups to prepare HPC systems and AI factories for customer deployment.
Benefits
- Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.
Is this posting expired or inaccurate?
