JobsSenior Solutions Architect - AI Factory Deployment

Senior Solutions Architect - AI Factory Deployment

NVIDIA

Senior Solutions Architect - AI Factory Deployment

NVIDIA

Location

USA (Multiple Locations)

Type

Full-time

Posted

5/10/2026

Compensation

$184,000 - $356,500 per year

Undergraduate with 5+ Years of Experience

Approval 99.2%·Filings 1,781·New hires 873·

👑 Elite Sponsor

·FY 2025

Job description

The Senior Solutions Architect - AI Factory Deployment role at NVIDIA focuses on developing, deploying, and validating AI factories end to end. This position is part of the Infrastructure Specialists team in Santa Clara and emphasizes running and debugging AI/LLM workloads on Linux-based GPU clusters. The architect will enhance performance and scalability through observability and automation while collaborating across teams to ensure AI factories are customer-ready. The role requires expertise in troubleshooting and optimizing complex distributed workloads.

Requirements

Bachelor’s degree or equivalent experience in Computer Science, Mathematics, Engineering, Physics, or a related field.
More than 6+ years of experience managing Linux-based systems in HPC, distributed systems, or extensive AI/ML settings.
Hands-on experience running AI/ML workloads on multi-GPU and/or multi-node clusters, with practical knowledge of NCCL.
Solid grasp of collective communication patterns, particularly AllReduce and AllToAll, and their application in contemporary ML/LLM training.
Familiarity with LLM training and/or inference workflows using frameworks such as PyTorch or TensorFlow.
Proficiency with Python and Shell/Bash for scripting, automation, and tooling.
Experience with benchmarking, including crafting, executing, and interpreting performance benchmarks.
Comfortable working with observability data to troubleshoot and optimize complex distributed workloads.
Strong communication skills and the ability to work effectively with cross-functional teams.

Responsibilities

Set up, adjust, and verify AI factory environments across multi-GPU and multi-node Linux clusters.
Ensure configurations align with guidelines for NCCL, collectives, and distributed training frameworks.
Own the execution of key AI/LLM benchmarks, including setup, orchestration, result collection, and analysis.
Investigate and resolve issues when training jobs or benchmarks fail, hang, or underperform.
Build and improve observability for AI factories to understand workload behavior and system health.
Develop automation for running benchmarks, collecting results, and performing regression checks.
Examine communication patterns and NCCL usage for AI/LLM workloads, focusing on collectives such as AllReduce and AllToAll.
Recommend changes to job configuration, parallelism strategies, and cluster settings to improve throughput, latency, and scaling efficiency.
Work closely with hardware, software, networking, datacenter, and product teams to prepare AI factories for customer use.
Contribute to documentation, guidelines, and readiness collateral that support internal collaborators and customer-facing teams.

Benefits

Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.

Is this posting expired or inaccurate?