JobsSenior Software Engineer, AI Resiliency
Job description
NVIDIA is seeking a Senior Software Engineer for AI Resiliency to lead the development of software features that enhance the reliability of AI supercomputers. This role involves working with a team dedicated to minimizing downtime and ensuring robust AI systems. The engineer will contribute to large-scale distributed systems and collaborate with various teams to integrate resiliency features into popular AI frameworks. This position offers the opportunity to work on cutting-edge AI infrastructure challenges.
Requirements
- Bachelor's, Master's, or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience.
- Proficiency in C++ and Python with experience in writing efficient, high-performance code.
- 6+ years of relevant experience in software engineering.
- Strong understanding of distributed systems concepts, parallel programming, and fault tolerance.
- Familiarity with AI frameworks such as PyTorch, JAX/XLA, or TensorFlow.
- Experience with debugging and profiling tools like gdb, perf, valgrind, or NVIDIA Nsight.
- Excellent problem-solving skills and ability to work in a collaborative environment.
Responsibilities
- Implement and optimize software features that improve AI system reliability at scale.
- Contribute to large-scale distributed systems with high-quality C++ and Python code.
- Work on AI system error handling and develop monitoring tools for failure mitigation.
- Collaborate with senior engineers and AI researchers to integrate resiliency features into AI frameworks.
- Develop and implement tests to ensure the robustness and efficiency of resiliency mechanisms.
- Assist in debugging and performance tuning of large-scale AI workloads in cloud and HPC environments.
Benefits
- Employees at NVIDIA are often offered comprehensive, day-one benefits—including medical, dental, and vision coverage with HSA support, life and disability insurance, an Employee Assistance Program, and a 401(k) with auto-enrollment. Many roles also have generous time off and holidays, donation matching (up to $10,000), and a wide menu of extras like FSAs, commuter benefits, legal and identity-theft protection, pet insurance, and wellness discounts. Optional programs can include student-loan and home-purchase support, plus family care resources and expert medical services.
Is this posting expired or inaccurate?
