Post your job offer for free on H1BConnect with no upfront cost!

Logo

Hire with Us
NVIDIA Corporation logo

Senior Site Reliability Engineer, HPC and LSF

NVIDIA Corporation

4/20/2025

US, CA, Santa Clara

Full-time

Salary: $184,000 - $287,500


Job Description

NVIDIA is seeking a Site Reliability Engineer to manage and support workload and resource schedulers in a large-scale HPC environment and automate deployment and monitoring processes.

Requirements

  • Extensive knowledge with job scheduler administration (e.g. IBM Spectrum LSF or SLURM)
  • Proficient in administering Centos/RHEL Linux distributions
  • In-depth understanding of container technologies like Docker
  • Proficiency in UNIX scripting languages and Python
  • Excellent problem-solving skills
  • Excellent communication and teamwork skills
  • 10+ years experience in a large, distributed Linux environment
  • BS in Computer Science, similar degree or equivalent experience

Responsibilities

  • Manage and support workload and resource schedulers in a large-scale HPC environment
  • Automate deployment, configuration management, and operational monitoring
  • Develop solutions for complex computing resource management requirements
  • Troubleshoot complex issues from bare metal to application level
  • Develop, define, and document standard methodologies
  • Collaborate with domain experts to improve chip development process
  • Contribute to overall quality and time to market for next generation chips

Benefits

  • Multiple relocation packages
  • Two weeklong shutdowns (mid-summer and year-end) in the US (in addition to PTO)
  • 8-week parental leave
  • 9 Employee Resource Groups
  • Annual bonus offering
  • Flexible work arrangements
  • Up to 6% 401K matching
Logo

© 2024 H1BConnect. All rights reserved.

Check out our sister site LatamDev for tech jobs in Latin America! 🌎