Vacancy expired!
- Responsible to architect a framework that is more readily available and demonstrate ease of use. When factoring new architecture make build v/s buy decision and consider cost aspects.
- Work in coordination with other internal teams to ensure the infrastructure fully and effectively supports current and planned application systems
- Troubleshoot OS, Networking, Storage, and Software issues while leveraging internal teams for solutions.
- Deliver changes to the HPC production platforms according to the change control process. Communicating and seeking approvals from business owners.
- Practice network asset management, including maintenance of network component inventory and related documentation
- Develop tools to deploy, manage, monitor, and troubleshoot HPC systems at scale.
- Maintain asset lists of all servers, applications and licensing ensuring compliancy.
- Maintain security standards according to internal policies.
- Execute the day-to-day activities of the Incident Management process
- Manage and respond to tickets/requests in accordance with SLA timeframes.
- Develop tools to deploy, manage, monitor, and troubleshoot HPC systems at scale.
- 3-5 years of experience in High Performance Computing System Administration
- Minimum of 2+ years of customer facing experience in HPC and AWS
- Working knowledge of complete HPC stack in building a high availability infrastructure
- Demonstrated experience in AWS Cloud platform
- Strong understanding and hands experience in deploying, troubleshooting issues with Compute, Networking, Storage, Database services on AWS
- Experience in designing & implementing of HPC Clusters using AWS Parallel cluster
- Experience in DevOps tools like Ansible Tower, Bitbucket, Terraform, and CloudFormation etc.
- Experience in Linux Administration various distributions like Redhat, Amazon, CentOS
- Team player with willingness to work in 24x7 environment
- Strong verbal and written communications skills are a must
- Familiarity with Cloud platforms, products, and tools.
- Experience with job schedulers like Grid Engine, LSF, PBS, SLURM, Torque, Symphony, TIBCO.
- Experience with compilers and libraries such as MPI, GCC, CUDA etc.
- Experience with scripting (bash, Python, PowerShell, etc.).
- Experience in Filesystem's like NFS, Lustre/GPFS, etc.,
- Experience in Application installations and troubleshooting on HPC Clusters based on CPU, GPU.
- Docker, Singularity, Kubernetes, Google Cloud Platform will be a plus.
- Knowledge of distributed computing
- Ansible, Jira, Confluence, Service Now, Excel, Presentation Skills
- Worked on building clusters with individual machines (not a service like EMR etc.)
Vacancy expired!