Linux System Admin Storage Admin job vacancy

Vacancy expired!

GBS requires the services of a Storage System Administrator III to provide labor services to support the DOE National Energy Research Scientific Computing Center (NERSC) Storage Systems Group (SSG) Team’s hardware and software environment at the Lawrence Berkeley National Laboratory’s NERSC facilities in Berkeley, CA. The hardware and software is part of a High Performance Computing (HPC) system environment and includes storage systems, servers in support of storage systems, storage services, software, and network components. The work will require active interaction/participation with clients and the Team to troubleshoot and resolve technical issues with production storage system.Baseline Hardware and Software Environment Support

Baseline Equipment, QTY Estimated:

14 racks of Elastic Storage System computer storage - Community File System - manufactured by IBM
43 disk arrays - NetApp
80 storage servers - Supermicro
16 storage arrays - Nexsan
12 elastic storage system enclosures - IBM
44 storage servers - test development environment - Supermicro
48 mid-range servers - HPE

164 enterprise tape drives - installed in IBM tape libraries
3 tape libraries - manufactured by IBM.
3 director level fiber channel switches - Cisco

Baseline Software:

IBM Spectrum Scale
IBM Red Hat Linux, Centos
High Performance Storage System

Ansible

Team Interaction/Participation

Participate in weekly team meetings to maintain awareness of open projects and goals as allowable to maintain internal info and activities with other vendors, NDA etc.
Monitor Slack for direct messages and other channels for issues related to storage systems

limit to certain channels at the discretion of the University

Respond to email in timely manner as determined by the University Technical Representative
Participate as a proactive team member
Participate in on-call 24/7 responsibilities
Production storage system problem determination and resolution

engage with other team members for advice when in doubt and vendor support when needed
one-week rotation between 3-4 other individuals
average < 5 off-hours calls per person per year
2 hour on-site response time in emergency situations

Education and Training

Knowledge Transfer from the Subcontractor (or other vendors) to University

draw in relevant parties to deliver content
establish connections to knowledge/technology providers
keep storage team aware of developments and research efforts

Hardware activities

Communicate discovered and suspected hardware issues to the storage team

Slack or email for awareness
Service Now ticket for tracking status and closure

Monitor for and respond to hardware issues on all systems from multiple vendors as needed, open support cases with upstream vendors

coordinate with SSG team for replacement of components live or with down-time when required
monitoring requires pro-active parsing of logs, monitoring Graphical User Interface (GUIs) to determine, rather than reactively waiting until something fails
see issues through to resolution

e.g. disk controller failure: confirm that replacement is requested, arrives, is installed and returned material authorization (RMA) is sent back

Amber light walk at least weekly
Work with on-site technicians as needed from University and vendors
Install/de-install hardware as needed

rack and cable both new and existing equipment
contribute to larger-scale integration responsibilities shared with other groups; e.g. making storage system available to new compute resources

Software activitiesAt the Client’s discretion -

Determine for all storage system components (OS/kernel/firmware/etc.) when updates are needed

Read release notes, determine any impact of upgrades, fixes provided
communicate concerns/issues to the team
Via Gitlab issues, document upgrade plan, date of change(s) and systems involved, any issues encountered, potential risks
For new systems implement new baseline for OS/kernel/Mellanox OFED(MOFED)/GPFS/(Lustre version for data transfer nodes) with input by University’s engineers across all systems

Identify areas for routine process optimization and implement solutions

Automation of common tasks, contributing to monitoring infrastructure
Develop scripts and tools and contribute them to internal Gitlab repository
Contribute to integration and implementation planning for future system upgrades and deployments

Assist with debugging integration between storage systems and other systems

e.g. IBM Spectrum Scale (previously GPFS) and Cray Data Virtualization Service (DVS); HPSS client testing

Required skills/Level of Experience :

Bachelor’s degree or equivalent experience and a minimum of
five years of computing or storage experience; or equivalent experience
Experience using one or more interpreted programming or scripting languages such as Python and Bash to automate system management tasks.
Working knowledge of parallel storage technologies such as distributed storage systems, parallel file systems, object stores, hierarchical storage management, storage networking, and/or relevant hardware technologies.
Strong understanding of Linux fundamentals including file systems, networking, and automation tools like Ansible or Puppet.
Understanding of file system internals, prior work developing storage systems, or experience troubleshooting and optimizing parallel I/O.
Experience using or administering one or more HPC storage system technologies (e.g., Lustre, Spectrum Scale, HPSS, Panasas).
Strong written and verbal communication skills and the ability to document and describe complex tasks to audiences of varying familiarity with storage technologies.
Ability to work effectively and collaboratively on a team and leading technical projects, as well as give and receive constructive feedback to foster communication and trust.
Strong sense of intellectual curiosity, self-direction, and desire to pursue challenging problems and understand complex systems.
Experience running cables, cable management, racking systems, and labeling
Provides technical support and analysis of infrastructure project and production environment
Develops upgrade/improvement recommendations; monitors, plans, measures, and tests new products and services
Works on client/enterprise technologies, software configurations management, operating support systems and distribution, storage area networks
Works on data center technologies such as network (LAN,WAN, router) management, server
Has demonstrated contributions to the high-performance storage community (e.g., conference presentations, open source software). Ability to present and describe systems and issues to technical staff as well as higher level management.
Strong organizational skills and ability to effectively manage priorities across many projects ranging from immediate problem resolution to long-term strategic planning.

Clearance RequirementsU.S Citizen required.

Vacancy expired!

ID	#15502657
State	California
City	Berkeley
Job type	Contract
Salary	USD $50 - $70 50 - 70
Source	CoreHive Computing LLC
Showed	2021-06-16
Date	2021-06-15
Deadline	2021-08-14
Category	Et cetera
Create resume

Job Details

Linux System Admin/ Storage Admin