Job Details

ID #15502657
State California
City Berkeley
Job type Contract
Salary USD $50 - $70 50 - 70
Source CoreHive Computing LLC
Showed 2021-06-16
Date 2021-06-15
Deadline 2021-08-14
Category Et cetera
Create resume

Linux System Admin/ Storage Admin

California, Berkeley, 94701 Berkeley USA

Vacancy expired!

GBS requires the services of a Storage System Administrator III to provide labor services to support the DOE National Energy Research Scientific Computing Center (NERSC) Storage Systems Group (SSG) Team’s hardware and software environment at the Lawrence Berkeley National Laboratory’s NERSC facilities in Berkeley, CA. The hardware and software is part of a High Performance Computing (HPC) system environment and includes storage systems, servers in support of storage systems, storage services, software, and network components. The work will require active interaction/participation with clients and the Team to troubleshoot and resolve technical issues with production storage system.Baseline Hardware and Software Environment Support

Baseline Equipment, QTY Estimated:
  • 14 racks of Elastic Storage System computer storage - Community File System - manufactured by IBM
  • 43 disk arrays - NetApp
  • 80 storage servers - Supermicro
  • 16 storage arrays - Nexsan
  • 12 elastic storage system enclosures - IBM
  • 44 storage servers - test development environment - Supermicro
  • 48 mid-range servers - HPE
  • 164 enterprise tape drives - installed in IBM tape libraries
  • 3 tape libraries - manufactured by IBM.
  • 3 director level fiber channel switches - Cisco

Baseline Software:
  • IBM Spectrum Scale
  • IBM Red Hat Linux, Centos
  • High Performance Storage System
Ansible

Team Interaction/Participation
  • Participate in weekly team meetings to maintain awareness of open projects and goals as allowable to maintain internal info and activities with other vendors, NDA etc.
  • Monitor Slack for direct messages and other channels for issues related to storage systems
      • limit to certain channels at the discretion of the University
  • Respond to email in timely manner as determined by the University Technical Representative
  • Participate as a proactive team member
  • Participate in on-call 24/7 responsibilities
  • Production storage system problem determination and resolution
      • engage with other team members for advice when in doubt and vendor support when needed
      • one-week rotation between 3-4 other individuals
      • average < 5 off-hours calls per person per year
      • 2 hour on-site response time in emergency situations

Education and Training
  • Knowledge Transfer from the Subcontractor (or other vendors) to University
      • draw in relevant parties to deliver content
      • establish connections to knowledge/technology providers
      • keep storage team aware of developments and research efforts

Hardware activities
  • Communicate discovered and suspected hardware issues to the storage team
      • Slack or email for awareness
      • Service Now ticket for tracking status and closure
  • Monitor for and respond to hardware issues on all systems from multiple vendors as needed, open support cases with upstream vendors
      • coordinate with SSG team for replacement of components live or with down-time when required
      • monitoring requires pro-active parsing of logs, monitoring Graphical User Interface (GUIs) to determine, rather than reactively waiting until something fails
      • see issues through to resolution
        • e.g. disk controller failure: confirm that replacement is requested, arrives, is installed and returned material authorization (RMA) is sent back
  • Amber light walk at least weekly
  • Work with on-site technicians as needed from University and vendors
  • Install/de-install hardware as needed
      • rack and cable both new and existing equipment
      • contribute to larger-scale integration responsibilities shared with other groups; e.g. making storage system available to new compute resources

Software activitiesAt the Client’s discretion -
  • Determine for all storage system components (OS/kernel/firmware/etc.) when updates are needed
      • Read release notes, determine any impact of upgrades, fixes provided
      • communicate concerns/issues to the team
      • Via Gitlab issues, document upgrade plan, date of change(s) and systems involved, any issues encountered, potential risks
      • For new systems implement new baseline for OS/kernel/Mellanox OFED(MOFED)/GPFS/(Lustre version for data transfer nodes) with input by University’s engineers across all systems
  • Identify areas for routine process optimization and implement solutions
      • Automation of common tasks, contributing to monitoring infrastructure
      • Develop scripts and tools and contribute them to internal Gitlab repository
      • Contribute to integration and implementation planning for future system upgrades and deployments
  • Assist with debugging integration between storage systems and other systems
      • e.g. IBM Spectrum Scale (previously GPFS) and Cray Data Virtualization Service (DVS); HPSS client testing

Required skills/Level of Experience :
  • Bachelor’s degree or equivalent experience and a minimum of

    five years of computing or storage experience; or equivalent experience
  • Experience using one or more interpreted programming or scripting languages such as Python and Bash to automate system management tasks.
  • Working knowledge of parallel storage technologies such as distributed storage systems, parallel file systems, object stores, hierarchical storage management, storage networking, and/or relevant hardware technologies.
  • Strong understanding of Linux fundamentals including file systems, networking, and automation tools like Ansible or Puppet.
  • Understanding of file system internals, prior work developing storage systems, or experience troubleshooting and optimizing parallel I/O.
  • Experience using or administering one or more HPC storage system technologies (e.g., Lustre, Spectrum Scale, HPSS, Panasas).
  • Strong written and verbal communication skills and the ability to document and describe complex tasks to audiences of varying familiarity with storage technologies.
  • Ability to work effectively and collaboratively on a team and leading technical projects, as well as give and receive constructive feedback to foster communication and trust.
  • Strong sense of intellectual curiosity, self-direction, and desire to pursue challenging problems and understand complex systems.
  • Experience running cables, cable management, racking systems, and labeling
  • Provides technical support and analysis of infrastructure project and production environment
  • Develops upgrade/improvement recommendations; monitors, plans, measures, and tests new products and services
  • Works on client/enterprise technologies, software configurations management, operating support systems and distribution, storage area networks
  • Works on data center technologies such as network (LAN,WAN, router) management, server
  • Has demonstrated contributions to the high-performance storage community (e.g., conference presentations, open source software). Ability to present and describe systems and issues to technical staff as well as higher level management.
  • Strong organizational skills and ability to effectively manage priorities across many projects ranging from immediate problem resolution to long-term strategic planning.

Clearance RequirementsU.S Citizen required.

Vacancy expired!

Subscribe Report job