Job Details

ID #20835438
State Texas
City Houston
Job type Contract
Salary USD Depends on Experience Depends on Experience
Source ReqRoute, Inc
Showed 2021-10-08
Date 2021-10-06
Deadline 2021-12-04
Category Et cetera
Create resume

Sr. Site Reliability Engineer

Texas, Houston, 77001 Houston USA

Vacancy expired!

Job Description:One of our clients is in urgent need of a Sr. Site Reliability Engineer for one of their large engagements. As a Senior Site Reliability Engineer, you will be responsible for the Incident, Event Management and Configuration process. You will be working with your incident management team to assess the severity of reported incidents, identify relevant service owners, participate, initiate and lead incident management and daily review calls with relevant parties, communicate the progress of the incidents via relevant communication channels, provide technical recommendations, suggest preventive and corrective actions, ensure proper closure of those incidents and continuous auditing and improvements to Incident and Event Management process. Using data learned from those incidents you will drive further improvements into our automation, monitoring, and processes to proactively identify and resolve critical incidents.Location: Remote/Houston, TXContract: 12+ Months

Responsibilities:
  • Ability to effectively verbalize, document, communicate and facilitate the identification, handling, status reporting, solution options and alternatives, change implementation of various critical incidents and problems reactively and proactively identified in the IT environment, and effectively bring them to resolution and/or closure.
  • Participate in capacity management of core systems and services, application analysis and performance and security tuning. Provide operational support of systems and build automation to remediate and address the root cause; with the goal of automating response to all non-exceptional service conditions.
  • Diagnostics & Monitoring: Instrumenting the complete application architecture to provide real user and system performance data to provide insight into the root cause of all application bottlenecks, enable real time visibility to reduce risk exposure.
  • Provide enterprise-wide application, database, network support in a mature enterprise environment, and cloud infrastructure design experience preferred.
  • Triage issues as they arise, create strategies for long term permanent fixes to critical production incidents.
  • Maintain documentation, build tooling, and create alerts to both identify and address infrastructure reliability.
  • Understand the ecosystem and provide technical recommendations on major incident calls and post incident reviews.
  • Work closely with Service Owners to deliver a clear, concise picture of incidents and the short-term remediation applied; ensure problem prevention methods and mitigation strategies are continually applied to improve application availability and make recommendations on long-term solutions.
  • Effectively document business cases, solution strategies, event and configuration processes, procedures, and knowledge articles associated with implementing fixes and solutions to existing or predicted IT incidents and problems.
  • Participate in routine root cause analysis and problem review meetings and provide recommendations to service owners to improve the service availability.
  • Work with Support Knowledge Managers to build the team knowledge database.
  • Train and mentor Incident Management team to effectively perform the job during the shift.
  • Ability to work scheduled shifts including mornings, weekends and nights or participate in an on-call roster.

Qualifications:
  • Bachelor’s degree in Computer Science, Computer Information Systems, Management Information Systems
  • 5+ years of experience in a technical operations role, a systems analysis and support role or a DevOps role
  • 2+ years of experience and strong working knowledge in Amazon Web Services or similar cloud infrastructure platforms (Azure, Google Cloud, etc).
  • 2+ years of experience in performance monitoring & diagnostic tools (e.g Data Dog, Dynatrace, Splunk, New Relic, Nagios, etc)
  • Technical knowledge and experience on working with Windows/Linux environments, SQL, Active Directory, Scripting, etc.
  • Network troubleshooting knowledge including LAN/WAN, DHCP, TCP/IP, Firewalls, and Routing
  • Proven track record supporting large scale environments and applications
  • Superior English language and communication skills - both written and verbal.
  • Skills with the ability to articulate technical solutions for both technical and non-technical audiences.
  • Ability to direct cross-functional resources through incident closure with proper RCA and through the problem management lifecycle.

Vacancy expired!

Subscribe Report job