Job Details

ID #41453450
State North Carolina
City Charlotte
Job type Contract
Salary USD Depends on Experience Depends on Experience
Source Projas Technologies, LLC
Showed 2022-05-23
Date 2022-05-22
Deadline 2022-07-21
Category Systems/networking
Create resume

Senior Site Reliability Engineer Direct end client

North Carolina, Charlotte, 28201 Charlotte USA

Vacancy expired!

Incident Management:- Delivering Incident Command for high-severity incidents- Running blameless postmortem reviews for high-severity incidents- Assisting in developing automated incident detection and response improvements Operational Excellence:- Delivering data analysis (Incident Management, Change Management, Service Availability etc)- Creation of regular reporting/insights and advancing automation of such to reduce manual toil- Conducting Production Readiness Reviews for new services- Reviewing of upcoming production change requests Incident Management - Incident Command for high-severity incidents Incident Management - Communications & Updates for high-severity incidents Operational Excellence - Reporting and analytics (Incident Management, Change Management, Service Availability etc)- 7+ years of experience in a web-centric Linux production environment in a NOC or DevOps in a continuous release environment- Experience in running critical incidents from a technical leadership position- Experience with Computer Engineering with a focus on Infrastructure, Platform, and Application (Cloud, Containerization, Container orchestration, Network, Application Reliability, Database Architecture) and an understanding of full stack and the SDLC (Software Development Life Cycle)- Experience running and monitoring applications at scale, using metrics and tracing tools like Prometheus, Influx, Grafana, New Relic, Data Dog, Stackdriver, Zipkin, etc- Professional experience with Python, Go, or similar programming languages- Experience developing production quality tooling- Familiarity with SRE methodologies; passionate about solving operational challenges by using automation and software- Ability to communicate effectively vertically and horizontally within the organization through demonstrating written and verbal communication skills- Scala, Typescript, JS, Java, C,)- The team also develops automation and AI capabilities to ensure minimum toil across the engineering organization- Lead essential incidents in our environment with a focus on troubleshooting and fast restoration of our essential services- Provide insights on trends on issues affecting reliability and partner in cross functional projects to provide scalable solutions- Review high risk platform changes to minimize impact to the site- Work within a large distributed system based on Kubernetes and Google Cloud services- Maintain an automation-centric vision and incorporate SRE methodologies to increase reliability and decrease toil- Participate in technical design and architecture decisions and contribute to technical troubleshooting in various parts of the system

Vacancy expired!

Subscribe Report job