Vacancy expired!
- Independently designs, implements, productionizes and maintains site reliability guidelines, processes and systems
- Service Level Definition, Configuration and Measurement: Define SLIs, SLOs & SLAs specific to each application or system: Configuration of monitoring & alerting tools suitable for each product and/or platform team Measure reliability & resilience (through pre-defined SLIs & SLOs) utilizing monitoring/alerting tools to drive continuous improvement based on data analysis
- Incident Management Facilitation of incident response through the engagement of various teams and stakeholders, while providing robust communication and visibility to the organization during service interruptions Provide Root Cause Analysis for failures Experience with a modern incident management platform to effectively drive incident response and problem resolution
- Monitoring & Alerting Debug defects as well as develop dashboards using modern monitoring tools (e.g. New Relic, Splunk, AIOPs) to enable a reduction in mttd (detection time) & mttr (resolution time) Build monitors and alerts designed to manage SLAs, optimize performance, and minimize outages Construct E2E customer journey dashboards and alerts for customized transactions and applications.
- Automates reliability requirements into system and application implementations and updates; including the implementation of self-healing solutions (ansible, terraform, etc).
- Work with product management team to contribute to 1) the identification of reliability features & requirements and 2) level of effort estimates
- Candidates should have 10+ years’ experience in SRE and either or both of the following roles: DevOps, Software Engineering, leveraging automation extensively to achieve key deliverables.
Vacancy expired!