Vacancy expired!
- Increase operational efficiencies to proactively reduce and mitigate production incidents .
- Collaborate closely with senior technical leadership, and other engineering teams drive the operational and observability capabilities of the Platform and Services.
- Leading teams of experienced support engineers to meet or exceed expectations on incident SLAs and responsible for customer facing platform, applications, and services in a 24x7 environment.
- Provide Call to the Leadership to mitigate critical incidents .
- Ability to understand full technology stack of systems in the assigned domain .
- Collaborating with other tech leads and support teams to ensure integrated end-to-end availability, reliability, and performance .
- Define support strategies for systems in the Cloud (AWS , Azure ) .
- Influencing resiliency and scalability in production environments in Amazon Web Services and other cloud platforms .
- Identify and drive resolution on monitoring and alerting gaps .
- Solve problems relating to mission-critical services and build automation to prevent problem recurrence; with the goal of automated response to all non-exceptional service conditions .
- Engage in service capacity planning and demand forecasting, software performance analysis and system tuning .
- Lead the team to design, write and deliver technical and process automation to improve the availability, scalability, latency, and efficiency of 7-eleven's services .
- Solve problems relating to mission-critical services and build automation to prevent problem recurrence; with the goal of automated response to all non-exceptional service conditions .
- Identifying and remediating risk to critical and non-critical system KPIs .
- Proven experience defining and implementing enterprise-level observability strategy, including rollout of tools to gather required telemetry (events, metrics, logs, traces) .
- Hands-on lead who can drive change and influence firm's telemetry products and services to meet the needs of CB's business and application teams.
- Ability and relevant experience working with cross line of business teams, service providers and partner organizations to ensure consistent SRE strategy and practices .
- Experience implementing SRE Practices and adoption of SLIs, SLOs across large teams .
- Understanding of networking and cloud technologies, for example, security, load balancing, network routing protocols.
- Customer focused, understanding specific use cases from different groups and challenging the team to build solutions that reduce user issues and can be easily supported.
- Educate application developers on the best way to understand the runtime state of a product effectively through telemetry. Qualifications .
- Deep technical understanding of monitoring and telemetry capabilities .
- Strong knowledge and experience across multiple platforms, including on public and private cloud architecture (AWS, Azure, etc. )
- Strong knowledge and experience across Mongo DB .
- Strong working knowledge of modern development technologies and tools such Agile, CI/CD, Git, and Jenkins.
- Development experience with Java Script, Node, Python and associated frameworks
- Strong communication both written and verbal and organization skills. Technical writing and documentation skills.
- Passionate about good engineering practices, new technologies and proven leadership skills with drive for continuous improvement.
Vacancy expired!