Vacancy expired!
- Ability to effectively verbalize, document, communicate and facilitate the identification, handling, status reporting, solution options and alternatives, change implementation of various critical incidents and problems reactively and proactively identified in the IT environment, and effectively bring them to resolution and/or closure.
- Participate in capacity management of core systems and services, application analysis and performance and security tuning. Provide operational support of systems and build automation to remediate and address the root cause; with the goal of automating response to all non-exceptional service conditions.
- Diagnostics & Monitoring: Instrumenting the complete application architecture to provide real user and system performance data to provide insight into the root cause of all application bottlenecks, enable real time visibility to reduce risk exposure.
- Provide enterprise-wide application, database, network support in a mature enterprise environment, and cloud infrastructure design experience preferred.
- Triage issues as they arise, create strategies for long term permanent fixes to critical production incidents.
- Maintain documentation, build tooling, and create alerts to both identify and address infrastructure reliability.
- Understand the ecosystem and provide technical recommendations on major incident calls and post incident reviews.
- Work closely with Service Owners to deliver a clear, concise picture of incidents and the short-term remediation applied; ensure problem prevention methods and mitigation strategies are continually applied to improve application availability and make recommendations on long-term solutions.
- Effectively document business cases, solution strategies, event and configuration processes, procedures, and knowledge articles associated with implementing fixes and solutions to existing or predicted IT incidents and problems.
- Participate in routine root cause analysis and problem review meetings and provide recommendations to service owners to improve the service availability.
- Work with Support Knowledge Managers to build the team knowledge database.
- Train and mentor Incident Management team to effectively perform the job during the shift.
- Ability to work scheduled shifts including mornings, weekends and nights or participate in an on-call roster.
- Bachelor’s degree in Computer Science, Computer Information Systems, Management Information Systems
- 5+ years of experience in a technical operations role, a systems analysis and support role or a DevOps role
- 2+ years of experience and strong working knowledge in Amazon Web Services or similar cloud infrastructure platforms (Azure, Google Cloud, etc).
- 2+ years of experience in performance monitoring & diagnostic tools (e.g Data Dog, Dynatrace, Splunk, New Relic, Nagios, etc)
- Technical knowledge and experience on working with Windows/Linux environments, SQL, Active Directory, Scripting, etc.
- Network troubleshooting knowledge including LAN/WAN, DHCP, TCP/IP, Firewalls, and Routing
- Proven track record supporting large scale environments and applications
- Superior English language and communication skills - both written and verbal.
- Skills with the ability to articulate technical solutions for both technical and non-technical audiences.
- Ability to direct cross-functional resources through incident closure with proper RCA and through the problem management lifecycle.