Job Details

ID	#23574917
State	Texas
City	Houston
Job type	Contract
Salary	Depends on Experience
Source	ReqRoute, Inc
Showed	2021-11-28
Date	2021-11-23
Deadline	2022-01-22
Category	Et cetera
	Create resume

STATES

CITIES

AWS Cloud Engineer

Texas, Houston

Vacancy expired!

Job Description:One of our clients is in urgent need of an

AWS Cloud Engineer for one of their large engagements. Please share the updated resume with contact information if you would like to Evaluate.As a Senior Site Reliability Engineer, you will be a part of the Operations team based in Houston, Texas, and responsible for the Incident, Event Management and Configuration process. You will be working with your incident management team to assess the severity of reported incidents, identify relevant service owners, participate, initiate and lead incident management and daily review calls with relevant parties, communicate the progress of the incidents via relevant communication channels, provide technical recommendations, suggest preventive and corrective actions, ensure proper closure of those incidents and continuous auditing and improvements to Incident and Event Management process. Using data learned from those incidents you will drive further improvements into our automation, monitoring, and processes to proactively identify and resolve critical incidents.Location: Remote/Houston, TXContract: 12+ Months

Responsibilities:

Ability to effectively verbalize, document, communicate and facilitate the identification, handling, status reporting, solution options and alternatives, change implementation of various critical incidents and problems reactively and proactively identified in the IT environment, and effectively bring them to resolution and/or closure.
Participate in capacity management of core systems and services, application analysis and performance and security tuning. Provide operational support of systems and build automation to remediate and address the root cause; with the goal of automating response to all non-exceptional service conditions.
Diagnostics & Monitoring: Instrumenting the complete application architecture to provide real user and system performance data to provide insight into the root cause of all application bottlenecks, enable real time visibility to reduce risk exposure.
Provide enterprise-wide application, database, network support in a mature enterprise environment, and cloud infrastructure design experience preferred.
Triage issues as they arise, create strategies for long term permanent fixes to critical production incidents.
Maintain documentation, build tooling, and create alerts to both identify and address infrastructure reliability.
Understand the ecosystem and provide technical recommendations on major incident calls and post incident reviews.
Work closely with Service Owners to deliver a clear, concise picture of incidents and the short-term remediation applied; ensure problem prevention methods and mitigation strategies are continually applied to improve application availability and make recommendations on long-term solutions.
Effectively document business cases, solution strategies, event and configuration processes, procedures, and knowledge articles associated with implementing fixes and solutions to existing or predicted IT incidents and problems.
Participate in routine root cause analysis and problem review meetings and provide recommendations to service owners to improve the service availability.
Work with Support Knowledge Managers to build the team knowledge database.
Train and mentor Incident Management team to effectively perform the job during the shift.
Ability to work scheduled shifts including mornings, weekends and nights or participate in an on-call roster.

Qualifications:

Bachelor’s degree in Computer Science, Computer Information Systems, Management Information Systems
5+ years of experience in a technical operations role, a systems analysis and support role or a DevOps role
2+ years of experience and strong working knowledge in Amazon Web Services or similar cloud infrastructure platforms (Azure, Google Cloud, etc).
2+ years of experience in performance monitoring & diagnostic tools (e.g Data Dog, Dynatrace, Splunk, New Relic, Nagios, etc)
Technical knowledge and experience on working with Windows/Linux environments, SQL, Active Directory, Scripting, etc.
Network troubleshooting knowledge including LAN/WAN, DHCP, TCP/IP, Firewalls, and Routing
Proven track record supporting large scale environments and applications
Superior English language and communication skills - both written and verbal.
Skills with the ability to articulate technical solutions for both technical and non-technical audiences.
Ability to direct cross-functional resources through incident closure with proper RCA and through the problem management lifecycle.

Brief about our client:Our client is a leader in building software that drives their customers business, for enterprises and software product companies with software at the core of their transformation.

Subscribe Report job