Vacancy expired!
- Implement tools that ensure high availability of product
- Gain deep knowledge of complex applications
- Identify opportunities to automate or improve processes and then implement the automation
- Coordinate incident response across multiple teams clearly understanding and communicating what is going on, next steps, who is responsible for what, and so on
- Implement observability tools to ensure visibility into service stability and performance
- Be on-call for production services
- Operating, troubleshooting, and deploying software to Unix systems
- Thinking about things in a systemic, methodical way, especially when troubleshooting
- Expertise in observability and monitoring of applications, services, and networks, using tools such as PrometheGrafana and ELK logging
- Unix/Linux experience, including application installation, configuration, and maintenance
- Significant experience with site reliability, developer productivity, devops, or server infrastructure engineering (including on call incident response)
- Understanding of Internet networking protocols: TCP/IP, TLS, DNS, HTTP/S, SMTP
- Experience troubleshooting issues across the entire stack (hardware, software, network, etc)
- Experience writing automation scripts and utilities in a scripting language such as Python, Perl, Shell, PHP, etc
- Experience with incident and problem management
- Strong communication and interpersonal skills
- Experience coding in Rust or C
- Experience supporting large-scale, mission critical services
- Experience with CI/CD pipelines
Vacancy expired!