Site Reliability Engineer
The Site Reliability Engineer (SRE) position requires a mix of strategic engineering and design along with hands-on, technical work. An ideal candidate will have experience in being a Systems Administrator that has moved on to DevOps/Automation in their career, and have coding skills to automate tasks and build tools to help with our service operations. The SRE will configure, tune, and troubleshoot multi-tiered systems to achieve optimal application performance, stability and availability. The SRE will work closely with the software engineers, infrastructure and network engineers to deploy and maintain our services.
- Strong sense of ownership, customer service, and integrity demonstrated through clear communication.
- Deep understanding of the Linux and system administration at large-scale
- Understanding of standard networking protocols and components such as: HTTP, DNS, TCP/IP, the OSI Model, Subnetting and Load Balancing strategies.
- Coding experience using a high-level programming language like: Python, Golang
- Experience running docker based workloads in production using a platform like Nomad.
- The successful candidate will be highly self-motivated with a passion for excellence, quality and attention to detail.
- Responsibilities of the SRE include the following:
- Keeping the lights on – Oncall and Alert Handling.
- Manage new build-outs (additions and decommissions)
- Develop and maintain scripts used for environment monitoring and task automation (Python, Ansible, Puppet)
- Experience setting up and managing monitoring tools such as Graphite, Prometheus, InfluxDB, Grafana
- Set priorities and work efficiently in a fast-paced environment
- Measure and optimize system performance
- Demonstrate ability to deliver results on time with high quality Experience with Docker, Spinnaker, Kubernetes and AWS is a plus.
To apply for this job please visit www2.jobdiva.com.