Site Reliability Engineer
Oil & Gas
Key Skills and Attributes Required:
- 7 years experience with software engineering, software development, or system operations
- Excellent communication skills, both verbal and written
- Knows their way around a Unix/Linux, can write scripts, and understands Linux internals
- Experience debugging complex problems
- Experience designing, building, and operating large-scale production systems
- Knows Python, Java, Go, Rust, or similar
- Understands networking and messaging, especially between services
- Has hands-on experience using source control (Git, GitHub, GitLab) and feature branching strategies
- Has experience with a variety of open-source databases (Postgres, Redis etc.)
- Experience with containers, such as with Docker or Kubernetes
- Experience Elastic Stack Istio, HashiCorp vault, Prometheus, grafana
- Proven track record of automating processes and auto Heal work
- Experience with monitoring and observability such as with Datadog, Sensu, New Relic, and Nagios
- Experience automating infrastructure, testing, and deployments using tools like Terraform, Helm and can explain the Infrastructure as Code paradigm
- Experience with configuration management
- Understands the idea behind Chaos Engineering, even if they haven’t yet implemented it themselves
- Worked in regulated industries such banking telecom power
Do you enjoy working with a highly motivated and talented team to deliver mission critical software?
Our Renewables and Energy Solutions is growing and we as Site Reliability Engineering team play a key role we help deploy, manage, troubleshoot, and enhance our complex cloud-based services for a wide variety of customers.
As a Site Reliability Engineer you will design and implement web applications and REST API services using a microservice-based infrastructure. The new technology stack includes (AWS), Docker/K8S, NoSQL/SQL database, and a range of monitoring tools. Your focus will be on maximizing system uptime. Team members all participate in an on-call rotation.
You will build innovative automated solutions and tools to help debug and resolve problems in production and prevent them from recurring. Further, you will proactively seek out system weaknesses and find ways to fix them before they cause production issues using monitoring data, watching trends, and using Chaos Engineering.
- Keeping your assigned site or service up and running or getting it back up and running quickly when failure occurs
- Working closely with internal partners and teams to ensure that we ship software that meets security, Compliance, SLA, and performance requirements
- Writing, updating, and using documentation, including runbooks/playbooks
- Automating work including infrastructure needs, testing, failover solutions, failure mitigation..
- Debugging complex problems across an entire stack and creating solid solutions
- Developing CI/CD processes to improve cadence
- Using Chaos Engineering to test what you build under real-world conditions
Please send us your recent CV (word) + a cover letter for this role (both in English) together with your availability/planned vacations and all-in hourly rate VAT (BTW) excluded.