System Engineer Cloud
Oil and Gas
About Cloud Native Engineering
The Cloud Native Engineering Practice is an organization of engineers who work with our production services throughout their entire life cycle, from design and architecture, through implementation, deployment, and sustaining operation.
Site Reliability Engineers (SRE) deliver important system properties: reliability, performance, efficiency, and scalability, for the products and platforms that our customers use every day.
SRE's work in high-performance squads with expertise on large scale system reliability and in-depth understanding of critical business components architecture, as well as dedicated engineering teams building comprehensive tools, platform and infrastructure.
Do you want to work on a team building client’s next generation AI Platform? Do you want to work with the latest opensource toolset? Join our newly formed Cloud Native Engineering team as a Lead Site Reliability Engineer (SRE) and help shape the future of software and data modelling at the client.
As an SRE you will join a team of reliability engineers who partner with development teams throughout the organization with the goal of improving products, features, and flow reliability.
An SRE spends just as much of their time working on systems as they do writing code. You’ll be tasked with all manner of work from building operational tooling, automating operational workflows, performing architecture and design reviews, investigating system failures and complex outages, improving our monitoring infrastructure, defining service level objectives and agreements for products and flows, and much more.
- Lead and inspire a team of SRE Engineers in the US, Europe & India to deliver world class services
- Work with development partners to shape the architecture, design, and implementations of new and existing systems to enhance their reliability, performance, efficiency, and scalability
- Ensure all key services are measured, monitored and raising alerts when needed
- Automation of deployment and configuration processes
- Develop reliability tools and frameworks for use by all engineers
- Share on-call for most critical systems and lead incident response and no-blame post-mortem analysis and review
- Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, monitoring and root cause analysis.
- We are expert in infrastructure and best practices, and we help development teams using infrastructure more effectively.
- We are on point for capacity planning and to help teams anticipate and prepare for growth.
What you'll need
- Grit, drive and a deep feeling of ownership.
- BS or MS in Computer Science or a related technical discipline. Equivalent practical experience is a reasonable substitute.
- Experience in the Linux environment and a good understanding of its fundamentals and internals: filesystems and modern memory management, threads and processes, the user/kernel-space divide, etc.
- A good understanding of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring and storage systems.
- Working knowledge of the TCP/IP stack, internet routing and load balancing.
- Working knowledge of Kubernetes, Terraform, Prometheus, Jenkins (or other similar toolset)
Please send us asap your recent CV + a motivation for this role, both in English, together with your availability/planned vacations and all-in hourly rate VAT (BTW) excluded.