Site Reliability Engineer
Highly valued qualifications & experiences
- Experience with DC/OS
- Fan of automatic testing and qualification, if can be part of CI/CD pipeline.
- Affinity to dig deep into the details of networking issues
- Available to work (remotely) outside regular office hours when it proves that attempt to build a fail-safe system was not yet successful. We really want this to be an exception, not a rule
Required qualifications & experiences
- Knowledge of distributed computing systems, practical experience (must!)
- Experience with Ansible
- Linux expert
- Networking Expert
- Understands RHEV and Virtual Machines
- Problem solving / Go-fix mentality
- Broad Obsession about e.g, Ansible, Networking, Linux, Hadoop, Failover designs and virtualization (RHEV)
- Flexible, we are a 5 x 8 organization but if must we do not stop there.
- Likes to deliver working products, functionality over latest greatest technology
Our client is one of the world’s leading manufacturers of semiconductor-chip-making equipment. A majority of the world’s microchips receive their critical lithographic patterning in machines made by
ASML. In addition produces metrology tools and advanced applications to analyze and optimize the performance of the customer production process.
In one line: Do your part to ensure that with growing number of installs of VCP the D&E team can stay stable in size. Reduction of manual actions needed for install and upgrade, resolve structural stability issues and design automated tests suites to qualify bot design and system setup on customer site.
Investigate issues at hand, define root causes and design, develop, test structural reliability improvements. Share your knowledge and work with the first line support team on the most complex issues.
80% of your time is spend on development work, you aid dedicated support teams of Managed Operations department in their work.
The platform under development shall be the foundation under the applications developed at ASML. This is a task for other teams, we do platform. You develop solutions to ensure the platform can e.g. support customer specific hardware configuration and different sub sets of applications that run on the platform.
Some examples potential tasks; design and build or configure a solution that can determine the configuration of the platform and identify the deltas between the configuration as build versus the configuration as designed. A second one; troubleshoot problems with platform DNS and come to solutions to prevent cross talk between internal and external DNS requests.
The Managed Operations (MO) department, active 24/7 in 3 geographical locations (time zones) is in between customer and the Site Reliability Engineering team. As such monitoring and alert handling is not in the scope of the SRD at ASML. Where MO cannot address the problem the Site Reliability Developer (SRD) comes in to support solving the problem at hand. It is the task of the SRD team to enable their MO counterpart to handle alerts without escalation by clear documentation and well defined automated corrective actions.
A great SRD will take the learning from the incident to improve the system in a next release. Via automation, automation and automation plus reduction of moving parts, upgrades of critical components or additional alerting the SRD tries to bring back the number of alerts back to ‘0’. The time that is saved is spent on adding features and capabilities to the platform to further drive the applications roadmap of ASML.
Responsibilities of this role:
- Define and implement structural improvements to simplify configurations and drive uptime
- Root cause finding with your team of the most complex issues
- Development of new features that help to further improve the platform stability
- Build Ansible playbooks, maintainable, configurable
Context of the position
You will be working at Business Line Applications. This is a product development team. You are not supporting “the business”, you are the business.
There are 3 – 4 infra teams, 20-30 engineers, Product Owners and Scrum Masters working on the platform layers.
The application development teams that develop the business critical applications consist of 15-25 teams. The organization is large, we develop our process in parallel to the product.
Please send us asap your recent CV + a motivation for this role, both in English, together with your availability/planned vacations and all-in hourly rate VAT (BTW) excluded.