Site Reliability Engineer
- Design, code, test, and deliver software to automate manual operational work.
- Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents.
- Engage with development team throughout the life cycle to help develop software for reliability and scale, ensuring minimal refactoring or changes.
- Identify application patterns and analytics in support of better service level objectives.
- Design self-healing and resiliency patterns.
- Design automated software and product upgrades, change management, and release management solutions.
- Coach or manage teams as applicable.
- Participate in the 24x7 support coverage as needed.
- Bachelor's degree or equivalent experience in an software engineering discipline.
- Expertise in at least one technology stack designing, coding, testing, and delivering software.
- Proficiency in one or more technology domains, may be a cross-domain expert able to solve complex and mission critical problems within a business or across the firm.
- Working knowledge of infrastructure components (e.g. routers, load balancers, cloud products, container systems, compute, storage, and networks).
- Excellent debugging and trouble shooting skills.
- 5+ years of experience in developing enterprise software and proficiency in multiple technologies preferably Java, SpringBoot, No-SQL Databases.
- 2+ years of incident resolution experience in an large scale operations environment.
- Experience/knowledge administering application servers, web servers, and databases (Tomcat, Eureka, Cassandra, Kafka, etc.)
- Proven ability to understand and troubleshoot complex problems under pressure
- Experience with one or more cloud platforms like Cloud Foundry, Mesosphere, Kubernetes, AWS, GCP.
- Hands-on experience with cloud deployment, monitoring, and ops analysis tools such as Prometheus, Elasticsearch, Grafana, Kibana, Splunk, etc.
- Expertise in Agile and can work with at least one of the common frameworks.