SRE Team Lead (Hands-on)
Job Details
About the Company
With operational hubs scattered across Europe, Asia, and LATAM, and its headquarters situated in San Francisco, US, the company boasts a workforce of over 1,000 adept professionals. Spanning across more than 20 countries, ALLSTARSIT offers a diverse range of skilled employees across various verticals, including AI, cybersecurity, healthcare, fintech, telecom, media, and so on.
About the Project
We are looking for a hands-on SRE Team Lead to own the reliability, scalability, and operational excellence of a cloud-native fintech platform built on microservices.
This role combines technical leadership, architecture ownership, and deep hands-on execution.
You will lead a small SRE team while remaining actively involved in design, coding, incident response, and reliability engineering.
Specialization
Headquarters
Years on the market
Team size and structure
Current technology stack
Required skills:
- 8+ years of experience in SRE / Platform / DevOps engineering
- Strong hands-on experience with:
- AWS (EKS, EC2, RDS, IAM, CloudWatch, ALB)
- Kubernetes & Docker
- Microservices architectures
- Strong programming background in Java and/or Node.js
- Deep understanding of:
- Distributed systems
- Production debugging
- Capacity planning
- Experience in fintech or regulated environments is a strong plus
Nice to Have
- Experience with chaos engineering tools
- Security & compliance exposure (PCI-DSS, SOC2, ISO)
- Prior experience building or scaling SRE teams
Scope of work:
Reliability & Architecture
- Own platform availability, latency, scalability, and resilience across environments
- Define and enforce SLOs, SLIs, error budgets, and operational KPIs
- Design and review resilience patterns: circuit breakers, retries, rate limiting, graceful degradation
- Drive chaos engineering, fault-injection, and disaster-recovery readiness
Hands-on Engineering
- Actively contribute code (Java / Node) for:
- Reliability tooling
- Platform automation
- Observability integrations
- Review microservice architecture with engineering teams to eliminate single points of failure
Cloud & DevOps Leadership
- Own AWS architecture (VPCs, IAM, EKS, RDS, ALB/NLB, autoscaling)
- Drive Kubernetes best practices (resource tuning, HPA, pod disruption budgets)
- Improve CI/CD pipelines for reliability, speed, and safety
Incident & Operations
- Lead production incident response, root cause analysis (RCA), and postmortems
- Establish blameless postmortem culture
- Reduce MTTR through automation and better observability
- Participate in escalation/on-call strategy (not firefighting 24×7)
People & Process
- Mentor SRE DevOps and SRE Full-Stack engineers
- Define operational standards, runbooks, and SRE practices
- Work closely with product, security, and engineering leaders