AdTechTalent
Engineering2 months agoHybrid

Epsilon

Staff Site Reliability Engineer

pythonshellawsazuregcpkubernetesterraformansibledevopssrecloud engineeringautomationmonitoringci/cdlinuxwindowsmlopsinfrastructure as codeself-healing systemsincident triage

Key details

Salary

Not specified

Employment type

Full-time

Seniority

Senior

Years experience

10+

Location

Bengaluru, India

Full job description

Seeking a Staff Site Reliability Engineer with 12+ years experience in Platform/Cloud Engineering, SRE, and DevOps to lead and evolve infrastructure platforms managing 15,000+ on-premise servers and multi-cloud environments (AWS, Azure, GCP). Responsibilities include leading SRE initiatives, managing Linux and Windows servers, automating workflows using n8n, building self-service platforms with Backstage, architecting scalable AWS infrastructure, administering Kubernetes clusters, driving automation with Python, Shell, Terraform, and Ansible, designing AI agents for observability and incident triage, collaborating across teams to improve CI/CD and security, building monitoring pipelines, participating in on-call rotations, conducting root cause analysis, and promoting best practices in reliability and cost optimization. Required skills include strong coding in Python and Shell, expertise in cloud platforms, Kubernetes, Linux administration, AWS services, Infrastructure as Code tools, CI/CD pipelines, and monitoring tools such as Zabbix, PagerDuty, Grafana, and ELK. Location: Bengaluru, Karnataka, India.

What you'll do

  • Lead SRE initiatives across hybrid infrastructure (on-premise and multi-cloud AWS, Azure, GCP)
  • Manage and optimize 15,000+ servers on Linux and Windows platforms
  • Create automation workflows using n8n and integrations across tech stack
  • Build self-service platform using Backstage and write product integrations
  • Architect and support scalable, resilient AWS infrastructure (EKS, EC2, S3, RDS, Lambda)
  • Administer Kubernetes clusters at scale ensuring health, upgrades, and secure deployments
  • Drive infrastructure automation using Python, Shell, Terraform, and Ansible
  • Design and implement AI agents for observability, root cause analysis, and incident triage
  • Collaborate with development, IT Ops, Command Center, cloud, and platform teams to improve CI/CD, security, and SLA adherence
  • Build monitoring and alerting pipelines using Grafana, Prometheus, ELK, PagerDuty or similar tools
  • Participate in and improve on-call rotations and build self-healing systems
  • Lead root cause analysis exercises and post-incident reviews
  • Promote best practices in reliability, scalability, and cost optimization

Requirements

  • 12+ years of experience in Platform/Cloud Engineering, SRE, DevOps
  • Strong hands-on coding experience in Python and Shell
  • Strong expertise in Cloud, Kubernetes, Linux Administration
  • Hands-on experience with AWS services and Kubernetes
  • Proficiency in Infrastructure as Code tools like Terraform and Ansible
  • Extensive experience in delivering efficient developer experience
  • Extensive knowledge in building CI/CD pipelines
  • Familiarity with monitoring tools such as Zabbix, PagerDuty, Grafana, ELK

Tech stack

PythonShellAWSAzureGCPLinuxWindowsKubernetesTerraformAnsiblen8nBackstageEKSEC2S3RDSLambdaGrafanaPrometheusELKPagerDutyZabbix

Benefits

Employee well-being focusCollaborative work environmentOpportunities for learning, development, and career advancementInnovation-driven cultureWork-life balance and flexibilityCommitment to diversity, inclusion, and equal employment opportunities

Apply now

This MVP uses a placeholder application flow. In production, this section can connect to an external apply URL or a native application form.

Similar jobs

More roles worth a look

Related opportunities based on specialty and working model so candidates can keep momentum.