Staff Site Reliability Engineer

pythonshellawsazuregcpkubernetesterraformansibledevopssrecloud engineeringautomationmonitoringci/cdlinuxwindowsmlopsinfrastructure as codeself-healing systemsincident triage

Key details

Salary

Not specified

Employment type

Full-time

Seniority

Senior

Years experience

10+

Location

Bengaluru, India

Full job description

Seeking a Staff Site Reliability Engineer with 12+ years experience in Platform/Cloud Engineering, SRE, and DevOps to lead and evolve infrastructure platforms managing 15,000+ on-premise servers and multi-cloud environments (AWS, Azure, GCP). Responsibilities include leading SRE initiatives, managing Linux and Windows servers, automating workflows using n8n, building self-service platforms with Backstage, architecting scalable AWS infrastructure, administering Kubernetes clusters, driving automation with Python, Shell, Terraform, and Ansible, designing AI agents for observability and incident triage, collaborating across teams to improve CI/CD and security, building monitoring pipelines, participating in on-call rotations, conducting root cause analysis, and promoting best practices in reliability and cost optimization. Required skills include strong coding in Python and Shell, expertise in cloud platforms, Kubernetes, Linux administration, AWS services, Infrastructure as Code tools, CI/CD pipelines, and monitoring tools such as Zabbix, PagerDuty, Grafana, and ELK. Location: Bengaluru, Karnataka, India.

What you'll do

Lead SRE initiatives across hybrid infrastructure (on-premise and multi-cloud AWS, Azure, GCP)
Manage and optimize 15,000+ servers on Linux and Windows platforms
Create automation workflows using n8n and integrations across tech stack
Build self-service platform using Backstage and write product integrations
Architect and support scalable, resilient AWS infrastructure (EKS, EC2, S3, RDS, Lambda)
Administer Kubernetes clusters at scale ensuring health, upgrades, and secure deployments
Drive infrastructure automation using Python, Shell, Terraform, and Ansible
Design and implement AI agents for observability, root cause analysis, and incident triage
Collaborate with development, IT Ops, Command Center, cloud, and platform teams to improve CI/CD, security, and SLA adherence
Build monitoring and alerting pipelines using Grafana, Prometheus, ELK, PagerDuty or similar tools
Participate in and improve on-call rotations and build self-healing systems
Lead root cause analysis exercises and post-incident reviews
Promote best practices in reliability, scalability, and cost optimization

Requirements

12+ years of experience in Platform/Cloud Engineering, SRE, DevOps
Strong hands-on coding experience in Python and Shell
Strong expertise in Cloud, Kubernetes, Linux Administration
Hands-on experience with AWS services and Kubernetes
Proficiency in Infrastructure as Code tools like Terraform and Ansible
Extensive experience in delivering efficient developer experience
Extensive knowledge in building CI/CD pipelines
Familiarity with monitoring tools such as Zabbix, PagerDuty, Grafana, ELK

Tech stack

PythonShellAWSAzureGCPLinuxWindowsKubernetesTerraformAnsiblen8nBackstageEKSEC2S3RDSLambdaGrafanaPrometheusELKPagerDutyZabbix

Benefits

Employee well-being focusCollaborative work environmentOpportunities for learning, development, and career advancementInnovation-driven cultureWork-life balance and flexibilityCommitment to diversity, inclusion, and equal employment opportunities

Apply now

This MVP uses a placeholder application flow. In production, this section can connect to an external apply URL or a native application form.

Continue to application Browse more jobs

Similar jobs

More roles worth a look

Related opportunities based on specialty and working model so candidates can keep momentum.

The Trade Desk

Sr AI Engineer

Bellevue, United States•1 month ago

$125K – $229K

pythonc#sql

View job details→

The Trade Desk

Sr AI Enterprise Engineer

Bellevue, United States•1 month ago

$125K – $229K

AIlarge language modelsLLM

View job details→

The Trade Desk

Sr AI Enablement Engineer

Bellevue, United States•1 month ago

$125K – $229K

AIlarge language modelsLLM

View job details→