Manager Systems Administration

kubernetesgkeekslinuxnetworkingtcp/ipdnsmonitoringobservabilityprometheusgrafanaelksplunkzabbixdevopssreci/cdincident managementitildisaster recoveryoperationscommand centernocsoc

Key details

Salary

Not specified

Employment type

Full-time

Seniority

Senior

Years experience

10+

Location

Bengaluru, India

Full job description

Lead and manage a DevOps/SRE operations team supporting enterprise-scale infrastructure. Oversee operational activities, improve service delivery, and implement service improvements. Require expertise in Kubernetes (GKE/EKS), Linux, networking protocols, distributed systems, and monitoring tools (OpsRamp, ThousandEyes, Grafana, Prometheus, ELK, Splunk, Zabbix). Define and track SLOs/SLIs, manage incident command, and act as escalation point for outages. Collaborate with engineering to improve CI/CD and automation. Maintain operational documentation, logs, and processes. Ensure service continuity, disaster recovery readiness, and adherence to ITIL-aligned processes. Bachelor's degree and 12-15 years experience required, with Command Centre/NOC/SOC background. Location: Bengaluru, India.

What you'll do

Lead, mentor, and upskill a high-performing DevOps/SRE-oriented operations team
Partner with engineering teams to improve CI/CD reliability, release safety, and change automation
Own the observability strategy across metrics, logs, and traces
Optimize monitoring, alerting, and log analytics platforms (e.g., ELK, Splunk, Zabbix, OpsRamp)
Continuously tune alerts to minimize noise and improve signal quality
Ensure consistent observability of platforms and services to maintain optimal uptime
Ensure incident, problem, and change processes are lightweight, automated, and outcome-focused
Drive service continuity, resilience testing, and disaster recovery readiness
Maintain consistent connection with peer teams for smooth operational efficiencies
Participate in RCA of issues and address monitoring/process gaps during major incidents
Coach and guide Operations teams to build capabilities and achieve strategic goals
Ensure all incidents and requests follow documented processes and meet SLA/OLA
Maintain ticket quality through regular assessments
Present management-level reporting on incidents, tickets, projects, and challenges
Evaluate production change proposals and ensure smooth, risk-aware implementation
Ensure service continuity plans are compatible and regularly tested
Review and analyze Operations management toolsets for best practices
Coordinate with clients to prepare Operational run books
Maintain operational logs, journals, documentation, processes, and diagnostic tools
Ensure maintenance tasks are completed as per procedural documentation

Requirements

Good knowledge in Kubernetes with hands-on experience using platforms such as GKE or EKS
Experience setting up enterprise observability, alerting, and managing incident command at scale
Deep understanding of Linux, networking protocols (TCP/IP, DNS), and distributed systems
Strong skills in monitoring tools (OpsRamp, ThousandEyes, Grafana, Prometheus)
Ability to define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
Experience acting as an escalation point for critical production outages
Bachelor’s degree in engineering, Computer Science, IT, or equivalent field
12 - 15 years of related experience
Command Centre/NOC/SOC experience
Familiarity with application lifecycle and IT Service Management concepts

Tech stack

KubernetesGoogle Kubernetes Engine (GKE)Amazon Elastic Kubernetes Service (EKS)LinuxTCP/IPDNSOpsRampThousandEyesGrafanaPrometheusELKSplunkZabbixCI/CDITIL

Benefits

Opportunities for growth through learning, development and career advancementFocus on employee well-beingCollaborative work environmentFlexibility to balance work and personal lifeInclusive and diverse workplace

Apply now

This MVP uses a placeholder application flow. In production, this section can connect to an external apply URL or a native application form.

Continue to application Browse more jobs

Similar jobs

More roles worth a look

Related opportunities based on specialty and working model so candidates can keep momentum.

The Trade Desk

Sr AI Engineer

Bellevue, United States•1 month ago

$125K – $229K

pythonc#sql

View job details→

The Trade Desk

Sr AI Enterprise Engineer

Bellevue, United States•1 month ago

$125K – $229K

AIlarge language modelsLLM

View job details→

The Trade Desk

Sr AI Enablement Engineer

Bellevue, United States•1 month ago

$125K – $229K

AIlarge language modelsLLM

View job details→