AdTechTalent
Engineering1 month agoHybrid

Epsilon

Manager Systems Administration

kubernetesgkeekslinuxnetworkingtcp/ipdnsmonitoringobservabilityprometheusgrafanaelksplunkzabbixdevopssreci/cdincident managementitildisaster recoveryoperationscommand centernocsoc

Key details

Salary

Not specified

Employment type

Full-time

Seniority

Senior

Years experience

10+

Location

Bengaluru, India

Full job description

Lead and manage a DevOps/SRE operations team supporting enterprise-scale infrastructure. Oversee operational activities, improve service delivery, and implement service improvements. Require expertise in Kubernetes (GKE/EKS), Linux, networking protocols, distributed systems, and monitoring tools (OpsRamp, ThousandEyes, Grafana, Prometheus, ELK, Splunk, Zabbix). Define and track SLOs/SLIs, manage incident command, and act as escalation point for outages. Collaborate with engineering to improve CI/CD and automation. Maintain operational documentation, logs, and processes. Ensure service continuity, disaster recovery readiness, and adherence to ITIL-aligned processes. Bachelor's degree and 12-15 years experience required, with Command Centre/NOC/SOC background. Location: Bengaluru, India.

What you'll do

  • Lead, mentor, and upskill a high-performing DevOps/SRE-oriented operations team
  • Partner with engineering teams to improve CI/CD reliability, release safety, and change automation
  • Own the observability strategy across metrics, logs, and traces
  • Optimize monitoring, alerting, and log analytics platforms (e.g., ELK, Splunk, Zabbix, OpsRamp)
  • Continuously tune alerts to minimize noise and improve signal quality
  • Ensure consistent observability of platforms and services to maintain optimal uptime
  • Ensure incident, problem, and change processes are lightweight, automated, and outcome-focused
  • Drive service continuity, resilience testing, and disaster recovery readiness
  • Maintain consistent connection with peer teams for smooth operational efficiencies
  • Participate in RCA of issues and address monitoring/process gaps during major incidents
  • Coach and guide Operations teams to build capabilities and achieve strategic goals
  • Ensure all incidents and requests follow documented processes and meet SLA/OLA
  • Maintain ticket quality through regular assessments
  • Present management-level reporting on incidents, tickets, projects, and challenges
  • Evaluate production change proposals and ensure smooth, risk-aware implementation
  • Ensure service continuity plans are compatible and regularly tested
  • Review and analyze Operations management toolsets for best practices
  • Coordinate with clients to prepare Operational run books
  • Maintain operational logs, journals, documentation, processes, and diagnostic tools
  • Ensure maintenance tasks are completed as per procedural documentation

Requirements

  • Good knowledge in Kubernetes with hands-on experience using platforms such as GKE or EKS
  • Experience setting up enterprise observability, alerting, and managing incident command at scale
  • Deep understanding of Linux, networking protocols (TCP/IP, DNS), and distributed systems
  • Strong skills in monitoring tools (OpsRamp, ThousandEyes, Grafana, Prometheus)
  • Ability to define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
  • Experience acting as an escalation point for critical production outages
  • Bachelor’s degree in engineering, Computer Science, IT, or equivalent field
  • 12 - 15 years of related experience
  • Command Centre/NOC/SOC experience
  • Familiarity with application lifecycle and IT Service Management concepts

Tech stack

KubernetesGoogle Kubernetes Engine (GKE)Amazon Elastic Kubernetes Service (EKS)LinuxTCP/IPDNSOpsRampThousandEyesGrafanaPrometheusELKSplunkZabbixCI/CDITIL

Benefits

Opportunities for growth through learning, development and career advancementFocus on employee well-beingCollaborative work environmentFlexibility to balance work and personal lifeInclusive and diverse workplace

Apply now

This MVP uses a placeholder application flow. In production, this section can connect to an external apply URL or a native application form.

Similar jobs

More roles worth a look

Related opportunities based on specialty and working model so candidates can keep momentum.