AdTechTalent
Engineering2 months agoHybrid

Epsilon

Manager, System and Platform Operations

site reliabilitydockerkubernetesterraformjavagolangpythonbashdatadogprometheusgrafanapostgresqlbigtableapimicroservicesdevopsitilcloudmonitoringobservabilityleadershipagilehigh availability

Key details

Salary

Not specified

Employment type

Full-time

Seniority

Senior

Years experience

5-10

Location

London, United Kingdom

Full job description

The System and Platform Operations Manager is a senior technical leadership role responsible for the support, reliability, and stability of Epsilon Retail Media production systems. The role manages the Platform Operation Team within a single geo-region, overseeing deployment, management, monitoring, reporting, troubleshooting, and repair of production systems. Responsibilities include establishing operational practices, exceeding service level objectives, resolving complex performance and reliability issues, communicating with customers and stakeholders, enabling rapid product releases while maintaining platform stability, collaborating with cross-functional teams, and maintaining expertise in current and emerging technologies. Requirements include 5+ years in site reliability roles, strong knowledge of Docker, Kubernetes, Terraform, scripting languages (Java, Golang, Python, Bash), monitoring tools (DataDog, Prometheus, Grafana), database systems (PostgreSQL, Bigtable), API and microservices architecture, leadership experience, and familiarity with DevOps, ITIL, Cloud Services, and Agile methodologies. The position offers hybrid work from the London office and includes competitive compensation, benefits, career advancement opportunities, and a commitment to diversity and inclusion.

What you'll do

  • Establish and manage operational practices and support model for future needs
  • Adopt a 'Measure Everything' approach to exceed service level objectives and agreements
  • Lead resolution of complex issues related to performance, reliability, and scalability
  • Communicate incident resolutions and impacts to customers and stakeholders
  • Empower Delivery teams to release new products, features, updates, and fixes quickly while ensuring platform stability
  • Collaborate with Engineering, Product, Delivery, and Security teams to ensure production/system reliability
  • Identify capabilities needed to meet current and emerging business needs
  • Maintain understanding of current technology, database management, reliability practices, and future trends

Requirements

  • At least 5 years of hands-on experience in Site Reliability focused positions
  • Strong knowledge of containerization technologies (Docker, Kubernetes)
  • Experience with infrastructure as code (Terraform)
  • Solid understanding of networking, security, and system architecture
  • Proficient in scripting languages (Java, Golang, Python, Bash, or similar)
  • Experience with monitoring and observability tools (DataDog, Prometheus, Grafana)
  • Knowledge of database management systems (PostgreSQL, Bigtable)
  • Understanding of API and microservices architecture
  • Strong people leadership skills with at least a year in leading and driving high-performance technical teams
  • Experience with operations teams within enterprise environments including DevOps, ITIL, Cloud Services, IT Infrastructure and Operations
  • Experience establishing Service Delivery strategies aligned with Agile methods
  • Experience delivering IT support services in a high availability (HA) environment such as 24/7 operations

Tech stack

DockerKubernetesTerraformJavaGolangPythonBashDataDogPrometheusGrafanaPostgreSQLBigtableAPImicroservices

Benefits

Competitive compensationGreat benefits packageEndless opportunities to advance your careerHybrid working opportunitiesInclusive and diverse workforceReasonable adjustments for candidates in application process

Apply now

This MVP uses a placeholder application flow. In production, this section can connect to an external apply URL or a native application form.

Similar jobs

More roles worth a look

Related opportunities based on specialty and working model so candidates can keep momentum.