Full job description
Build and maintain reliable, scalable, and high-performance digital media measurement platforms. Implement observability best practices including metrics collection, dashboarding, and alerting. Reduce mean time to recovery (MTTR) for critical incidents through automation and proactive monitoring. Respond to and resolve Sev1/Sev2 incidents. Monitor and maintain infrastructure across GCP, AWS, OCI, and on-premises. Lead technical projects from planning to deployment. Develop automations to improve operational efficiency. Use AI-assisted tools for automation and problem resolution. Implement Infrastructure-as-Code with Terraform, Helm, Python, and configuration management tools. Create and maintain documentation and runbooks. Participate in on-call rotations and post-incident reviews. Requires 4+ years in SRE, DevOps, or related roles with Linux/Unix administration experience. Proficient in Python, Bash, or Go. Experienced with cloud platforms (GCP, AWS, OCI), Kubernetes, monitoring tools (Prometheus, Grafana, Splunk, Nagios), and Infrastructure-as-Code tools (Terraform, Ansible, Helm). Knowledge of networking, databases, CI/CD, and workflow automation. Strong communication, problem-solving, and ownership mindset. Preferred qualifications include relevant degrees, certifications, AI-assisted development experience, and security best practices knowledge. Salary range $89,000 - $178,000 plus bonus, equity, and benefits. Hybrid work model with 3 days per week in office at NYC Global HQ.
What you'll do
- Build and maintain reliability, scalability, and performance of digital media measurement platforms
- Implement observability best practices including metrics collection, dashboarding, and alerting
- Reduce MTTR for critical incidents through automation and improved observability
- Respond to incidents and manage Sev1/Sev2 situations
- Monitor and maintain high availability infrastructure and services across GCP, AWS, OCI, and on-premises
- Lead technical projects from planning through deployment
- Build and deploy automations to eliminate operational toil and improve efficiency
- Leverage AI-assisted development tools to accelerate automation and problem resolution
- Build custom integrations and MCP servers for monitoring platforms
- Implement Infrastructure-as-Code using Terraform, Helm charts, Python, scripts, and configuration management tools
- Develop production automations for routine operational tasks
- Create and maintain documentation, runbooks, and SOPs in Confluence
- Participate in on-call rotations and post-incident reviews
Requirements
- 4+ years in Site Reliability Engineering, DevOps, or related operational roles
- Proven experience in Linux/Unix systems administration
- Proficiency in scripting and programming languages such as Python, Bash, or Go
- Strong experience with cloud infrastructure and services across GCP, AWS, and OCI
- Experience with container orchestration tools like Kubernetes
- Expertise in monitoring and observability tools such as Prometheus, Grafana, Splunk, Nagios
- Hands-on experience with Infrastructure-as-Code tools like Terraform, Ansible, or Helm
- Ability to develop and track SLIs, SLOs, and SLAs
- Deep understanding of networking, DNS, load balancing, and CDN technologies
- Familiarity with databases (SQL, NoSQL, Vertica, MongoDB, Snowflake) and data pipeline technologies
- Knowledge of CI/CD pipelines, GitLab, and deployment automation
- Experience with workflow automation platforms is a strong plus
- Exceptional communication skills
- Proactive problem-solving approach
- Ownership mentality
- Passion for mentorship and knowledge sharing
- Bachelor's or Master's degree in Computer Science, Engineering, or related field (preferred)
- Industry certifications such as AWS Certified DevOps Engineer, Google Professional Cloud DevOps Engineer, Certified Kubernetes Administrator (CKA), or Terraform/Grafana certifications (preferred)
- Experience with AI-assisted development tools like ChatGPT, Cursor, Glean, or Copilot (preferred)
- Familiarity with security best practices in cloud and containerized environments (preferred)
Tech stack
LinuxUnixPythonBashGoGCPAWSOCIKubernetesPrometheusGrafanaSplunkNagiosTerraformAnsibleHelmSQLNoSQLVerticaMongoDBSnowflakeGitLabCI/CDChatGPTCursorGleanCopilot
Benefits
Bonus/commission (as applicable)EquityBenefits (unspecified)