AdTechTalent
Engineering6 days agoHybrid

DoubleVerify

Sr. Site Reliability Engineer I

site reliability engineeringdevopspythonbashgocloudgcpawsocikubernetesterraformansiblehelmprometheusgrafanasplunknagiosmonitoringautomationci/cdgitlabinfrastructure as codeai-assisted developmentchatgptcursorcopilotnetworkingdnsload balancingcdnsqlnosqlmongodbsnowflakevertica

Key details

Salary

$89K – $178K

Employment type

Full-time

Seniority

Mid-level

Years experience

3-5

Location

New York City, New York, United States

Full job description

Build and maintain reliable, scalable, and high-performance digital media measurement platforms. Implement observability best practices including metrics collection, dashboarding, and alerting. Reduce mean time to recovery (MTTR) for critical incidents through automation and proactive monitoring. Respond to and resolve Sev1/Sev2 incidents. Monitor and maintain infrastructure across GCP, AWS, OCI, and on-premises. Lead technical projects from planning to deployment. Develop automations to improve operational efficiency. Use AI-assisted tools for automation and problem resolution. Implement Infrastructure-as-Code with Terraform, Helm, Python, and configuration management tools. Create and maintain documentation and runbooks. Participate in on-call rotations and post-incident reviews. Requires 4+ years in SRE, DevOps, or related roles with Linux/Unix administration experience. Proficient in Python, Bash, or Go. Experienced with cloud platforms (GCP, AWS, OCI), Kubernetes, monitoring tools (Prometheus, Grafana, Splunk, Nagios), and Infrastructure-as-Code tools (Terraform, Ansible, Helm). Knowledge of networking, databases, CI/CD, and workflow automation. Strong communication, problem-solving, and ownership mindset. Preferred qualifications include relevant degrees, certifications, AI-assisted development experience, and security best practices knowledge. Salary range $89,000 - $178,000 plus bonus, equity, and benefits. Hybrid work model with 3 days per week in office at NYC Global HQ.

What you'll do

  • Build and maintain reliability, scalability, and performance of digital media measurement platforms
  • Implement observability best practices including metrics collection, dashboarding, and alerting
  • Reduce MTTR for critical incidents through automation and improved observability
  • Respond to incidents and manage Sev1/Sev2 situations
  • Monitor and maintain high availability infrastructure and services across GCP, AWS, OCI, and on-premises
  • Lead technical projects from planning through deployment
  • Build and deploy automations to eliminate operational toil and improve efficiency
  • Leverage AI-assisted development tools to accelerate automation and problem resolution
  • Build custom integrations and MCP servers for monitoring platforms
  • Implement Infrastructure-as-Code using Terraform, Helm charts, Python, scripts, and configuration management tools
  • Develop production automations for routine operational tasks
  • Create and maintain documentation, runbooks, and SOPs in Confluence
  • Participate in on-call rotations and post-incident reviews

Requirements

  • 4+ years in Site Reliability Engineering, DevOps, or related operational roles
  • Proven experience in Linux/Unix systems administration
  • Proficiency in scripting and programming languages such as Python, Bash, or Go
  • Strong experience with cloud infrastructure and services across GCP, AWS, and OCI
  • Experience with container orchestration tools like Kubernetes
  • Expertise in monitoring and observability tools such as Prometheus, Grafana, Splunk, Nagios
  • Hands-on experience with Infrastructure-as-Code tools like Terraform, Ansible, or Helm
  • Ability to develop and track SLIs, SLOs, and SLAs
  • Deep understanding of networking, DNS, load balancing, and CDN technologies
  • Familiarity with databases (SQL, NoSQL, Vertica, MongoDB, Snowflake) and data pipeline technologies
  • Knowledge of CI/CD pipelines, GitLab, and deployment automation
  • Experience with workflow automation platforms is a strong plus
  • Exceptional communication skills
  • Proactive problem-solving approach
  • Ownership mentality
  • Passion for mentorship and knowledge sharing
  • Bachelor's or Master's degree in Computer Science, Engineering, or related field (preferred)
  • Industry certifications such as AWS Certified DevOps Engineer, Google Professional Cloud DevOps Engineer, Certified Kubernetes Administrator (CKA), or Terraform/Grafana certifications (preferred)
  • Experience with AI-assisted development tools like ChatGPT, Cursor, Glean, or Copilot (preferred)
  • Familiarity with security best practices in cloud and containerized environments (preferred)

Tech stack

LinuxUnixPythonBashGoGCPAWSOCIKubernetesPrometheusGrafanaSplunkNagiosTerraformAnsibleHelmSQLNoSQLVerticaMongoDBSnowflakeGitLabCI/CDChatGPTCursorGleanCopilot

Benefits

Bonus/commission (as applicable)EquityBenefits (unspecified)

Apply now

This MVP uses a placeholder application flow. In production, this section can connect to an external apply URL or a native application form.

Similar jobs

More roles worth a look

Related opportunities based on specialty and working model so candidates can keep momentum.