Sr. Site Reliability Engineer I

site reliability engineeringdevopspythonbashgocloudgcpawsocikubernetesterraformansiblehelmprometheusgrafanasplunknagiosmonitoringautomationci/cdgitlabinfrastructure as codeai-assisted developmentchatgptcursorcopilotnetworkingdnsload balancingcdnsqlnosqlmongodbsnowflakevertica

Key details

Salary

$89K – $178K

Employment type

Full-time

Seniority

Mid-level

Years experience

3-5

Location

New York, US

Full job description

Build and maintain reliable, scalable, and high-performance digital media measurement platforms. Implement observability best practices including metrics collection, dashboarding, and alerting. Reduce mean time to recovery (MTTR) for critical incidents through automation and proactive monitoring. Respond to and resolve Sev1/Sev2 incidents. Monitor and maintain infrastructure across GCP, AWS, OCI, and on-premises. Lead technical projects from planning to deployment. Develop automations to improve operational efficiency. Use AI-assisted tools for automation and problem resolution. Implement Infrastructure-as-Code with Terraform, Helm, Python, and configuration management tools. Create and maintain documentation and runbooks. Participate in on-call rotations and post-incident reviews. Requires 4+ years in SRE, DevOps, or related roles with Linux/Unix administration experience. Proficient in Python, Bash, or Go. Experienced with cloud platforms (GCP, AWS, OCI), Kubernetes, monitoring tools (Prometheus, Grafana, Splunk, Nagios), and Infrastructure-as-Code tools (Terraform, Ansible, Helm). Knowledge of networking, databases, CI/CD, and workflow automation. Strong communication, problem-solving, and ownership mindset. Preferred qualifications include relevant degrees, certifications, AI-assisted development experience, and security best practices knowledge. Salary range $89,000 - $178,000 plus bonus, equity, and benefits. Hybrid work model with 3 days per week in office at NYC Global HQ.

What you'll do

Build and maintain reliability, scalability, and performance of digital media measurement platforms
Implement observability best practices including metrics collection, dashboarding, and alerting
Reduce MTTR for critical incidents through automation and improved observability
Respond to incidents and manage Sev1/Sev2 situations
Monitor and maintain high availability infrastructure and services across GCP, AWS, OCI, and on-premises
Lead technical projects from planning through deployment
Build and deploy automations to eliminate operational toil and improve efficiency
Leverage AI-assisted development tools to accelerate automation and problem resolution
Build custom integrations and MCP servers for monitoring platforms
Implement Infrastructure-as-Code using Terraform, Helm charts, Python, scripts, and configuration management tools
Develop production automations for routine operational tasks
Create and maintain documentation, runbooks, and SOPs in Confluence
Participate in on-call rotations and post-incident reviews

Requirements

4+ years in Site Reliability Engineering, DevOps, or related operational roles
Proven experience in Linux/Unix systems administration
Proficiency in scripting and programming languages such as Python, Bash, or Go
Strong experience with cloud infrastructure and services across GCP, AWS, and OCI
Experience with container orchestration tools like Kubernetes
Expertise in monitoring and observability tools such as Prometheus, Grafana, Splunk, Nagios
Hands-on experience with Infrastructure-as-Code tools like Terraform, Ansible, or Helm
Ability to develop and track SLIs, SLOs, and SLAs
Deep understanding of networking, DNS, load balancing, and CDN technologies
Familiarity with databases (SQL, NoSQL, Vertica, MongoDB, Snowflake) and data pipeline technologies
Knowledge of CI/CD pipelines, GitLab, and deployment automation
Experience with workflow automation platforms is a strong plus
Exceptional communication skills
Proactive problem-solving approach
Ownership mentality
Passion for mentorship and knowledge sharing
Bachelor's or Master's degree in Computer Science, Engineering, or related field (preferred)
Industry certifications such as AWS Certified DevOps Engineer, Google Professional Cloud DevOps Engineer, Certified Kubernetes Administrator (CKA), or Terraform/Grafana certifications (preferred)
Experience with AI-assisted development tools like ChatGPT, Cursor, Glean, or Copilot (preferred)
Familiarity with security best practices in cloud and containerized environments (preferred)

Tech stack

LinuxUnixPythonBashGoGCPAWSOCIKubernetesPrometheusGrafanaSplunkNagiosTerraformAnsibleHelmSQLNoSQLVerticaMongoDBSnowflakeGitLabCI/CDChatGPTCursorGleanCopilot

Benefits

Bonus/commission (as applicable)EquityBenefits (unspecified)

Apply now

Ready to take the next step in your career? Click the button below to continue to the application process.

Continue to application Browse more jobs

Similar jobs

More roles worth a look

Related opportunities based on specialty and working model so candidates can keep momentum.

The Trade Desk

Business Development GM (Holdco)

New York, US•2 months ago

$134K – $245K

business developmentsalesagency

View job details→

TripleLift

Accountant

Detroit, United States; New York, US•2 months ago

$75K – $95K

accountingpayrollcompensation

View job details→

TripleLift

Associate Campaign Manager

Pune, India•2 months ago

ad opsprogrammaticcampaign management

View job details→