Full job description
Tatari is hiring a Data Platform Engineer focused on systems and infrastructure to ensure the reliability, stability, and operational health of the data platform. This role involves administering, scaling, hardening, and evolving the platform rather than data engineering tasks. Responsibilities include owning platform reliability, enforcing environment promotion discipline, defining SOPs for deployments and maintenance, monitoring platform health, participating in architecture discussions, collaborating with cross-functional teams, identifying and remediating risks, and supporting stable customer-facing and internal systems. Candidates should have 3+ years of experience in cloud infrastructure, SRE, or platform engineering with strong operational discipline, knowledge of high availability architectures, workflow orchestration, Linux scripting, distributed data processing, containerization, data ingestion, infrastructure-as-code, databases, monitoring tools, network infrastructure, and security. MLOps knowledge is a plus. The role is full-time, hybrid with 2 days per week in-office in Los Angeles, California. Compensation ranges from $190,000 to $240,000 plus equity and benefits including health insurance, 401K, education benefits, unlimited PTO, wellness days, and office perks.
What you'll do
- Own the reliability and availability of data platform infrastructure across all environments
- Enforce and improve environment promotion discipline
- Define and uphold SOPs around deployments, maintenance windows, and change management
- Instrument and monitor platform health using observability tooling and build meaningful alerting
- Participate in architecture and deployment discussions and push back when something isn't ready
- Collaborate with data scientists, engineers, and product managers on infrastructure needs
- Identify and remediate reliability risks before they become incidents
- Support customer-facing and internal systems with a bias toward stability over velocity
Requirements
- Operational instinct and discipline around production environments
- 3+ years in cloud infrastructure, SRE, or platform engineering
- Experience with high availability architecture including blue/green deployments, data replication, and load balancing
- Experience with workflow orchestration tools like Airflow or similar
- Strong Linux fundamentals and scripting skills (Bash, Python, or similar)
- Experience with distributed data processing frameworks such as Spark or PySpark
- Experience with containerization and orchestration tools like Kubernetes and Docker
- Experience with data ingestion, ETL, or streaming systems like Kafka or Flink
- Experience with infrastructure-as-code and provisioning tools like Terraform or Helm
- Knowledge of OLAP and OLTP databases such as Clickhouse, Postgres, or Redshift
- Experience with monitoring, logging, and observability tools like Datadog, Prometheus, or Kibana
- Experience administering and scaling managed data platforms like Databricks
- Knowledge of network infrastructure fundamentals including load balancers, DNS, auto-scaling, multi-region topologies, and proxies
- Knowledge of security and access management including least-privilege, secrets management, and controls for data systems
- MLOps concepts or tooling is a plus
- Humility, methodical execution, strong communication, ownership, and independence
Tech stack
AWSGCPAzureAirflowcronBashPythonSparkPySparkKubernetesDockerKafkaFlinkTerraformHelmClickhousePostgresRedshiftDatadogPrometheusKibanaDatabricks
Benefits
Total compensation ($190,000 - $240,000)Equity compensationHealth insurance coverage for employee and dependents401K, FSA, and commuter benefits$150 monthly spending account$1,000 annual continued education benefit$500 Newbie Productivity PerkUnlimited PTO and sick daysMonthly Company Wellness Day OffSnacks, drinks, and catered lunches at the officeTeam building eventsHybrid return-to-office of 2 days per week