Full job description
Tatari is seeking a Data Platform Engineer focused on systems and infrastructure to ensure the reliability, stability, and operational health of the data platform. The role involves administering, scaling, hardening, and evolving the platform rather than data engineering tasks. Responsibilities include owning platform reliability across environments, enforcing deployment discipline, defining SOPs, monitoring platform health, participating in architecture discussions, collaborating with teams on infrastructure needs, identifying risks, and supporting systems with a focus on stability. Candidates should have 3+ years in cloud infrastructure, SRE, or platform engineering with experience in high availability architecture, workflow orchestration, Linux scripting, distributed data processing, containerization, data ingestion, infrastructure-as-code, OLAP/OLTP databases, monitoring tools, managed data platforms, network infrastructure, and security best practices. MLOps knowledge is a plus. The role values operational discipline, humility, methodical execution, communication, ownership, and independence. Compensation ranges from $190,000 to $240,000 with equity and benefits including health insurance, 401K, education benefits, unlimited PTO, wellness days, snacks, team events, and hybrid work with 2 days in office per week. Location: San Francisco, CA.
What you'll do
- Own the reliability and availability of data platform infrastructure across all environments
- Enforce and improve environment promotion discipline
- Define and uphold SOPs around deployments, maintenance windows, and change management
- Instrument and monitor platform health using observability tooling and build meaningful alerting
- Participate in architecture and deployment discussions and push back when something isn't ready
- Collaborate with data scientists, engineers, and product managers on infrastructure needs
- Identify and remediate reliability risks before they become incidents
- Support customer-facing and internal systems with a bias toward stability over velocity
Requirements
- Operational instinct with production systems and maintenance windows
- 3+ years in cloud infrastructure, SRE, or platform engineering
- Experience with high availability architecture including blue/green deployments, data replication, load balancing
- Experience with workflow orchestration tools like Airflow or similar
- Strong Linux fundamentals and scripting skills (Bash, Python, or similar)
- Experience with distributed data processing frameworks like Spark or PySpark
- Experience with containerization and orchestration (Kubernetes, Docker, or similar)
- Experience with data ingestion, ETL, or streaming systems (Kafka, Flink, or similar)
- Experience with infrastructure-as-code and provisioning tools (Terraform, Helm, or similar)
- Knowledge of OLAP and OLTP databases (Clickhouse, Postgres, Redshift, or similar)
- Experience with monitoring, logging, and observability tools (Datadog, Prometheus, Kibana, or similar)
- Experience administering and scaling managed data platforms (Databricks or similar)
- Knowledge of network infrastructure fundamentals (load balancers, DNS, auto-scaling, multi-region topologies, proxies)
- Knowledge of security and access management best practices (least-privilege, secrets management, controls for data systems)
- MLOps concepts or tooling is a plus
- Humility, methodical execution, communication, ownership, and independence
Tech stack
AWSGCPAzureAirflowBashPythonSparkPySparkKubernetesDockerKafkaFlinkTerraformHelmClickhousePostgresRedshiftDatadogPrometheusKibanaDatabricks
Benefits
Total compensation ($190,000 - $240,000)Equity compensationHealth insurance coverage for employee and dependents401K, FSA, and commuter benefits$150 monthly spending account$1,000 annual continued education benefit$500 Newbie Productivity PerkUnlimited PTO and sick daysMonthly Company Wellness Day OffSnacks, drinks, and catered lunches at the officeTeam building eventsHybrid return-to-office of 2 days per week