Senior Site Reliability Engineer (GPU & ML Infrastructure)

site reliability engineeringSREGPUmachine learningML infrastructureRayKubernetesNVIDIA Tritondistributed systemsPythonGoC#inference servingcloud-nativeGKEEKSplatform engineering

Key details

Salary

Not specified

Employment type

Permanent Full Time

Seniority

Senior

Years experience

5-10

Location

Grenoble, France; Paris, France

Full job description

Senior Site Reliability Engineer role focused on GPU and ML infrastructure. Responsibilities include building and operating scalable Ray clusters on Kubernetes, developing self-service distributed computing platforms for ML workloads, and optimizing NVIDIA Triton inference platforms. Requires 5+ years experience in backend, SRE, or platform engineering with distributed systems, strong Kubernetes skills, hands-on GPU workload experience, and software engineering skills in C#, Python, or Go. Bonus for experience with distributed ML frameworks, inference serving stacks, GPU scheduling, and cloud-native GPU orchestration. Hybrid work model based in Paris and Grenoble, France. Benefits include hybrid work, career development, health and wellness support, inclusive team environment, competitive salary, and potential equity.

What you'll do

Build and operate scalable Ray clusters running on Kubernetes
Develop reliable self-service distributed computing platforms for ML workloads
Improve provisioning, observability, reliability, and operational efficiency of ray-as-a-service environments
Operate and optimize large-scale inference platforms using NVIDIA Triton Inference Server
Improve latency, throughput, scalability, and GPU utilization for deep learning inference workloads
Collaborate closely with ML engineers, data scientists, and infrastructure teams to deliver reliable, production-grade ML platforms

Requirements

5+ years of experience in backend engineering, Site Reliability Engineering, or platform engineering roles focused on distributed systems
Strong experience with Kubernetes, including workload scheduling, dynamic provisioning, and custom controllers/operators
Hands-on experience running or optimizing GPU-based workloads in production, ideally for ML training or inference systems
Strong software engineering skills in C#, Python, Go, or similar languages, with a focus on building reliable distributed systems
Experience building or operating production-grade infrastructure with strong requirements around performance, scalability, and reliability
Strong interest in automation, observability, and designing systems that scale efficiently under high load
Bonus: Experience with distributed ML frameworks such as Ray or similar systems
Bonus: Familiarity with inference serving stacks such as NVIDIA Triton or TensorRT
Bonus: Experience with GPU scheduling, resource management, or multi-tenant GPU platforms
Bonus: Exposure to cloud-native GPU orchestration (GKE, EKS, or on-prem Kubernetes GPU clusters)

Tech stack

RayKubernetesNVIDIA Triton Inference ServerC#PythonGoTensorRTGKEEKS

Benefits

Hybrid work model blending home and in-office experiencesLearning, mentorship & career development programsHealth benefits, wellness perks & mental health supportDiverse, inclusive, and globally connected teamAttractive salary with performance-based rewards and family-friendly policiesPotential for equity depending on role and level

Apply now

Ready to take the next step in your career? Click the button below to continue to the application process.

Continue to application Browse more jobs

Similar jobs

More roles worth a look

Related opportunities based on specialty and working model so candidates can keep momentum.

The Trade Desk

Business Development GM (Holdco)

New York, US•2 months ago

$134K – $245K

business developmentsalesagency

View job details→

TripleLift

Accountant

Detroit, United States; New York, US•2 months ago

$75K – $95K

accountingpayrollcompensation

View job details→

TripleLift

Associate Campaign Manager

Pune, India•2 months ago

ad opsprogrammaticcampaign management

View job details→