AdTechTalent
Engineering56 days agoOn-site

Microsoft

Principal Software Engineer

C++CUDAGPU inferenceTensorRTNVIDIA Tritonreal-time biddingRTBdistributed systemsperformance engineeringprofilinglow latencymachine learningLLMdeep learningKafkaFlinkSpark Streamingmulti-threadingNUMADPDKio_uringad servingonline servingscalabilityautoscaling

Key details

Salary

$140K – $275K

Employment type

Full-time

Seniority

Senior

Years experience

10+

Location

Redmond, Washington, United States

Full job description

Microsoft Advertising seeks a Principal Software Engineer for the Ads Engineering Platform team to advance ad-serving infrastructure powering Bing Search, MSN, Microsoft Start, and Edge shopping. The role involves designing and optimizing large-scale, distributed GPU/CPU inference and ranking pipelines handling millions of ad requests per second with low latency and high throughput. Responsibilities include developing inference infrastructure, profiling and optimizing performance across CUDA kernels, GPU pipelines, CPU threads, and OS scheduling, ensuring live-site reliability, and mentoring engineering teams. Required qualifications include a Bachelor's degree in Computer Science or related field with 6+ years of engineering experience in languages such as C, C++, C#, Java, JavaScript, or Python. Preferred qualifications include a Master's degree or equivalent experience with 8+ years, industry experience in advertising or search backend systems, expertise in real-time data streaming, LLM inference optimization, GPU inference frameworks (NVIDIA Triton, CUDA, TensorRT), and deep systems debugging. The position is full-time, on-site in Redmond, Washington, with salary ranges from $139,900 to $274,800 annually in most U.S. locations and higher in San Francisco and New York City areas.

What you'll do

  • Design and lead development of large-scale, distributed online serving systems including GPU-accelerated and CPU-based ranking/inference pipelines to process millions of ad requests per second with ultra-low latency, high throughput, and reliability
  • Architect and optimize end-to-end inference infrastructure including model serving, batching/streaming, caching, scheduling, and resource orchestration across heterogeneous hardware (GPU, CPU, memory tiers)
  • Profile and optimize performance across full stack from CUDA kernels and GPU pipelines to CPU threads and OS-level scheduling, identifying bottlenecks and improving cost efficiency
  • Own live-site reliability as a DRI: design telemetry, alerting, fault-tolerance mechanisms; drive rapid diagnosis and mitigation of performance regressions or outages in globally distributed systems
  • Collaborate and mentor across teams: drive architecture reviews, enforce engineering excellence, promote system-level optimization practices, and mentor others in debugging, profiling, and performance engineering

Requirements

  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including C, C++, C#, Java, JavaScript, or Python, OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including C, C++, C#, Java, JavaScript, or Python, OR Bachelor's Degree and 12+ years experience, OR equivalent experience
  • Industry experience in advertising or search engine backend systems such as large-scale ad ranking, real-time bidding, or relevance-serving infrastructure
  • Hands-on experience with real-time data streaming systems (Kafka, Flink, Spark Streaming), feature-store integration, and multi-region deployment for low-latency, globally distributed services
  • Familiarity with LLM inference optimization including model sharding, tensor/kv-cache parallelism, paged attention, continuous batching, quantization (AWQ/FP8), and hybrid CPU–GPU orchestration
  • Experience operating large-scale systems with SLA-based capacity forecasting, autoscaling, and performance telemetry
  • Leadership in cross-functional architecture initiatives and technical mentorship
  • Expertise in GPU inference frameworks such as NVIDIA Triton Inference Server, CUDA, and TensorRT, including custom CUDA kernels and GPU optimization
  • Understanding of model-serving trade-offs including batching vs. streaming, latency vs. throughput, quantization, dynamic batching, continuous model rollout, and adaptive inference scheduling
  • Ability to profile and optimize GPU and system workloads including tensor/memory alignment, compute–memory balancing, embedding table management, parameter servers, hierarchical caching, and vectorized inference for transformer/LLM architectures
  • Expertise in low-level system and OS internals including multi-threading, process scheduling, NUMA-aware memory allocation, lock-free data structures, context switching, I/O stack tuning (NVMe, RDMA), kernel bypass (DPDK, io_uring), and CPU/GPU affinity optimization

Tech stack

CC++C#JavaJavaScriptPythonCUDANVIDIA Triton Inference ServerTensorRTKafkaFlinkSpark StreamingDPDKio_uring

Benefits

Certain roles may be eligible for benefits and other compensation (details at https://careers.microsoft.com/us/en/us-corporate-pay)

Apply now

This MVP uses a placeholder application flow. In production, this section can connect to an external apply URL or a native application form.

Similar jobs

More roles worth a look

Related opportunities based on specialty and working model so candidates can keep momentum.