Full job description
Microsoft Advertising seeks a Principal Software Engineer for the Ads Engineering Platform team to advance ad-serving infrastructure powering Bing Search, MSN, Microsoft Start, and Edge shopping. The role involves designing and optimizing large-scale, distributed GPU/CPU inference and ranking pipelines handling millions of ad requests per second with low latency and high throughput. Responsibilities include developing inference infrastructure, profiling and optimizing performance across CUDA kernels, GPU pipelines, CPU threads, and OS scheduling, ensuring live-site reliability, and mentoring engineering teams. Required qualifications include a Bachelor's degree in Computer Science or related field with 6+ years of engineering experience in languages such as C, C++, C#, Java, JavaScript, or Python. Preferred qualifications include a Master's degree or equivalent experience with 8+ years, industry experience in advertising or search backend systems, expertise in real-time data streaming, LLM inference optimization, GPU inference frameworks (NVIDIA Triton, CUDA, TensorRT), and deep systems debugging. The position is full-time, on-site in Redmond, Washington, with salary ranges from $139,900 to $274,800 annually in most U.S. locations and higher in San Francisco and New York City areas.
What you'll do
- Design and lead development of large-scale, distributed online serving systems including GPU-accelerated and CPU-based ranking/inference pipelines to process millions of ad requests per second with ultra-low latency, high throughput, and reliability
- Architect and optimize end-to-end inference infrastructure including model serving, batching/streaming, caching, scheduling, and resource orchestration across heterogeneous hardware (GPU, CPU, memory tiers)
- Profile and optimize performance across full stack from CUDA kernels and GPU pipelines to CPU threads and OS-level scheduling, identifying bottlenecks and improving cost efficiency
- Own live-site reliability as a DRI: design telemetry, alerting, fault-tolerance mechanisms; drive rapid diagnosis and mitigation of performance regressions or outages in globally distributed systems
- Collaborate and mentor across teams: drive architecture reviews, enforce engineering excellence, promote system-level optimization practices, and mentor others in debugging, profiling, and performance engineering
Requirements
- Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including C, C++, C#, Java, JavaScript, or Python, OR equivalent experience
- Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including C, C++, C#, Java, JavaScript, or Python, OR Bachelor's Degree and 12+ years experience, OR equivalent experience
- Industry experience in advertising or search engine backend systems such as large-scale ad ranking, real-time bidding, or relevance-serving infrastructure
- Hands-on experience with real-time data streaming systems (Kafka, Flink, Spark Streaming), feature-store integration, and multi-region deployment for low-latency, globally distributed services
- Familiarity with LLM inference optimization including model sharding, tensor/kv-cache parallelism, paged attention, continuous batching, quantization (AWQ/FP8), and hybrid CPU–GPU orchestration
- Experience operating large-scale systems with SLA-based capacity forecasting, autoscaling, and performance telemetry
- Leadership in cross-functional architecture initiatives and technical mentorship
- Expertise in GPU inference frameworks such as NVIDIA Triton Inference Server, CUDA, and TensorRT, including custom CUDA kernels and GPU optimization
- Understanding of model-serving trade-offs including batching vs. streaming, latency vs. throughput, quantization, dynamic batching, continuous model rollout, and adaptive inference scheduling
- Ability to profile and optimize GPU and system workloads including tensor/memory alignment, compute–memory balancing, embedding table management, parameter servers, hierarchical caching, and vectorized inference for transformer/LLM architectures
- Expertise in low-level system and OS internals including multi-threading, process scheduling, NUMA-aware memory allocation, lock-free data structures, context switching, I/O stack tuning (NVMe, RDMA), kernel bypass (DPDK, io_uring), and CPU/GPU affinity optimization
Tech stack
CC++C#JavaJavaScriptPythonCUDANVIDIA Triton Inference ServerTensorRTKafkaFlinkSpark StreamingDPDKio_uring
Benefits
Certain roles may be eligible for benefits and other compensation (details at https://careers.microsoft.com/us/en/us-corporate-pay)