Engineering4 months agoOn-site

Microsoft

Principal Software Engineer

C++CUDAGPU inferenceTensorRTNVIDIA Tritonreal-time biddingRTBdistributed systemsperformance engineeringprofilinglow latencymachine learningLLMdeep learningKafkaFlinkSpark Streamingmulti-threadingNUMADPDKio_uringad servingonline servingscalabilityautoscaling

Key details

Salary

$140K – $275K

Employment type

Full-time

Seniority

Senior

Years experience

10+

Location

Redmond, United States

Full job description

Microsoft Advertising seeks a Principal Software Engineer for the Ads Engineering Platform team to advance ad-serving infrastructure powering Bing Search, MSN, Microsoft Start, and Edge shopping. The role involves designing and optimizing large-scale, distributed GPU/CPU inference and ranking pipelines handling millions of ad requests per second with low latency and high throughput. Responsibilities include developing inference infrastructure, profiling and optimizing performance across CUDA kernels, GPU pipelines, CPU threads, and OS scheduling, ensuring live-site reliability, and mentoring engineering teams. Required qualifications include a Bachelor's degree in Computer Science or related field with 6+ years of engineering experience in languages such as C, C++, C#, Java, JavaScript, or Python. Preferred qualifications include a Master's degree or equivalent experience with 8+ years, industry experience in advertising or search backend systems, expertise in real-time data streaming, LLM inference optimization, GPU inference frameworks (NVIDIA Triton, CUDA, TensorRT), and deep systems debugging. The position is full-time, on-site in Redmond, Washington, with salary ranges from $139,900 to $274,800 annually in most U.S. locations and higher in San Francisco and New York City areas.

What you'll do

Design and lead development of large-scale, distributed online serving systems including GPU-accelerated and CPU-based ranking/inference pipelines to process millions of ad requests per second with ultra-low latency, high throughput, and reliability
Architect and optimize end-to-end inference infrastructure including model serving, batching/streaming, caching, scheduling, and resource orchestration across heterogeneous hardware (GPU, CPU, memory tiers)
Profile and optimize performance across full stack from CUDA kernels and GPU pipelines to CPU threads and OS-level scheduling, identifying bottlenecks and improving cost efficiency
Own live-site reliability as a DRI: design telemetry, alerting, fault-tolerance mechanisms; drive rapid diagnosis and mitigation of performance regressions or outages in globally distributed systems
Collaborate and mentor across teams: drive architecture reviews, enforce engineering excellence, promote system-level optimization practices, and mentor others in debugging, profiling, and performance engineering

Requirements

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including C, C++, C#, Java, JavaScript, or Python, OR equivalent experience
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including C, C++, C#, Java, JavaScript, or Python, OR Bachelor's Degree and 12+ years experience, OR equivalent experience
Industry experience in advertising or search engine backend systems such as large-scale ad ranking, real-time bidding, or relevance-serving infrastructure
Hands-on experience with real-time data streaming systems (Kafka, Flink, Spark Streaming), feature-store integration, and multi-region deployment for low-latency, globally distributed services
Familiarity with LLM inference optimization including model sharding, tensor/kv-cache parallelism, paged attention, continuous batching, quantization (AWQ/FP8), and hybrid CPU–GPU orchestration
Experience operating large-scale systems with SLA-based capacity forecasting, autoscaling, and performance telemetry
Leadership in cross-functional architecture initiatives and technical mentorship
Expertise in GPU inference frameworks such as NVIDIA Triton Inference Server, CUDA, and TensorRT, including custom CUDA kernels and GPU optimization
Understanding of model-serving trade-offs including batching vs. streaming, latency vs. throughput, quantization, dynamic batching, continuous model rollout, and adaptive inference scheduling
Ability to profile and optimize GPU and system workloads including tensor/memory alignment, compute–memory balancing, embedding table management, parameter servers, hierarchical caching, and vectorized inference for transformer/LLM architectures
Expertise in low-level system and OS internals including multi-threading, process scheduling, NUMA-aware memory allocation, lock-free data structures, context switching, I/O stack tuning (NVMe, RDMA), kernel bypass (DPDK, io_uring), and CPU/GPU affinity optimization

Tech stack

CC++C#JavaJavaScriptPythonCUDANVIDIA Triton Inference ServerTensorRTKafkaFlinkSpark StreamingDPDKio_uring

Benefits

Certain roles may be eligible for benefits and other compensation (details at https://careers.microsoft.com/us/en/us-corporate-pay)

Apply now

Ready to take the next step in your career? Click the button below to continue to the application process.

Continue to application Browse more jobs

Company

Microsoft

Every company has a mission. What's ours? To empower every person and every organization to achieve more. We believe technology can and should be a force for good and that meaningful innovation contributes to a brighter world in the future and today. Our culture doesn’t just encourage curiosity; it embraces it. Each day we make progress together by showing up as our authentic selves. We show up with a learn-it-all mentality. We show up cheering on others, knowing their success doesn't diminish our own. We show up every day open to learning our own biases, changing our behavior, and inviting in differences. Because impact matters. Microsoft operates in 190 countries and is made up of approximately 228,000 passionate employees worldwide.

Industry

Software Development

Company size

10001+

Website

https://news.microsoft.com/

Posted

4 months ago

Category: Engineering

Similar jobs

More roles worth a look

Related opportunities based on specialty and working model so candidates can keep momentum.

TripleLift

Data Scientist

New York, US•2 months ago

$90K – $120K

data sciencemachine learningpython

View job details→

TripleLift

Director of Sales - US, West

Los Angeles, United States•2 months ago

$290K – $350K

sales leadershipprogrammaticCTV

View job details→

TripleLift

Director, Product Management

New York, US•2 months ago

$200K – $250K

product managementCTVprogrammatic

View job details→