Best of Both Worlds: Cloud-Native AI Inference at Scale using KServe and llm-d

Authors:

Yuan Tang: Project Lead, KServe; Senior Principal Software Engineer
Ran Pollak: Manager, AI Catalyst at Red Hat

Originally posted on KServe Blog.

Enterprises today seek to integrate generative AI (GenAI) capabilities into their applications. However, scaling large AI models introduces complexity: managing high-volume traffic from large language models (LLMs), optimizing inference performance, maintaining predictable latency, and controlling infrastructure costs.

Platform engineering leaders require more than just model deployment capabilities. They need a robust, Kubernetes-native infrastructure that supports:

Efficient GPU utilization
Intelligent request routing
Distributed inference patterns
Cost-aware autoscaling
Production-grade governance

This article demonstrates how two open-source solutions, KServe and llm-d, can be combined to address these challenges.

We explore the role of each solution, illustrate their integration architecture, and provide practical guidance for AI platform teams, with deeper focus on KServe’s LLMInferenceService, available since KServe v0.16.

KServe: Simplified Deployment of AI Models on Kubernetes

KServe is a Kubernetes-based model serving platform that simplifies deploying and managing ML models, including LLMs, at scale.

For platform engineers, KServe acts as the model serving control plane: the layer responsible for lifecycle, scaling, and operational governance.

KServe Generative Inference Architecture

Inference as a Service

InferenceService serves as KServe’s core abstraction for model deployment, encapsulating the full serving lifecycle, including:

Automatic deployment creation and reconciliation
Request-based autoscaling with scale-to-zero and autoscaling based on custom metrics
Revision management and canary rollouts
Endpoint exposure and traffic routing
Runtime abstraction across serving backends for both predictive and generative AI
Optional pre-processing/post-processing, inference pipelines, and ensembles

ML engineers provide trained models. Platform engineers retain operational control without writing custom deployment code.

LLMInferenceService in KServe

KServe v0.16 introduces stronger generative AI capabilities, including LLMInferenceService, designed specifically for large language model workloads.

Unlike traditional stateless predictors, LLM workloads require:

Long-running streaming responses
GPU-heavy memory footprints
Prefix KV-cache management
High-concurrency token streaming
OpenAI-compatible APIs

LLMInferenceService shares common foundations with InferenceService but introduces additional capabilities tailored for large language models, including:

Unlocking Generative AI Serving with LLMInferenceService: From Pod-Level Speed to Cluster-Wide Intelligence

Imagine you want to bring the power of generative AI directly into your applications, but without rewriting your entire stack. It offers OpenAI-compatible endpoints like /v1/chat/completions, complete with streaming token responses and multi-turn support. With prompt templating built in, developers can integrate seamlessly with existing tools—whether it’s the OpenAI SDKs, LangChain, LlamaIndex, Llama Stack, RAG frameworks, or even enterprise GenAI gateways.

Under the hood, KServe connects to LLM-optimized runtimes such as vLLM, Hugging Face TGI, or other GPU-native backends. These engines bring advanced capabilities like continuous batching, memory-efficient paged attention, and KV-cache reuse, delivering high throughput per GPU.

Yet, while these runtime-level optimizations make each pod lightning fast, true cluster-wide efficiency needs more. That’s exactly the role of llm-d: adding an extra layer of intelligence that orchestrates resources and maximizes performance across the entire deployment.

Distributed & Multi-Node Model Support

LLMInferenceService supports advanced parallelism strategies implemented by runtimes, including tensor parallelism, pipeline parallelism, and multi-GPU sharding.

This enables hosting 70B+ parameter models, partitioning models across nodes, and serving models larger than single-GPU memory.

KServe orchestrates the deployment topology, while the runtime manages execution parallelism.

Advanced Autoscaling & Networking (Including Scale-to-Zero)

KServe integrates deeply with Kubernetes to support request- and concurrency-based autoscaling via Knative, GPU-backed scaling, and scale-to-zero for cost control.

It also integrates with the Kubernetes Gateway API for TLS termination, traffic splitting, and advanced routing.

This makes it suitable for development environments, internal copilots, and large-scale production workloads.

Kubernetes Gateway API Integration

KServe integrates with Kubernetes Gateway API for:

Enterprise-grade routing
TLS termination
Traffic splitting
Multi-model routing

This enables integration with modern Kubernetes networking stacks.

Where KServe Alone Is Not Enough

Even with LLMInferenceService and optimized runtimes, KServe does not inherently:

Route requests based on KV-cache locality across replicas
Separate prefill and decode cluster-wide
Perform SLA-aware routing decisions
Optimize GPU utilization across multiple pods

To address these, we introduce llm-d.

llm-d: Distributed Intelligence for LLM Inference

llm-d is a Kubernetes-native distributed inference framework designed to enhance performance and efficiency of LLM workloads.

If KServe is the control plane for models, llm-d is the distributed intelligence scheduling layer.

llm-d Architecture

KV-Cache Aware Scheduling and Disaggregated Inference with llm-d

As LLM deployments mature, scaling is no longer just about adding GPUs. It’s about using them intelligently. Modern runtimes such as vLLM introduced prefix (KV) caching to reduce redundant computation, but without smart scheduling, much of that benefit is lost.

This is where llm-d changes the game.

Disaggregated Inference (Prefill / Decode Separation)

LLM inference consists of two distinct phases: prefill and decode. The prefill phase is compute-heavy, processing the full prompt and building the model’s attention context. The decode phase is latency-sensitive, generating tokens step by step where responsiveness directly impacts user experience.

llm-d separates these phases across different GPU groups, assigning compute-optimized resources to prefill and latency-optimized resources to decode. With intelligent scheduling between them, workloads are aligned to the right hardware profile.

This phase-aware architecture increases GPU utilization, reduces tail latency, and lowers cost per token by eliminating resource contention between fundamentally different workloads.

Intelligent Inference Scheduler

llm-d’s inference scheduler evaluates the following metrics:

GPU utilization
Queue depth
Cache residency
SLA constraints
Load distribution

It enhances load balancing with an intelligent scheduler to decrease serving latency and increase throughput with prefix-cache aware routing, utilization-based load balancing, fairness and prioritization for multi-tenant serving, and predicted latency balancing.

KServe LLMInferenceService and llm-d

Responsibility Separation

This layered design ensures composability and specialization, providing a complete, production-ready solution for generative AI. KServe acts as the control plane and LLMInferenceService delivers the generative API abstraction, while llm-d provides the cluster-wide optimization.

Layer	Responsibility
KServe	Model lifecycle, scaling, governance
LLMInferenceService	Generative API abstraction
vLLM	Efficient execution inside runtime
llm-d	Cross-runtime routing & cache awareness
Kubernetes	Resource orchestration

Together, KServe and llm-d enable a production-ready, Kubernetes-native inference platform that balances scalability, performance, and cost efficiency, providing the best of both worlds for cloud-native AI inference at scale.

Cost Efficiency Comparison: Naive vs Optimized

Serving LLMs at scale is no longer just a model problem. It is a distributed systems problem where naive load balancing leads to significant inefficiencies and wasted resources — lost cache locality, GPU imbalance, redundant prefill processing, high tail latency, and overprovisioned GPUs.

Naive Problems:

Cache locality loss
GPU imbalance
Redundant prefill processing
High tail latency
Overprovisioned GPUs

Optimized Architecture with KServe + llm-d

The combined KServe and llm-d solution introduces distributed intelligence to solve the problems of naive architectures, delivering superior performance, scalability, and cost control. This optimized architecture is pluggable and extensible to work well with many AI and cloud-native technologies.

KServe Layered Architecture

Benefits:

Cache reuse preserved
Balanced GPU utilization
Reduced recomputation
Lower cost per token
Controlled autoscaling via LLMInferenceService

Benchmark Results: Why Cluster-Level Intelligence Matters

By integrating llm-d’s cache-aware routing, prefill and decode disaggregation, and SLA-based scheduling with KServe’s enterprise-grade generative serving and autoscaling, the system achieves cluster-wide GPU optimization.

Note: The following results are based on benchmarks published by the llm-d project

Optimization Area	Naive Architecture (Round Robin LB)	Optimized (KServe + llm-d)	Source
Cache Locality	Requests routed randomly → KV cache frequently missed	Cache-aware routing preserves prefix locality	llm-d blog
Time to First Token (P90)	Baseline latency under cache-blind scheduling	Up to ~57× faster P90 TTFT in benchmark	llm-d blog
Token Throughput	~4,400 tokens/sec (baseline test cluster)	~8,730 tokens/sec (~2× improvement)	llm-d blog
Throughput at Scale	Degrades under multi-tenant load	Sustained 4.5k–11k tokens/sec	llm-d blog
Tail Latency (P95/P99)	Higher tail latency due to stragglers & imbalance	~50% tail latency reduction (reported tests)	Red Hat Developers
GPU Utilization	Uneven utilization, idle GPUs possible	Improved effective utilization via routing intelligence	llm-d docs
Autoscaling Control	Scale reacts to load only	Works with KServe autoscaling + routing intelligence	KServe docs

Modern GenAI platforms require cache locality awareness, phase-aware scheduling, distributed intelligence, and composable Kubernetes-native design. This combination ensures a production-ready system that meets the demands of large-scale production workloads.

Next Steps

Explore detailed project documentation:

KServe
llm-d

Engage with community resources and Slack channels to stay updated and contribute to ongoing developments: