Skip to content

Commit 39e6784

Browse files
Joohoyuzisun
andauthored
docs: Add LLMInferenceService guides and diagrams (#539)
* docs: Add LLMInferenceService guides and diagrams Add detailed documentation covering: - Overview and core concepts - Architecture and component interactions - Configuration patterns and examples - Required dependencies and setup Signed-off-by: Jooho Lee <[email protected]> * fix typo Signed-off-by: Jooho Lee <[email protected]> * update docs Signed-off-by: Jooho Lee <[email protected]> --------- Signed-off-by: Jooho Lee <[email protected]> Signed-off-by: Dan Sun <[email protected]> Co-authored-by: Dan Sun <[email protected]>
1 parent d9081c9 commit 39e6784

16 files changed

+1444
-2
lines changed
Lines changed: 328 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,328 @@
1+
---
2+
sidebar_label: "Control Plane - LLMISVC"
3+
sidebar_position: 2
4+
title: "LLMInferenceService Architecture Deep Dive"
5+
---
6+
7+
# LLMInferenceService Architecture Deep Dive
8+
9+
This guide provides an in-depth look at the LLMInferenceService architecture, component interactions, and advanced patterns for production deployments.
10+
11+
> **Prerequisites**: Familiarity with [core concepts](../../model-serving/generative-inference/llmisvc/llmisvc-overview.md) and [configuration](../../model-serving/generative-inference/llmisvc/llmisvc-configuration.md) is recommended.
12+
13+
---
14+
15+
## System Architecture Overview
16+
<img src={require('../../model-serving/generative-inference/llmisvc/imgs/architecture_overview.png').default} alt="Architecture Overview" style={{width: '700px', maxWidth: '100%'}} />
17+
18+
---
19+
20+
## Gateway Architecture
21+
22+
### What is a Gateway?
23+
24+
A **Gateway** is the entry point for external traffic into the Kubernetes cluster. It's a Kubernetes Gateway API resource that:
25+
26+
- Defines listeners (HTTP, HTTPS, ports)
27+
- Configures TLS termination
28+
- Managed by a Gateway Provider (Envoy Gateway, Istio, etc.)
29+
- Can be cluster-scoped or namespace-scoped
30+
31+
### Managed vs Referenced Gateway
32+
33+
#### Managed Gateway (Default)
34+
35+
```yaml
36+
spec:
37+
router:
38+
gateway: {} # KServe creates Gateway automatically
39+
```
40+
41+
**What KServe creates**
42+
43+
```yaml
44+
apiVersion: gateway.networking.k8s.io/v1
45+
kind: Gateway
46+
metadata:
47+
name: llama-3-8b-kserve-gateway
48+
namespace: default
49+
spec:
50+
gatewayClassName: eg # Default: Envoy Gateway
51+
listeners:
52+
- name: http
53+
port: 80
54+
protocol: HTTP
55+
```
56+
57+
#### Referenced Gateway (Existing)
58+
59+
```yaml
60+
spec:
61+
router:
62+
gateway:
63+
refs:
64+
- name: my-custom-gateway
65+
namespace: istio-system
66+
```
67+
68+
**Use case**: Shared gateway across multiple services.
69+
70+
---
71+
72+
## HTTPRoute Architecture
73+
74+
### What is an HTTPRoute?
75+
76+
An **HTTPRoute** defines path-based routing rules that connect Gateways to backend services (InferencePools or Services).
77+
78+
### Managed HTTPRoute (Default)
79+
80+
```yaml
81+
spec:
82+
router:
83+
route: {} # KServe creates HTTPRoute automatically
84+
```
85+
86+
**What KServe creates**
87+
88+
```yaml
89+
apiVersion: gateway.networking.k8s.io/v1
90+
kind: HTTPRoute
91+
metadata:
92+
name: llama-3-8b-kserve-route
93+
namespace: default
94+
spec:
95+
parentRefs:
96+
- name: llama-3-8b-kserve-gateway
97+
rules:
98+
- backendRefs:
99+
- group: inference.networking.x-k8s.io
100+
kind: InferencePool
101+
name: llama-3-8b-inference-pool
102+
```
103+
104+
### Routing Flow
105+
106+
<img src={require('../../model-serving/generative-inference/llmisvc/imgs/routing_flow.png').default} alt="Routing Flow" style={{width: '700px', maxWidth: '100%'}} />
107+
108+
---
109+
110+
## Scheduler Architecture
111+
112+
### Overview
113+
114+
The **Scheduler** (also called **Endpoint Picker Pod - EPP**) provides intelligent request routing based on:
115+
- **Prefix cache**: Routes to pods with matching KV cache blocks
116+
- **Load**: Balances requests across available endpoints
117+
- **Prefill-Decode separation**: Routes to appropriate pool
118+
119+
### Scoring Mechanism
120+
121+
The scheduler tracks KV cache blocks via ZMQ events from vLLM pods:
122+
- **BlockStored**: Cache block created (includes block hash, tokens, storage location)
123+
- **BlockRemoved**: Cache block evicted from memory
124+
125+
These events populate an index mapping `{ModelName, BlockHash}` → `{PodID, DeviceTier}`, allowing the scheduler to track which pods hold which cache blocks.
126+
127+
For each incoming request, the scheduler calculates a weighted score across all endpoints using pluggable scorers:
128+
129+
| Scorer | Weight | Purpose |
130+
|--------|--------|---------|
131+
| **Prefix cache scorer** | 2.0 | Prioritizes pods with matching KV cache blocks |
132+
| **Load-aware scorer** | 1.0 | Balances requests across endpoints |
133+
| **Queue scorer** | (configurable) | Routes based on queue depth |
134+
135+
The request is routed to the highest-scoring pod, optimizing for both cache hit rate and load distribution.
136+
137+
---
138+
139+
## Scheduler vs No Scheduler
140+
141+
| Feature | No Scheduler | With Scheduler |
142+
|---------|-------------|----------------|
143+
| **Routing** | Kubernetes Service (kube-proxy) | InferencePool (EPP) |
144+
| **Load Balancing** | Round-robin | Intelligent (load-aware, cache-aware) |
145+
| **Prefix Cache** | ❌ | ✅ Routes to pods with matching KV cache |
146+
| **Prefill-Decode** | ❌ | ✅ Automatic pool selection |
147+
| **Resource Overhead** | Minimal (no extra pods) | Low (1 scheduler pod) |
148+
| **Use Case** | Simple/Dev | Production |
149+
150+
---
151+
152+
## Request Flow Analysis
153+
154+
### Standard Request Flow
155+
156+
```
157+
1. Client sends request
158+
159+
160+
2. Gateway receives (port 80/443)
161+
162+
163+
3. HTTPRoute matches path (/v1/completions)
164+
165+
166+
4. Routes to InferencePool
167+
168+
169+
5. Gateway queries EPP Service
170+
"Which endpoint should I use?"
171+
172+
173+
6. EPP evaluates:
174+
- Prefix cache match (weight: 2.0)
175+
- Current load (weight: 1.0)
176+
177+
178+
7. EPP returns selected endpoint
179+
"Use Pod 2 (10.0.1.42:8000)"
180+
181+
182+
8. Gateway forwards to Pod 2
183+
184+
185+
9. Pod 2 processes inference
186+
187+
188+
10. Response flows back to client
189+
```
190+
191+
### Prefill-Decode Request Flow
192+
193+
```
194+
1. Client sends NEW request (no KV cache)
195+
196+
197+
2. Gateway -> HTTPRoute -> InferencePool (Prefill)
198+
199+
200+
3. EPP: "This is a new request" → Route to Prefill Pool
201+
202+
203+
4. Prefill Pod processes prompt, generates KV cache
204+
205+
206+
5. KV cache transferred to Decode Pod via RDMA
207+
208+
209+
6. Response includes KV transfer metadata
210+
211+
212+
7. Client sends CONTINUATION request (with KV cache ID)
213+
214+
215+
8. EPP: "This is continuation" → Route to Decode Pool
216+
217+
218+
9. Decode Pod uses transferred KV cache
219+
220+
221+
10. Token-by-token generation
222+
```
223+
224+
---
225+
226+
## Network Flow
227+
228+
### KV Cache Communication
229+
230+
LLM serving requires two types of KV cache communication:
231+
232+
#### 1. KV Cache Event Tracking (ZMQ)
233+
234+
**Purpose**: Real-time monitoring of KV cache blocks for intelligent routing
235+
236+
- **Protocol**: ZMQ (Zero Message Queue) over TCP/IP
237+
- **Usage**: vLLM publishes events when cache blocks are created/evicted
238+
- **Consumer**: Scheduler (EPP) tracks which pods have which cache blocks (see [Scoring Mechanism](#scoring-mechanism))
239+
240+
**Configuration**:
241+
```yaml
242+
spec:
243+
template:
244+
containers:
245+
- name: main
246+
env:
247+
- name: VLLM_ADDITIONAL_ARGS
248+
value: "--kv-events-config '{\"enable_kv_cache_events\":true,\"publisher\":\"zmq\",\"endpoint\":\"tcp://scheduler:5557\",\"topic\":\"kv@${POD_IP}@model\"}'"
249+
```
250+
251+
#### 2. KV Cache Data Transfer (NixlConnector)
252+
253+
**Purpose**: Actual KV cache block transfer for prefill-decode separation
254+
255+
- **Protocol**: NixlConnector (RDMA-based, RoCE network)
256+
- **Usage**: Transfer KV cache blocks from prefill pods to decode pods
257+
- **Use Case**: Disaggregated prefill-decode architecture
258+
259+
**Configuration**:
260+
```yaml
261+
spec:
262+
template:
263+
containers:
264+
- name: main
265+
env:
266+
- name: KSERVE_INFER_ROCE
267+
value: "true"
268+
- name: VLLM_ADDITIONAL_ARGS
269+
value: "--kv_transfer_config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\"}'"
270+
```
271+
272+
## Advanced Patterns
273+
274+
### Pattern: Multi-Node Prefill-Decode
275+
276+
Combining P/D separation with LeaderWorkerSet:
277+
278+
```yaml
279+
spec:
280+
# Decode workload (multi-node)
281+
parallelism:
282+
tensor: 4
283+
data: 8
284+
dataLocal: 4
285+
template:
286+
containers:
287+
- name: main
288+
resources:
289+
limits:
290+
nvidia.com/gpu: "4"
291+
worker:
292+
containers:
293+
- name: main
294+
resources:
295+
limits:
296+
nvidia.com/gpu: "4"
297+
298+
# Prefill workload (multi-node)
299+
prefill:
300+
parallelism:
301+
tensor: 4
302+
data: 16
303+
dataLocal: 8
304+
template:
305+
containers:
306+
- name: main
307+
resources:
308+
limits:
309+
nvidia.com/gpu: "8"
310+
worker:
311+
containers:
312+
- name: main
313+
resources:
314+
limits:
315+
nvidia.com/gpu: "8"
316+
```
317+
318+
**Result**:
319+
- Prefill: 2 LWS replicas (16/8), each with 8 GPUs
320+
- Decode: 2 LWS replicas (8/4), each with 4 GPUs
321+
- Total: 24 GPUs
322+
323+
---
324+
325+
## Next Steps
326+
327+
- **[Configuration Guide](../../model-serving/generative-inference/llmisvc/llmisvc-configuration.md)**: Detailed spec reference
328+
- **[Dependencies](../../model-serving/generative-inference/llmisvc/llmisvc-dependencies.md)**: Install required components

docs/concepts/architecture/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ KServe offers two powerful deployment modes to fit diverse operational needs:
1818
## In This Section
1919

2020
- **[Control Plane](control-plane.md)**: Learn how KServe manages model lifecycles, autoscaling, and orchestration
21+
- **[Control Plane - LLMISVC](control-plane-llmisvc.md)**: Understand LLMInferenceService architecture, component interactions, and advanced patterns
2122
- **[Data Plane](./data-plane/data-plane.md)**: Explore how inference requests are processed, including protocol support and performance optimizations
2223

2324
## What's Next?
126 KB
Loading
79.2 KB
Loading
14.2 KB
Loading
47.2 KB
Loading
89.1 KB
Loading
21 KB
Loading
64.5 KB
Loading
117 KB
Loading

0 commit comments

Comments
 (0)