Skip to content

Commit a1258dd

Browse files
committed
feat: add model architecture configuration documentation
This commit introduces a new document detailing the model architecture configuration, including support for transformer architectures and their components. It outlines terminology, properties, and provides an example configuration for better understanding and implementation. Signed-off-by: Zhao Chen <[email protected]>
1 parent d962898 commit a1258dd

File tree

2 files changed

+375
-0
lines changed

2 files changed

+375
-0
lines changed

docs/architecture.md

Lines changed: 371 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,371 @@
1+
# Model Architecture Configuration
2+
3+
Each model artifact has an associated optional architecture configuration that describes the detailed structure and components of the model. Currently, only decoder-type transformer architectures are supported. Future extensions will include:
4+
5+
- Multi-modal language models
6+
- State Space Models
7+
- Diffusion Models
8+
9+
## Terminology
10+
11+
The transformer is the most popular architecture for LLMs. It consists of a stack of structured layers, where each layer contains a self-attention block and a feed-forward network, with normalization layers and residual connections. The complete architecture includes a tokenizer, input embedding layer, position embedding layer, transformer layers, and output embedding layer. The transformer architecture has remained relatively stable since [Attention is all you need][attention-paper]. As shown in the table below, current open-weight model architectures are converging, making it feasible to define a common abstraction.
12+
13+
| Model | Tokenizer | PE | Self-Attention | Norm | Feed-Forward | Residual |
14+
|---------------------------------|-----------|------------|----------------|------------|--------------|----------|
15+
| [GPT2][gpt2-repo] | BPE | Sinusoidal | MHA | Layer Norm | MLP | Yes |
16+
| [Llama2][llama2-paper] | BPE | RoPE | GQA | RMS Norm | MLP | Yes |
17+
| [Llama3][llama3-paper] | BPE | RoPE | GQA | RMS Norm | MLP | Yes |
18+
| [Qwen2][qwen2-paper] | BPE | RoPE | GQA | RMS Norm | MoE | Yes |
19+
| [Qwen3][qwen3-paper] | BPE | RoPE | GQA | RMS Norm | MoE | Yes |
20+
| [Gemma2][gemma2-paper] | BPE | RoPE | GQA | RMS Norm | MLP | Yes |
21+
| [Gemma3][gemma3-paper] | BPE | RoPE | GQA | RMS Norm | MLP | Yes |
22+
| [Mixtral][mixtral-paper] | BPE | RoPE | SWA | RMS Norm | MoE | Yes |
23+
| [DeepseekV2][deepseek-v2-paper] | BPE | RoPE | MLA | RMS Norm | MoE | Yes |
24+
| [DeepseekV3][deepseek-v3-paper] | BPE | RoPE | MLA | RMS Norm | MoE | Yes |
25+
| [Kimi-K2][kimi-k2-paper] | BPE | RoPE | MLA | RMS Norm | MoE | Yes |
26+
27+
*Note: Each model represents the largest variant within its respective series.*
28+
29+
30+
## Properties
31+
32+
- **transformer** _object_, REQUIRED
33+
34+
Contains the transformer configuration parameters.
35+
36+
- **architecture_version** _string_, REQUIRED
37+
38+
The version of the transformer architecture configuration using semantic versioning. An independent version is required for future extensibility.
39+
40+
- **type** _string_, REQUIRED
41+
42+
The type of transformer architecture. Currently supported: `decoder`. The default is `decoder`.
43+
44+
- **vocabulary_size** _uint64_, REQUIRED
45+
46+
Vocabulary size of the model.
47+
48+
- **hidden_size** _uint64_, REQUIRED
49+
50+
The hidden size of the model.
51+
52+
- **tokenizer** _object_, REQUIRED
53+
54+
Contains the tokenizer configuration parameters.
55+
56+
- **type** _string_, REQUIRED
57+
58+
Tokenizer type. Currently supported: `bpe`. The default is `bpe`.
59+
60+
- **library** _string_, REQUIRED
61+
62+
The name or URL of the tokenizer library. Currently supported: `huggingface`. The default is `huggingface`.
63+
64+
- **revision** _string_, OPTIONAL
65+
66+
Revision of the tokenizer library. Can be a branch name, tag name, commit ID, or `main` (latest version). The default is `main`.
67+
68+
- **token_embedding** _object_, REQUIRED
69+
70+
Contains the token embedding configuration parameters.
71+
72+
- **has_bias** _boolean_, REQUIRED
73+
74+
Whether the embedding has a bias. The default is `false`.
75+
76+
- **has_norm** _boolean_, REQUIRED
77+
78+
Whether the embedding has a normalization. The default is `true`. The normalization configuration is defined in the normalization property.
79+
80+
- **shared_embedding** _boolean_, REQUIRED
81+
82+
Whether the embedding is shared with the model prediction head. The default is `false`.
83+
84+
- **position_embedding** _object_, REQUIRED
85+
86+
Contains the position embedding configuration parameters.
87+
88+
- **type** _string_, REQUIRED
89+
90+
Position embedding type. Currently supported: `rope`. The default is `rope`. For more details, see [RoPE][rope-paper] and its [PyTorch implementation][rope-pytorch].
91+
92+
- **max_position_embeddings** _uint64_, REQUIRED
93+
94+
The maximum number of position embeddings. The default is `1024`.
95+
96+
- **rope_theta** _float_, REQUIRED
97+
98+
The theta parameter in the RoPE position embedding. The default is `10000`.
99+
100+
- **rope_scaling** _object_, OPTIONAL
101+
102+
The scaling configuration for the RoPE embeddings. The default is `null`.
103+
104+
- **transformer_layer** _object_, REQUIRED
105+
106+
Contains the transformer layer configuration parameters. Must specify either uniform_layers or mixed_layers.
107+
108+
- **uniform_layers** _object_, OPTIONAL
109+
110+
Configuration for uniform layers where all layers have identical structure.
111+
112+
- **num_layers** _uint64_, REQUIRED
113+
114+
Number of transformer layers. The default is `0`.
115+
116+
- **attention** _object_, REQUIRED
117+
118+
Contains the attention configuration parameters.
119+
120+
- **type** _string_, REQUIRED
121+
122+
Attention mechanism type. Currently supported: [MHA][mha-paper], [GQA][gqa-paper], [MLA][mla-paper]. The default is `mha`.
123+
124+
- **is_causal** _boolean_, REQUIRED
125+
126+
Whether the attention is causal. The default is `true`.
127+
128+
- **is_qkv_merged** _boolean_, REQUIRED
129+
130+
Whether the QKV projection is merged. The default is `false`.
131+
132+
- **num_attention_heads** _uint64_, REQUIRED
133+
134+
Number of attention heads. The default is `0`.
135+
136+
- **num_key_value_heads** _uint64_, REQUIRED
137+
138+
Number of key-value heads. The default is `0`.
139+
140+
- **head_dim** _uint64_, REQUIRED
141+
142+
The attention head dimension. If 0, defaults to hidden_size / num_attention_heads. The default is `0`.
143+
144+
- **has_residual** _boolean_, REQUIRED
145+
146+
Whether the attention has a residual connection. The default is `true`.
147+
148+
- **has_qkv_bias** _boolean_, REQUIRED
149+
150+
Whether the QKV projection has a bias. The default is `false`.
151+
152+
- **has_output_bias** _boolean_, REQUIRED
153+
154+
Whether the output projection has a bias. The default is `false`.
155+
156+
- **has_pre_norm** _boolean_, REQUIRED
157+
158+
Whether the attention has a pre-normalization. The default is `false`.
159+
160+
- **has_post_norm** _boolean_, REQUIRED
161+
162+
Whether the attention has a post-normalization. The default is `false`.
163+
164+
- **mlp** _object_, OPTIONAL
165+
166+
MLP configuration parameters. Either mlp or moe must be specified.
167+
168+
- **intermediate_size** _uint64_, REQUIRED
169+
170+
The size of the intermediate layer. The default is `0`.
171+
172+
- **activation** _string_, REQUIRED
173+
174+
The activation function. The default is `gelu`.
175+
176+
- **use_gated_activation** _boolean_, REQUIRED
177+
178+
Whether to use gated activation. The default is `true`.
179+
180+
- **has_residual** _boolean_, REQUIRED
181+
182+
Whether the MLP has a residual connection. The default is `true`.
183+
184+
- **has_bias** _boolean_, REQUIRED
185+
186+
Whether the MLP has a bias. The default is `false`.
187+
188+
- **has_pre_norm** _boolean_, REQUIRED
189+
190+
Whether the MLP has a pre-normalization. The default is `false`.
191+
192+
- **has_post_norm** _boolean_, REQUIRED
193+
194+
Whether the MLP has a post-normalization. The default is `false`.
195+
196+
- **is_mlp_merged** _boolean_, REQUIRED
197+
198+
Whether the MLP projection is merged. The default is `false`.
199+
200+
- **moe** _object_, OPTIONAL
201+
202+
MoE configuration parameters.
203+
204+
- **has_bias** _boolean_, REQUIRED
205+
206+
Whether the MOE has a bias. The default is `false`.
207+
208+
- **activation** _string_, REQUIRED
209+
210+
The activation function. The default is `gelu`.
211+
212+
- **use_gated_activation** _boolean_, REQUIRED
213+
214+
Whether to use gated activation. The default is `true`.
215+
216+
- **num_experts** _uint64_, REQUIRED
217+
218+
Number of experts. The default is `0`.
219+
220+
- **moe_intermediate_size** _uint64_, REQUIRED
221+
222+
The size of the intermediate layer of the routed expert. The default is `0`.
223+
224+
- **num_shared_experts** _uint64_, REQUIRED
225+
226+
Number of shared experts. The default is `0`.
227+
228+
- **shared_expert_intermediate_size** _uint64_, REQUIRED
229+
230+
The size of the intermediate layer of the shared expert. The default is `0`.
231+
232+
- **top_k** _uint64_, REQUIRED
233+
234+
Top k experts to be used. The default is `0`.
235+
236+
- **scoring_function** _string_, REQUIRED
237+
238+
Method of computing expert weights. The default is `softmax`.
239+
240+
- **norm_topk_prob** _boolean_, REQUIRED
241+
242+
Whether to normalize the top k probabilities. The default is `false`.
243+
244+
- **mixed_layers** _object_, OPTIONAL
245+
246+
Configuration for mixed layers where layers have different structures.
247+
248+
- **num_layers** _uint64_, REQUIRED
249+
250+
Number of transformer layers. The default is `0`.
251+
252+
- **mlp_layers** _array_, REQUIRED
253+
254+
Layers that use MLP. If empty, moe_frequency determines sparsity. The default is `[]`.
255+
256+
- **pre_norm_layers** _array_, OPTIONAL
257+
258+
Layers that use pre-normalization. The default is `[]`.
259+
260+
- **post_norm_layers** _array_, OPTIONAL
261+
262+
Layers that use post-normalization. The default is `[]`.
263+
264+
- **moe_frequency** _uint64_, REQUIRED
265+
266+
Frequency of the MoE layer. The default is `0`.
267+
268+
- **attention** _object_, REQUIRED
269+
270+
Attention parameters (same structure as in uniform_layers).
271+
272+
- **mlp** _object_, OPTIONAL
273+
274+
MLP parameters (same structure as in uniform_layers).
275+
276+
- **moe** _object_, OPTIONAL
277+
278+
MoE parameters (same structure as in uniform_layers).
279+
280+
- **normalization** _object_, REQUIRED
281+
282+
Contains the normalization configuration parameters.
283+
284+
- **type** _string_, REQUIRED
285+
286+
Normalization type. Supported: [`RMSNorm`][rmsnorm-paper], [`LayerNorm`][layernorm-paper]. The default is `rmsnorm`.
287+
288+
- **epsilon** _float_, REQUIRED
289+
290+
Epsilon for the normalization. The default is `1e-5`.
291+
292+
## Example
293+
294+
Here is an example transformer architecture configuration:
295+
296+
```json,title=Transformer%20Architecture%20Configuration&mediatype=application/vnd.cncf.model.architecture.v1%2Bjson
297+
{
298+
"transformer": {
299+
"vocabulary_size": 32000,
300+
"hidden_size": 4096,
301+
"tokenizer": {
302+
"type": "bpe",
303+
"library": "huggingface",
304+
"revision": "main"
305+
},
306+
"token_embedding": {
307+
"has_bias": false,
308+
"has_norm": true,
309+
"shared_embedding": false
310+
},
311+
"position_embedding": {
312+
"type": "rope",
313+
"max_position_embeddings": 2048,
314+
"rope_theta": 10000.0,
315+
"rope_scaling": null
316+
},
317+
"transformer_layer": {
318+
"uniform_layers": {
319+
"num_layers": 32,
320+
"attention": {
321+
"type": "gqa",
322+
"is_causal": true,
323+
"is_qkv_merged": false,
324+
"num_attention_heads": 32,
325+
"num_key_value_heads": 8,
326+
"head_dim": 128,
327+
"has_residual": true,
328+
"has_qkv_bias": false,
329+
"has_output_bias": false,
330+
"has_pre_norm": true,
331+
"has_post_norm": false
332+
},
333+
"mlp": {
334+
"intermediate_size": 11008,
335+
"activation": "silu",
336+
"use_gated_activation": true,
337+
"has_residual": true,
338+
"has_bias": false,
339+
"has_pre_norm": false,
340+
"has_post_norm": true,
341+
"is_mlp_merged": false
342+
}
343+
}
344+
},
345+
"normalization": {
346+
"type": "rmsnorm",
347+
"epsilon": 1e-5
348+
}
349+
}
350+
}
351+
```
352+
353+
[attention-paper]: https://arxiv.org/abs/1706.03762
354+
[gpt2-repo]: https://github.com/openai/gpt-2
355+
[llama2-paper]: https://arxiv.org/abs/2307.09288
356+
[llama3-paper]: https://arxiv.org/abs/2407.21783
357+
[qwen2-paper]: https://arxiv.org/abs/2407.10671
358+
[qwen3-paper]: https://arxiv.org/pdf/2505.09388
359+
[gemma2-paper]: https://arxiv.org/abs/2408.00118
360+
[gemma3-paper]: https://arxiv.org/pdf/2503.19786
361+
[mixtral-paper]: https://arxiv.org/abs/2401.04088
362+
[deepseek-v2-paper]: https://arxiv.org/abs/2405.04434
363+
[deepseek-v3-paper]: https://arxiv.org/pdf/2412.19437
364+
[kimi-k2-paper]: https://arxiv.org/pdf/2507.20534
365+
[rope-paper]: https://arxiv.org/abs/2104.09864
366+
[rope-pytorch]: https://pytorch.org/torchtune/stable/generated/torchtune.modules.RotaryPositionalEmbeddings.html
367+
[mha-paper]: https://arxiv.org/abs/1706.03762
368+
[gqa-paper]: https://arxiv.org/abs/2305.13245v3
369+
[mla-paper]: https://arxiv.org/abs/2412.19437
370+
[rmsnorm-paper]: https://arxiv.org/abs/1910.07467
371+
[layernorm-paper]: https://arxiv.org/abs/1607.06450

docs/config.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,10 @@ The following terms are used in this section:
7575

7676
The architecture of the model, such as "transformer", "cnn", or "rnn".
7777

78+
- **architecture_config** _object_, OPTIONAL
79+
80+
The configuration of the architecture. The details are defined in the [Architecture](./architectures.md) file.
81+
7882
- **format** _string_, OPTIONAL
7983

8084
The format for the model, such as "onnx", "safetensors", "gguf", or "pt"(pytorch format).

0 commit comments

Comments
 (0)