Skip to content

Conversation

@Kuangdd01
Copy link
Contributor

@Kuangdd01 Kuangdd01 commented Jan 13, 2026

Description

  • IMO, GLM4_MOE can be treated as a combination of deepseek_v3 and qwen/llama. It keep the first dense layer, one share expert per sparse layer and route_bias with GQA. So we just combine the template of qwen and deepseek_v3 and specify some arguments like rope_percent.

  • This modification was verified with a tiny random model on LlamaFactory.

loss1
  • training and merge process pass ✅

Help Need

  • Real model should be tested
  • MTP moudle should be address after GLM4.5

cc @hiyouga @chocoded

@chocoded
Copy link
Collaborator

It appears that the final layer (Layer 46) has a different structure compared to the intermediate layers. This requires some special handling which seems to have been overlooked in the current implementation. Could you please look into this?

截屏2026-01-13 20 14 33

@PanAndy
Copy link
Collaborator

PanAndy commented Jan 13, 2026

@chocoded

@PanAndy PanAndy requested a review from chocoded January 13, 2026 12:31
@chocoded chocoded self-assigned this Jan 13, 2026
@Kuangdd01
Copy link
Contributor Author

Sure, the last layer is mtp-specific layer and we will look through this.

@chocoded chocoded closed this Jan 13, 2026
@chocoded chocoded reopened this Jan 13, 2026
@chocoded
Copy link
Collaborator

For reference, you can check out the implementation here: converters and hf_invalid_keys. Could you please update the code to account for this?

@Kuangdd01
Copy link
Contributor Author

Kuangdd01 commented Jan 16, 2026

[2026-01-17 14:34:40,429] [INFO] [mcore_adapter.models.auto.modeling_auto]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:34:40,433] [INFO] [mcore_adapter.models.model_config]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:34:40,435] [INFO] [mcore_adapter.initialize]: Initializing mpu on device cuda:0
[2026-01-17 14:34:40,440] [INFO] [mcore_adapter.initialize]: initialized tensor model parallel with size 1
[2026-01-17 14:34:40,440] [INFO] [mcore_adapter.initialize]: initialized pipeline model parallel with size 1
[2026-01-17 14:34:40,450] [WARNING] [mcore_adapter.models.model_config]: Non-interleaved pipeline parallelism does not support overlapping p2p communication!
[2026-01-17 14:34:40,750] [INFO] [mcore_adapter.models.model_factory]: number of parameters on (tensor, pipeline, expert) model parallel rank (0, 0, 0): 2169532416
[2026-01-17 14:34:41,608] [INFO] [mcore_adapter.models.model_factory]: End loading, cost: 1.176s
layer:0, layer_in, diff: 0, diff>1e-05:[0/1048576] diff_max:0.0 diff_mean:0.0
layer:0, input_layernorm_out, diff: 0, diff>1e-05:[0/1048576] diff_max:0.0 diff_mean:0.0
layer:0, input_layernorm_out_weight, diff: 0, diff>1e-05:[0/4096] diff_max:0.0 diff_mean:0.0
layer:0, q_proj_out, diff: 0, diff>1e-05:[0/3145728] diff_max:0.0 diff_mean:0.0
layer:0, k_proj_out, diff: 0, diff>1e-05:[0/262144] diff_max:0.0 diff_mean:0.0
layer:0, v_proj_out, diff: 0, diff>1e-05:[0/262144] diff_max:0.0 diff_mean:0.0
layer:0, o_proj_in, diff: 302008, diff>1e-05:[148236/3145728] diff_max:0.0234375 diff_mean:7.98512264736928e-05
layer:0, o_proj_out, diff: 173126, diff>1e-05:[86498/1048576] diff_max:0.072265625 diff_mean:0.0027410960756242275
layer:0, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:0, attn_out, diff: 173126, diff>1e-05:[86498/1048576] diff_max:0.072265625 diff_mean:0.0027410960756242275
layer:1, layer_in, diff: 143490, diff>1e-05:[72076/1048576] diff_max:0.0625 diff_mean:0.0017221365123987198
layer:1, input_layernorm_out, diff: 141314, diff>1e-05:[71004/1048576] diff_max:0.046875 diff_mean:0.0012908073840662837
layer:1, input_layernorm_out_weight, diff: 0, diff>1e-05:[0/4096] diff_max:0.0 diff_mean:0.0
layer:1, q_proj_out, diff: 448024, diff>1e-05:[223810/3145728] diff_max:0.0625 diff_mean:0.001731364638544619
layer:1, k_proj_out, diff: 37220, diff>1e-05:[18592/262144] diff_max:0.05078125 diff_mean:0.0017332927091047168
layer:1, v_proj_out, diff: 37476, diff>1e-05:[18666/262144] diff_max:0.05078125 diff_mean:0.0017366009997203946
layer:1, o_proj_in, diff: 497442, diff>1e-05:[247212/3145728] diff_max:0.0703125 diff_mean:0.00046942406333982944
layer:1, o_proj_out, diff: 180114, diff>1e-05:[90260/1048576] diff_max:6.875 diff_mean:0.18534711003303528
layer:1, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:1, attn_out, diff: 180114, diff>1e-05:[90260/1048576] diff_max:6.875 diff_mean:0.18534711003303528
layer:2, layer_in, diff: 154285, diff>1e-05:[77484/1048576] diff_max:0.3671875 diff_mean:0.003143888898193836
layer:2, input_layernorm_out, diff: 152527, diff>1e-05:[76603/1048576] diff_max:0.21875 diff_mean:0.0020288443192839622
layer:2, input_layernorm_out_weight, diff: 0, diff>1e-05:[0/4096] diff_max:0.0 diff_mean:0.0
layer:2, q_proj_out, diff: 472335, diff>1e-05:[236226/3145728] diff_max:0.25390625 diff_mean:0.002649908885359764
layer:2, k_proj_out, diff: 39421, diff>1e-05:[19701/262144] diff_max:0.2734375 diff_mean:0.0026722438633441925
layer:2, v_proj_out, diff: 39165, diff>1e-05:[19602/262144] diff_max:0.234375 diff_mean:0.0026325180660933256
layer:2, o_proj_in, diff: 507630, diff>1e-05:[253508/3145728] diff_max:0.1484375 diff_mean:0.00071554264286533
layer:2, o_proj_out, diff: 180131, diff>1e-05:[90097/1048576] diff_max:7.40625 diff_mean:0.2104431539773941
layer:2, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:2, attn_out, diff: 180131, diff>1e-05:[90097/1048576] diff_max:7.40625 diff_mean:0.2104431539773941
layer:3, layer_in, diff: 160016, diff>1e-05:[80035/1048576] diff_max:0.53125 diff_mean:0.004714156035333872
layer:3, input_layernorm_out, diff: 159897, diff>1e-05:[80014/1048576] diff_max:0.28125 diff_mean:0.0027506439946591854
layer:3, input_layernorm_out_weight, diff: 0, diff>1e-05:[0/4096] diff_max:0.0 diff_mean:0.0
layer:3, q_proj_out, diff: 486877, diff>1e-05:[243362/3145728] diff_max:0.390625 diff_mean:0.0035586850717663765
layer:3, k_proj_out, diff: 40539, diff>1e-05:[20323/262144] diff_max:0.296875 diff_mean:0.00354281859472394
layer:3, v_proj_out, diff: 40654, diff>1e-05:[20296/262144] diff_max:0.28125 diff_mean:0.0035621817223727703
layer:3, o_proj_in, diff: 517216, diff>1e-05:[257890/3145728] diff_max:0.19921875 diff_mean:0.0009617977775633335
layer:3, o_proj_out, diff: 180142, diff>1e-05:[90104/1048576] diff_max:7.21875 diff_mean:0.23430882394313812
layer:3, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:3, attn_out, diff: 180142, diff>1e-05:[90104/1048576] diff_max:7.21875 diff_mean:0.23430882394313812
layer:3, lmhead, diff: 6128510, diff>1e-05:[3066149/38797312] diff_max:0.4140625 diff_mean:0.004307578783482313
layer:3, lmhead_weight, diff: 0, diff>1e-05:[0/620756992] diff_max:0.0 diff_mean:0.0
layer:3, lmhead_token, diff: 2, diff>1e-05:[0/256] diff_max:29177 diff_mean:227.9453125
[2026-01-17 14:34:45,938] [INFO] [mcore_adapter.models.auto.modeling_auto]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:34:45,940] [INFO] [mcore_adapter.models.model_config]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:34:45,943] [WARNING] [mcore_adapter.models.model_config]: Non-interleaved pipeline parallelism does not support overlapping p2p communication!
[2026-01-17 14:34:46,365] [INFO] [mcore_adapter.models.model_factory]: number of parameters on (tensor, pipeline, expert) model parallel rank (0, 0, 0): 2169532416
[2026-01-17 14:34:47,898] [INFO] [mcore_adapter.models.model_factory]: End loading, cost: 1.959s
layer:0, layer_in, diff: 0, diff>1e-05:[0/1048576] diff_max:0.0 diff_mean:0.0
layer:0, input_layernorm_out, diff: 0, diff>1e-05:[0/1048576] diff_max:0.0 diff_mean:0.0
layer:0, input_layernorm_out_weight, diff: 0, diff>1e-05:[0/4096] diff_max:0.0 diff_mean:0.0
layer:0, q_proj_out, diff: 527906, diff>1e-05:[94/3145728] diff_max:1.6450881958007812e-05 diff_mean:1.9595603362176917e-07
layer:0, k_proj_out, diff: 43460, diff>1e-05:[0/262144] diff_max:8.58306884765625e-06 diff_mean:1.4310661811123282e-07
layer:0, v_proj_out, diff: 43534, diff>1e-05:[0/262144] diff_max:9.298324584960938e-06 diff_mean:1.4514284885080997e-07
layer:0, o_proj_in, diff: 534156, diff>1e-05:[0/3145728] diff_max:6.4373016357421875e-06 diff_mean:4.5743480114879276e-08
layer:0, o_proj_out, diff: 180194, diff>1e-05:[88916/1048576] diff_max:0.06640821695327759 diff_mean:0.0027209932450205088
layer:0, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:0, attn_out, diff: 180194, diff>1e-05:[88916/1048576] diff_max:0.06640821695327759 diff_mean:0.0027209932450205088
layer:1, layer_in, diff: 178718, diff>1e-05:[4092/1048576] diff_max:2.6226043701171875e-05 diff_mean:6.740136200278357e-07
layer:1, input_layernorm_out, diff: 178730, diff>1e-05:[712/1048576] diff_max:2.1457672119140625e-05 diff_mean:4.992290882910311e-07
layer:1, input_layernorm_out_weight, diff: 0, diff>1e-05:[0/4096] diff_max:0.0 diff_mean:0.0
layer:1, q_proj_out, diff: 536908, diff>1e-05:[12658/3145728] diff_max:2.8252601623535156e-05 diff_mean:6.788021096326702e-07
layer:1, k_proj_out, diff: 44734, diff>1e-05:[836/262144] diff_max:2.0503997802734375e-05 diff_mean:6.560435963365308e-07
layer:1, v_proj_out, diff: 44712, diff>1e-05:[862/262144] diff_max:2.384185791015625e-05 diff_mean:6.59423676552251e-07
layer:1, o_proj_in, diff: 538800, diff>1e-05:[466/3145728] diff_max:2.968311309814453e-05 diff_mean:1.7786358341709274e-07
layer:1, o_proj_out, diff: 180224, diff>1e-05:[90262/1048576] diff_max:6.855518341064453 diff_mean:0.18533337116241455
layer:1, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:1, attn_out, diff: 180224, diff>1e-05:[90262/1048576] diff_max:6.855518341064453 diff_mean:0.18533337116241455
layer:2, layer_in, diff: 179575, diff>1e-05:[45557/1048576] diff_max:0.00011140108108520508 diff_mean:2.1114485662110383e-06
layer:2, input_layernorm_out, diff: 179666, diff>1e-05:[27911/1048576] diff_max:7.492303848266602e-05 diff_mean:1.3763965398538858e-06
layer:2, input_layernorm_out_weight, diff: 0, diff>1e-05:[0/4096] diff_max:0.0 diff_mean:0.0
layer:2, q_proj_out, diff: 539107, diff>1e-05:[113206/3145728] diff_max:0.00011390447616577148 diff_mean:1.761038220138289e-06
layer:2, k_proj_out, diff: 44934, diff>1e-05:[9354/262144] diff_max:9.161233901977539e-05 diff_mean:1.7637929659031215e-06
layer:2, v_proj_out, diff: 44907, diff>1e-05:[9531/262144] diff_max:7.283687591552734e-05 diff_mean:1.7631423361308407e-06
layer:2, o_proj_in, diff: 539962, diff>1e-05:[12138/3145728] diff_max:8.916854858398438e-05 diff_mean:4.7652602574999037e-07
layer:2, o_proj_out, diff: 180224, diff>1e-05:[90234/1048576] diff_max:7.374320030212402 diff_mean:0.21047236025333405
layer:2, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:2, attn_out, diff: 180224, diff>1e-05:[90234/1048576] diff_max:7.374320030212402 diff_mean:0.21047236025333405
layer:3, layer_in, diff: 179852, diff>1e-05:[61891/1048576] diff_max:0.00014591217041015625 diff_mean:3.5861353353539016e-06
layer:3, input_layernorm_out, diff: 179865, diff>1e-05:[44659/1048576] diff_max:8.571147918701172e-05 diff_mean:2.102397729686345e-06
layer:3, input_layernorm_out_weight, diff: 0, diff>1e-05:[0/4096] diff_max:0.0 diff_mean:0.0
layer:3, q_proj_out, diff: 539720, diff>1e-05:[160106/3145728] diff_max:0.0001385211944580078 diff_mean:2.682334525161423e-06
layer:3, k_proj_out, diff: 44974, diff>1e-05:[13589/262144] diff_max:0.00012230873107910156 diff_mean:2.689327175176004e-06
layer:3, v_proj_out, diff: 44961, diff>1e-05:[13400/262144] diff_max:0.00010967254638671875 diff_mean:2.678813416423509e-06
layer:3, o_proj_in, diff: 540225, diff>1e-05:[26866/3145728] diff_max:9.620189666748047e-05 diff_mean:6.992754606471863e-07
layer:3, o_proj_out, diff: 180224, diff>1e-05:[90068/1048576] diff_max:7.198688507080078 diff_mean:0.23438535630702972
layer:3, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:3, attn_out, diff: 180224, diff>1e-05:[90068/1048576] diff_max:7.198688507080078 diff_mean:0.23438535630702972
layer:3, lmhead, diff: 6658496, diff>1e-05:[2246431/38797312] diff_max:0.00017070770263671875 diff_mean:3.3481887840025593e-06
layer:3, lmhead_weight, diff: 0, diff>1e-05:[0/620756992] diff_max:0.0 diff_mean:0.0
layer:3, lmhead_token, diff: 0, diff>1e-05:[0/256] diff_max:0 diff_mean:0.0
[2026-01-17 14:34:53,249] [INFO] [mcore_adapter.models.auto.modeling_auto]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:34:53,251] [INFO] [mcore_adapter.models.model_config]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:34:53,254] [WARNING] [mcore_adapter.models.model_config]: Non-interleaved pipeline parallelism does not support overlapping p2p communication!
[2026-01-17 14:34:53,307] [INFO] [mcore_adapter.models.model_factory]: number of parameters on (tensor, pipeline, expert) model parallel rank (0, 0, 0): 2169532416
[2026-01-17 14:34:54,621] [INFO] [mcore_adapter.models.model_factory]: End loading, cost: 1.370s
layer:0, layer_in, diff: 0, diff>1e-05:[0/1048576] diff_max:0.0 diff_mean:0.0
layer:0, q_proj_out, diff: 540596, diff>1e-05:[262584/3145728] diff_max:0.0013006627559661865 diff_mean:3.577635652618483e-05
layer:0, k_proj_out, diff: 45050, diff>1e-05:[21622/262144] diff_max:0.0011242926120758057 diff_mean:3.559543256415054e-05
layer:0, v_proj_out, diff: 45050, diff>1e-05:[21772/262144] diff_max:0.0011820793151855469 diff_mean:3.586460661608726e-05
layer:0, o_proj_in, diff: 540636, diff>1e-05:[206000/3145728] diff_max:0.0012644529342651367 diff_mean:9.159600267594215e-06
layer:0, o_proj_out, diff: 180224, diff>1e-05:[89846/1048576] diff_max:0.06705456972122192 diff_mean:0.0027215054724365473
layer:0, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:0, attn_out, diff: 180224, diff>1e-05:[89846/1048576] diff_max:0.06705456972122192 diff_mean:0.0027215054724365473
layer:1, layer_in, diff: 180218, diff>1e-05:[89588/1048576] diff_max:0.00469970703125 diff_mean:0.0001344092597719282
layer:1, q_proj_out, diff: 540646, diff>1e-05:[268082/3145728] diff_max:0.004760265350341797 diff_mean:0.00013145888806320727
layer:1, k_proj_out, diff: 45054, diff>1e-05:[22462/262144] diff_max:0.004516184329986572 diff_mean:0.00013177364598959684
layer:1, v_proj_out, diff: 45054, diff>1e-05:[22228/262144] diff_max:0.004531919956207275 diff_mean:0.00013021739141549915
layer:1, o_proj_in, diff: 540668, diff>1e-05:[244736/3145728] diff_max:0.004379630088806152 diff_mean:3.368413672433235e-05
layer:1, o_proj_out, diff: 180224, diff>1e-05:[90268/1048576] diff_max:6.853757381439209 diff_mean:0.1853277087211609
layer:1, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:1, attn_out, diff: 180224, diff>1e-05:[90268/1048576] diff_max:6.853757381439209 diff_mean:0.1853277087211609
layer:2, layer_in, diff: 180216, diff>1e-05:[89390/1048576] diff_max:0.2682913541793823 diff_mean:0.0006241101073101163
layer:2, q_proj_out, diff: 540649, diff>1e-05:[267505/3145728] diff_max:0.21330617368221283 diff_mean:0.0005061804549768567
layer:2, k_proj_out, diff: 45054, diff>1e-05:[22804/262144] diff_max:0.2250310182571411 diff_mean:0.0005112465587444603
layer:2, v_proj_out, diff: 45051, diff>1e-05:[22306/262144] diff_max:0.17135649919509888 diff_mean:0.0004965619300492108
layer:2, o_proj_in, diff: 540665, diff>1e-05:[251722/3145728] diff_max:0.14607185125350952 diff_mean:0.00016303086886182427
layer:2, o_proj_out, diff: 180224, diff>1e-05:[90226/1048576] diff_max:7.372097969055176 diff_mean:0.21046662330627441
layer:2, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:2, attn_out, diff: 180224, diff>1e-05:[90226/1048576] diff_max:7.372097969055176 diff_mean:0.21046662330627441
layer:3, layer_in, diff: 180221, diff>1e-05:[89611/1048576] diff_max:0.4042009711265564 diff_mean:0.0012419320410117507
layer:3, q_proj_out, diff: 540664, diff>1e-05:[268863/3145728] diff_max:0.3084768056869507 diff_mean:0.0008932847413234413
layer:3, k_proj_out, diff: 45055, diff>1e-05:[22475/262144] diff_max:0.24067068099975586 diff_mean:0.0008843119721859694
layer:3, v_proj_out, diff: 45056, diff>1e-05:[22321/262144] diff_max:0.28110820055007935 diff_mean:0.0008722394122742116
layer:3, o_proj_in, diff: 540667, diff>1e-05:[251880/3145728] diff_max:0.25284716486930847 diff_mean:0.00023696967400610447
layer:3, o_proj_out, diff: 180224, diff>1e-05:[90076/1048576] diff_max:7.200575351715088 diff_mean:0.2343575358390808
layer:3, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:3, attn_out, diff: 180224, diff>1e-05:[90076/1048576] diff_max:7.200575351715088 diff_mean:0.2343575358390808
layer:3, lmhead, diff: 6668188, diff>1e-05:[3320419/38797312] diff_max:0.4843890070915222 diff_mean:0.0012724035186693072
layer:3, lmhead_weight, diff: 0, diff>1e-05:[0/620756992] diff_max:0.0 diff_mean:0.0
layer:3, lmhead_token, diff: 0, diff>1e-05:[0/256] diff_max:0 diff_mean:0.0
[2026-01-17 14:35:00,312] [INFO] [mcore_adapter.models.auto.modeling_auto]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:35:00,314] [INFO] [mcore_adapter.models.model_config]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:35:00,316] [WARNING] [mcore_adapter.models.model_config]: Non-interleaved pipeline parallelism does not support overlapping p2p communication!
[2026-01-17 14:35:00,376] [INFO] [mcore_adapter.models.model_factory]: number of parameters on (tensor, pipeline, expert) model parallel rank (0, 0, 0): 2169532416
[2026-01-17 14:35:01,171] [INFO] [mcore_adapter.models.model_factory]: End loading, cost: 0.857s
layer:0, layer_in, diff: 0, diff>1e-05:[0/1048576] diff_max:0.0 diff_mean:0.0
layer:0, q_proj_out, diff: 0, diff>1e-05:[0/3145728] diff_max:0.0 diff_mean:0.0
layer:0, k_proj_out, diff: 0, diff>1e-05:[0/262144] diff_max:0.0 diff_mean:0.0
layer:0, v_proj_out, diff: 0, diff>1e-05:[0/262144] diff_max:0.0 diff_mean:0.0
layer:0, o_proj_in, diff: 16, diff>1e-05:[8/3145728] diff_max:0.001953125 diff_mean:3.2608415967416704e-09
layer:0, o_proj_out, diff: 172330, diff>1e-05:[86134/1048576] diff_max:0.0703125 diff_mean:0.0027189888060092926
layer:0, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:0, attn_out, diff: 172330, diff>1e-05:[86134/1048576] diff_max:0.0703125 diff_mean:0.0027189888060092926
layer:1, layer_in, diff: 15716, diff>1e-05:[7804/1048576] diff_max:0.03125 diff_mean:7.824960630387068e-05
layer:1, q_proj_out, diff: 81434, diff>1e-05:[41014/3145728] diff_max:0.03125 diff_mean:0.00013662966375704855
layer:1, k_proj_out, diff: 6796, diff>1e-05:[3412/262144] diff_max:0.03125 diff_mean:0.00013915784074924886
layer:1, v_proj_out, diff: 6758, diff>1e-05:[3400/262144] diff_max:0.03125 diff_mean:0.0001346649369224906
layer:1, o_proj_in, diff: 224434, diff>1e-05:[111102/3145728] diff_max:0.015625 diff_mean:6.704116822220385e-05
layer:1, o_proj_out, diff: 180090, diff>1e-05:[90268/1048576] diff_max:6.84375 diff_mean:0.18533046543598175
layer:1, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:1, attn_out, diff: 180090, diff>1e-05:[90268/1048576] diff_max:6.84375 diff_mean:0.18533046543598175
layer:2, layer_in, diff: 90346, diff>1e-05:[45081/1048576] diff_max:0.375 diff_mean:0.001155816251412034
layer:2, q_proj_out, diff: 350532, diff>1e-05:[174855/3145728] diff_max:0.251953125 diff_mean:0.0011170278303325176
layer:2, k_proj_out, diff: 29265, diff>1e-05:[14556/262144] diff_max:0.28125 diff_mean:0.001129601034335792
layer:2, v_proj_out, diff: 29117, diff>1e-05:[14567/262144] diff_max:0.25390625 diff_mean:0.0011012377217411995
layer:2, o_proj_in, diff: 440680, diff>1e-05:[218171/3145728] diff_max:0.1484375 diff_mean:0.0003210300055798143
layer:2, o_proj_out, diff: 180128, diff>1e-05:[90201/1048576] diff_max:7.34375 diff_mean:0.21043574810028076
layer:2, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:2, attn_out, diff: 180128, diff>1e-05:[90201/1048576] diff_max:7.34375 diff_mean:0.21043574810028076
layer:3, layer_in, diff: 129079, diff>1e-05:[64589/1048576] diff_max:0.5 diff_mean:0.0025045976508408785
layer:3, q_proj_out, diff: 419785, diff>1e-05:[209664/3145728] diff_max:0.359375 diff_mean:0.001990190939977765
layer:3, k_proj_out, diff: 34886, diff>1e-05:[17393/262144] diff_max:0.3125 diff_mean:0.0019785473123192787
layer:3, v_proj_out, diff: 34966, diff>1e-05:[17555/262144] diff_max:0.3046875 diff_mean:0.001987813040614128
layer:3, o_proj_in, diff: 493371, diff>1e-05:[245988/3145728] diff_max:0.1640625 diff_mean:0.0006154172588139772
layer:3, o_proj_out, diff: 180142, diff>1e-05:[90106/1048576] diff_max:7.1875 diff_mean:0.23433226346969604
layer:3, o_proj_out_weight, diff: 0, diff>1e-05:[0/50331648] diff_max:0.0 diff_mean:0.0
layer:3, attn_out, diff: 180142, diff>1e-05:[90106/1048576] diff_max:7.1875 diff_mean:0.23433226346969604
layer:3, lmhead, diff: 5734653, diff>1e-05:[2869906/38797312] diff_max:0.40625 diff_mean:0.0029410708229988813
layer:3, lmhead_weight, diff: 0, diff>1e-05:[0/620756992] diff_max:0.0 diff_mean:0.0
layer:3, lmhead_token, diff: 3, diff>1e-05:[1/256] diff_max:70042 diff_mean:581.35546875
[2026-01-17 14:35:05,517] [INFO] [mcore_adapter.models.auto.modeling_auto]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:35:05,520] [INFO] [mcore_adapter.models.model_config]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:35:05,522] [WARNING] [mcore_adapter.models.model_config]: Non-interleaved pipeline parallelism does not support overlapping p2p communication!
[2026-01-17 14:35:05,556] [INFO] [mcore_adapter.models.model_factory]: number of parameters on (tensor, pipeline, expert) model parallel rank (0, 0, 0): 2169532416
[2026-01-17 14:35:06,311] [INFO] [mcore_adapter.models.model_factory]: End loading, cost: 0.792s
exp_tp: 1
model ./tiny-glm-4.5-air forward logits diff: {'max_abs': tensor(0.4072, device='cuda:0'), 'diff_avg': tensor(0.0043, device='cuda:0')} hidden_states diff: {'max_abs': tensor(0.2578, device='cuda:0'), 'diff_avg': tensor(0.0033, device='cuda:0')} dtype torch.bfloat16 for config {'transformer_impl': 'local', 'bf16': True, 'tensor_model_parallel_size': 1, 'moe_router_dtype': 'float32'}
[2026-01-17 14:35:07,826] [INFO] [mcore_adapter.models.auto.modeling_auto]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:35:07,828] [INFO] [mcore_adapter.models.model_config]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:35:07,831] [WARNING] [mcore_adapter.models.model_config]: Non-interleaved pipeline parallelism does not support overlapping p2p communication!
[2026-01-17 14:35:07,865] [INFO] [mcore_adapter.models.model_factory]: number of parameters on (tensor, pipeline, expert) model parallel rank (0, 0, 0): 2169532416
[2026-01-17 14:35:09,128] [INFO] [mcore_adapter.models.model_factory]: End loading, cost: 1.301s
exp_tp: 1
model ./tiny-glm-4.5-air forward logits diff: {'max_abs': tensor(0.0002, device='cuda:0'), 'diff_avg': tensor(3.3513e-06, device='cuda:0')} hidden_states diff: {'max_abs': tensor(0.0001, device='cuda:0'), 'diff_avg': tensor(2.6231e-06, device='cuda:0')} dtype torch.float32 for config {'transformer_impl': 'local', 'tensor_model_parallel_size': 1, 'moe_router_dtype': 'float32'}
[2026-01-17 14:35:11,321] [INFO] [mcore_adapter.models.auto.modeling_auto]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:35:11,324] [INFO] [mcore_adapter.models.model_config]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:35:11,326] [WARNING] [mcore_adapter.models.model_config]: Non-interleaved pipeline parallelism does not support overlapping p2p communication!
[2026-01-17 14:35:11,395] [INFO] [mcore_adapter.models.model_factory]: number of parameters on (tensor, pipeline, expert) model parallel rank (0, 0, 0): 2169532416
[2026-01-17 14:35:12,903] [INFO] [mcore_adapter.models.model_factory]: End loading, cost: 1.580s
exp_tp: 1
model ./tiny-glm-4.5-air forward logits diff: {'max_abs': tensor(0.4844, device='cuda:0'), 'diff_avg': tensor(0.0013, device='cuda:0')} hidden_states diff: {'max_abs': tensor(0.2846, device='cuda:0'), 'diff_avg': tensor(0.0010, device='cuda:0')} dtype torch.float32 for config {'transformer_impl': 'transformer_engine', 'tensor_model_parallel_size': 1, 'moe_router_dtype': 'float32'}
[2026-01-17 14:35:15,179] [INFO] [mcore_adapter.models.auto.modeling_auto]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:35:15,181] [INFO] [mcore_adapter.models.model_config]: Did not find ./tiny-glm-4.5-air/mca_config.json, loading HuggingFace config from ./tiny-glm-4.5-air
[2026-01-17 14:35:15,184] [WARNING] [mcore_adapter.models.model_config]: Non-interleaved pipeline parallelism does not support overlapping p2p communication!
[2026-01-17 14:35:15,253] [INFO] [mcore_adapter.models.model_factory]: number of parameters on (tensor, pipeline, expert) model parallel rank (0, 0, 0): 2169532416
[2026-01-17 14:35:16,008] [INFO] [mcore_adapter.models.model_factory]: End loading, cost: 0.828s
exp_tp: 1
model ./tiny-glm-4.5-air forward logits diff: {'max_abs': tensor(0.4150, device='cuda:0'), 'diff_avg': tensor(0.0027, device='cuda:0')} hidden_states diff: {'max_abs': tensor(0.2656, device='cuda:0'), 'diff_avg': tensor(0.0022, device='cuda:0')} dtype torch.bfloat16 for config {'transformer_impl': 'transformer_engine', 'tensor_model_parallel_size': 1, 'bf16': True, 'moe_router_dtype': 'float32'}
-------------------
model ./tiny-glm-4.5-air forward logits diff: {'max_abs': tensor(0.4072, device='cuda:0'), 'diff_avg': tensor(0.0043, device='cuda:0')} hidden_states diff: {'max_abs': tensor(0.2578, device='cuda:0'), 'diff_avg': tensor(0.0033, device='cuda:0')} dtype torch.bfloat16 for config {'transformer_impl': 'local', 'bf16': True, 'tensor_model_parallel_size': 1, 'moe_router_dtype': 'float32'}
model ./tiny-glm-4.5-air forward logits diff: {'max_abs': tensor(0.0002, device='cuda:0'), 'diff_avg': tensor(3.3513e-06, device='cuda:0')} hidden_states diff: {'max_abs': tensor(0.0001, device='cuda:0'), 'diff_avg': tensor(2.6231e-06, device='cuda:0')} dtype torch.float32 for config {'transformer_impl': 'local', 'tensor_model_parallel_size': 1, 'moe_router_dtype': 'float32'}
model ./tiny-glm-4.5-air forward logits diff: {'max_abs': tensor(0.4844, device='cuda:0'), 'diff_avg': tensor(0.0013, device='cuda:0')} hidden_states diff: {'max_abs': tensor(0.2846, device='cuda:0'), 'diff_avg': tensor(0.0010, device='cuda:0')} dtype torch.float32 for config {'transformer_impl': 'transformer_engine', 'tensor_model_parallel_size': 1, 'moe_router_dtype': 'float32'}
model ./tiny-glm-4.5-air forward logits diff: {'max_abs': tensor(0.4150, device='cuda:0'), 'diff_avg': tensor(0.0027, device='cuda:0')} hidden_states diff: {'max_abs': tensor(0.2656, device='cuda:0'), 'diff_avg': tensor(0.0022, device='cuda:0')} dtype torch.bfloat16 for config {'transformer_impl': 'transformer_engine', 'tensor_model_parallel_size': 1, 'bf16': True, 'moe_router_dtype': 'float32'}

updated @chocoded

@chocoded chocoded merged commit b5522b0 into alibaba:main Jan 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants