fix(ollama): honor max_tokens and stop defaulting to the model-max context by RasulOs · Pull Request #361 · droidrun/mobilerun

RasulOs · 2026-06-11T12:29:07Z

Problem (Discord report)

Why is setting max_tokens not doing anything? … the context is 262144?

Two independent bugs in the Ollama path:

max_tokens was silently dropped. It is not a field on llama-index's Ollama class, so pydantic discarded it — no error, no effect. Verified: load_llm("Ollama", ..., max_tokens=16) produced an 890-char response with additional_kwargs: {}.
The context defaulted to the model's maximum. llama-index's context_window=-1 default resolves to the model's max context via a hidden client.show() call, allocating the full KV cache (256K-context models → ~19 GB, CPU spill). Because mobilerun sends num_ctx per request, this overrides every Ollama-side setting (OLLAMA_CONTEXT_LENGTH, Modelfile, /set parameter) — users could not fix it from the Ollama side, which is exactly the reporter's confusion.

Fix

A pure _prepare_ollama_kwargs() helper in load_llm's Ollama branch — the single chokepoint covering CLI, SDK, and config-profile paths:

max_tokens → additional_kwargs.num_predict; explicit num_predict wins (one-line warning on conflict); non-integer values warn and skip (today they're a silent no-op, so warn+skip strictly improves); a model_fields guard auto-disables the shim if llama-index adds native support.
context_window defaults to 32768 when neither it nor additional_kwargs.num_ctx is set (in-repo precedent: the DeepSeek branch's context_window default). An explicit num_ctx is mirrored into context_window so they stay aligned and the hidden show() network call is never triggered. context_window: -1 remains the documented escape hatch for model-max on big-GPU machines.
Other unknown Ollama kwargs log a deduped warning instead of vanishing (warn-only, never popped).

Plus: wizard-created Ollama profiles write an explicit context_window: 32768 (self-documenting configs; the runtime injection is the fallback for hand-written ones), an Ollama-gated "Context window" prompt in mobilerun configure advanced settings, the stale loader docstring path, and a commented Ollama example in config_example.yaml.

Back-compat: no config migration — runtime translation is the compatibility layer; existing additional_kwargs.num_predict/num_ctx workaround configs behave byte-for-byte identically under the precedence rules. Two visible changes: wizard-written max_tokens now actually caps output, and unconfigured Ollama profiles drop from model-max to 32K context (one line restores either).

Testing

169/169 unit tests, 16 new in tests/test_llm_picker.py: translation, both conflict directions, numeric-string coercion, invalid-value warn+skip (string/bool/None), default injection, explicit/-1 preservation, num_ctx mirroring (incl. non-numeric fallback), deduped unknown-kwarg warning, the model_fields future-guard, end-to-end load_llm("Ollama"), and the wizard default.
Live A/B vs released 0.6.5 (real Ollama server, gemma4 = 131072-context model):

	0.6.5	this branch
`ollama ps` CONTEXT	131072 (model max)	32768
`max_tokens=16`	ignored → 890-char response	`num_predict: 16` → output capped
hidden `show()` call	yes	no

Live emulator agent run on Ollama (-p Ollama -m gemma4:latest): completed the screen-title task with the 32K context active at 100% GPU.

🤖 Generated with Claude Code

…ntext Reported on Discord: setting max_tokens did nothing and `ollama ps` showed a 262144 context (19 GB, CPU spill) regardless of Ollama-side settings. Two root causes, both in how kwargs reach llama-index's Ollama class: - max_tokens is not an Ollama field, so pydantic silently dropped it. It now translates to additional_kwargs.num_predict in load_llm's Ollama branch (the single chokepoint for CLI, SDK, and config profiles); an explicit num_predict wins with a one-line warning, and the translation auto-disables if llama-index ever grows a native max_tokens field. - context_window defaulted to -1, which resolves to the model's MAXIMUM context via a hidden client.show() call — and the per-request num_ctx mobilerun sends overrides every Ollama-side knob, so users could not fix it server-side. Unset profiles now default to 32768 (the DeepSeek context_window default is in-repo precedent); an explicit additional_kwargs.num_ctx is mirrored into context_window so the two stay aligned without the network lookup; context_window: -1 remains the escape hatch for model-max. Also: wizard-created Ollama profiles get an explicit context_window: 32768, the configure wizard gains an Ollama-gated "Context window" prompt, unknown Ollama kwargs log a deduped warning instead of vanishing, the stale loader docstring config path is fixed, and config_example.yaml documents an Ollama profile. Verified A/B against released 0.6.5 with gemma4 (131072-context model): ollama ps context drops 131072 -> 32768, max_tokens=16 goes from ignored (890-char response) to a hard cap, and a live emulator agent run on Ollama completes with the 32K context active at 100% GPU. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mintlify · 2026-06-11T12:40:22Z

Docs PR opened: https://github.com/droidrun/mobilerun-docs/pull/12

Documented Ollama profile kwargs in the configuration guide, covering max_tokens, the new 32K context_window default, and -1 override.

RasulOs merged commit c4ecac0 into main Jun 11, 2026
4 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ollama): honor max_tokens and stop defaulting to the model-max context#361

fix(ollama): honor max_tokens and stop defaulting to the model-max context#361
RasulOs merged 1 commit into
mainfrom
rasul/fix-ollama-max-tokens-context

RasulOs commented Jun 11, 2026

Uh oh!

Uh oh!

mintlify Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

RasulOs commented Jun 11, 2026

Problem (Discord report)

Fix

Testing

Uh oh!

Uh oh!

mintlify Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant