Skip to content

fix(ollama): honor max_tokens and stop defaulting to the model-max context#361

Merged
RasulOs merged 1 commit into
mainfrom
rasul/fix-ollama-max-tokens-context
Jun 11, 2026
Merged

fix(ollama): honor max_tokens and stop defaulting to the model-max context#361
RasulOs merged 1 commit into
mainfrom
rasul/fix-ollama-max-tokens-context

Conversation

@RasulOs

@RasulOs RasulOs commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Problem (Discord report)

Why is setting max_tokens not doing anything? … the context is 262144?

Two independent bugs in the Ollama path:

  1. max_tokens was silently dropped. It is not a field on llama-index's Ollama class, so pydantic discarded it — no error, no effect. Verified: load_llm("Ollama", ..., max_tokens=16) produced an 890-char response with additional_kwargs: {}.
  2. The context defaulted to the model's maximum. llama-index's context_window=-1 default resolves to the model's max context via a hidden client.show() call, allocating the full KV cache (256K-context models → ~19 GB, CPU spill). Because mobilerun sends num_ctx per request, this overrides every Ollama-side setting (OLLAMA_CONTEXT_LENGTH, Modelfile, /set parameter) — users could not fix it from the Ollama side, which is exactly the reporter's confusion.

Fix

A pure _prepare_ollama_kwargs() helper in load_llm's Ollama branch — the single chokepoint covering CLI, SDK, and config-profile paths:

  • max_tokensadditional_kwargs.num_predict; explicit num_predict wins (one-line warning on conflict); non-integer values warn and skip (today they're a silent no-op, so warn+skip strictly improves); a model_fields guard auto-disables the shim if llama-index adds native support.
  • context_window defaults to 32768 when neither it nor additional_kwargs.num_ctx is set (in-repo precedent: the DeepSeek branch's context_window default). An explicit num_ctx is mirrored into context_window so they stay aligned and the hidden show() network call is never triggered. context_window: -1 remains the documented escape hatch for model-max on big-GPU machines.
  • Other unknown Ollama kwargs log a deduped warning instead of vanishing (warn-only, never popped).

Plus: wizard-created Ollama profiles write an explicit context_window: 32768 (self-documenting configs; the runtime injection is the fallback for hand-written ones), an Ollama-gated "Context window" prompt in mobilerun configure advanced settings, the stale loader docstring path, and a commented Ollama example in config_example.yaml.

Back-compat: no config migration — runtime translation is the compatibility layer; existing additional_kwargs.num_predict/num_ctx workaround configs behave byte-for-byte identically under the precedence rules. Two visible changes: wizard-written max_tokens now actually caps output, and unconfigured Ollama profiles drop from model-max to 32K context (one line restores either).

Testing

  • 169/169 unit tests, 16 new in tests/test_llm_picker.py: translation, both conflict directions, numeric-string coercion, invalid-value warn+skip (string/bool/None), default injection, explicit/-1 preservation, num_ctx mirroring (incl. non-numeric fallback), deduped unknown-kwarg warning, the model_fields future-guard, end-to-end load_llm("Ollama"), and the wizard default.
  • Live A/B vs released 0.6.5 (real Ollama server, gemma4 = 131072-context model):
0.6.5 this branch
ollama ps CONTEXT 131072 (model max) 32768
max_tokens=16 ignored → 890-char response num_predict: 16 → output capped
hidden show() call yes no
  • Live emulator agent run on Ollama (-p Ollama -m gemma4:latest): completed the screen-title task with the 32K context active at 100% GPU.

🤖 Generated with Claude Code

…ntext

Reported on Discord: setting max_tokens did nothing and `ollama ps`
showed a 262144 context (19 GB, CPU spill) regardless of Ollama-side
settings.

Two root causes, both in how kwargs reach llama-index's Ollama class:

- max_tokens is not an Ollama field, so pydantic silently dropped it.
  It now translates to additional_kwargs.num_predict in load_llm's
  Ollama branch (the single chokepoint for CLI, SDK, and config
  profiles); an explicit num_predict wins with a one-line warning, and
  the translation auto-disables if llama-index ever grows a native
  max_tokens field.
- context_window defaulted to -1, which resolves to the model's MAXIMUM
  context via a hidden client.show() call — and the per-request num_ctx
  mobilerun sends overrides every Ollama-side knob, so users could not
  fix it server-side. Unset profiles now default to 32768 (the DeepSeek
  context_window default is in-repo precedent); an explicit
  additional_kwargs.num_ctx is mirrored into context_window so the two
  stay aligned without the network lookup; context_window: -1 remains
  the escape hatch for model-max.

Also: wizard-created Ollama profiles get an explicit
context_window: 32768, the configure wizard gains an Ollama-gated
"Context window" prompt, unknown Ollama kwargs log a deduped warning
instead of vanishing, the stale loader docstring config path is fixed,
and config_example.yaml documents an Ollama profile.

Verified A/B against released 0.6.5 with gemma4 (131072-context model):
ollama ps context drops 131072 -> 32768, max_tokens=16 goes from
ignored (890-char response) to a hard cap, and a live emulator agent
run on Ollama completes with the 32K context active at 100% GPU.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@RasulOs RasulOs merged commit c4ecac0 into main Jun 11, 2026
4 of 7 checks passed
@mintlify

mintlify Bot commented Jun 11, 2026

Copy link
Copy Markdown

Docs PR opened: https://github.com/droidrun/mobilerun-docs/pull/12

Documented Ollama profile kwargs in the configuration guide, covering max_tokens, the new 32K context_window default, and -1 override.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant