fix(ollama): honor max_tokens and stop defaulting to the model-max context#361
Merged
Conversation
…ntext Reported on Discord: setting max_tokens did nothing and `ollama ps` showed a 262144 context (19 GB, CPU spill) regardless of Ollama-side settings. Two root causes, both in how kwargs reach llama-index's Ollama class: - max_tokens is not an Ollama field, so pydantic silently dropped it. It now translates to additional_kwargs.num_predict in load_llm's Ollama branch (the single chokepoint for CLI, SDK, and config profiles); an explicit num_predict wins with a one-line warning, and the translation auto-disables if llama-index ever grows a native max_tokens field. - context_window defaulted to -1, which resolves to the model's MAXIMUM context via a hidden client.show() call — and the per-request num_ctx mobilerun sends overrides every Ollama-side knob, so users could not fix it server-side. Unset profiles now default to 32768 (the DeepSeek context_window default is in-repo precedent); an explicit additional_kwargs.num_ctx is mirrored into context_window so the two stay aligned without the network lookup; context_window: -1 remains the escape hatch for model-max. Also: wizard-created Ollama profiles get an explicit context_window: 32768, the configure wizard gains an Ollama-gated "Context window" prompt, unknown Ollama kwargs log a deduped warning instead of vanishing, the stale loader docstring config path is fixed, and config_example.yaml documents an Ollama profile. Verified A/B against released 0.6.5 with gemma4 (131072-context model): ollama ps context drops 131072 -> 32768, max_tokens=16 goes from ignored (890-char response) to a hard cap, and a live emulator agent run on Ollama completes with the 32K context active at 100% GPU. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Docs PR opened: https://github.com/droidrun/mobilerun-docs/pull/12 Documented Ollama profile kwargs in the configuration guide, covering max_tokens, the new 32K context_window default, and -1 override. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (Discord report)
Two independent bugs in the Ollama path:
max_tokenswas silently dropped. It is not a field on llama-index'sOllamaclass, so pydantic discarded it — no error, no effect. Verified:load_llm("Ollama", ..., max_tokens=16)produced an 890-char response withadditional_kwargs: {}.context_window=-1default resolves to the model's max context via a hiddenclient.show()call, allocating the full KV cache (256K-context models → ~19 GB, CPU spill). Because mobilerun sendsnum_ctxper request, this overrides every Ollama-side setting (OLLAMA_CONTEXT_LENGTH, Modelfile,/set parameter) — users could not fix it from the Ollama side, which is exactly the reporter's confusion.Fix
A pure
_prepare_ollama_kwargs()helper inload_llm's Ollama branch — the single chokepoint covering CLI, SDK, and config-profile paths:max_tokens→additional_kwargs.num_predict; explicitnum_predictwins (one-line warning on conflict); non-integer values warn and skip (today they're a silent no-op, so warn+skip strictly improves); amodel_fieldsguard auto-disables the shim if llama-index adds native support.context_windowdefaults to 32768 when neither it noradditional_kwargs.num_ctxis set (in-repo precedent: the DeepSeek branch'scontext_windowdefault). An explicitnum_ctxis mirrored intocontext_windowso they stay aligned and the hiddenshow()network call is never triggered.context_window: -1remains the documented escape hatch for model-max on big-GPU machines.Plus: wizard-created Ollama profiles write an explicit
context_window: 32768(self-documenting configs; the runtime injection is the fallback for hand-written ones), an Ollama-gated "Context window" prompt inmobilerun configureadvanced settings, the stale loader docstring path, and a commented Ollama example inconfig_example.yaml.Back-compat: no config migration — runtime translation is the compatibility layer; existing
additional_kwargs.num_predict/num_ctxworkaround configs behave byte-for-byte identically under the precedence rules. Two visible changes: wizard-writtenmax_tokensnow actually caps output, and unconfigured Ollama profiles drop from model-max to 32K context (one line restores either).Testing
tests/test_llm_picker.py: translation, both conflict directions, numeric-string coercion, invalid-value warn+skip (string/bool/None), default injection, explicit/-1preservation,num_ctxmirroring (incl. non-numeric fallback), deduped unknown-kwarg warning, themodel_fieldsfuture-guard, end-to-endload_llm("Ollama"), and the wizard default.gemma4= 131072-context model):ollama psCONTEXTmax_tokens=16num_predict: 16→ output cappedshow()call-p Ollama -m gemma4:latest): completed the screen-title task with the 32K context active at 100% GPU.🤖 Generated with Claude Code