fix(llm): share post-processing helpers between streaming and non-streaming llm_call#1918
Conversation
…eaming llm_call The streaming path in `_stream_llm_call` set the same context variables and `llm_call_info` fields as the non-streaming path, but it did so inline and skipped `_log_completion`, the `<think>` tag fallback inside `_store_reasoning_traces`, and the shared `_update_token_stats` helper. The two paths could drift, and reasoning models that emit `<think>...</think>` in content (rather than via `delta_reasoning`) had their hidden reasoning left in `llm_call_info.completion` and never extracted into `reasoning_trace_var` during streaming. This change builds the `LLMResponse` once streaming finishes and then runs the same five helpers (`_store_reasoning_traces`, `_log_completion`, `_update_token_stats`, `_store_tool_calls`, `_store_response_metadata`) in the same order as the non-streaming branch. Reasoning extraction now runs before the completion is logged or stored so hidden `<think>` content does not leak into `llm_call_info.completion`. Added four tests in `tests/llm/clients/test_stream_llm_call.py` covering `<think>` extraction from streamed content, `_log_completion` populating `llm_call_info.completion` plus emitting the completion log line, the stripped-completion guarantee, and global token stat increments. Closes NVIDIA-NeMo#1756 Signed-off-by: Aditya Singh <adisin650@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR refactors the streaming LLM call path to apply the same post-processing helpers ( ChangesStreaming LLM post-processing parity
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Greptile SummaryThis PR unifies the post-streaming pipeline in
|
| Filename | Overview |
|---|---|
| nemoguardrails/actions/llm/utils.py | Refactored _stream_llm_call to call the same five post-processing helpers as the non-streaming path; correctly resolves the <think> tag TODO and the missing _log_completion gap, but the switch from _update_token_stats_from_chunk to _update_token_stats introduces new INFO-level log noise for streaming calls where the model omits usage info. |
| tests/llm/clients/test_stream_llm_call.py | Four new tests cover the key scenarios: <think> tag extraction for content-based reasoning, _log_completion populating llm_call_info and emitting a log line, stripped completion stored in llm_call_info, and token stat increment via llm_stats_var; all use correct context-var teardown via reset(token). |
Sequence Diagram
sequenceDiagram
participant Caller
participant llm_call
participant _stream_llm_call
participant StreamingHandler
participant Model
Caller->>llm_call: "llm_call(llm, prompt, streaming_handler=h)"
llm_call->>_stream_llm_call: _stream_llm_call(model, prompt, handler, stop)
loop stream chunks
_stream_llm_call->>Model: stream_async(prompt)
Model-->>_stream_llm_call: LLMResponseChunk
_stream_llm_call->>StreamingHandler: push_chunk(content, metadata)
end
_stream_llm_call->>StreamingHandler: finish()
Note over _stream_llm_call: Build LLMResponse(content=handler.completion, ...)
_stream_llm_call->>_stream_llm_call: _store_reasoning_traces(response)
_stream_llm_call->>_stream_llm_call: _log_completion(response)
_stream_llm_call->>_stream_llm_call: _update_token_stats(response)
_stream_llm_call->>_stream_llm_call: _store_tool_calls(response)
_stream_llm_call->>_stream_llm_call: _store_response_metadata(response)
_stream_llm_call-->>llm_call: LLMResponse
llm_call-->>Caller: LLMResponse
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
nemoguardrails/actions/llm/utils.py:156
**New log noise for streaming calls without usage info**
`_update_token_stats` has an `else` branch that emits `logger.info("Token stats in LLM call info cannot be computed for current model!")` when `response.usage` is `None`. The old streaming path used `_update_token_stats_from_chunk`, which has no such branch — it silently skips when usage is absent. Every streaming call from a model that doesn't report usage (e.g. many local/self-hosted models, proxies, or providers that omit usage in stream chunks) will now emit this log line on every request, which was previously silent on the streaming path.
Reviews (1): Last reviewed commit: "fix(llm): share post-processing helpers ..." | Re-trigger Greptile
| # response.content are stripped before the completion is logged or stored. | ||
| _store_reasoning_traces(response) | ||
| _log_completion(response) | ||
| _update_token_stats(response) |
There was a problem hiding this comment.
New log noise for streaming calls without usage info
_update_token_stats has an else branch that emits logger.info("Token stats in LLM call info cannot be computed for current model!") when response.usage is None. The old streaming path used _update_token_stats_from_chunk, which has no such branch — it silently skips when usage is absent. Every streaming call from a model that doesn't report usage (e.g. many local/self-hosted models, proxies, or providers that omit usage in stream chunks) will now emit this log line on every request, which was previously silent on the streaming path.
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemoguardrails/actions/llm/utils.py
Line: 156
Comment:
**New log noise for streaming calls without usage info**
`_update_token_stats` has an `else` branch that emits `logger.info("Token stats in LLM call info cannot be computed for current model!")` when `response.usage` is `None`. The old streaming path used `_update_token_stats_from_chunk`, which has no such branch — it silently skips when usage is absent. Every streaming call from a model that doesn't report usage (e.g. many local/self-hosted models, proxies, or providers that omit usage in stream chunks) will now emit this log line on every request, which was previously silent on the streaming path.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Acknowledged. The unification is exactly what the linked issue #1756 asks for, so the streaming path now matches the non-streaming path one-for-one, including this log line when usage is missing. That keeps the two paths from drifting again.
Two consequences of keeping the log:
- Operators see the same "Token stats in LLM call info cannot be computed for current model!" message regardless of streaming mode, which is useful signal when integrating a new provider that does not report usage.
- The fix and its side effects are localized: removing the log would mean diverging from
_update_token_stats, which is the opposite direction from this PR.
If the maintainers prefer the line to be silenced (or downgraded to debug) for both paths, I am happy to do that in a small follow-up PR so it can land independently of this unification.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The non-streaming branch of
llm_callruns a small fixed pipeline after the model returns:_store_reasoning_traces,_log_completion,_update_token_stats,_store_tool_calls,_store_response_metadata. The streaming branch did the same work inline but skipped a few of those helpers, so the two paths could drift. Notably_log_completionwas never invoked, so the completion log line was missing for streaming, and the<think>...</think>fallback inside_store_reasoning_traceswas never reached, which left hidden reasoning inllm_call_info.completionand an unsetreasoning_trace_varfor models that stream reasoning via tags in content rather than viadelta_reasoning.The fix builds the
LLMResponseonce streaming finishes and then runs the same five helpers in the same order as the non-streaming path._store_reasoning_tracesruns first so any<think>tags inresponse.contentare stripped before the completion is logged or stored, matching the order already enforced for the non-streaming path. The pre-existing TODO in_stream_llm_callabout extracting<think>tags from the completed response is resolved by this reuse.I added four tests in
tests/llm/clients/test_stream_llm_call.pycovering reasoning extraction from streamed<think>content,_log_completionpopulatingllm_call_info.completionand emitting the completion log line, the stripped-completion guarantee, and global token stat increments via the shared helper. All 88 existing tests undertests/test_streaming_*,tests/integrations/langchain/test_streaming.py, andtests/llm/clients/test_stream_llm_call.pycontinue to pass, along with the broader llm/reasoning/tool-call test suites.Closes #1756
Summary by CodeRabbit
Release Notes
Refactor
Tests