Skip to content

fix(llm): share post-processing helpers between streaming and non-streaming llm_call#1918

Open
adityasingh2400 wants to merge 1 commit into
NVIDIA-NeMo:developfrom
adityasingh2400:fix/streaming-post-processing-helpers
Open

fix(llm): share post-processing helpers between streaming and non-streaming llm_call#1918
adityasingh2400 wants to merge 1 commit into
NVIDIA-NeMo:developfrom
adityasingh2400:fix/streaming-post-processing-helpers

Conversation

@adityasingh2400
Copy link
Copy Markdown

@adityasingh2400 adityasingh2400 commented May 22, 2026

The non-streaming branch of llm_call runs a small fixed pipeline after the model returns: _store_reasoning_traces, _log_completion, _update_token_stats, _store_tool_calls, _store_response_metadata. The streaming branch did the same work inline but skipped a few of those helpers, so the two paths could drift. Notably _log_completion was never invoked, so the completion log line was missing for streaming, and the <think>...</think> fallback inside _store_reasoning_traces was never reached, which left hidden reasoning in llm_call_info.completion and an unset reasoning_trace_var for models that stream reasoning via tags in content rather than via delta_reasoning.

The fix builds the LLMResponse once streaming finishes and then runs the same five helpers in the same order as the non-streaming path. _store_reasoning_traces runs first so any <think> tags in response.content are stripped before the completion is logged or stored, matching the order already enforced for the non-streaming path. The pre-existing TODO in _stream_llm_call about extracting <think> tags from the completed response is resolved by this reuse.

I added four tests in tests/llm/clients/test_stream_llm_call.py covering reasoning extraction from streamed <think> content, _log_completion populating llm_call_info.completion and emitting the completion log line, the stripped-completion guarantee, and global token stat increments via the shared helper. All 88 existing tests under tests/test_streaming_*, tests/integrations/langchain/test_streaming.py, and tests/llm/clients/test_stream_llm_call.py continue to pass, along with the broader llm/reasoning/tool-call test suites.

Closes #1756

Summary by CodeRabbit

Release Notes

  • Refactor

    • Unified post-processing logic for streaming and non-streaming LLM calls to ensure consistent behavior across both code paths.
  • Tests

    • Added comprehensive test coverage for streaming LLM call post-processing, including reasoning extraction, completion logging, and token statistics tracking.

Review Change Stack

…eaming llm_call

The streaming path in `_stream_llm_call` set the same context variables and
`llm_call_info` fields as the non-streaming path, but it did so inline and
skipped `_log_completion`, the `<think>` tag fallback inside
`_store_reasoning_traces`, and the shared `_update_token_stats` helper. The
two paths could drift, and reasoning models that emit `<think>...</think>`
in content (rather than via `delta_reasoning`) had their hidden reasoning
left in `llm_call_info.completion` and never extracted into
`reasoning_trace_var` during streaming.

This change builds the `LLMResponse` once streaming finishes and then runs
the same five helpers (`_store_reasoning_traces`, `_log_completion`,
`_update_token_stats`, `_store_tool_calls`, `_store_response_metadata`) in
the same order as the non-streaming branch. Reasoning extraction now runs
before the completion is logged or stored so hidden `<think>` content does
not leak into `llm_call_info.completion`.

Added four tests in `tests/llm/clients/test_stream_llm_call.py` covering
`<think>` extraction from streamed content, `_log_completion` populating
`llm_call_info.completion` plus emitting the completion log line, the
stripped-completion guarantee, and global token stat increments.

Closes NVIDIA-NeMo#1756

Signed-off-by: Aditya Singh <adisin650@gmail.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 639f6e3b-372e-4e9a-a165-d2837b3666b0

📥 Commits

Reviewing files that changed from the base of the PR and between 2a1d965 and e7748d2.

📒 Files selected for processing (2)
  • nemoguardrails/actions/llm/utils.py
  • tests/llm/clients/test_stream_llm_call.py

📝 Walkthrough

Walkthrough

The PR refactors the streaming LLM call path to apply the same post-processing helpers (_store_reasoning_traces, _log_completion, _update_token_stats, _store_tool_calls, _store_response_metadata) used by the non-streaming path, eliminating silent behavioral differences. A new test suite validates this parity across reasoning extraction, completion logging, tag stripping, and token statistics.

Changes

Streaming LLM post-processing parity

Layer / File(s) Summary
Streaming path refactor to shared post-processing
nemoguardrails/actions/llm/utils.py
After handler.finish(), streaming path constructs LLMResponse with computed reasoning_content and accumulated metadata, then applies the same post-processing pipeline as non-streaming: storing reasoning traces, logging completion, updating token stats, storing tool calls, and storing response metadata.
Streaming post-processing test suite
tests/llm/clients/test_stream_llm_call.py
Import context variables and types needed for streaming post-processing validation. Add TestStreamLlmCallSharedPostProcessing test suite with four test methods verifying that streaming extracts reasoning from <think> tags, logs completion while populating LLMCallInfo, strips <think> tags from stored completion, and updates both LLMCallInfo and global LLMStats token counts from streamed usage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(llm): share post-processing helpers between streaming and non-streaming llm_call' directly describes the main change—making streaming reuse the same post-processing helpers as non-streaming.
Linked Issues check ✅ Passed The PR fully addresses issue #1756 by ensuring streaming calls all five post-processing helpers (_store_reasoning_traces, _log_completion, _update_token_stats, _store_tool_calls, _store_response_metadata) in the same order as non-streaming, resolving behavioral divergences.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the stated objective: modifications to the streaming path in llm/utils.py and comprehensive test coverage in test_stream_llm_call.py for the shared post-processing behavior.
Test Results For Major Changes ✅ Passed Four tests documented and verified covering tag extraction, completion logging, tag stripping, and token stats. Commit states existing tests pass.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR unifies the post-streaming pipeline in _stream_llm_call with the non-streaming path by building an LLMResponse once streaming finishes and then running the same five helpers (_store_reasoning_traces, _log_completion, _update_token_stats, _store_tool_calls, _store_response_metadata) in the same order. This closes the long-standing gaps where _log_completion was never called on streaming responses and <think> tag extraction was never reached for models that stream reasoning through content tags.

  • Core fix: The five post-processing helpers are now shared across both code paths, eliminating drift. _store_reasoning_traces runs first so <think> tags are stripped from response.content before the completion is logged or stored in llm_call_info, matching the non-streaming order.
  • Tests: Four new tests in TestStreamLlmCallSharedPostProcessing verify <think> extraction, llm_call_info.completion population, stripped-completion guarantee, and global token stat increments via the shared helper.
  • Subtle behavioral change: The old streaming path called _update_token_stats_from_chunk (which silently skips when usage is absent) whereas the new path calls _update_token_stats, which emits an INFO log line when response.usage is None; models or providers that omit usage in stream responses will now generate per-call log noise.

Confidence Score: 4/5

Safe to merge; the change correctly unifies the two code paths and the tests verify the key new behaviors.

The refactoring is clean and the logic is sound. The one behavioral change worth watching is that _update_token_stats now logs a per-call INFO message for streaming responses from models that don't report usage, where before the streaming path was silent. This is log noise rather than wrong behavior, and it can be addressed as a follow-up.

No files require special attention; nemoguardrails/actions/llm/utils.py is straightforward and the test file has solid coverage of the new paths.

Important Files Changed

Filename Overview
nemoguardrails/actions/llm/utils.py Refactored _stream_llm_call to call the same five post-processing helpers as the non-streaming path; correctly resolves the <think> tag TODO and the missing _log_completion gap, but the switch from _update_token_stats_from_chunk to _update_token_stats introduces new INFO-level log noise for streaming calls where the model omits usage info.
tests/llm/clients/test_stream_llm_call.py Four new tests cover the key scenarios: <think> tag extraction for content-based reasoning, _log_completion populating llm_call_info and emitting a log line, stripped completion stored in llm_call_info, and token stat increment via llm_stats_var; all use correct context-var teardown via reset(token).

Sequence Diagram

sequenceDiagram
    participant Caller
    participant llm_call
    participant _stream_llm_call
    participant StreamingHandler
    participant Model

    Caller->>llm_call: "llm_call(llm, prompt, streaming_handler=h)"
    llm_call->>_stream_llm_call: _stream_llm_call(model, prompt, handler, stop)
    loop stream chunks
        _stream_llm_call->>Model: stream_async(prompt)
        Model-->>_stream_llm_call: LLMResponseChunk
        _stream_llm_call->>StreamingHandler: push_chunk(content, metadata)
    end
    _stream_llm_call->>StreamingHandler: finish()
    Note over _stream_llm_call: Build LLMResponse(content=handler.completion, ...)
    _stream_llm_call->>_stream_llm_call: _store_reasoning_traces(response)
    _stream_llm_call->>_stream_llm_call: _log_completion(response)
    _stream_llm_call->>_stream_llm_call: _update_token_stats(response)
    _stream_llm_call->>_stream_llm_call: _store_tool_calls(response)
    _stream_llm_call->>_stream_llm_call: _store_response_metadata(response)
    _stream_llm_call-->>llm_call: LLMResponse
    llm_call-->>Caller: LLMResponse
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemoguardrails/actions/llm/utils.py:156
**New log noise for streaming calls without usage info**

`_update_token_stats` has an `else` branch that emits `logger.info("Token stats in LLM call info cannot be computed for current model!")` when `response.usage` is `None`. The old streaming path used `_update_token_stats_from_chunk`, which has no such branch — it silently skips when usage is absent. Every streaming call from a model that doesn't report usage (e.g. many local/self-hosted models, proxies, or providers that omit usage in stream chunks) will now emit this log line on every request, which was previously silent on the streaming path.

Reviews (1): Last reviewed commit: "fix(llm): share post-processing helpers ..." | Re-trigger Greptile

# response.content are stripped before the completion is logged or stored.
_store_reasoning_traces(response)
_log_completion(response)
_update_token_stats(response)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 New log noise for streaming calls without usage info

_update_token_stats has an else branch that emits logger.info("Token stats in LLM call info cannot be computed for current model!") when response.usage is None. The old streaming path used _update_token_stats_from_chunk, which has no such branch — it silently skips when usage is absent. Every streaming call from a model that doesn't report usage (e.g. many local/self-hosted models, proxies, or providers that omit usage in stream chunks) will now emit this log line on every request, which was previously silent on the streaming path.

Prompt To Fix With AI
This is a comment left during a code review.
Path: nemoguardrails/actions/llm/utils.py
Line: 156

Comment:
**New log noise for streaming calls without usage info**

`_update_token_stats` has an `else` branch that emits `logger.info("Token stats in LLM call info cannot be computed for current model!")` when `response.usage` is `None`. The old streaming path used `_update_token_stats_from_chunk`, which has no such branch — it silently skips when usage is absent. Every streaming call from a model that doesn't report usage (e.g. many local/self-hosted models, proxies, or providers that omit usage in stream chunks) will now emit this log line on every request, which was previously silent on the streaming path.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The unification is exactly what the linked issue #1756 asks for, so the streaming path now matches the non-streaming path one-for-one, including this log line when usage is missing. That keeps the two paths from drifting again.

Two consequences of keeping the log:

  1. Operators see the same "Token stats in LLM call info cannot be computed for current model!" message regardless of streaming mode, which is useful signal when integrating a new provider that does not report usage.
  2. The fix and its side effects are localized: removing the log would mean diverging from _update_token_stats, which is the opposite direction from this PR.

If the maintainers prefer the line to be silenced (or downgraded to debug) for both paths, I am happy to do that in a small follow-up PR so it can land independently of this unification.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: streaming path skips shared post-processing helpers

1 participant