fix(llm): share post-processing helpers between streaming and non-streaming llm_call by adityasingh2400 · Pull Request #1918 · NVIDIA-NeMo/Guardrails

adityasingh2400 · 2026-05-22T10:21:15Z

The non-streaming branch of llm_call runs a small fixed pipeline after the model returns: _store_reasoning_traces, _log_completion, _update_token_stats, _store_tool_calls, _store_response_metadata. The streaming branch did the same work inline but skipped a few of those helpers, so the two paths could drift. Notably _log_completion was never invoked, so the completion log line was missing for streaming, and the <think>...</think> fallback inside _store_reasoning_traces was never reached, which left hidden reasoning in llm_call_info.completion and an unset reasoning_trace_var for models that stream reasoning via tags in content rather than via delta_reasoning.

The fix builds the LLMResponse once streaming finishes and then runs the same five helpers in the same order as the non-streaming path. _store_reasoning_traces runs first so any <think> tags in response.content are stripped before the completion is logged or stored, matching the order already enforced for the non-streaming path. The pre-existing TODO in _stream_llm_call about extracting <think> tags from the completed response is resolved by this reuse.

I added four tests in tests/llm/clients/test_stream_llm_call.py covering reasoning extraction from streamed <think> content, _log_completion populating llm_call_info.completion and emitting the completion log line, the stripped-completion guarantee, and global token stat increments via the shared helper. All 88 existing tests under tests/test_streaming_*, tests/integrations/langchain/test_streaming.py, and tests/llm/clients/test_stream_llm_call.py continue to pass, along with the broader llm/reasoning/tool-call test suites.

Closes #1756

Summary by CodeRabbit

Release Notes

Refactor
- Unified post-processing logic for streaming and non-streaming LLM calls to ensure consistent behavior across both code paths.
Tests
- Added comprehensive test coverage for streaming LLM call post-processing, including reasoning extraction, completion logging, and token statistics tracking.

…eaming llm_call The streaming path in `_stream_llm_call` set the same context variables and `llm_call_info` fields as the non-streaming path, but it did so inline and skipped `_log_completion`, the `<think>` tag fallback inside `_store_reasoning_traces`, and the shared `_update_token_stats` helper. The two paths could drift, and reasoning models that emit `<think>...</think>` in content (rather than via `delta_reasoning`) had their hidden reasoning left in `llm_call_info.completion` and never extracted into `reasoning_trace_var` during streaming. This change builds the `LLMResponse` once streaming finishes and then runs the same five helpers (`_store_reasoning_traces`, `_log_completion`, `_update_token_stats`, `_store_tool_calls`, `_store_response_metadata`) in the same order as the non-streaming branch. Reasoning extraction now runs before the completion is logged or stored so hidden `<think>` content does not leak into `llm_call_info.completion`. Added four tests in `tests/llm/clients/test_stream_llm_call.py` covering `<think>` extraction from streamed content, `_log_completion` populating `llm_call_info.completion` plus emitting the completion log line, the stripped-completion guarantee, and global token stat increments. Closes NVIDIA-NeMo#1756 Signed-off-by: Aditya Singh <adisin650@gmail.com>

coderabbitai · 2026-05-22T10:24:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 639f6e3b-372e-4e9a-a165-d2837b3666b0

📥 Commits

Reviewing files that changed from the base of the PR and between 2a1d965 and e7748d2.

📒 Files selected for processing (2)

nemoguardrails/actions/llm/utils.py
tests/llm/clients/test_stream_llm_call.py

📝 Walkthrough

Walkthrough

The PR refactors the streaming LLM call path to apply the same post-processing helpers (_store_reasoning_traces, _log_completion, _update_token_stats, _store_tool_calls, _store_response_metadata) used by the non-streaming path, eliminating silent behavioral differences. A new test suite validates this parity across reasoning extraction, completion logging, tag stripping, and token statistics.

Changes

Streaming LLM post-processing parity

Layer / File(s)	Summary
Streaming path refactor to shared post-processing `nemoguardrails/actions/llm/utils.py`	After `handler.finish()`, streaming path constructs `LLMResponse` with computed `reasoning_content` and accumulated metadata, then applies the same post-processing pipeline as non-streaming: storing reasoning traces, logging completion, updating token stats, storing tool calls, and storing response metadata.
Streaming post-processing test suite `tests/llm/clients/test_stream_llm_call.py`	Import context variables and types needed for streaming post-processing validation. Add `TestStreamLlmCallSharedPostProcessing` test suite with four test methods verifying that streaming extracts reasoning from `<think>` tags, logs completion while populating `LLMCallInfo`, strips `<think>` tags from stored completion, and updates both `LLMCallInfo` and global `LLMStats` token counts from streamed `usage`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(llm): share post-processing helpers between streaming and non-streaming llm_call' directly describes the main change—making streaming reuse the same post-processing helpers as non-streaming.
Linked Issues check	✅ Passed	The PR fully addresses issue `#1756` by ensuring streaming calls all five post-processing helpers (_store_reasoning_traces, _log_completion, _update_token_stats, _store_tool_calls, _store_response_metadata) in the same order as non-streaming, resolving behavioral divergences.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the stated objective: modifications to the streaming path in llm/utils.py and comprehensive test coverage in test_stream_llm_call.py for the shared post-processing behavior.
Test Results For Major Changes	✅ Passed	Four tests documented and verified covering tag extraction, completion logging, tag stripping, and token stats. Commit states existing tests pass.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-05-22T10:24:57Z

Greptile Summary

This PR unifies the post-streaming pipeline in _stream_llm_call with the non-streaming path by building an LLMResponse once streaming finishes and then running the same five helpers (_store_reasoning_traces, _log_completion, _update_token_stats, _store_tool_calls, _store_response_metadata) in the same order. This closes the long-standing gaps where _log_completion was never called on streaming responses and <think> tag extraction was never reached for models that stream reasoning through content tags.

Core fix: The five post-processing helpers are now shared across both code paths, eliminating drift. _store_reasoning_traces runs first so <think> tags are stripped from response.content before the completion is logged or stored in llm_call_info, matching the non-streaming order.
Tests: Four new tests in TestStreamLlmCallSharedPostProcessing verify <think> extraction, llm_call_info.completion population, stripped-completion guarantee, and global token stat increments via the shared helper.
Subtle behavioral change: The old streaming path called _update_token_stats_from_chunk (which silently skips when usage is absent) whereas the new path calls _update_token_stats, which emits an INFO log line when response.usage is None; models or providers that omit usage in stream responses will now generate per-call log noise.

Confidence Score: 4/5

Safe to merge; the change correctly unifies the two code paths and the tests verify the key new behaviors.

The refactoring is clean and the logic is sound. The one behavioral change worth watching is that _update_token_stats now logs a per-call INFO message for streaming responses from models that don't report usage, where before the streaming path was silent. This is log noise rather than wrong behavior, and it can be addressed as a follow-up.

No files require special attention; nemoguardrails/actions/llm/utils.py is straightforward and the test file has solid coverage of the new paths.

Important Files Changed

Filename	Overview
nemoguardrails/actions/llm/utils.py	Refactored `_stream_llm_call` to call the same five post-processing helpers as the non-streaming path; correctly resolves the `<think>` tag TODO and the missing `_log_completion` gap, but the switch from `_update_token_stats_from_chunk` to `_update_token_stats` introduces new INFO-level log noise for streaming calls where the model omits usage info.
tests/llm/clients/test_stream_llm_call.py	Four new tests cover the key scenarios: `<think>` tag extraction for content-based reasoning, `_log_completion` populating `llm_call_info` and emitting a log line, stripped completion stored in `llm_call_info`, and token stat increment via `llm_stats_var`; all use correct context-var teardown via `reset(token)`.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant llm_call
    participant _stream_llm_call
    participant StreamingHandler
    participant Model

    Caller->>llm_call: "llm_call(llm, prompt, streaming_handler=h)"
    llm_call->>_stream_llm_call: _stream_llm_call(model, prompt, handler, stop)
    loop stream chunks
        _stream_llm_call->>Model: stream_async(prompt)
        Model-->>_stream_llm_call: LLMResponseChunk
        _stream_llm_call->>StreamingHandler: push_chunk(content, metadata)
    end
    _stream_llm_call->>StreamingHandler: finish()
    Note over _stream_llm_call: Build LLMResponse(content=handler.completion, ...)
    _stream_llm_call->>_stream_llm_call: _store_reasoning_traces(response)
    _stream_llm_call->>_stream_llm_call: _log_completion(response)
    _stream_llm_call->>_stream_llm_call: _update_token_stats(response)
    _stream_llm_call->>_stream_llm_call: _store_tool_calls(response)
    _stream_llm_call->>_stream_llm_call: _store_response_metadata(response)
    _stream_llm_call-->>llm_call: LLMResponse
    llm_call-->>Caller: LLMResponse

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nemoguardrails/actions/llm/utils.py:156
**New log noise for streaming calls without usage info**

`_update_token_stats` has an `else` branch that emits `logger.info("Token stats in LLM call info cannot be computed for current model!")` when `response.usage` is `None`. The old streaming path used `_update_token_stats_from_chunk`, which has no such branch — it silently skips when usage is absent. Every streaming call from a model that doesn't report usage (e.g. many local/self-hosted models, proxies, or providers that omit usage in stream chunks) will now emit this log line on every request, which was previously silent on the streaming path.

_{Reviews (1): Last reviewed commit: "fix(llm): share post-processing helpers ..." | Re-trigger Greptile}

greptile-apps · 2026-05-22T10:25:01Z

+        # response.content are stripped before the completion is logged or stored.
+        _store_reasoning_traces(response)
+        _log_completion(response)
+        _update_token_stats(response)


New log noise for streaming calls without usage info

_update_token_stats has an else branch that emits logger.info("Token stats in LLM call info cannot be computed for current model!") when response.usage is None. The old streaming path used _update_token_stats_from_chunk, which has no such branch — it silently skips when usage is absent. Every streaming call from a model that doesn't report usage (e.g. many local/self-hosted models, proxies, or providers that omit usage in stream chunks) will now emit this log line on every request, which was previously silent on the streaming path.

Prompt To Fix With AI

This is a comment left during a code review. Path: nemoguardrails/actions/llm/utils.py Line: 156 Comment: **New log noise for streaming calls without usage info** `_update_token_stats` has an `else` branch that emits `logger.info("Token stats in LLM call info cannot be computed for current model!")` when `response.usage` is `None`. The old streaming path used `_update_token_stats_from_chunk`, which has no such branch — it silently skips when usage is absent. Every streaming call from a model that doesn't report usage (e.g. many local/self-hosted models, proxies, or providers that omit usage in stream chunks) will now emit this log line on every request, which was previously silent on the streaming path. How can I resolve this? If you propose a fix, please make it concise.

Acknowledged. The unification is exactly what the linked issue #1756 asks for, so the streaming path now matches the non-streaming path one-for-one, including this log line when usage is missing. That keeps the two paths from drifting again.

Two consequences of keeping the log:

Operators see the same "Token stats in LLM call info cannot be computed for current model!" message regardless of streaming mode, which is useful signal when integrating a new provider that does not report usage.

The fix and its side effects are localized: removing the log would mean diverging from _update_token_stats, which is the opposite direction from this PR.

If the maintainers prefer the line to be silenced (or downgraded to debug) for both paths, I am happy to do that in a small follow-up PR so it can land independently of this unification.

codecov · 2026-05-22T10:29:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

greptile-apps Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llm): share post-processing helpers between streaming and non-streaming llm_call#1918

fix(llm): share post-processing helpers between streaming and non-streaming llm_call#1918
adityasingh2400 wants to merge 1 commit into
NVIDIA-NeMo:developfrom
adityasingh2400:fix/streaming-post-processing-helpers

adityasingh2400 commented May 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 22, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented May 22, 2026

Confidence Score: 4/5

Sequence Diagram

Uh oh!

greptile-apps Bot May 22, 2026

Uh oh!

adityasingh2400 May 22, 2026

Uh oh!

codecov Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adityasingh2400 commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 22, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented May 22, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

adityasingh2400 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 22, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adityasingh2400 commented May 22, 2026 •

edited by coderabbitai Bot

Loading