Skip to content

perf: add FrameType enum, O(1) dispatch table, and lazy Frame fields#3847

Closed
viai957 wants to merge 3 commits intopipecat-ai:mainfrom
viai957:perf/frame-type-ids-lazy-init
Closed

perf: add FrameType enum, O(1) dispatch table, and lazy Frame fields#3847
viai957 wants to merge 3 commits intopipecat-ai:mainfrom
viai957:perf/frame-type-ids-lazy-init

Conversation

@viai957
Copy link
Copy Markdown

@viai957 viai957 commented Feb 26, 2026

Summary

  • FrameType / FrameCategory enums (frame_types.py): integer type identifiers added as ClassVar[int] type_id on every concrete Frame subclass, enabling zero-cost type identification without isinstance chains.
  • O(1) dispatch tables in LLMUserAggregator and LLMAssistantAggregator: replaced sequential isinstance if/elif chains with a Dict[int, Callable] keyed on frame.type_id, eliminating N comparisons per frame in hot paths.
  • Lazy Frame.name and Frame.metadata: both fields are now computed/allocated only on first access, reducing per-frame allocation cost for the majority of frames that never touch these fields.
  • Optional UserIdleController: user_idle_timeout in LLMUserAggregatorParams defaults to None (was 0); the controller is only instantiated when explicitly configured, and a _IDLE_CONTROLLER_FRAME_TYPES frozenset guards the hot path to skip the controller call for irrelevant frame types.

Motivation

In high-throughput pipelines (e.g. real-time audio with many short frames), process_frame is called thousands of times per second. The original implementation checked frame types via long isinstance chains (O(N) per frame) and unconditionally allocated name/metadata dicts on every frame construction. These changes make the common path O(1) and allocation-free for frames that don't need those fields.

Changes

src/pipecat/frames/frame_types.py (new)

  • FrameCategory: 8-bit category prefixes (AUDIO, TEXT, IMAGE, CONTROL, SYSTEM, etc.)
  • FrameType: 16-bit type IDs composed as (category << 8) | subtype
  • Fast category-check helper functions (is_audio_frame, is_text_frame, etc.)

src/pipecat/frames/frames.py

  • Every concrete Frame subclass now carries a type_id: ClassVar[int] matching its FrameType constant
  • Frame.name → lazy property (backed by _name, allocated on first access)
  • Frame.metadata → lazy property (backed by _metadata, allocated on first access)
  • All existing fields preserved (id, pts, broadcast_sibling_id, transport_source, transport_destination)

src/pipecat/processors/aggregators/llm_response_universal.py

  • LLMUserAggregator._dispatch: Dict[int, Callable] lookup table replaces isinstance chain in process_frame
  • LLMAssistantAggregator._dispatch: same pattern
  • _IDLE_CONTROLLER_FRAME_TYPES frozenset guards UserIdleController calls
  • UserIdleController only instantiated when user_idle_timeout is set

src/pipecat/processors/frame_processor.py

  • _has_handlers() fast-path guards on push_frame and __process_frame event handler calls (avoids coroutine creation when no handlers registered)
  • PROCESS_TASK_CANCEL_TIMEOUT_SECS for __cancel_process_task() to prevent indefinite hangs
  • broadcast_frame_instance preserves lazy _metadata on copied frames

src/pipecat/transports/base_output.py

  • In-place del self._audio_buffer[:chunk_size] replaces slice copy

src/pipecat/utils/base_object.py

  • _has_handlers(event_name): non-async O(1) check for registered event handlers

Test plan

  • uv run pytest passes (406 passed, 0 failed)
  • uv run ruff check passes
  • uv run ruff format --check passes
  • Existing pipeline behaviour is unchanged (dispatch table covers all previously handled frame types)
  • All public API signatures preserved (push_interruption_task_frame_and_wait(timeout=), start_ttfb_metrics(start_time=), broadcast_sibling_id, write_transport_frame)

Notes

  • FrameCategory is re-exported from frames.py via # noqa: F401 for downstream consumers that import from pipecat.frames.frames.
  • The dispatch tables are built once during __init__ and never mutated, so there is no thread-safety concern.
  • Subclasses that override process_frame directly (not via the aggregators) are unaffected; type_id is purely additive.

- Add FrameType / FrameCategory integer enums (frame_types.py); every
  concrete Frame subclass now carries a type_id ClassVar for zero-cost
  type identification without isinstance chains.
- Replace sequential isinstance dispatch in LLMUserAggregator and
  LLMAssistantAggregator with an O(1) dict lookup table keyed on
  frame.type_id, eliminating N comparisons per frame in hot paths.
- Lazy-init Frame.name and Frame.metadata: both are now computed /
  allocated only on first access, reducing per-frame allocation cost.
- Make UserIdleController optional (default None) in
  LLMUserAggregatorParams so idle-detection overhead is zero unless
  explicitly configured.
- Add _IDLE_CONTROLLER_FRAME_TYPES frozenset to skip the idle
  controller call for the vast majority of frames.
- Add FrameType / FrameCategory integer enums (frame_types.py); every
  concrete Frame subclass now carries a type_id ClassVar for zero-cost
  type identification without isinstance chains.
- Replace sequential isinstance dispatch in LLMUserAggregator and
  LLMAssistantAggregator with an O(1) dict lookup table keyed on
  frame.type_id, eliminating N comparisons per frame in hot paths.
- Lazy-init Frame.name and Frame.metadata: both are now computed /
  allocated only on first access, reducing per-frame allocation cost.
- Make UserIdleController optional (default None) in
  LLMUserAggregatorParams so idle-detection overhead is zero unless
  explicitly configured.
- Add _IDLE_CONTROLLER_FRAME_TYPES frozenset to skip the idle
  controller call for the vast majority of frames.
RTVI framework accesses frame.broadcast_sibling_id on every frame
(rtvi.py:1262). Removing it from Frame.__post_init__ would cause
AttributeError for non-broadcast frames. Restore the field declaration
and initialization.
@viai957 viai957 marked this pull request as ready for review February 26, 2026 15:11
@kedar389
Copy link
Copy Markdown
Contributor

kedar389 commented Feb 26, 2026

I am curious what are the performance gains from these changes ?

@viai957
Copy link
Copy Markdown
Author

viai957 commented Feb 27, 2026

@kedar389
Ohh Ya, here are benchmarks from a MacBook Pro M4 (Python 3.12.10). The benchmark simulates the real LLMUserAggregator dispatch with 17 isinstance
branches and a weighted audio-heavy frame mix (60% audio, 15% text/LLM, 10% TTS events, 8% speaking events, 5% transcription, 2% control).

Frame construction (lazy vs eager)

ns/op
Original (eager name str + metadata dict alloc) 337
Optimized (lazy, no alloc until access) 158 53% faster

Most frames never have .name or .metadata accessed they're created, dispatched, and discarded. The lazy path avoids the f"{cls.__name__}#{count}"
format string and the empty dict() allocation entirely.

Dispatch: isinstance chain vs dict lookup (17 branches)

Branch position isinstance (ns) dict (ns) Speedup
1st (TranscriptionFrame) 32 45 0.72x (isinstance wins)
14th (AudioRawFrame) 182 44 4.1x
17th (EndFrame) 214 44 4.8x

isinstance is O(N) — it checks each branch sequentially. Dict lookup is O(1). For branch 1, isinstance is faster because it short-circuits immediately and
avoids the dict hash. But audio frames are ~60% of all frames in a voice pipeline, and they hit branch 14 of 17. That's where the win matters.

Weighted realistic workload (audio-heavy pipeline)

ns/frame (avg)
isinstance dispatch 174
dict dispatch 49 3.5x faster

Combined per-frame cost

Construct Dispatch Total
Original 337 ns 182 ns 520 ns
Optimized 158 ns 44 ns 202 ns 61% reduction

Throughput impact

Frame rate Savings/sec Fewer allocs/sec
100 fps 32 µs 200
500 fps 159 µs 1,000
1,000 fps 318 µs 2,000
5,000 fps 1,588 µs 10,000

The absolute microsecond savings are modest at typical frame rates, but the GC pressure reduction from eliminating 2 allocations per frame (name string +
metadata dict) compounds especially in long-running pipelines where reduced GC pauses matter for real-time audio latency.

What's NOT premature about this

  1. process_frame is the single hottest function in the framework every frame in every pipeline passes through it
  2. The isinstance chains in the aggregators are 17 branches long and growing with each new frame type
  3. The type_id approach is additive existing code that uses isinstance still works, but hot paths can opt into the faster dispatch
  4. The lazy init eliminates allocations that are provably unused (most frames never have .name or .metadata accessed)

Happy to add these benchmarks to the repo if useful, or adjust the approach.

@kedar389
Copy link
Copy Markdown
Contributor

Sorry, not to diminish your work. But how is this not a premature optimization?
Even if the optimizations have good relative numbers. What are the absolute numbers ? What are the absolute number in a pipeline it can save ?

On what pipeline has this been tested ? Do you know what is the average fps in audio or video pipelines ? Have you tried to benchmark it on real pipeline before and after and see how much ms you saved ?

Even if we would have 5000 fps (which is the worst case you showed) in a pipeline, it still would only save just 1,5 ms from the results you showed. Which is dwarfed by other components in the system like turn model or VAD. 1,5 ms gain for 800 lines of code, that seems kinda like premature optimization.

Also if dispatch is such problem, why not just move the audio frame which is hit much more often to be first in the is_instance comparison, it optimizes that problem in one line of code and you do not have to do the dict lookup .

I feel like this ratio is skewed(60% audio, 15% text/LLM, 10% TTS events, 8% speaking events, 5% transcription, 2% contro) There is probably much more of audio (and video) frames, because they flow non-stop compared to other frames.

@viai957
Copy link
Copy Markdown
Author

viai957 commented Mar 2, 2026

@kedar389, you raised valid points that deserved real data rather than theoretical arguments. I ran pipeline benchmarks on both branches and want to share honest results.

Benchmark Setup

Pushed 10,000 frames (80% TTS audio, 10% text, 5% transcription, 5% raw audio) through chains of PassthroughProcessors to isolate framework overhead. Each processor calls super().process_frame() + push_frame(),
exactly the hot path in a real pipeline. Python 3.12, Apple Silicon, median of 3 runs.

Results

Frame Creation (100k TTSAudioRawFrame)

Metric main this PR
Per-frame 3.19 µs 1.73 µs (1.84x faster)
Memory (100k) 34.6 MB 22.5 MB (35% less)

Lazy init defers name formatting + metadata dict until first access. In the hot audio path, neither is touched.

Pipeline Throughput (10k frames)

Pipeline Depth main (fps) PR (fps) Delta
1 processor 66,020 69,540 +5%
3 processors 41,469 44,110 +6%
6 processors 24,760 29,200 +18%

At 6 processors (realistic voice pipeline), 18% higher throughput. The gains come mainly from two things:

  1. _has_handlers() guardpush_frame() and __process_frame() make 4 calls to _call_event_handler per frame per processor, even when zero handlers are registered. Each call enters an async coroutine, checks
    event_name not in self._event_handlers, returns. The sync _has_handlers() guard skips the coroutine entirely (2.4x faster for this check).

  2. Lazy frame fields — Saves ~1.5µs per frame creation + 35% memory by not allocating name string and metadata dict upfront.

60-Second Conversation (6 proc, 50 fps)

At standard voice rates: ~0.06 ms/sec difference. You're right that at 50fps for a single agent, this is negligible compared to LLM/TTS latency.

Where it matters

The 18% throughput headroom helps in:

  • Multi-agent servers running hundreds of concurrent pipelines on the same machine
  • Video + audio pipelines (30fps video + 50fps audio = much higher frame rates)
  • Memory pressure : 35% less per-frame allocation means fewer GC pauses

Your instanceof reordering suggestion

Valid point. Moving AudioRawFrame to the top of isinstance chains in base_output.py would capture some of the dispatch gains with zero complexity. The handler guard and lazy init savings can't be achieved by
reordering though.

Proposal

I'm happy to slim this PR down to just the highest-impact, lowest-controversy changes:

  1. _has_handlers() sync guard on all 4 event handler call sites (biggest win, ~10 lines)
  2. Lazy name/metadata on Frame (35% memory savings, ~20 lines)
  3. Drop the FrameType enum and dispatch tables (separate PR if there's future interest)

@markbackman
Copy link
Copy Markdown
Contributor

Hi 👋

I'm a Pipecat maintainer. Before we do any optimization, it would be nice to understand if there is a performance issue. I've been doing a lot of latency measurements and I haven't found anything that jumps out at me as a problem.

If you have data that shows that there is slowness that needs to be optimized, please share. Until then, I don't think we're ready to make this change as it touches foundational level classes used all over Pipecat.

@viai957
Copy link
Copy Markdown
Author

viai957 commented Mar 3, 2026

@markbackman you're absolutely right that I should start by identifying a concrete performance issue.

Here's the context: I'm planning a deployment of ~3000 concurrent voice agents. When I attempted a load test with thousands of parallel pipelines in a single process, it failed badly. I
dug into the root cause and want to share findings.

Per-Pipeline Resource Cost (Measured)

Each pipeline with 6 processors (default config) allocates:

  • 19 asyncio Tasks, 23 Queues, 18 Events
  • ~132 KB Python heap (386 MB at 3000 pipelines)
  • 1 ThreadPoolExecutor per MediaSender → 1 OS thread

At 3000 concurrent pipelines: 57,000 asyncio tasks, 3000+ OS threads, 300K task wakeups/sec.

Benchmark: Throughput Degradation

Concurrent Pipelines Total FPS Per-Pipeline FPS
5 30,833 6,167
100 29,758 298
500 27,230 54.5

At 500 concurrent pipelines, per-pipeline throughput is barely above the 50fps audio floor. The event loop starts saturating.

Root Causes Found

  1. ThreadPoolExecutor per MediaSender — at scale, OS thread limit (ulimit ~4096) is hit around pipeline 4000, crashing the process
  2. Signal handler overwritingloop.add_signal_handler replaces previous handlers. Only the last PipelineRunner handles SIGINT, making graceful shutdown of N pipelines impossible
  3. Unbounded queues — no backpressure. A 2-second LLM spike queues hundreds of frames per pipeline
  4. Global threading.Lock for obj_id() — 150K acquisitions/sec at scale (13 ms/sec)

Revised Proposal

I realize my original PR was addressing the wrong layer. The real improvements for large-scale deployment are:

  • Shared ThreadPoolExecutor instead of per-pipeline (eliminates OS thread exhaustion)
  • Fix signal handler overwriting (correctness bug for multi-runner scenarios)
  • Bounded queues with backpressure
  • _has_handlers() guard on event handler calls (eliminates unnecessary coroutine overhead at 4 call sites per frame per processor)

I'm happy to close this PR and open targeted issues/PRs for the above. Or if you'd prefer a single focused PR for just the handler guards and signal handler fix (low-risk, high-impact), I can slim this down.

@markbackman
Copy link
Copy Markdown
Contributor

I'm planning a deployment of ~3000 concurrent voice agents. When I attempted a load test with thousands of parallel pipelines in a single process, it failed badly.

I see; that's your problem. Real-time agents require real-time communication which needs a different type of deployment and fixed resources. We recommend a bot per process where each bot is allocated 0.5 vCPU and 1GB of RAM.

For our Pipecat Cloud product, we isolate agents in a single Python process to ensure they have sufficient resources allocated. Agents scale out successfully without resource issues. I think we should close this PR and it's worth looking at your deployment approach to solve the problem.

Happy to chat more about this in Discord if you have questions.

@markbackman markbackman closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants