perf: add FrameType enum, O(1) dispatch table, and lazy Frame fields#3847
perf: add FrameType enum, O(1) dispatch table, and lazy Frame fields#3847viai957 wants to merge 3 commits intopipecat-ai:mainfrom
Conversation
- Add FrameType / FrameCategory integer enums (frame_types.py); every concrete Frame subclass now carries a type_id ClassVar for zero-cost type identification without isinstance chains. - Replace sequential isinstance dispatch in LLMUserAggregator and LLMAssistantAggregator with an O(1) dict lookup table keyed on frame.type_id, eliminating N comparisons per frame in hot paths. - Lazy-init Frame.name and Frame.metadata: both are now computed / allocated only on first access, reducing per-frame allocation cost. - Make UserIdleController optional (default None) in LLMUserAggregatorParams so idle-detection overhead is zero unless explicitly configured. - Add _IDLE_CONTROLLER_FRAME_TYPES frozenset to skip the idle controller call for the vast majority of frames.
- Add FrameType / FrameCategory integer enums (frame_types.py); every concrete Frame subclass now carries a type_id ClassVar for zero-cost type identification without isinstance chains. - Replace sequential isinstance dispatch in LLMUserAggregator and LLMAssistantAggregator with an O(1) dict lookup table keyed on frame.type_id, eliminating N comparisons per frame in hot paths. - Lazy-init Frame.name and Frame.metadata: both are now computed / allocated only on first access, reducing per-frame allocation cost. - Make UserIdleController optional (default None) in LLMUserAggregatorParams so idle-detection overhead is zero unless explicitly configured. - Add _IDLE_CONTROLLER_FRAME_TYPES frozenset to skip the idle controller call for the vast majority of frames.
RTVI framework accesses frame.broadcast_sibling_id on every frame (rtvi.py:1262). Removing it from Frame.__post_init__ would cause AttributeError for non-broadcast frames. Restore the field declaration and initialization.
|
I am curious what are the performance gains from these changes ? |
|
@kedar389 Frame construction (lazy vs eager)
Most frames never have Dispatch: isinstance chain vs dict lookup (17 branches)
Weighted realistic workload (audio-heavy pipeline)
Combined per-frame cost
Throughput impact
The absolute microsecond savings are modest at typical frame rates, but the GC pressure reduction from eliminating 2 allocations per frame (name string + What's NOT premature about this
Happy to add these benchmarks to the repo if useful, or adjust the approach. |
|
Sorry, not to diminish your work. But how is this not a premature optimization? On what pipeline has this been tested ? Do you know what is the average fps in audio or video pipelines ? Have you tried to benchmark it on real pipeline before and after and see how much ms you saved ? Even if we would have 5000 fps (which is the worst case you showed) in a pipeline, it still would only save just 1,5 ms from the results you showed. Which is dwarfed by other components in the system like turn model or VAD. 1,5 ms gain for 800 lines of code, that seems kinda like premature optimization. Also if dispatch is such problem, why not just move the audio frame which is hit much more often to be first in the is_instance comparison, it optimizes that problem in one line of code and you do not have to do the dict lookup . I feel like this ratio is skewed(60% audio, 15% text/LLM, 10% TTS events, 8% speaking events, 5% transcription, 2% contro) There is probably much more of audio (and video) frames, because they flow non-stop compared to other frames. |
|
@kedar389, you raised valid points that deserved real data rather than theoretical arguments. I ran pipeline benchmarks on both branches and want to share honest results. Benchmark SetupPushed 10,000 frames (80% TTS audio, 10% text, 5% transcription, 5% raw audio) through chains of PassthroughProcessors to isolate framework overhead. Each processor calls ResultsFrame Creation (100k TTSAudioRawFrame)
Lazy init defers Pipeline Throughput (10k frames)
At 6 processors (realistic voice pipeline), 18% higher throughput. The gains come mainly from two things:
60-Second Conversation (6 proc, 50 fps)At standard voice rates: ~0.06 ms/sec difference. You're right that at 50fps for a single agent, this is negligible compared to LLM/TTS latency. Where it mattersThe 18% throughput headroom helps in:
Your instanceof reordering suggestionValid point. Moving AudioRawFrame to the top of isinstance chains in ProposalI'm happy to slim this PR down to just the highest-impact, lowest-controversy changes:
|
|
Hi 👋 I'm a Pipecat maintainer. Before we do any optimization, it would be nice to understand if there is a performance issue. I've been doing a lot of latency measurements and I haven't found anything that jumps out at me as a problem. If you have data that shows that there is slowness that needs to be optimized, please share. Until then, I don't think we're ready to make this change as it touches foundational level classes used all over Pipecat. |
|
@markbackman you're absolutely right that I should start by identifying a concrete performance issue. Here's the context: I'm planning a deployment of ~3000 concurrent voice agents. When I attempted a load test with thousands of parallel pipelines in a single process, it failed badly. I Per-Pipeline Resource Cost (Measured)Each pipeline with 6 processors (default config) allocates:
At 3000 concurrent pipelines: 57,000 asyncio tasks, 3000+ OS threads, 300K task wakeups/sec. Benchmark: Throughput Degradation
At 500 concurrent pipelines, per-pipeline throughput is barely above the 50fps audio floor. The event loop starts saturating. Root Causes Found
Revised ProposalI realize my original PR was addressing the wrong layer. The real improvements for large-scale deployment are:
I'm happy to close this PR and open targeted issues/PRs for the above. Or if you'd prefer a single focused PR for just the handler guards and signal handler fix (low-risk, high-impact), I can slim this down. |
I see; that's your problem. Real-time agents require real-time communication which needs a different type of deployment and fixed resources. We recommend a bot per process where each bot is allocated 0.5 vCPU and 1GB of RAM. For our Pipecat Cloud product, we isolate agents in a single Python process to ensure they have sufficient resources allocated. Agents scale out successfully without resource issues. I think we should close this PR and it's worth looking at your deployment approach to solve the problem. Happy to chat more about this in Discord if you have questions. |
Summary
FrameType/FrameCategoryenums (frame_types.py): integer type identifiers added asClassVar[int] type_idon every concreteFramesubclass, enabling zero-cost type identification withoutisinstancechains.LLMUserAggregatorandLLMAssistantAggregator: replaced sequentialisinstanceif/elif chains with aDict[int, Callable]keyed onframe.type_id, eliminating N comparisons per frame in hot paths.Frame.nameandFrame.metadata: both fields are now computed/allocated only on first access, reducing per-frame allocation cost for the majority of frames that never touch these fields.UserIdleController:user_idle_timeoutinLLMUserAggregatorParamsdefaults toNone(was0); the controller is only instantiated when explicitly configured, and a_IDLE_CONTROLLER_FRAME_TYPESfrozenset guards the hot path to skip the controller call for irrelevant frame types.Motivation
In high-throughput pipelines (e.g. real-time audio with many short frames),
process_frameis called thousands of times per second. The original implementation checked frame types via longisinstancechains (O(N) per frame) and unconditionally allocatedname/metadatadicts on every frame construction. These changes make the common path O(1) and allocation-free for frames that don't need those fields.Changes
src/pipecat/frames/frame_types.py(new)FrameCategory: 8-bit category prefixes (AUDIO, TEXT, IMAGE, CONTROL, SYSTEM, etc.)FrameType: 16-bit type IDs composed as(category << 8) | subtypeis_audio_frame,is_text_frame, etc.)src/pipecat/frames/frames.pyFramesubclass now carries atype_id: ClassVar[int]matching itsFrameTypeconstantFrame.name→ lazy property (backed by_name, allocated on first access)Frame.metadata→ lazy property (backed by_metadata, allocated on first access)id,pts,broadcast_sibling_id,transport_source,transport_destination)src/pipecat/processors/aggregators/llm_response_universal.pyLLMUserAggregator._dispatch:Dict[int, Callable]lookup table replacesisinstancechain inprocess_frameLLMAssistantAggregator._dispatch: same pattern_IDLE_CONTROLLER_FRAME_TYPESfrozenset guardsUserIdleControllercallsUserIdleControlleronly instantiated whenuser_idle_timeoutis setsrc/pipecat/processors/frame_processor.py_has_handlers()fast-path guards onpush_frameand__process_frameevent handler calls (avoids coroutine creation when no handlers registered)PROCESS_TASK_CANCEL_TIMEOUT_SECSfor__cancel_process_task()to prevent indefinite hangsbroadcast_frame_instancepreserves lazy_metadataon copied framessrc/pipecat/transports/base_output.pydel self._audio_buffer[:chunk_size]replaces slice copysrc/pipecat/utils/base_object.py_has_handlers(event_name): non-async O(1) check for registered event handlersTest plan
uv run pytestpasses (406 passed, 0 failed)uv run ruff checkpassesuv run ruff format --checkpassespush_interruption_task_frame_and_wait(timeout=),start_ttfb_metrics(start_time=),broadcast_sibling_id,write_transport_frame)Notes
FrameCategoryis re-exported fromframes.pyvia# noqa: F401for downstream consumers that import frompipecat.frames.frames.__init__and never mutated, so there is no thread-safety concern.process_framedirectly (not via the aggregators) are unaffected;type_idis purely additive.