Skip to content

[BUG] copy.deepcopy(agent) fails on second gateway channel — Agent RLock not picklable #1746

@Dhivya-Bharathy

Description

@Dhivya-Bharathy

[BUG] copy.deepcopy(agent) fails on second gateway channel — Agent RLock not picklable

Labels: bug, gateway, telegram, multi-channel


Overview

PraisonAI's gateway is designed to run multiple messaging channels from a single process — for example, three Telegram bots (CFO, Ops, Content) each routed to a different agent. This is the standard Hermes-style workforce pattern: one VPS, one gateway, many bots.

When the gateway starts the second channel bot, startup fails with a Python pickling error. The first channel may initialize successfully; every subsequent channel is skipped or crashes during _create_bot(). Multi-bot deployments therefore require a userland monkey-patch today.


What the user sees

Console / log output

When starting a gateway with 2+ Telegram channels (each bound to its own agent), the gateway logs something like:

Failed to create bot for 'telegram_ops': cannot pickle '_thread.RLock' object

Or, depending on Python version and call stack:

TypeError: cannot pickle '_thread.RLock' object

In a three-channel workforce setup (telegram_cfo, telegram_ops, telegram_content), typical startup looks like:

Channel 'telegram_cfo' (telegram) initialized
Failed to create bot for 'telegram_ops': cannot pickle '_thread.RLock' object
Failed to create bot for 'telegram_content': cannot pickle '_thread.RLock' object
Started 1 channel bot(s)

The operator believes all three bots are live. Only one actually polls Telegram.

Workaround in the wild

Deployments that need multi-channel gateway today patch WebSocketGateway._create_bot to skip copy.deepcopy(agent) and pass each channel its pre-built dedicated agent instance instead. That unblocks startup but should not be required for a supported multi-channel path.


Architecture — how multi-channel gateway is supposed to work

flowchart TB
    subgraph Gateway["WebSocketGateway (single process, port 8765)"]
        YAML["gateway.yaml"]
        Agents["agents: { cfo, ops, content }"]
        Channels["channels: { telegram_cfo, telegram_ops, telegram_content }"]
    end

    YAML --> Agents
    YAML --> Channels

    Channels --> C1["_create_bot(telegram_cfo)"]
    Channels --> C2["_create_bot(telegram_ops)"]
    Channels --> C3["_create_bot(telegram_content)"]

    C1 --> B1["TelegramBot → @cfo_bot → cfo agent"]
    C2 --> B2["TelegramBot → @ops_bot → ops agent"]
    C3 --> B3["TelegramBot → @content_bot → content agent"]

    Agents --> A1["Agent(cfo)"]
    Agents --> A2["Agent(ops)"]
    Agents --> A3["Agent(content)"]

    C1 -.->|"deepcopy(agent)"| A1
    C2 -.->|"deepcopy(agent) 💥 RLock"| A2
    C3 -.->|"deepcopy(agent) 💥 RLock"| A3
Loading

Intended design: Each channel gets its own agent instance so channel-specific settings (tools, memory, session) do not leak between bots. The gateway achieves isolation by calling copy.deepcopy(agent) inside _create_bot() before wrapping the agent in a TelegramBot.

What breaks: Modern Agent objects hold a threading.RLock (used for cache/thread safety). RLocks are not picklable and not deep-copyable. The moment the gateway tries to clone the second agent, Python raises TypeError: cannot pickle '_thread.RLock' object.


Step-by-step failure sequence

  1. Operator defines gateway.yaml with multiple channels and multiple agents (Hermes workforce pattern).
  2. Gateway loads all agents into self._agents.
  3. Gateway iterates channels and calls _create_bot() for each.
  4. Inside _create_bot():
    import copy
    agent = copy.deepcopy(agent)   # ← fails here on 2nd+ channel
  5. Agent.__deepcopy__ or the default deepcopy walks the object graph and hits self.__cache_lock = threading.RLock().
  6. Python cannot serialize RLock → exception → channel skipped → "Started N channel bot(s)" with N < expected.

Why this matters

Impact Detail
Blocks Hermes parity Hermes runs multiple Telegram bots on one gateway. PraisonAI cannot do this out of the box.
Silent partial failure Gateway may report "Started 1 channel bot(s)" while 2 channels failed — easy to miss in logs.
Forces unsafe workarounds Monkey-patching _create_bot bypasses intended isolation; operators may share agent state unintentionally.
Affects any multi-channel setup Not Telegram-specific — any second channel that deep-copies an Agent with memory/tools enabled will hit this.

Root cause (technical)

The gateway assumes Agent is deep-copy safe. It is not, because:

  • praisonaiagents.agent.agent.Agent initializes self.__cache_lock = threading.RLock() eagerly (thread-safe cache access).
  • copy.deepcopy() uses the pickling protocol internally for many object types.
  • threading.RLock is a OS-level synchronization primitive with no meaningful duplicate — Python refuses to copy it.

The first channel often succeeds because it deep-copies the first agent before any concurrent access complicates state — or because the failure is deterministic on the second call regardless. Either way, multi-channel is broken.


Expected behavior

  • Gateway with 3 Telegram channels and 3 agents starts all 3 bots without error.
  • Each channel receives an isolated agent instance (no shared mutable session/tools state).
  • No userland monkey-patch required.

Proposed fix

Option A — Agent.clone_for_channel() (recommended)

Add a first-class SDK method that produces a channel-safe clone:

  • Shallow-copy safe fields (name, model, tools list, instructions).
  • Re-create a fresh RLock instead of copying the old one.
  • Reset or fork session store handle per channel.
  • Gateway calls agent.clone_for_channel() instead of copy.deepcopy(agent).

Option B — Skip deepcopy when agents are already dedicated

If gateway.yaml maps each channel to a distinct agent ID (no sharing), skip clone entirely and document that shared-agent multi-channel requires explicit clone support.

Tests

  • Unit test: gateway config with 2+ channels → all _channel_bots populated.
  • Regression test: assert no copy.deepcopy(agent) on paths where RLock exists.

Related issues


Acceptance criteria

  • Gateway with 3 Telegram channels starts all 3 bots — no cannot pickle '_thread.RLock' object
  • Channel-specific agent settings do not leak between channels (session, tools, memory)
  • Unit/integration test covers multi-channel _create_bot without monkey-patch
  • Log message clearly reports partial channel failure if any channel still fails for other reasons

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions