Skip to content

[Enhancement] /health should expose Telegram polling conflicts (409) and last bot error #1750

@Dhivya-Bharathy

Description

@Dhivya-Bharathy

[Enhancement] /health should expose Telegram polling conflicts (409) and last bot error

Labels: enhancement, gateway, observability, telegram


Overview

PraisonAI exposes GET /health on the gateway (default http://127.0.0.1:8765/health). Operators use it to verify the workforce is running. Today it reports each channel as "running": true based on a simple boolean flag — with no last error, no Telegram conflict detection, and no distinction between "polling successfully" and " stuck in retry loop after 409 Conflict".

When two gateway processes poll the same Telegram bot token, one wins and the other fails with a 409 Conflict from Telegram's getUpdates API. The loser may retry silently forever. Health still shows green.

This was observed in a Hermes workforce deployment: CFO bot stopped replying after a duplicate gateway restart; /health reported all channels healthy.


What the user sees

Health endpoint today (misleading)

{
  "status": "healthy",
  "uptime": 64.76,
  "agents": 3,
  "sessions": 0,
  "clients": 0,
  "channels": {
    "telegram_cfo": { "platform": "telegram", "running": true },
    "telegram_ops": { "platform": "telegram", "running": true },
    "telegram_content": { "platform": "telegram", "running": true }
  }
}

All channels show "running": true even when Telegram polling is broken.

Log output ( buried — operator may not see it)

When a second process steals the Telegram poll:

ERROR Bot 'telegram_cfo' crashed: Conflict: terminated by other getUpdates request;
make sure that only one bot instance is running
INFO Reconnecting 'telegram_cfo' in 5s...

Or from python-telegram-bot:

telegram.error.Conflict: Conflict: terminated by other getUpdates request

Gateway retries up to 5 times with exponential backoff, then gives up — but /health never reflects this.

Operator experience

  1. User messages @mervincfo_bot — no reply.
  2. Operator curls /health — everything looks fine.
  3. Operator assumes OpenAI/API issue, burns time debugging wrong layer.
  4. Actual fix: kill duplicate python start_gateway.py processes.

Architecture — health vs reality gap

flowchart TB
    subgraph HealthEndpoint["GET /health"]
        H1["For each channel bot"]
        H2["running = bot.is_running"]
        H3["Return JSON"]
        H1 --> H2 --> H3
    end

    subgraph BotRuntime["Channel bot runtime"]
        R1["_run_bot_safe() retry loop"]
        R2["_start_telegram_bot_polling()"]
        R3["PTB updater.start_polling()"]
        R1 --> R2 --> R3
    end

    subgraph TelegramAPI["Telegram Bot API"]
        T1["getUpdates long poll"]
        T2["409 Conflict if two pollers same token"]
    end

    R3 --> T1
    T1 --> T2

    T2 -.->|"error logged only"| R1
    H2 -.->|"does not read last_error"| H3

    style H2 fill:#FFB6C1
    style T2 fill:#FF6347
Loading

The gap: Runtime captures exceptions in _run_bot_safe and logs them. Health reads bot.is_running — which may stay True briefly or not reflect polling failure at all. Last error is discarded.


Telegram 409 Conflict explained

Telegram allows exactly one active getUpdates connection per bot token. If Process A and Process B both call getUpdates:

  • One receives updates normally.
  • The other gets HTTP 409 Conflict with message: "terminated by other getUpdates request".
  • python-telegram-bot surfaces this as telegram.error.Conflict.

This is the most common cause of "bot started but silent" in multi-process or duplicate-gateway scenarios — more common than API quota or code bugs.


Proposed health response shape

{
  "status": "degraded",
  "channels": {
    "telegram_cfo": {
      "platform": "telegram",
      "running": false,
      "last_error": "telegram_conflict: terminated by other getUpdates request",
      "last_error_at": "2026-05-26T06:10:06Z",
      "retry_count": 3
    }
  }
}

Gateway overall status should be "degraded" when any channel has a fatal polling error, not "healthy".


Proposed fix

1. Track last error on each channel bot

Add to bot instance or gateway channel registry:

  • last_error: str | null
  • last_error_at: ISO timestamp
  • retry_count: int

Update in _run_bot_safe exception handler.

2. Parse Telegram-specific failures

Detect telegram.error.Conflict or message substring "terminated by other getUpdates" → set error_code: "telegram_conflict" with operator-facing hint:

Another process is polling this bot token. Run praisonai gateway status to find duplicate instances.

3. Surface in CLI

praisonai gateway status should print channel errors, not just up/down.

4. Doctor check

praisonai doctor gateway:

  • Port 8765 in use by foreign process?
  • Multiple gateway PIDs?
  • Any channel with telegram_conflict in last 5 minutes?

Acceptance criteria

  • /health includes last_error and last_error_at per channel when polling fails
  • Telegram 409 sets running: false and error_code: telegram_conflict
  • Overall gateway status is degraded when any channel is failed
  • praisonai gateway status displays channel errors in human-readable form
  • Doctor warns on duplicate gateway / conflict condition

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions