[Enhancement] /health should expose Telegram polling conflicts (409) and last bot error

# [Enhancement] `/health` should expose Telegram polling conflicts (409) and last bot error

**Labels:** `enhancement`, `gateway`, `observability`, `telegram`

---

## Overview

PraisonAI exposes `GET /health` on the gateway (default `http://127.0.0.1:8765/health`). Operators use it to verify the workforce is running. Today it reports each channel as `"running": true` based on a simple boolean flag — with **no last error, no Telegram conflict detection, and no distinction between "polling successfully" and " stuck in retry loop after 409 Conflict".**

When two gateway processes poll the same Telegram bot token, one wins and the other fails with a **409 Conflict** from Telegram's `getUpdates` API. The loser may retry silently forever. Health still shows green.

This was observed in a Hermes workforce deployment: CFO bot stopped replying after a duplicate gateway restart; `/health` reported all channels healthy.

---

## What the user sees

### Health endpoint today (misleading)

```json
{
  "status": "healthy",
  "uptime": 64.76,
  "agents": 3,
  "sessions": 0,
  "clients": 0,
  "channels": {
    "telegram_cfo": { "platform": "telegram", "running": true },
    "telegram_ops": { "platform": "telegram", "running": true },
    "telegram_content": { "platform": "telegram", "running": true }
  }
}
```

All channels show `"running": true` even when Telegram polling is broken.

### Log output ( buried — operator may not see it)

When a second process steals the Telegram poll:

```
ERROR Bot 'telegram_cfo' crashed: Conflict: terminated by other getUpdates request;
make sure that only one bot instance is running
INFO Reconnecting 'telegram_cfo' in 5s...
```

Or from python-telegram-bot:

```
telegram.error.Conflict: Conflict: terminated by other getUpdates request
```

Gateway retries up to 5 times with exponential backoff, then gives up — but `/health` never reflects this.

### Operator experience

1. User messages `@mervincfo_bot` — no reply.
2. Operator curls `/health` — everything looks fine.
3. Operator assumes OpenAI/API issue, burns time debugging wrong layer.
4. Actual fix: kill duplicate `python start_gateway.py` processes.

---

## Architecture — health vs reality gap

```mermaid
flowchart TB
    subgraph HealthEndpoint["GET /health"]
        H1["For each channel bot"]
        H2["running = bot.is_running"]
        H3["Return JSON"]
        H1 --> H2 --> H3
    end

    subgraph BotRuntime["Channel bot runtime"]
        R1["_run_bot_safe() retry loop"]
        R2["_start_telegram_bot_polling()"]
        R3["PTB updater.start_polling()"]
        R1 --> R2 --> R3
    end

    subgraph TelegramAPI["Telegram Bot API"]
        T1["getUpdates long poll"]
        T2["409 Conflict if two pollers same token"]
    end

    R3 --> T1
    T1 --> T2

    T2 -.->|"error logged only"| R1
    H2 -.->|"does not read last_error"| H3

    style H2 fill:#FFB6C1
    style T2 fill:#FF6347
```

**The gap:** Runtime captures exceptions in `_run_bot_safe` and logs them. Health reads `bot.is_running` — which may stay `True` briefly or not reflect polling failure at all. **Last error is discarded.**

---

## Telegram 409 Conflict explained

Telegram allows **exactly one active `getUpdates` connection per bot token**. If Process A and Process B both call `getUpdates`:

- One receives updates normally.
- The other gets HTTP **409 Conflict** with message: *"terminated by other getUpdates request"*.
- python-telegram-bot surfaces this as `telegram.error.Conflict`.

This is the **most common cause** of "bot started but silent" in multi-process or duplicate-gateway scenarios — more common than API quota or code bugs.

---

## Proposed health response shape

```json
{
  "status": "degraded",
  "channels": {
    "telegram_cfo": {
      "platform": "telegram",
      "running": false,
      "last_error": "telegram_conflict: terminated by other getUpdates request",
      "last_error_at": "2026-05-26T06:10:06Z",
      "retry_count": 3
    }
  }
}
```

Gateway overall `status` should be `"degraded"` when any channel has a fatal polling error, not `"healthy"`.

---

## Proposed fix

### 1. Track last error on each channel bot

Add to bot instance or gateway channel registry:

- `last_error: str | null`
- `last_error_at: ISO timestamp`
- `retry_count: int`

Update in `_run_bot_safe` exception handler.

### 2. Parse Telegram-specific failures

Detect `telegram.error.Conflict` or message substring `"terminated by other getUpdates"` → set `error_code: "telegram_conflict"` with operator-facing hint:

> Another process is polling this bot token. Run `praisonai gateway status` to find duplicate instances.

### 3. Surface in CLI

`praisonai gateway status` should print channel errors, not just up/down.

### 4. Doctor check

`praisonai doctor gateway`:

- Port 8765 in use by foreign process?
- Multiple gateway PIDs?
- Any channel with `telegram_conflict` in last 5 minutes?

---

## Acceptance criteria

- [ ] `/health` includes `last_error` and `last_error_at` per channel when polling fails
- [ ] Telegram 409 sets `running: false` and `error_code: telegram_conflict`
- [ ] Overall gateway `status` is `degraded` when any channel is failed
- [ ] `praisonai gateway status` displays channel errors in human-readable form
- [ ] Doctor warns on duplicate gateway / conflict condition


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Enhancement] /health should expose Telegram polling conflicts (409) and last bot error #1750

[Enhancement] `/health` should expose Telegram polling conflicts (409) and last bot error

Overview

What the user sees

Health endpoint today (misleading)

Log output ( buried — operator may not see it)

Operator experience

Architecture — health vs reality gap

Telegram 409 Conflict explained

Proposed health response shape

Proposed fix

1. Track last error on each channel bot

2. Parse Telegram-specific failures

3. Surface in CLI

4. Doctor check

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Enhancement] /health should expose Telegram polling conflicts (409) and last bot error #1750

Description

[Enhancement] /health should expose Telegram polling conflicts (409) and last bot error

Overview

What the user sees

Health endpoint today (misleading)

Log output ( buried — operator may not see it)

Operator experience

Architecture — health vs reality gap

Telegram 409 Conflict explained

Proposed health response shape

Proposed fix

1. Track last error on each channel bot

2. Parse Telegram-specific failures

3. Surface in CLI

4. Doctor check

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

[Enhancement] `/health` should expose Telegram polling conflicts (409) and last bot error