Wrapper defects: scheduler timeout dies in its own worker thread, tool-class loader is duplicated/bypasses ToolResolver, and handoff/web/planning have no YAML surface

## Scope

In-depth analysis of `src/praisonai/praisonai` (the **wrapper** layer), focused only on key architectural/behavioural gaps — **not** docs, tests, coverage, file sizes, or line counts.

These three are deliberately chosen to **not overlap** with the already-open wrapper audits #1722 (dead framework runners / async-vs-sync scheduler drift / process singletons) and #1602 (OpenAI client singleton / `0.0.0.0` bind / duplicated `ScheduleParser`). Each finding below was read end-to-end on `claude/bold-bohr-YAv0w`, every line number was checked against the current tree, and one is reproduced with a runnable snippet. Each maps to a stated principle: *"multi-agent + async safe by default"*, *"DRY / single source of truth"*, and *"each feature runs 3 ways: CLI, YAML, Python."*

---

## 1. `AgentScheduler` advertises "Thread-safe operation" but its `timeout` is dead in its own worker thread, and its stats counters race

### Where
`src/praisonai/praisonai/scheduler/agent_scheduler.py`

The class docstring promises thread-safety:
```python
# agent_scheduler.py:19-28
class AgentScheduler:
    """
    Scheduler for running PraisonAI agents periodically.

    Features:
    - Interval-based scheduling (hourly, daily, custom)
    - Thread-safe operation          # <-- claim
    - Automatic retry on failure
    ...
```

Scheduled runs execute on a background daemon thread:
```python
# agent_scheduler.py:115-120
self._thread = threading.Thread(
    target=self._run_schedule,
    args=(interval, max_retries),
    daemon=True
)
self._thread.start()
```

### 1a. The per-execution timeout cannot fire — `signal.signal` only works on the main thread

`_execute_with_retry` (reached from `_run_schedule` on the worker thread) installs a `SIGALRM` handler:
```python
# agent_scheduler.py:244-259
if self.timeout:
    import signal

    def timeout_handler(signum, frame):
        raise TimeoutError(f"Execution exceeded {self.timeout}s timeout")

    # Set timeout alarm (Unix only)
    try:
        signal.signal(signal.SIGALRM, timeout_handler)   # line 252
        signal.alarm(self.timeout)
        result = self._executor.execute(self.task)
        signal.alarm(0)  # Cancel alarm
    except AttributeError:                                # line 256 — only AttributeError
        # Windows doesn't support SIGALRM, use threading.Timer fallback
        logger.warning("Timeout not supported on this platform, executing without timeout")
        result = self._executor.execute(self.task)
```

In CPython, `signal.signal()` from any non-main thread raises **`ValueError`**, not `AttributeError`. The `except AttributeError` does **not** catch it, so the `ValueError` propagates to the generic `except Exception` at line 291 — the run is logged as an *agent failure*, retried `max_retries` times (all failing identically), and **the agent never executes**. Whenever a `timeout` is set, every scheduled run on the worker thread fails before the task even starts. `from_recipe(...)` makes this the default path: it always sets `timeout` (defaults to `300`, lines 489/497), so recipe-backed schedulers are broken out of the box.

Reproduced:
```python
import signal, threading
def worker():
    try:
        signal.signal(signal.SIGALRM, lambda *a: None)
        print("registered ok")
    except AttributeError as e:
        print("AttributeError:", e)
    except ValueError as e:
        print("ValueError:", e)
t = threading.Thread(target=worker); t.start(); t.join()
# -> ValueError: signal only works in main thread of the main interpreter
```

The codebase already ships the correct, cross-platform, thread-safe primitive — `_wrap_with_timeout` in `agents_generator.py:94-158` runs the call on a worker thread and `join`s with a timeout. The scheduler should use the same approach instead of `SIGALRM`.

### 1b. Stats counters are read/written across threads with no lock

The worker thread mutates four counters with no synchronization:
```python
# agent_scheduler.py (inside _execute_with_retry, on the worker thread)
self._execution_count += 1     # 235
...
self._total_cost += estimated_cost   # 271
self._success_count += 1             # 274
...
self._failure_count += 1             # 302
```
while any caller thread reads them unguarded:
```python
# agent_scheduler.py:163-173
def get_stats(self) -> Dict[str, Any]:
    runtime = (datetime.now() - self._start_time).total_seconds() if self._start_time else 0
    return {
        "is_running": self.is_running,
        "total_executions": self._execution_count,
        "successful_executions": self._success_count,
        "failed_executions": self._failure_count,
        "success_rate": (self._success_count / self._execution_count * 100) if self._execution_count > 0 else 0,
        "total_cost_usd": round(self._total_cost, 4),
        ...
    }
```
There is no lock anywhere in the class, so `get_stats()` can observe torn snapshots (e.g. `failed > total`, `success_rate > 100`). Note the *async* sibling (`async_agent_scheduler.py`) does this correctly under `self._stats_lock` — the sync class simply never got the same treatment. (Minor corroborating smell: `start()` duplicates `self.is_running = True` / `self._stop_event.clear()` at lines 100-101 and again at 106-107.)

### Fix
Replace `SIGALRM` with the existing thread-based timeout, and guard the counters with a lock:
```python
# agent_scheduler.py
import threading
from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeout

# in __init__:
self._stats_lock = threading.Lock()

# in _execute_with_retry, replace the signal block (244-259) with:
if self.timeout:
    with ThreadPoolExecutor(max_workers=1) as ex:
        fut = ex.submit(self._executor.execute, self.task)
        try:
            result = fut.result(timeout=self.timeout)   # works on ANY thread, cross-platform
        except FuturesTimeout:
            raise TimeoutError(f"Execution exceeded {self.timeout}s timeout")
else:
    result = self._executor.execute(self.task)

# wrap every counter mutation and the get_stats() read body in `with self._stats_lock:`
```

---

## 2. Tool-class loading is duplicated and bypasses the `ToolResolver` that the code itself calls the "single source of truth"

### Where
- `src/praisonai/praisonai/agents_generator.py`
- `src/praisonai/praisonai/tool_resolver.py`

`AgentsGenerator.__init__` builds a resolver and comments it as canonical:
```python
# agents_generator.py:295-297
# Initialize tool resolver with the registry wired in (single source of truth for tool resolution)
from .tool_resolver import ToolResolver
self.tool_resolver = ToolResolver(registry=self.tool_registry)
```

But the same module re-implements the resolver's class-extraction logic verbatim:
```python
# agents_generator.py:459-475
def _extract_tool_classes(self, module):
    result = {}
    for name, obj in inspect.getmembers(module,
        lambda x: inspect.isclass(x) and (
            x.__module__.startswith('langchain_community.tools') or
            (PRAISONAI_TOOLS_AVAILABLE and BaseTool and issubclass(x, BaseTool))
        ) and x is not BaseTool):
        try:
            result[name] = obj()
        except Exception as e:
            self.logger.warning(f"Error instantiating tool class {name}: {e}")
            continue
    return result
```
```python
# tool_resolver.py:472-484  (ToolResolver.get_local_tool_classes — character-for-character the same predicate)
for name, obj in inspect.getmembers(module,
    lambda x: inspect.isclass(x) and (
        x.__module__.startswith('langchain_community.tools') or
        (PRAISONAI_TOOLS_AVAILABLE and BaseTool and issubclass(x, BaseTool))
    ) and x is not BaseTool):
    try:
        result[name] = obj()
        logger.debug(f"Loaded local tool class: {name}")
    except Exception as e:
        logger.warning(f"Error instantiating tool class {name}: {e}")
        continue
```

The `BaseTool` import dance is also duplicated — `agents_generator.py:30-37` vs `tool_resolver.py:459-469` — so the two copies can disagree about what `PRAISONAI_TOOLS_AVAILABLE`/`BaseTool` even are.

Worse, the actual run path uses **both** mechanisms and they diverge: `tools.py` goes through the resolver, but the `tools/` **directory** is loaded directly, bypassing the resolver entirely:
```python
# agents_generator.py:627-639
# Use consolidated ToolResolver for tools.py loading
tools_dict.update(self.tool_resolver.get_local_tool_classes())   # resolver path (tools.py)
if os.path.isfile(tools_py_path):
    self.logger.debug("tools.py exists in the root directory. ...")
elif tools_dir_path.is_dir():
    from ._safe_loader import load_user_module
    for py_file in tools_dir_path.glob("*.py"):                  # bypass: tools/ dir
        if py_file.name.startswith("__"):
            continue
        module = load_user_module(py_file, name=f"tools_{py_file.stem}")
        if module is not None:
            tools_dict.update(self._extract_tool_classes(module))  # the duplicate, not the resolver
```

So there are two independent class-loading code paths with separately-maintained predicates and exception handling. Any fix to discovery rules (new base class, new module prefix, dedup policy) must be made twice or it silently rots — exactly the DRY / single-source-of-truth violation the philosophy calls out.

### Fix
Make the resolver the only loader. Add a directory-aware method to `ToolResolver` and delete the duplicate from `agents_generator.py`:
```python
# tool_resolver.py — extend the existing single source of truth
def get_local_tool_classes_from_dir(self, tools_dir: "os.PathLike|str") -> Dict[str, Any]:
    """Load BaseTool/langchain classes from every *.py in a tools/ directory."""
    from pathlib import Path
    from ._safe_loader import load_user_module
    classes: Dict[str, Any] = {}
    for py_file in Path(tools_dir).glob("*.py"):
        if py_file.name.startswith("__"):
            continue
        module = load_user_module(py_file, name=f"tools_{py_file.stem}")
        if module is not None:
            classes.update(self._extract_tool_classes(module))   # reuse resolver's own predicate
    return classes
```
```python
# agents_generator.py:627-639 — becomes
tools_dict.update(self.tool_resolver.get_local_tool_classes())
if os.path.isfile(tools_py_path):
    self.logger.debug("tools.py exists; skipping tools/ folder.")
elif tools_dir_path.is_dir():
    tools_dict.update(self.tool_resolver.get_local_tool_classes_from_dir(tools_dir_path))
# DELETE AgentsGenerator._extract_tool_classes (459-475) and the duplicate BaseTool import block (30-37);
# import BaseTool from the resolver if still needed.
```

---

## 3. `--handoff*`, `--web`/`--web-fetch`, and `--planning` are CLI-only — no YAML surface, and the CLI→YAML merge silently drops them

### Where
- CLI flags: `src/praisonai/praisonai/cli/main.py`
- Merge + field validation: `src/praisonai/praisonai/agents_generator.py`

The flags exist and are first-class on the CLI:
```python
# cli/main.py:974-981, 1037-1043
parser.add_argument("--web", "--web-search", action="store_true", help="Enable native web search ...")
parser.add_argument("--web-fetch", action="store_true", help="Enable web fetch to retrieve URL content ...")
parser.add_argument("--planning", action="store_true", help="Enable planning mode ...")
...
parser.add_argument("--handoff", type=str, help="Comma-separated agent roles for task delegation")
parser.add_argument("--handoff-policy", type=str, choices=["full", "summary", "none", "last_n"], ...)
parser.add_argument("--handoff-timeout", type=float, ...)
parser.add_argument("--handoff-max-depth", type=int, ...)
parser.add_argument("--handoff-max-concurrent", type=int, ...)
parser.add_argument("--handoff-detect-cycles", type=str, choices=["true", "false"], ...)
```
A repo grep confirms `handoff` and `web_search`/`WebConfig` appear **only** under `cli/` — never in `framework_adapters/`, `agents_generator.py`, or `auto.py`.

When an agent is run from `agents.yaml`, CLI options reach the run only through `_merge_cli_config`, whose allowlist does **not** include any of them:
```python
# agents_generator.py:349-370
agent_level_fields = ['tool_timeout', 'planning_tools', 'autonomy']        # no handoff / web / planning
...
approval_fields = ['trust', 'approval', 'approve_all_tools', 'approval_timeout', 'approve_level']
...
if 'guardrail' in cli_config:
    agent_overrides['guardrails'] = cli_config['guardrail']
```

And the YAML field validator's set of recognized fields has no `handoff` or `web` entry — so putting them in `agents.yaml` produces an "Unknown field" warning and they are ignored:
```python
# agents_generator.py:396-403
known_fields = {
    'role', 'goal', 'instructions', 'backstory', 'tools', 'tasks', 'llm',
    'function_calling_llm', 'allow_delegation', 'max_iter', 'max_rpm',
    'max_execution_time', 'verbose', 'cache', 'system_template',
    'prompt_template', 'response_template', 'tool_timeout', 'planning_tools',
    'planning', 'autonomy', 'guardrails', 'streaming', 'stream',
    'approval', 'skills', 'cli_backend', 'reflection'        # no 'handoff', no 'web'
}
```
```python
# agents_generator.py:426-428
self.logger.warning(
    f"Unknown field '{field_name}' in {entity_name} '{name}'.{suggestion}"
)
```

Net effect: handoff orchestration and native web search/fetch are reachable from the CLI (and from the core SDK directly in Python), but there is **no YAML representation** and a CLI flag like `praisonai --handoff researcher,writer agents.yaml` is silently dropped on the multi-agent YAML path. This breaks the explicit *"each feature runs 3 ways: CLI, YAML, Python"* contract. (`--planning` is a partial case: YAML recognizes a `planning` field at line 401, but the `--planning` flag is still not propagated by `_merge_cli_config`, so the flag has no effect on YAML-driven runs.)

### Fix
Make these first-class on all three surfaces:
1. Add `'handoff'` and `'web'` to `known_fields` (and document a `handoff:` / `web:` agent schema, e.g. `handoff: {to: [...], policy: summary, timeout: 30}`).
2. Extend the `_merge_cli_config` allowlist so the CLI flags land on the YAML config:
```python
# agents_generator.py:349
agent_level_fields = ['tool_timeout', 'planning_tools', 'autonomy', 'planning', 'web', 'web_fetch', 'handoff']
```
   plus map the `--handoff-*` sub-flags into a single `handoff` dict (mirroring the `ApprovalSpec.from_cli` pattern already used at lines 354-366).
3. Have the framework adapter(s) read `handoff` / `web` from the merged agent config and pass them to the core `Agent`, so the YAML path produces the same agent the CLI direct-prompt path does.

---

## Why these three
- **#1** is a live functional bug: the documented `timeout` safety valve never engages on the scheduler's own worker thread, and "thread-safe" stats can tear. Both have a known-good in-repo pattern to copy.
- **#2** is in-progress drift: two hand-maintained copies of the same tool-discovery predicate, with the `tools/` directory already bypassing the "single source of truth."
- **#3** is a contract gap: real features that exist on one leg of the CLI/YAML/Python tripod and are silently missing on another.

All three are scoped to `src/praisonai/praisonai`, are independent of #1722 and #1602, and each comes with a concrete fix. Happy to send PRs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrapper defects: scheduler timeout dies in its own worker thread, tool-class loader is duplicated/bypasses ToolResolver, and handoff/web/planning have no YAML surface #1738

Scope

1. `AgentScheduler` advertises "Thread-safe operation" but its `timeout` is dead in its own worker thread, and its stats counters race

Where

1a. The per-execution timeout cannot fire — `signal.signal` only works on the main thread

1b. Stats counters are read/written across threads with no lock

Fix

2. Tool-class loading is duplicated and bypasses the `ToolResolver` that the code itself calls the "single source of truth"

Where

Fix

3. `--handoff*`, `--web`/`--web-fetch`, and `--planning` are CLI-only — no YAML surface, and the CLI→YAML merge silently drops them

Where

Fix

Why these three

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Wrapper defects: scheduler timeout dies in its own worker thread, tool-class loader is duplicated/bypasses ToolResolver, and handoff/web/planning have no YAML surface #1738

Description

Scope

1. AgentScheduler advertises "Thread-safe operation" but its timeout is dead in its own worker thread, and its stats counters race

Where

1a. The per-execution timeout cannot fire — signal.signal only works on the main thread

1b. Stats counters are read/written across threads with no lock

Fix

2. Tool-class loading is duplicated and bypasses the ToolResolver that the code itself calls the "single source of truth"

Where

Fix

3. --handoff*, --web/--web-fetch, and --planning are CLI-only — no YAML surface, and the CLI→YAML merge silently drops them

Where

Fix

Why these three

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `AgentScheduler` advertises "Thread-safe operation" but its `timeout` is dead in its own worker thread, and its stats counters race

1a. The per-execution timeout cannot fire — `signal.signal` only works on the main thread

2. Tool-class loading is duplicated and bypasses the `ToolResolver` that the code itself calls the "single source of truth"

3. `--handoff*`, `--web`/`--web-fetch`, and `--planning` are CLI-only — no YAML surface, and the CLI→YAML merge silently drops them