Skip to content

Wrapper defects: scheduler timeout dies in its own worker thread, tool-class loader is duplicated/bypasses ToolResolver, and handoff/web/planning have no YAML surface #1738

@MervinPraison

Description

@MervinPraison

Scope

In-depth analysis of src/praisonai/praisonai (the wrapper layer), focused only on key architectural/behavioural gaps — not docs, tests, coverage, file sizes, or line counts.

These three are deliberately chosen to not overlap with the already-open wrapper audits #1722 (dead framework runners / async-vs-sync scheduler drift / process singletons) and #1602 (OpenAI client singleton / 0.0.0.0 bind / duplicated ScheduleParser). Each finding below was read end-to-end on claude/bold-bohr-YAv0w, every line number was checked against the current tree, and one is reproduced with a runnable snippet. Each maps to a stated principle: "multi-agent + async safe by default", "DRY / single source of truth", and "each feature runs 3 ways: CLI, YAML, Python."


1. AgentScheduler advertises "Thread-safe operation" but its timeout is dead in its own worker thread, and its stats counters race

Where

src/praisonai/praisonai/scheduler/agent_scheduler.py

The class docstring promises thread-safety:

# agent_scheduler.py:19-28
class AgentScheduler:
    """
    Scheduler for running PraisonAI agents periodically.

    Features:
    - Interval-based scheduling (hourly, daily, custom)
    - Thread-safe operation          # <-- claim
    - Automatic retry on failure
    ...

Scheduled runs execute on a background daemon thread:

# agent_scheduler.py:115-120
self._thread = threading.Thread(
    target=self._run_schedule,
    args=(interval, max_retries),
    daemon=True
)
self._thread.start()

1a. The per-execution timeout cannot fire — signal.signal only works on the main thread

_execute_with_retry (reached from _run_schedule on the worker thread) installs a SIGALRM handler:

# agent_scheduler.py:244-259
if self.timeout:
    import signal

    def timeout_handler(signum, frame):
        raise TimeoutError(f"Execution exceeded {self.timeout}s timeout")

    # Set timeout alarm (Unix only)
    try:
        signal.signal(signal.SIGALRM, timeout_handler)   # line 252
        signal.alarm(self.timeout)
        result = self._executor.execute(self.task)
        signal.alarm(0)  # Cancel alarm
    except AttributeError:                                # line 256 — only AttributeError
        # Windows doesn't support SIGALRM, use threading.Timer fallback
        logger.warning("Timeout not supported on this platform, executing without timeout")
        result = self._executor.execute(self.task)

In CPython, signal.signal() from any non-main thread raises ValueError, not AttributeError. The except AttributeError does not catch it, so the ValueError propagates to the generic except Exception at line 291 — the run is logged as an agent failure, retried max_retries times (all failing identically), and the agent never executes. Whenever a timeout is set, every scheduled run on the worker thread fails before the task even starts. from_recipe(...) makes this the default path: it always sets timeout (defaults to 300, lines 489/497), so recipe-backed schedulers are broken out of the box.

Reproduced:

import signal, threading
def worker():
    try:
        signal.signal(signal.SIGALRM, lambda *a: None)
        print("registered ok")
    except AttributeError as e:
        print("AttributeError:", e)
    except ValueError as e:
        print("ValueError:", e)
t = threading.Thread(target=worker); t.start(); t.join()
# -> ValueError: signal only works in main thread of the main interpreter

The codebase already ships the correct, cross-platform, thread-safe primitive — _wrap_with_timeout in agents_generator.py:94-158 runs the call on a worker thread and joins with a timeout. The scheduler should use the same approach instead of SIGALRM.

1b. Stats counters are read/written across threads with no lock

The worker thread mutates four counters with no synchronization:

# agent_scheduler.py (inside _execute_with_retry, on the worker thread)
self._execution_count += 1     # 235
...
self._total_cost += estimated_cost   # 271
self._success_count += 1             # 274
...
self._failure_count += 1             # 302

while any caller thread reads them unguarded:

# agent_scheduler.py:163-173
def get_stats(self) -> Dict[str, Any]:
    runtime = (datetime.now() - self._start_time).total_seconds() if self._start_time else 0
    return {
        "is_running": self.is_running,
        "total_executions": self._execution_count,
        "successful_executions": self._success_count,
        "failed_executions": self._failure_count,
        "success_rate": (self._success_count / self._execution_count * 100) if self._execution_count > 0 else 0,
        "total_cost_usd": round(self._total_cost, 4),
        ...
    }

There is no lock anywhere in the class, so get_stats() can observe torn snapshots (e.g. failed > total, success_rate > 100). Note the async sibling (async_agent_scheduler.py) does this correctly under self._stats_lock — the sync class simply never got the same treatment. (Minor corroborating smell: start() duplicates self.is_running = True / self._stop_event.clear() at lines 100-101 and again at 106-107.)

Fix

Replace SIGALRM with the existing thread-based timeout, and guard the counters with a lock:

# agent_scheduler.py
import threading
from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeout

# in __init__:
self._stats_lock = threading.Lock()

# in _execute_with_retry, replace the signal block (244-259) with:
if self.timeout:
    with ThreadPoolExecutor(max_workers=1) as ex:
        fut = ex.submit(self._executor.execute, self.task)
        try:
            result = fut.result(timeout=self.timeout)   # works on ANY thread, cross-platform
        except FuturesTimeout:
            raise TimeoutError(f"Execution exceeded {self.timeout}s timeout")
else:
    result = self._executor.execute(self.task)

# wrap every counter mutation and the get_stats() read body in `with self._stats_lock:`

2. Tool-class loading is duplicated and bypasses the ToolResolver that the code itself calls the "single source of truth"

Where

  • src/praisonai/praisonai/agents_generator.py
  • src/praisonai/praisonai/tool_resolver.py

AgentsGenerator.__init__ builds a resolver and comments it as canonical:

# agents_generator.py:295-297
# Initialize tool resolver with the registry wired in (single source of truth for tool resolution)
from .tool_resolver import ToolResolver
self.tool_resolver = ToolResolver(registry=self.tool_registry)

But the same module re-implements the resolver's class-extraction logic verbatim:

# agents_generator.py:459-475
def _extract_tool_classes(self, module):
    result = {}
    for name, obj in inspect.getmembers(module,
        lambda x: inspect.isclass(x) and (
            x.__module__.startswith('langchain_community.tools') or
            (PRAISONAI_TOOLS_AVAILABLE and BaseTool and issubclass(x, BaseTool))
        ) and x is not BaseTool):
        try:
            result[name] = obj()
        except Exception as e:
            self.logger.warning(f"Error instantiating tool class {name}: {e}")
            continue
    return result
# tool_resolver.py:472-484  (ToolResolver.get_local_tool_classes — character-for-character the same predicate)
for name, obj in inspect.getmembers(module,
    lambda x: inspect.isclass(x) and (
        x.__module__.startswith('langchain_community.tools') or
        (PRAISONAI_TOOLS_AVAILABLE and BaseTool and issubclass(x, BaseTool))
    ) and x is not BaseTool):
    try:
        result[name] = obj()
        logger.debug(f"Loaded local tool class: {name}")
    except Exception as e:
        logger.warning(f"Error instantiating tool class {name}: {e}")
        continue

The BaseTool import dance is also duplicated — agents_generator.py:30-37 vs tool_resolver.py:459-469 — so the two copies can disagree about what PRAISONAI_TOOLS_AVAILABLE/BaseTool even are.

Worse, the actual run path uses both mechanisms and they diverge: tools.py goes through the resolver, but the tools/ directory is loaded directly, bypassing the resolver entirely:

# agents_generator.py:627-639
# Use consolidated ToolResolver for tools.py loading
tools_dict.update(self.tool_resolver.get_local_tool_classes())   # resolver path (tools.py)
if os.path.isfile(tools_py_path):
    self.logger.debug("tools.py exists in the root directory. ...")
elif tools_dir_path.is_dir():
    from ._safe_loader import load_user_module
    for py_file in tools_dir_path.glob("*.py"):                  # bypass: tools/ dir
        if py_file.name.startswith("__"):
            continue
        module = load_user_module(py_file, name=f"tools_{py_file.stem}")
        if module is not None:
            tools_dict.update(self._extract_tool_classes(module))  # the duplicate, not the resolver

So there are two independent class-loading code paths with separately-maintained predicates and exception handling. Any fix to discovery rules (new base class, new module prefix, dedup policy) must be made twice or it silently rots — exactly the DRY / single-source-of-truth violation the philosophy calls out.

Fix

Make the resolver the only loader. Add a directory-aware method to ToolResolver and delete the duplicate from agents_generator.py:

# tool_resolver.py — extend the existing single source of truth
def get_local_tool_classes_from_dir(self, tools_dir: "os.PathLike|str") -> Dict[str, Any]:
    """Load BaseTool/langchain classes from every *.py in a tools/ directory."""
    from pathlib import Path
    from ._safe_loader import load_user_module
    classes: Dict[str, Any] = {}
    for py_file in Path(tools_dir).glob("*.py"):
        if py_file.name.startswith("__"):
            continue
        module = load_user_module(py_file, name=f"tools_{py_file.stem}")
        if module is not None:
            classes.update(self._extract_tool_classes(module))   # reuse resolver's own predicate
    return classes
# agents_generator.py:627-639 — becomes
tools_dict.update(self.tool_resolver.get_local_tool_classes())
if os.path.isfile(tools_py_path):
    self.logger.debug("tools.py exists; skipping tools/ folder.")
elif tools_dir_path.is_dir():
    tools_dict.update(self.tool_resolver.get_local_tool_classes_from_dir(tools_dir_path))
# DELETE AgentsGenerator._extract_tool_classes (459-475) and the duplicate BaseTool import block (30-37);
# import BaseTool from the resolver if still needed.

3. --handoff*, --web/--web-fetch, and --planning are CLI-only — no YAML surface, and the CLI→YAML merge silently drops them

Where

  • CLI flags: src/praisonai/praisonai/cli/main.py
  • Merge + field validation: src/praisonai/praisonai/agents_generator.py

The flags exist and are first-class on the CLI:

# cli/main.py:974-981, 1037-1043
parser.add_argument("--web", "--web-search", action="store_true", help="Enable native web search ...")
parser.add_argument("--web-fetch", action="store_true", help="Enable web fetch to retrieve URL content ...")
parser.add_argument("--planning", action="store_true", help="Enable planning mode ...")
...
parser.add_argument("--handoff", type=str, help="Comma-separated agent roles for task delegation")
parser.add_argument("--handoff-policy", type=str, choices=["full", "summary", "none", "last_n"], ...)
parser.add_argument("--handoff-timeout", type=float, ...)
parser.add_argument("--handoff-max-depth", type=int, ...)
parser.add_argument("--handoff-max-concurrent", type=int, ...)
parser.add_argument("--handoff-detect-cycles", type=str, choices=["true", "false"], ...)

A repo grep confirms handoff and web_search/WebConfig appear only under cli/ — never in framework_adapters/, agents_generator.py, or auto.py.

When an agent is run from agents.yaml, CLI options reach the run only through _merge_cli_config, whose allowlist does not include any of them:

# agents_generator.py:349-370
agent_level_fields = ['tool_timeout', 'planning_tools', 'autonomy']        # no handoff / web / planning
...
approval_fields = ['trust', 'approval', 'approve_all_tools', 'approval_timeout', 'approve_level']
...
if 'guardrail' in cli_config:
    agent_overrides['guardrails'] = cli_config['guardrail']

And the YAML field validator's set of recognized fields has no handoff or web entry — so putting them in agents.yaml produces an "Unknown field" warning and they are ignored:

# agents_generator.py:396-403
known_fields = {
    'role', 'goal', 'instructions', 'backstory', 'tools', 'tasks', 'llm',
    'function_calling_llm', 'allow_delegation', 'max_iter', 'max_rpm',
    'max_execution_time', 'verbose', 'cache', 'system_template',
    'prompt_template', 'response_template', 'tool_timeout', 'planning_tools',
    'planning', 'autonomy', 'guardrails', 'streaming', 'stream',
    'approval', 'skills', 'cli_backend', 'reflection'        # no 'handoff', no 'web'
}
# agents_generator.py:426-428
self.logger.warning(
    f"Unknown field '{field_name}' in {entity_name} '{name}'.{suggestion}"
)

Net effect: handoff orchestration and native web search/fetch are reachable from the CLI (and from the core SDK directly in Python), but there is no YAML representation and a CLI flag like praisonai --handoff researcher,writer agents.yaml is silently dropped on the multi-agent YAML path. This breaks the explicit "each feature runs 3 ways: CLI, YAML, Python" contract. (--planning is a partial case: YAML recognizes a planning field at line 401, but the --planning flag is still not propagated by _merge_cli_config, so the flag has no effect on YAML-driven runs.)

Fix

Make these first-class on all three surfaces:

  1. Add 'handoff' and 'web' to known_fields (and document a handoff: / web: agent schema, e.g. handoff: {to: [...], policy: summary, timeout: 30}).
  2. Extend the _merge_cli_config allowlist so the CLI flags land on the YAML config:
# agents_generator.py:349
agent_level_fields = ['tool_timeout', 'planning_tools', 'autonomy', 'planning', 'web', 'web_fetch', 'handoff']

plus map the --handoff-* sub-flags into a single handoff dict (mirroring the ApprovalSpec.from_cli pattern already used at lines 354-366).
3. Have the framework adapter(s) read handoff / web from the merged agent config and pass them to the core Agent, so the YAML path produces the same agent the CLI direct-prompt path does.


Why these three

  • Github actions fix #1 is a live functional bug: the documented timeout safety valve never engages on the scheduler's own worker thread, and "thread-safe" stats can tear. Both have a known-good in-repo pattern to copy.
  • Merge pull request #1 from MervinPraison/develop #2 is in-progress drift: two hand-maintained copies of the same tool-discovery predicate, with the tools/ directory already bypassing the "single source of truth."
  • Main #3 is a contract gap: real features that exist on one leg of the CLI/YAML/Python tripod and are silently missing on another.

All three are scoped to src/praisonai/praisonai, are independent of #1722 and #1602, and each comes with a concrete fix. Happy to send PRs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingclaudeAuto-trigger Claude analysisdocumentationImprovements or additions to documentationenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions