You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Wrapper defects: scheduler timeout dies in its own worker thread, tool-class loader is duplicated/bypasses ToolResolver, and handoff/web/planning have no YAML surface #1738
In-depth analysis of src/praisonai/praisonai (the wrapper layer), focused only on key architectural/behavioural gaps — not docs, tests, coverage, file sizes, or line counts.
These three are deliberately chosen to not overlap with the already-open wrapper audits #1722 (dead framework runners / async-vs-sync scheduler drift / process singletons) and #1602 (OpenAI client singleton / 0.0.0.0 bind / duplicated ScheduleParser). Each finding below was read end-to-end on claude/bold-bohr-YAv0w, every line number was checked against the current tree, and one is reproduced with a runnable snippet. Each maps to a stated principle: "multi-agent + async safe by default", "DRY / single source of truth", and "each feature runs 3 ways: CLI, YAML, Python."
1. AgentScheduler advertises "Thread-safe operation" but its timeout is dead in its own worker thread, and its stats counters race
1a. The per-execution timeout cannot fire — signal.signal only works on the main thread
_execute_with_retry (reached from _run_schedule on the worker thread) installs a SIGALRM handler:
# agent_scheduler.py:244-259ifself.timeout:
importsignaldeftimeout_handler(signum, frame):
raiseTimeoutError(f"Execution exceeded {self.timeout}s timeout")
# Set timeout alarm (Unix only)try:
signal.signal(signal.SIGALRM, timeout_handler) # line 252signal.alarm(self.timeout)
result=self._executor.execute(self.task)
signal.alarm(0) # Cancel alarmexceptAttributeError: # line 256 — only AttributeError# Windows doesn't support SIGALRM, use threading.Timer fallbacklogger.warning("Timeout not supported on this platform, executing without timeout")
result=self._executor.execute(self.task)
In CPython, signal.signal() from any non-main thread raises ValueError, not AttributeError. The except AttributeError does not catch it, so the ValueError propagates to the generic except Exception at line 291 — the run is logged as an agent failure, retried max_retries times (all failing identically), and the agent never executes. Whenever a timeout is set, every scheduled run on the worker thread fails before the task even starts. from_recipe(...) makes this the default path: it always sets timeout (defaults to 300, lines 489/497), so recipe-backed schedulers are broken out of the box.
Reproduced:
importsignal, threadingdefworker():
try:
signal.signal(signal.SIGALRM, lambda*a: None)
print("registered ok")
exceptAttributeErrorase:
print("AttributeError:", e)
exceptValueErrorase:
print("ValueError:", e)
t=threading.Thread(target=worker); t.start(); t.join()
# -> ValueError: signal only works in main thread of the main interpreter
The codebase already ships the correct, cross-platform, thread-safe primitive — _wrap_with_timeout in agents_generator.py:94-158 runs the call on a worker thread and joins with a timeout. The scheduler should use the same approach instead of SIGALRM.
1b. Stats counters are read/written across threads with no lock
The worker thread mutates four counters with no synchronization:
# agent_scheduler.py (inside _execute_with_retry, on the worker thread)self._execution_count+=1# 235
...
self._total_cost+=estimated_cost# 271self._success_count+=1# 274
...
self._failure_count+=1# 302
There is no lock anywhere in the class, so get_stats() can observe torn snapshots (e.g. failed > total, success_rate > 100). Note the async sibling (async_agent_scheduler.py) does this correctly under self._stats_lock — the sync class simply never got the same treatment. (Minor corroborating smell: start() duplicates self.is_running = True / self._stop_event.clear() at lines 100-101 and again at 106-107.)
Fix
Replace SIGALRM with the existing thread-based timeout, and guard the counters with a lock:
# agent_scheduler.pyimportthreadingfromconcurrent.futuresimportThreadPoolExecutor, TimeoutErrorasFuturesTimeout# in __init__:self._stats_lock=threading.Lock()
# in _execute_with_retry, replace the signal block (244-259) with:ifself.timeout:
withThreadPoolExecutor(max_workers=1) asex:
fut=ex.submit(self._executor.execute, self.task)
try:
result=fut.result(timeout=self.timeout) # works on ANY thread, cross-platformexceptFuturesTimeout:
raiseTimeoutError(f"Execution exceeded {self.timeout}s timeout")
else:
result=self._executor.execute(self.task)
# wrap every counter mutation and the get_stats() read body in `with self._stats_lock:`
2. Tool-class loading is duplicated and bypasses the ToolResolver that the code itself calls the "single source of truth"
Where
src/praisonai/praisonai/agents_generator.py
src/praisonai/praisonai/tool_resolver.py
AgentsGenerator.__init__ builds a resolver and comments it as canonical:
# agents_generator.py:295-297# Initialize tool resolver with the registry wired in (single source of truth for tool resolution)from .tool_resolverimportToolResolverself.tool_resolver=ToolResolver(registry=self.tool_registry)
But the same module re-implements the resolver's class-extraction logic verbatim:
# agents_generator.py:459-475def_extract_tool_classes(self, module):
result= {}
forname, objininspect.getmembers(module,
lambdax: inspect.isclass(x) and (
x.__module__.startswith('langchain_community.tools') or
(PRAISONAI_TOOLS_AVAILABLEandBaseToolandissubclass(x, BaseTool))
) andxisnotBaseTool):
try:
result[name] =obj()
exceptExceptionase:
self.logger.warning(f"Error instantiating tool class {name}: {e}")
continuereturnresult
# tool_resolver.py:472-484 (ToolResolver.get_local_tool_classes — character-for-character the same predicate)forname, objininspect.getmembers(module,
lambdax: inspect.isclass(x) and (
x.__module__.startswith('langchain_community.tools') or
(PRAISONAI_TOOLS_AVAILABLEandBaseToolandissubclass(x, BaseTool))
) andxisnotBaseTool):
try:
result[name] =obj()
logger.debug(f"Loaded local tool class: {name}")
exceptExceptionase:
logger.warning(f"Error instantiating tool class {name}: {e}")
continue
The BaseTool import dance is also duplicated — agents_generator.py:30-37 vs tool_resolver.py:459-469 — so the two copies can disagree about what PRAISONAI_TOOLS_AVAILABLE/BaseTool even are.
Worse, the actual run path uses both mechanisms and they diverge: tools.py goes through the resolver, but the tools/directory is loaded directly, bypassing the resolver entirely:
# agents_generator.py:627-639# Use consolidated ToolResolver for tools.py loadingtools_dict.update(self.tool_resolver.get_local_tool_classes()) # resolver path (tools.py)ifos.path.isfile(tools_py_path):
self.logger.debug("tools.py exists in the root directory. ...")
eliftools_dir_path.is_dir():
from ._safe_loaderimportload_user_moduleforpy_fileintools_dir_path.glob("*.py"): # bypass: tools/ dirifpy_file.name.startswith("__"):
continuemodule=load_user_module(py_file, name=f"tools_{py_file.stem}")
ifmoduleisnotNone:
tools_dict.update(self._extract_tool_classes(module)) # the duplicate, not the resolver
So there are two independent class-loading code paths with separately-maintained predicates and exception handling. Any fix to discovery rules (new base class, new module prefix, dedup policy) must be made twice or it silently rots — exactly the DRY / single-source-of-truth violation the philosophy calls out.
Fix
Make the resolver the only loader. Add a directory-aware method to ToolResolver and delete the duplicate from agents_generator.py:
# tool_resolver.py — extend the existing single source of truthdefget_local_tool_classes_from_dir(self, tools_dir: "os.PathLike|str") ->Dict[str, Any]:
"""Load BaseTool/langchain classes from every *.py in a tools/ directory."""frompathlibimportPathfrom ._safe_loaderimportload_user_moduleclasses: Dict[str, Any] = {}
forpy_fileinPath(tools_dir).glob("*.py"):
ifpy_file.name.startswith("__"):
continuemodule=load_user_module(py_file, name=f"tools_{py_file.stem}")
ifmoduleisnotNone:
classes.update(self._extract_tool_classes(module)) # reuse resolver's own predicatereturnclasses
# agents_generator.py:627-639 — becomestools_dict.update(self.tool_resolver.get_local_tool_classes())
ifos.path.isfile(tools_py_path):
self.logger.debug("tools.py exists; skipping tools/ folder.")
eliftools_dir_path.is_dir():
tools_dict.update(self.tool_resolver.get_local_tool_classes_from_dir(tools_dir_path))
# DELETE AgentsGenerator._extract_tool_classes (459-475) and the duplicate BaseTool import block (30-37);# import BaseTool from the resolver if still needed.
3. --handoff*, --web/--web-fetch, and --planning are CLI-only — no YAML surface, and the CLI→YAML merge silently drops them
Where
CLI flags: src/praisonai/praisonai/cli/main.py
Merge + field validation: src/praisonai/praisonai/agents_generator.py
And the YAML field validator's set of recognized fields has no handoff or web entry — so putting them in agents.yaml produces an "Unknown field" warning and they are ignored:
# agents_generator.py:426-428self.logger.warning(
f"Unknown field '{field_name}' in {entity_name} '{name}'.{suggestion}"
)
Net effect: handoff orchestration and native web search/fetch are reachable from the CLI (and from the core SDK directly in Python), but there is no YAML representation and a CLI flag like praisonai --handoff researcher,writer agents.yaml is silently dropped on the multi-agent YAML path. This breaks the explicit "each feature runs 3 ways: CLI, YAML, Python" contract. (--planning is a partial case: YAML recognizes a planning field at line 401, but the --planning flag is still not propagated by _merge_cli_config, so the flag has no effect on YAML-driven runs.)
Fix
Make these first-class on all three surfaces:
Add 'handoff' and 'web' to known_fields (and document a handoff: / web: agent schema, e.g. handoff: {to: [...], policy: summary, timeout: 30}).
Extend the _merge_cli_config allowlist so the CLI flags land on the YAML config:
plus map the --handoff-* sub-flags into a single handoff dict (mirroring the ApprovalSpec.from_cli pattern already used at lines 354-366).
3. Have the framework adapter(s) read handoff / web from the merged agent config and pass them to the core Agent, so the YAML path produces the same agent the CLI direct-prompt path does.
Why these three
Github actions fix #1 is a live functional bug: the documented timeout safety valve never engages on the scheduler's own worker thread, and "thread-safe" stats can tear. Both have a known-good in-repo pattern to copy.
Merge pull request #1 from MervinPraison/develop #2 is in-progress drift: two hand-maintained copies of the same tool-discovery predicate, with the tools/ directory already bypassing the "single source of truth."
Main #3 is a contract gap: real features that exist on one leg of the CLI/YAML/Python tripod and are silently missing on another.
All three are scoped to src/praisonai/praisonai, are independent of #1722 and #1602, and each comes with a concrete fix. Happy to send PRs.
Scope
In-depth analysis of
src/praisonai/praisonai(the wrapper layer), focused only on key architectural/behavioural gaps — not docs, tests, coverage, file sizes, or line counts.These three are deliberately chosen to not overlap with the already-open wrapper audits #1722 (dead framework runners / async-vs-sync scheduler drift / process singletons) and #1602 (OpenAI client singleton /
0.0.0.0bind / duplicatedScheduleParser). Each finding below was read end-to-end onclaude/bold-bohr-YAv0w, every line number was checked against the current tree, and one is reproduced with a runnable snippet. Each maps to a stated principle: "multi-agent + async safe by default", "DRY / single source of truth", and "each feature runs 3 ways: CLI, YAML, Python."1.
AgentScheduleradvertises "Thread-safe operation" but itstimeoutis dead in its own worker thread, and its stats counters raceWhere
src/praisonai/praisonai/scheduler/agent_scheduler.pyThe class docstring promises thread-safety:
Scheduled runs execute on a background daemon thread:
1a. The per-execution timeout cannot fire —
signal.signalonly works on the main thread_execute_with_retry(reached from_run_scheduleon the worker thread) installs aSIGALRMhandler:In CPython,
signal.signal()from any non-main thread raisesValueError, notAttributeError. Theexcept AttributeErrordoes not catch it, so theValueErrorpropagates to the genericexcept Exceptionat line 291 — the run is logged as an agent failure, retriedmax_retriestimes (all failing identically), and the agent never executes. Whenever atimeoutis set, every scheduled run on the worker thread fails before the task even starts.from_recipe(...)makes this the default path: it always setstimeout(defaults to300, lines 489/497), so recipe-backed schedulers are broken out of the box.Reproduced:
The codebase already ships the correct, cross-platform, thread-safe primitive —
_wrap_with_timeoutinagents_generator.py:94-158runs the call on a worker thread andjoins with a timeout. The scheduler should use the same approach instead ofSIGALRM.1b. Stats counters are read/written across threads with no lock
The worker thread mutates four counters with no synchronization:
while any caller thread reads them unguarded:
There is no lock anywhere in the class, so
get_stats()can observe torn snapshots (e.g.failed > total,success_rate > 100). Note the async sibling (async_agent_scheduler.py) does this correctly underself._stats_lock— the sync class simply never got the same treatment. (Minor corroborating smell:start()duplicatesself.is_running = True/self._stop_event.clear()at lines 100-101 and again at 106-107.)Fix
Replace
SIGALRMwith the existing thread-based timeout, and guard the counters with a lock:2. Tool-class loading is duplicated and bypasses the
ToolResolverthat the code itself calls the "single source of truth"Where
src/praisonai/praisonai/agents_generator.pysrc/praisonai/praisonai/tool_resolver.pyAgentsGenerator.__init__builds a resolver and comments it as canonical:But the same module re-implements the resolver's class-extraction logic verbatim:
The
BaseToolimport dance is also duplicated —agents_generator.py:30-37vstool_resolver.py:459-469— so the two copies can disagree about whatPRAISONAI_TOOLS_AVAILABLE/BaseTooleven are.Worse, the actual run path uses both mechanisms and they diverge:
tools.pygoes through the resolver, but thetools/directory is loaded directly, bypassing the resolver entirely:So there are two independent class-loading code paths with separately-maintained predicates and exception handling. Any fix to discovery rules (new base class, new module prefix, dedup policy) must be made twice or it silently rots — exactly the DRY / single-source-of-truth violation the philosophy calls out.
Fix
Make the resolver the only loader. Add a directory-aware method to
ToolResolverand delete the duplicate fromagents_generator.py:3.
--handoff*,--web/--web-fetch, and--planningare CLI-only — no YAML surface, and the CLI→YAML merge silently drops themWhere
src/praisonai/praisonai/cli/main.pysrc/praisonai/praisonai/agents_generator.pyThe flags exist and are first-class on the CLI:
A repo grep confirms
handoffandweb_search/WebConfigappear only undercli/— never inframework_adapters/,agents_generator.py, orauto.py.When an agent is run from
agents.yaml, CLI options reach the run only through_merge_cli_config, whose allowlist does not include any of them:And the YAML field validator's set of recognized fields has no
handofforwebentry — so putting them inagents.yamlproduces an "Unknown field" warning and they are ignored:Net effect: handoff orchestration and native web search/fetch are reachable from the CLI (and from the core SDK directly in Python), but there is no YAML representation and a CLI flag like
praisonai --handoff researcher,writer agents.yamlis silently dropped on the multi-agent YAML path. This breaks the explicit "each feature runs 3 ways: CLI, YAML, Python" contract. (--planningis a partial case: YAML recognizes aplanningfield at line 401, but the--planningflag is still not propagated by_merge_cli_config, so the flag has no effect on YAML-driven runs.)Fix
Make these first-class on all three surfaces:
'handoff'and'web'toknown_fields(and document ahandoff:/web:agent schema, e.g.handoff: {to: [...], policy: summary, timeout: 30})._merge_cli_configallowlist so the CLI flags land on the YAML config:plus map the
--handoff-*sub-flags into a singlehandoffdict (mirroring theApprovalSpec.from_clipattern already used at lines 354-366).3. Have the framework adapter(s) read
handoff/webfrom the merged agent config and pass them to the coreAgent, so the YAML path produces the same agent the CLI direct-prompt path does.Why these three
timeoutsafety valve never engages on the scheduler's own worker thread, and "thread-safe" stats can tear. Both have a known-good in-repo pattern to copy.tools/directory already bypassing the "single source of truth."All three are scoped to
src/praisonai/praisonai, are independent of #1722 and #1602, and each comes with a concrete fix. Happy to send PRs.