Skip to content

Three high-impact gaps in src/praisonai/praisonai: broken ollama publish, multi-tenant-unsafe tool resolver, sync-only auto-generator with unreliable cleanup #1735

@MervinPraison

Description

@MervinPraison

Scope

In-depth analysis of src/praisonai/praisonai/ (the wrapper layer). Per project philosophy, this layer must be multi-agent + async safe by default, production-ready / safe by default, and have no perf regressions. The three gaps below each violate one or more of those pillars and are validated against the current code at HEAD.

This is intentionally narrow — only the highest-impact gaps, not a full audit, and not docs/tests/coverage/file-size complaints.


1) subprocess.run(["ollama", "serve"]) blocks forever — the entire ollama_save publish path never executes

Where

Same pattern, four places:

  • src/praisonai/praisonai/train.py:555
  • src/praisonai/praisonai/train_vision.py:291
  • src/praisonai/praisonai/upload_vision.py:91
  • src/praisonai/praisonai/train/llm/trainer.py:528

The bug

# src/praisonai/praisonai/train.py:551-557
def create_and_push_ollama_model(self):
    modelfile_content = self.prepare_modelfile_content()
    with open("Modelfile", "w") as file:
        file.write(modelfile_content)
    subprocess.run(["ollama", "serve"])                                          # <-- blocks forever
    subprocess.run(["ollama", "create", f"{self.config['ollama_model']}:{self.config['model_parameters']}", "-f", "Modelfile"])
    subprocess.run(["ollama", "push",   f"{self.config['ollama_model']}:{self.config['model_parameters']}"])

ollama serve is a long-running foreground HTTP server. subprocess.run(...) blocks until the child exits — which it doesn't. So create and push are unreachable, and run()create_and_push_ollama_model() hangs after training completes whenever the user sets ollama_save: true (the default in the example configs). The "publish to Ollama" feature is effectively dead.

Why it matters

  • Documented feature does not work — silent hang after a (potentially long) training run.
  • Wastes the user's GPU time: they finish fine-tuning, then watch the process sit forever while believing the push is in progress.
  • Four copies of the same broken block — fixing in one place misses the others.

How to fix

Don't run ollama serve synchronously; check that the daemon is up (or start it detached), then create/push:

import subprocess, shutil, time, socket, contextlib

def _ollama_ready(host="127.0.0.1", port=11434, timeout=0.2) -> bool:
    with contextlib.suppress(OSError), socket.create_connection((host, port), timeout):
        return True
    return False

def _ensure_ollama_running(self):
    if _ollama_ready():
        return None
    if shutil.which("ollama") is None:
        raise RuntimeError("`ollama` CLI not found; install from https://ollama.com")
    # Detached server — do NOT wait on it.
    proc = subprocess.Popen(
        ["ollama", "serve"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
        start_new_session=True,
    )
    # Poll until ready (or give up after a few seconds).
    for _ in range(50):
        if _ollama_ready():
            return proc
        time.sleep(0.1)
    raise RuntimeError("ollama serve did not become ready in time")

def create_and_push_ollama_model(self):
    with open("Modelfile", "w") as f:
        f.write(self.prepare_modelfile_content())
    self._ensure_ollama_running()
    tag = f"{self.config['ollama_model']}:{self.config['model_parameters']}"
    subprocess.run(["ollama", "create", tag, "-f", "Modelfile"], check=True)
    subprocess.run(["ollama", "push",   tag], check=True)

Extract this into one shared helper (e.g. praisonai/train/_ollama.py) so all four call sites use the same implementation — that also kills the DRY violation between train.py, train_vision.py, upload_vision.py, and train/llm/trainer.py.

Validation

grep '"ollama", "serve"' src/praisonai/praisonai/ returns exactly the four sites listed above; in all four, the next two lines are the unreachable create and push calls.


2) tool_resolver._default_resolver is a process-wide singleton that binds tools to the first caller's CWD — wrong tools in multi-tenant gateways and any test that chdirs

Where

  • src/praisonai/praisonai/tool_resolver.py:490-528 — the singleton and the convenience wrappers (resolve_tool, resolve_tools, list_available_tools, has_tool, validate_yaml_tools)
  • src/praisonai/praisonai/tool_resolver.py:56-76 — constructor: self._tools_py_path = tools_py_path or "tools.py" (a CWD-relative default)
  • src/praisonai/praisonai/tool_resolver.py:78-120_load_local_tools() populates a per-instance cache once and never invalidates

The bug

# src/praisonai/praisonai/tool_resolver.py:490-528
_default_resolver: Optional[ToolResolver] = None
_default_resolver_lock = threading.Lock()

def _get_default_resolver() -> ToolResolver:
    """...Returns cached ToolResolver that is anchored to the working directory
    at first call. Local tools.py resolution is CWD-dependent and cached
    for the lifetime of the process."""
    global _default_resolver
    if _default_resolver is None:
        with _default_resolver_lock:
            if _default_resolver is None:
                _default_resolver = ToolResolver()      # <-- captures CWD-relative "tools.py" forever
    return _default_resolver

def resolve_tool(name, resolver=None):
    return (resolver or _get_default_resolver()).resolve(name)
# resolve_tools / list_available_tools / has_tool / validate_yaml_tools all do the same
# src/praisonai/praisonai/tool_resolver.py:56-67
def __init__(self, tools_py_path=None, registry=None):
    self._tools_py_path = tools_py_path or "tools.py"   # relative path, resolved later
    ...

Concrete failure modes:

  1. Gateway / FastAPI server (src/praisonai/praisonai/gateway/server.py:1411-1412 instantiates ToolResolver() per request — good — but any code path that calls the module-level resolve_tool(...) instead silently uses the singleton). The first request to hit a resolve_tool() convenience call locks the local-tools cache to that worker's CWD. Subsequent requests from the same worker get the same tools.py regardless of their project context, and the cache cannot be reset.
  2. Multi-project CLI / orchestrator that os.chdirs between projects: only the project active at first-call wins for the rest of the process.
  3. Tests: cannot be isolated without restarting Python; the docstring explicitly waves this off ("create explicit resolver instances") but the convenience functions are imported and used throughout the codebase.

The module already provides per-key invalidate() for the framework-availability cache (src/praisonai/praisonai/_framework_availability.py:53); the tool resolver should have the same affordance.

How to fix

Two complementary changes:

(a) Resolve the path eagerly in the constructor so the binding is explicit and inspectable, not implicit-via-CWD-at-load-time:

# tool_resolver.py
from pathlib import Path

def __init__(self, tools_py_path=None, registry=None):
    self._tools_py_path = str(Path(tools_py_path or "tools.py").resolve())
    ...

(b) Add a reset_default_resolver() hook (mirroring _framework_availability.invalidate) and document that long-running servers must call it on CWD change or per-tenant boundary:

# tool_resolver.py
def reset_default_resolver() -> None:
    """Clear the process-default resolver. Call between tenants / on CWD change."""
    global _default_resolver
    with _default_resolver_lock:
        _default_resolver = None

Better still, for the gateway hot path, switch the convenience wrappers to look up a per-call resolver from a ContextVar (request-scoped), falling back to the global only for single-tenant CLI use. That keeps the perf win of the cache for CLI users without trading it for a correctness bug in servers.

Validation

  • src/praisonai/praisonai/tool_resolver.py:494-509 — the singleton has no reset path.
  • src/praisonai/praisonai/tool_resolver.py:67 — the default tools_py_path is the literal string "tools.py", only resolved against CWD inside load_user_module() (src/praisonai/praisonai/_safe_loader.py:42Path(module_path).resolve() runs at load time, not at construction time).
  • src/praisonai/praisonai/tool_resolver.py:78-120_local_tools_loaded flips to True after the first successful load and is never cleared.
  • src/praisonai/praisonai/_framework_availability.py:53 — the precedent for an explicit invalidate() already exists in this codebase; the tool resolver simply doesn't follow it.

3) BaseAutoGenerator has no async API and depends on __del__ for HTTP cleanup — violates "async-safe by default", leaks sockets

Where

  • src/praisonai/praisonai/auto.py:397-506BaseAutoGenerator, including _get_openai_client, close, __del__, and _structured_completion
  • Call sites that all funnel through the sync path: auto.py:700, 1088, 1110, 1414

The bug

# src/praisonai/praisonai/auto.py:425-456
self._openai_client = None
self._openai_client_lock = threading.Lock()

def _get_openai_client(self):
    if self._openai_client is None:
        with self._openai_client_lock:
            if self._openai_client is None:
                from openai import OpenAI
                cfg = self.config_list[0]
                self._openai_client = OpenAI(
                    api_key=cfg.get("api_key") or os.environ.get("OPENAI_API_KEY"),
                    base_url=cfg.get("base_url"),
                )
    return self._openai_client

def close(self):
    if not hasattr(self, '_openai_client_lock'):
        return
    with self._openai_client_lock:
        client = getattr(self, '_openai_client', None)
        self._openai_client = None
    if client is not None:
        client.close()

def __del__(self):
    """Best-effort cleanup, but the canonical path is explicit close()."""
    self.close()                                # <-- not guaranteed to run
# src/praisonai/praisonai/auto.py:458-506
def _structured_completion(self, response_model, messages, **kwargs):
    model_name = self.config_list[0]['model']

    if _check_litellm_available():
        litellm = _get_litellm()
        response = litellm.completion(...)      # <-- sync, blocks event loop
        return response_model.model_validate_json(response.choices[0].message.content)

    if _check_openai_available():
        client = self._get_openai_client()
        response = client.beta.chat.completions.parse(...)   # <-- sync, blocks event loop
        return response.choices[0].message.parsed
    ...

Two related problems:

(a) No async surface. generate() on AutoGenerator, WorkflowAutoGenerator, and JobWorkflowAutoGenerator is sync only. Any caller in an async context (the gateway, the daemon, the agent scheduler, any user using praisonai from FastAPI) blocks the event loop for the entire LLM round trip. Contrast with _async_bridge.py:83, where the wrapper goes to real trouble to never block the loop from async code — that contract is broken here.

(b) Cleanup via __del__ is unreliable. CPython doesn't guarantee finalizers run (cycles, interpreter shutdown, thread context, async refcount paths) — and the comment "best-effort cleanup, but the canonical path is explicit close()" already concedes that. There is no __enter__/__exit__, no __aenter__/__aexit__, and most call sites never call close(). Each transient generator leaks one HTTP connection pool to the process; over the lifetime of a long-running server (gateway, daemon, MCP server) that adds up.

How to fix

Add an async path and proper context-manager support. Sketch:

# src/praisonai/praisonai/auto.py
class BaseAutoGenerator:
    def __init__(self, config_list=None):
        ...
        self._openai_client = None
        self._async_openai_client = None
        self._client_lock = threading.Lock()

    # --- sync ---
    def close(self):
        client, self._openai_client = self._openai_client, None
        if client is not None:
            client.close()

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc, tb):
        self.close()
        return False

    # --- async ---
    async def aclose(self):
        client, self._async_openai_client = self._async_openai_client, None
        if client is not None:
            await client.close()

    async def __aenter__(self):
        return self

    async def __aexit__(self, exc_type, exc, tb):
        await self.aclose()
        return False

    async def _astructured_completion(self, response_model, messages, **kwargs):
        if _check_litellm_available():
            litellm = _get_litellm()
            response = await litellm.acompletion(
                model=self.config_list[0]['model'],
                messages=messages,
                response_format=response_model,
                **kwargs,
            )
            return response_model.model_validate_json(response.choices[0].message.content)

        if _check_openai_available():
            from openai import AsyncOpenAI
            if self._async_openai_client is None:
                cfg = self.config_list[0]
                self._async_openai_client = AsyncOpenAI(
                    api_key=cfg.get("api_key") or os.environ.get("OPENAI_API_KEY"),
                    base_url=cfg.get("base_url"),
                )
            response = await self._async_openai_client.beta.chat.completions.parse(
                model=self.config_list[0]['model'],
                messages=messages,
                response_format=response_model,
                **kwargs,
            )
            return response.choices[0].message.parsed
        raise ImportError("Install litellm or openai for structured output.")

    # Remove __del__: rely on context-manager / explicit close instead.

Then expose async def agenerate(...) on the three generator classes that simply awaits _astructured_completion(...) instead of calling the sync version. Callers from async land use:

async with WorkflowAutoGenerator(topic=...) as gen:
    path = await gen.agenerate(pattern="parallel")

Drop __del__ once __exit__ / __aexit__ exist — finalizers shouldn't be the contract.

Validation

  • src/praisonai/praisonai/auto.py:458_structured_completion is def, not async def; both branches use sync calls (litellm.completion and client.beta.chat.completions.parse, not acompletion / AsyncOpenAI).
  • src/praisonai/praisonai/auto.py:454-456 — the only cleanup path is __del__close(); no __enter__/__exit__, no async equivalents.
  • grep -n 'async def' src/praisonai/praisonai/auto.py — none.
  • src/praisonai/praisonai/_async_bridge.py:83 — the wrapper has a documented "do not block the loop from async code" contract that auto.py does not honor.

Summary

# Gap File(s) Principle violated
1 ollama serve called via blocking subprocess.runcreate/push unreachable train.py:555, train_vision.py:291, upload_vision.py:91, train/llm/trainer.py:528 Production-ready / safe by default
2 Process-wide _default_resolver binds tools to first CWD, no reset hook tool_resolver.py:490-528 (+ ctor at :67) Multi-agent + async safe by default
3 BaseAutoGenerator has no async path; cleanup via __del__ only auto.py:397-506 Async-safe by default; production-ready

Happy to break any of these out into separate issues / PRs if preferred — they're independent.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingclaudeAuto-trigger Claude analysisdocumentationImprovements or additions to documentation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions