Three high-impact gaps in src/praisonai/praisonai: broken ollama publish, multi-tenant-unsafe tool resolver, sync-only auto-generator with unreliable cleanup

## Scope

In-depth analysis of `src/praisonai/praisonai/` (the wrapper layer). Per project philosophy, this layer must be **multi-agent + async safe by default**, **production-ready / safe by default**, and have **no perf regressions**. The three gaps below each violate one or more of those pillars and are validated against the current code at HEAD.

This is intentionally narrow — only the highest-impact gaps, not a full audit, and not docs/tests/coverage/file-size complaints.

---

## 1) `subprocess.run(["ollama", "serve"])` blocks forever — the entire `ollama_save` publish path never executes

### Where

Same pattern, four places:

- `src/praisonai/praisonai/train.py:555`
- `src/praisonai/praisonai/train_vision.py:291`
- `src/praisonai/praisonai/upload_vision.py:91`
- `src/praisonai/praisonai/train/llm/trainer.py:528`

### The bug

```python
# src/praisonai/praisonai/train.py:551-557
def create_and_push_ollama_model(self):
    modelfile_content = self.prepare_modelfile_content()
    with open("Modelfile", "w") as file:
        file.write(modelfile_content)
    subprocess.run(["ollama", "serve"])                                          # <-- blocks forever
    subprocess.run(["ollama", "create", f"{self.config['ollama_model']}:{self.config['model_parameters']}", "-f", "Modelfile"])
    subprocess.run(["ollama", "push",   f"{self.config['ollama_model']}:{self.config['model_parameters']}"])
```

`ollama serve` is a long-running foreground HTTP server. `subprocess.run(...)` blocks until the child exits — which it doesn't. So `create` and `push` are unreachable, and `run()` → `create_and_push_ollama_model()` hangs after training completes whenever the user sets `ollama_save: true` (the default in the example configs). The "publish to Ollama" feature is effectively dead.

### Why it matters

- Documented feature does not work — silent hang after a (potentially long) training run.
- Wastes the user's GPU time: they finish fine-tuning, then watch the process sit forever while believing the push is in progress.
- Four copies of the same broken block — fixing in one place misses the others.

### How to fix

Don't run `ollama serve` synchronously; check that the daemon is up (or start it detached), then create/push:

```python
import subprocess, shutil, time, socket, contextlib

def _ollama_ready(host="127.0.0.1", port=11434, timeout=0.2) -> bool:
    with contextlib.suppress(OSError), socket.create_connection((host, port), timeout):
        return True
    return False

def _ensure_ollama_running(self):
    if _ollama_ready():
        return None
    if shutil.which("ollama") is None:
        raise RuntimeError("`ollama` CLI not found; install from https://ollama.com")
    # Detached server — do NOT wait on it.
    proc = subprocess.Popen(
        ["ollama", "serve"],
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
        start_new_session=True,
    )
    # Poll until ready (or give up after a few seconds).
    for _ in range(50):
        if _ollama_ready():
            return proc
        time.sleep(0.1)
    raise RuntimeError("ollama serve did not become ready in time")

def create_and_push_ollama_model(self):
    with open("Modelfile", "w") as f:
        f.write(self.prepare_modelfile_content())
    self._ensure_ollama_running()
    tag = f"{self.config['ollama_model']}:{self.config['model_parameters']}"
    subprocess.run(["ollama", "create", tag, "-f", "Modelfile"], check=True)
    subprocess.run(["ollama", "push",   tag], check=True)
```

Extract this into one shared helper (e.g. `praisonai/train/_ollama.py`) so all four call sites use the same implementation — that also kills the DRY violation between `train.py`, `train_vision.py`, `upload_vision.py`, and `train/llm/trainer.py`.

### Validation

`grep '"ollama", "serve"' src/praisonai/praisonai/` returns exactly the four sites listed above; in all four, the next two lines are the unreachable `create` and `push` calls.

---

## 2) `tool_resolver._default_resolver` is a process-wide singleton that binds tools to the first caller's CWD — wrong tools in multi-tenant gateways and any test that `chdir`s

### Where

- `src/praisonai/praisonai/tool_resolver.py:490-528` — the singleton and the convenience wrappers (`resolve_tool`, `resolve_tools`, `list_available_tools`, `has_tool`, `validate_yaml_tools`)
- `src/praisonai/praisonai/tool_resolver.py:56-76` — constructor: `self._tools_py_path = tools_py_path or "tools.py"` (a CWD-relative default)
- `src/praisonai/praisonai/tool_resolver.py:78-120` — `_load_local_tools()` populates a per-instance cache once and never invalidates

### The bug

```python
# src/praisonai/praisonai/tool_resolver.py:490-528
_default_resolver: Optional[ToolResolver] = None
_default_resolver_lock = threading.Lock()

def _get_default_resolver() -> ToolResolver:
    """...Returns cached ToolResolver that is anchored to the working directory
    at first call. Local tools.py resolution is CWD-dependent and cached
    for the lifetime of the process."""
    global _default_resolver
    if _default_resolver is None:
        with _default_resolver_lock:
            if _default_resolver is None:
                _default_resolver = ToolResolver()      # <-- captures CWD-relative "tools.py" forever
    return _default_resolver

def resolve_tool(name, resolver=None):
    return (resolver or _get_default_resolver()).resolve(name)
# resolve_tools / list_available_tools / has_tool / validate_yaml_tools all do the same
```

```python
# src/praisonai/praisonai/tool_resolver.py:56-67
def __init__(self, tools_py_path=None, registry=None):
    self._tools_py_path = tools_py_path or "tools.py"   # relative path, resolved later
    ...
```

Concrete failure modes:

1. **Gateway / FastAPI server** (`src/praisonai/praisonai/gateway/server.py:1411-1412` instantiates `ToolResolver()` per request — good — but any code path that calls the module-level `resolve_tool(...)` instead silently uses the singleton). The first request to hit a `resolve_tool()` convenience call locks the local-tools cache to that worker's CWD. Subsequent requests from the same worker get the same `tools.py` regardless of their project context, and the cache cannot be reset.
2. **Multi-project CLI / orchestrator** that `os.chdir`s between projects: only the project active at first-call wins for the rest of the process.
3. **Tests**: cannot be isolated without restarting Python; the docstring explicitly waves this off ("create explicit resolver instances") but the convenience functions are imported and used throughout the codebase.

The module already provides per-key `invalidate()` for the framework-availability cache (`src/praisonai/praisonai/_framework_availability.py:53`); the tool resolver should have the same affordance.

### How to fix

Two complementary changes:

(a) Resolve the path **eagerly** in the constructor so the binding is explicit and inspectable, not implicit-via-CWD-at-load-time:

```python
# tool_resolver.py
from pathlib import Path

def __init__(self, tools_py_path=None, registry=None):
    self._tools_py_path = str(Path(tools_py_path or "tools.py").resolve())
    ...
```

(b) Add a `reset_default_resolver()` hook (mirroring `_framework_availability.invalidate`) and document that long-running servers must call it on CWD change or per-tenant boundary:

```python
# tool_resolver.py
def reset_default_resolver() -> None:
    """Clear the process-default resolver. Call between tenants / on CWD change."""
    global _default_resolver
    with _default_resolver_lock:
        _default_resolver = None
```

Better still, for the gateway hot path, switch the convenience wrappers to look up a per-call resolver from a `ContextVar` (request-scoped), falling back to the global only for single-tenant CLI use. That keeps the perf win of the cache for CLI users without trading it for a correctness bug in servers.

### Validation

- `src/praisonai/praisonai/tool_resolver.py:494-509` — the singleton has no reset path.
- `src/praisonai/praisonai/tool_resolver.py:67` — the default `tools_py_path` is the literal string `"tools.py"`, only resolved against CWD inside `load_user_module()` (`src/praisonai/praisonai/_safe_loader.py:42` — `Path(module_path).resolve()` runs at load time, not at construction time).
- `src/praisonai/praisonai/tool_resolver.py:78-120` — `_local_tools_loaded` flips to `True` after the first successful load and is never cleared.
- `src/praisonai/praisonai/_framework_availability.py:53` — the precedent for an explicit `invalidate()` already exists in this codebase; the tool resolver simply doesn't follow it.

---

## 3) `BaseAutoGenerator` has no async API and depends on `__del__` for HTTP cleanup — violates "async-safe by default", leaks sockets

### Where

- `src/praisonai/praisonai/auto.py:397-506` — `BaseAutoGenerator`, including `_get_openai_client`, `close`, `__del__`, and `_structured_completion`
- Call sites that all funnel through the sync path: `auto.py:700`, `1088`, `1110`, `1414`

### The bug

```python
# src/praisonai/praisonai/auto.py:425-456
self._openai_client = None
self._openai_client_lock = threading.Lock()

def _get_openai_client(self):
    if self._openai_client is None:
        with self._openai_client_lock:
            if self._openai_client is None:
                from openai import OpenAI
                cfg = self.config_list[0]
                self._openai_client = OpenAI(
                    api_key=cfg.get("api_key") or os.environ.get("OPENAI_API_KEY"),
                    base_url=cfg.get("base_url"),
                )
    return self._openai_client

def close(self):
    if not hasattr(self, '_openai_client_lock'):
        return
    with self._openai_client_lock:
        client = getattr(self, '_openai_client', None)
        self._openai_client = None
    if client is not None:
        client.close()

def __del__(self):
    """Best-effort cleanup, but the canonical path is explicit close()."""
    self.close()                                # <-- not guaranteed to run
```

```python
# src/praisonai/praisonai/auto.py:458-506
def _structured_completion(self, response_model, messages, **kwargs):
    model_name = self.config_list[0]['model']

    if _check_litellm_available():
        litellm = _get_litellm()
        response = litellm.completion(...)      # <-- sync, blocks event loop
        return response_model.model_validate_json(response.choices[0].message.content)

    if _check_openai_available():
        client = self._get_openai_client()
        response = client.beta.chat.completions.parse(...)   # <-- sync, blocks event loop
        return response.choices[0].message.parsed
    ...
```

Two related problems:

(a) **No async surface.** `generate()` on `AutoGenerator`, `WorkflowAutoGenerator`, and `JobWorkflowAutoGenerator` is sync only. Any caller in an async context (the gateway, the daemon, the agent scheduler, any user using `praisonai` from FastAPI) blocks the event loop for the entire LLM round trip. Contrast with `_async_bridge.py:83`, where the wrapper goes to real trouble to never block the loop from async code — that contract is broken here.

(b) **Cleanup via `__del__` is unreliable.** CPython doesn't guarantee finalizers run (cycles, interpreter shutdown, thread context, async refcount paths) — and the comment "best-effort cleanup, but the canonical path is explicit `close()`" already concedes that. There is no `__enter__`/`__exit__`, no `__aenter__`/`__aexit__`, and most call sites never call `close()`. Each transient generator leaks one HTTP connection pool to the process; over the lifetime of a long-running server (gateway, daemon, MCP server) that adds up.

### How to fix

Add an async path and proper context-manager support. Sketch:

```python
# src/praisonai/praisonai/auto.py
class BaseAutoGenerator:
    def __init__(self, config_list=None):
        ...
        self._openai_client = None
        self._async_openai_client = None
        self._client_lock = threading.Lock()

    # --- sync ---
    def close(self):
        client, self._openai_client = self._openai_client, None
        if client is not None:
            client.close()

    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc, tb):
        self.close()
        return False

    # --- async ---
    async def aclose(self):
        client, self._async_openai_client = self._async_openai_client, None
        if client is not None:
            await client.close()

    async def __aenter__(self):
        return self

    async def __aexit__(self, exc_type, exc, tb):
        await self.aclose()
        return False

    async def _astructured_completion(self, response_model, messages, **kwargs):
        if _check_litellm_available():
            litellm = _get_litellm()
            response = await litellm.acompletion(
                model=self.config_list[0]['model'],
                messages=messages,
                response_format=response_model,
                **kwargs,
            )
            return response_model.model_validate_json(response.choices[0].message.content)

        if _check_openai_available():
            from openai import AsyncOpenAI
            if self._async_openai_client is None:
                cfg = self.config_list[0]
                self._async_openai_client = AsyncOpenAI(
                    api_key=cfg.get("api_key") or os.environ.get("OPENAI_API_KEY"),
                    base_url=cfg.get("base_url"),
                )
            response = await self._async_openai_client.beta.chat.completions.parse(
                model=self.config_list[0]['model'],
                messages=messages,
                response_format=response_model,
                **kwargs,
            )
            return response.choices[0].message.parsed
        raise ImportError("Install litellm or openai for structured output.")

    # Remove __del__: rely on context-manager / explicit close instead.
```

Then expose `async def agenerate(...)` on the three generator classes that simply awaits `_astructured_completion(...)` instead of calling the sync version. Callers from async land use:

```python
async with WorkflowAutoGenerator(topic=...) as gen:
    path = await gen.agenerate(pattern="parallel")
```

Drop `__del__` once `__exit__` / `__aexit__` exist — finalizers shouldn't be the contract.

### Validation

- `src/praisonai/praisonai/auto.py:458` — `_structured_completion` is `def`, not `async def`; both branches use sync calls (`litellm.completion` and `client.beta.chat.completions.parse`, not `acompletion` / `AsyncOpenAI`).
- `src/praisonai/praisonai/auto.py:454-456` — the only cleanup path is `__del__` → `close()`; no `__enter__`/`__exit__`, no async equivalents.
- `grep -n 'async def' src/praisonai/praisonai/auto.py` — none.
- `src/praisonai/praisonai/_async_bridge.py:83` — the wrapper has a documented "do not block the loop from async code" contract that `auto.py` does not honor.

---

## Summary

| # | Gap | File(s) | Principle violated |
|---|-----|---------|--------------------|
| 1 | `ollama serve` called via blocking `subprocess.run` — `create`/`push` unreachable | `train.py:555`, `train_vision.py:291`, `upload_vision.py:91`, `train/llm/trainer.py:528` | Production-ready / safe by default |
| 2 | Process-wide `_default_resolver` binds tools to first CWD, no reset hook | `tool_resolver.py:490-528` (+ ctor at `:67`) | Multi-agent + async safe by default |
| 3 | `BaseAutoGenerator` has no async path; cleanup via `__del__` only | `auto.py:397-506` | Async-safe by default; production-ready |

Happy to break any of these out into separate issues / PRs if preferred — they're independent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Three high-impact gaps in src/praisonai/praisonai: broken ollama publish, multi-tenant-unsafe tool resolver, sync-only auto-generator with unreliable cleanup #1735

Scope

1) `subprocess.run(["ollama", "serve"])` blocks forever — the entire `ollama_save` publish path never executes

Where

The bug

Why it matters

How to fix

Validation

2) `tool_resolver._default_resolver` is a process-wide singleton that binds tools to the first caller's CWD — wrong tools in multi-tenant gateways and any test that `chdir`s

Where

The bug

How to fix

Validation

3) `BaseAutoGenerator` has no async API and depends on `del` for HTTP cleanup — violates "async-safe by default", leaks sockets

Where

The bug

How to fix

Validation

Summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

#	Gap	File(s)	Principle violated
1	`ollama serve` called via blocking `subprocess.run` — `create`/`push` unreachable	`train.py:555`, `train_vision.py:291`, `upload_vision.py:91`, `train/llm/trainer.py:528`	Production-ready / safe by default
2	Process-wide `_default_resolver` binds tools to first CWD, no reset hook	`tool_resolver.py:490-528` (+ ctor at `:67`)	Multi-agent + async safe by default
3	`BaseAutoGenerator` has no async path; cleanup via `__del__` only	`auto.py:397-506`	Async-safe by default; production-ready

Uh oh!

Three high-impact gaps in src/praisonai/praisonai: broken ollama publish, multi-tenant-unsafe tool resolver, sync-only auto-generator with unreliable cleanup #1735

Description

Scope

1) subprocess.run(["ollama", "serve"]) blocks forever — the entire ollama_save publish path never executes

Where

The bug

Why it matters

How to fix

Validation

2) tool_resolver._default_resolver is a process-wide singleton that binds tools to the first caller's CWD — wrong tools in multi-tenant gateways and any test that chdirs

Where

The bug

How to fix

Validation

3) BaseAutoGenerator has no async API and depends on __del__ for HTTP cleanup — violates "async-safe by default", leaks sockets

Where

The bug

How to fix

Validation

Summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1) `subprocess.run(["ollama", "serve"])` blocks forever — the entire `ollama_save` publish path never executes

2) `tool_resolver._default_resolver` is a process-wide singleton that binds tools to the first caller's CWD — wrong tools in multi-tenant gateways and any test that `chdir`s

3) `BaseAutoGenerator` has no async API and depends on `del` for HTTP cleanup — violates "async-safe by default", leaks sockets