Skip to content

Wrapper security & lifecycle gaps: magic-link auth secret mismatch breaks default login, sandbox timeout leaks remote files + orphans workloads, and cloud deploy leaks the API key into gcloud argv #1743

@MervinPraison

Description

@MervinPraison

Scope

In-depth analysis of src/praisonai/praisonai (the wrapper layer), focused only on key correctness/security gaps that violate the stated "production-ready / safe by default" pillar — not docs, tests, coverage, file sizes, or line counts.

These three were each read end-to-end on claude/bold-bohr-pDWTv and every line number checked against the current tree. They are deliberately chosen to not overlap with the existing wrapper audits:

Each finding includes a concrete fix. Happy to send PRs.


1) Gateway magic-link / cookie auth signs and verifies with different secrets — magic-link login is broken in the default setup, and HTTP vs WebSocket auth can silently diverge

Where

src/praisonai/praisonai/gateway/server.py and src/praisonai/praisonai/gateway/cookie_auth.py

Three cookie-auth call sites resolve the signing secret from two different sources:

# server.py:339-340  (HTTP cookie VERIFY)        -> secret from os.environ
from .cookie_auth import create_auth_manager_from_env
auth_manager = create_auth_manager_from_env()

# server.py:648      (magic-link MINT)            -> secret from os.environ
auth_manager = create_auth_manager_from_env()
if not auth_manager:
    return JSONResponse({"error": "Cookie authentication not available"}, status_code=500)

# server.py:431-432  (WebSocket cookie VERIFY)    -> secret from self.config.auth_token
from .cookie_auth import CookieAuthManager
auth_manager = CookieAuthManager(secret_key=self.config.auth_token)

create_auth_manager_from_env() reads the secret only from the process environment and returns None when neither var is set:

# cookie_auth.py:223-241
def create_auth_manager_from_env() -> Optional[CookieAuthManager]:
    """Looks for GATEWAY_AUTH_TOKEN or PRAISONAI_SECRET_KEY."""
    import os
    secret = (os.environ.get("GATEWAY_AUTH_TOKEN")
              or os.environ.get("PRAISONAI_SECRET_KEY"))
    if not secret:
        return None
    return CookieAuthManager(secret_key=secret)

But the default token-resolution path generates the token and writes it to a file, never to os.environ:

# server.py:210-236
if hasattr(self.config, 'auth_token') and not self.config.auth_token:
    env_tok = os.environ.get("GATEWAY_AUTH_TOKEN", "").strip()
    if env_tok:
        self.config.auth_token = env_tok          # env case: config == env  (OK)
    else:
        if is_loopback(self.config.bind_host):
            self.config.auth_token = secrets.token_hex(16)   # generated...
            ...
            save_auth_token_to_env(self.config.auth_token)   # ...saved to ~/.praisonai/.env, NOT os.environ

The bug — two concrete failure modes

(a) Default loopback setup → magic-link login is completely broken. No GATEWAY_AUTH_TOKEN env var, no config token → self.config.auth_token is auto-generated and persisted to ~/.praisonai/.env. load_dotenv() does not re-read that file into the running process, so os.environ["GATEWAY_AUTH_TOKEN"] stays unset. When the user opens their minted magic link, the mint handler calls create_auth_manager_from_env()NoneHTTP 500 "Cookie authentication not available" (server.py:649-653). The headline one-click login flow fails out of the box.

(b) Token set in config + a different GATEWAY_AUTH_TOKEN env value → HTTP works, WebSocket silently fails. If config.auth_token = X (so the block at :210 is skipped) while the env holds Y, cookies are minted and HTTP-verified with Y (:648 / :340) but WebSocket-verified with X (:432). A browser that authenticated over HTTP then fails every WebSocket cookie check and is silently downgraded to the deprecated ?token= query-param path (or rejected at :446).

This is a single-source-of-truth violation on a security primitive: the secret that signs a session cookie must be the same one that verifies it on every transport.

How to fix

Resolve the secret once and use it everywhere. Simplest: export the resolved token so the env helper and the WS path agree, right after token resolution in __init__:

# server.py, immediately after the token-resolution block (~:236)
if self.config.auth_token:
    # Single source of truth: the env helper used by mint/HTTP-verify must see the same secret as the WS path.
    os.environ.setdefault("GATEWAY_AUTH_TOKEN", self.config.auth_token)

Better: drop create_auth_manager_from_env() from the request paths entirely and have all three sites build the manager from one resolved value:

# one helper on the gateway
def _cookie_auth_manager(self) -> Optional[CookieAuthManager]:
    return CookieAuthManager(secret_key=self.config.auth_token) if self.config.auth_token else None
# use self._cookie_auth_manager() at :340, :432, and :648

The mint path at :648 should never be able to return 500 when self.config.auth_token is set.

Validation

  • server.py:340 and :648 both call create_auth_manager_from_env(); :432 constructs CookieAuthManager(secret_key=self.config.auth_token) — two different secret sources for the same cookie.
  • cookie_auth.py:233-239 reads only os.environ and returns None when unset.
  • server.py:225-234 generates the token and calls save_auth_token_to_env(...) (file write) without ever setting os.environ["GATEWAY_AUTH_TOKEN"]create_auth_manager_from_env() returns None in the default loopback case.

2) Sandbox execution timeout leaks a remote temp file on every failed run, and does not actually stop the remote/container workload — the resource-limit guarantee is defeated on the timeout path

Where

src/praisonai/praisonai/sandbox/ssh.py and src/praisonai/praisonai/sandbox/docker.py

2a. SSH backend leaks the remote temp file and orphans the remote process

# ssh.py:168-222  (execute)
try:
    remote_file = f"{self.working_dir}/exec_{execution_id}.{self._get_file_extension(language)}"
    await self.write_file(remote_file, code)                 # remote file created
    command = self._build_command(language, remote_file, limits, env)
    result = await self._run_command_with_limits(command, limits, working_dir or self.working_dir)  # can raise
    await self._connection.run(f"rm -f {shlex.quote(remote_file)}")   # :186 — cleanup, INSIDE try, AFTER the run
    ...
except asyncio.TimeoutError:        # :203 — cleanup skipped
    return SandboxResult(... status=TIMEOUT ...)
except Exception as e:              # :213 — cleanup skipped
    return SandboxResult(... status=FAILED ...)

There is no try/finally. The rm -f at line 186 only runs on the success path, so every timed-out or failed execution permanently leaks exec_<uuid>.<ext> in the remote working dir. Over time the remote host fills with orphaned files.

Compounding it, the timeout itself doesn't stop the remote work:

# ssh.py:526-542  (_run_command_with_limits)
full_command = f"cd {shlex.quote(working_dir)} && {command}"   # no remote-side `timeout`
if timeout:
    result = await asyncio.wait_for(self._connection.run(full_command), timeout=timeout)

asyncio.wait_for cancels the local await; asyncssh does not guarantee the remote process dies. So the remote command (e.g. a runaway python exec_….py) keeps running after we report TIMEOUT.

2b. Docker backend kills the client, not the container

# docker.py:259-275
docker_cmd = ["docker", "run", "--rm",            # no --name, not detached
              "--memory", f"{limits.memory_mb}m",
              "--cpus", str(limits.cpu_percent / 100), ...]
docker_cmd.extend([self._image, "sh", "-c", cmd_str])

# docker.py:304-306  (on timeout)
except asyncio.TimeoutError:
    proc.kill()          # kills the local `docker run` CLIENT
    await proc.wait()

Killing the docker run client does not stop the container — the daemon keeps it running, and --rm only removes it after it stops. A CPU/network-heavy task that times out therefore keeps consuming the configured --cpus/--memory indefinitely, exactly defeating the limit the sandbox exists to enforce.

Why it matters

The sandbox is a safety component. "Timeout" silently failing to (a) clean up and (b) actually stop the workload turns a hard limit into a soft suggestion, and leaks remote state on every error.

How to fix

SSH — move cleanup into finally, and bound the remote side so the process self-terminates:

# ssh.py — execute()
remote_file = f"{self.working_dir}/exec_{execution_id}.{self._get_file_extension(language)}"
try:
    await self.write_file(remote_file, code)
    command = self._build_command(language, remote_file, limits, env)
    result = await self._run_command_with_limits(command, limits, working_dir or self.working_dir)
    ...
except asyncio.TimeoutError:
    ...
except Exception as e:
    ...
finally:
    try:
        await self._connection.run(f"rm -f {shlex.quote(remote_file)}")
    except Exception:
        pass  # never mask the real result/error with a cleanup failure

# _run_command_with_limits — wrap the remote command so it self-kills:
if timeout:
    full_command = f"cd {shlex.quote(working_dir)} && timeout {int(timeout)} sh -c {shlex.quote(command)}"

Docker — give the container a name and force-stop it on timeout (or launch detached and docker rm -f):

container_name = f"praisonai-{execution_id}"
docker_cmd = ["docker", "run", "--rm", "--name", container_name, ...]
...
except asyncio.TimeoutError:
    with contextlib.suppress(Exception):
        kill = await asyncio.create_subprocess_exec("docker", "kill", container_name)
        await kill.wait()
    proc.kill(); await proc.wait()

Validation

  • ssh.py:186rm -f is inside try, after _run_command_with_limits; ssh.py:203 and :213 except blocks have no cleanup.
  • ssh.py:537asyncio.wait_for(self._connection.run(full_command), ...); ssh.py:528 builds cd … && {command} with no remote timeout.
  • docker.py:260docker run --rm with no --name; docker.py:305-306 only proc.kill() + proc.wait(); grep "docker kill"/"docker rm" in docker.py → none.

3) Cloud deploy passes the resolved OPENAI_API_KEY inline in the gcloud argv — credential exposed in the process table / CI logs

Where

src/praisonai/praisonai/deploy.py:136-157

# deploy.py:136-141
from praisonai.llm.env import resolve_llm_endpoint
ep = resolve_llm_endpoint()
openai_model = ep.model
openai_key = ep.api_key or 'Enter your API key'
openai_base = ep.base_url

# deploy.py:154-157
['gcloud', 'run', 'deploy', 'praisonai-service',
 '--image', f'us-central1-docker.pkg.dev/{project_id}/praisonai-repository/praisonai-app:latest',
 '--platform', 'managed', '--region', 'us-central1', '--allow-unauthenticated',
 '--set-env-vars', f'OPENAI_MODEL_NAME={openai_model},OPENAI_API_KEY={openai_key},OPENAI_API_BASE={openai_base}']

The bug

The real OPENAI_API_KEY is interpolated directly into a command-line argument. Even though this is an argv list (so not a shell-injection vector), command-line arguments are:

  • visible to any local user via ps / /proc/<pid>/cmdline while gcloud runs,
  • routinely captured verbatim in CI/CD logs and shell history,
  • baked into the Cloud Run service config in plaintext.

For a package whose philosophy is "safe by default," shipping the user's LLM credential through argv is a real exposure. Secondary fragility: --set-env-vars is comma-delimited, so any value containing a comma (a base URL or a model alias) silently corrupts the env mapping.

How to fix

Pass secrets via a file or Secret Manager, never inline in argv:

# Option A — Secret Manager (preferred for managed services)
#   echo -n "$OPENAI_API_KEY" | gcloud secrets create praisonai-openai-key --data-file=-
['gcloud', 'run', 'deploy', 'praisonai-service', ...,
 '--set-env-vars', f'OPENAI_MODEL_NAME={openai_model},OPENAI_API_BASE={openai_base}',
 '--set-secrets', 'OPENAI_API_KEY=praisonai-openai-key:latest']

# Option B — env-vars file (avoids argv + the comma-splitting problem), deleted in finally
import tempfile, os, yaml
fd, path = tempfile.mkstemp(suffix=".yaml"); os.close(fd); os.chmod(path, 0o600)
try:
    with open(path, "w") as f:
        yaml.safe_dump({"OPENAI_MODEL_NAME": openai_model,
                        "OPENAI_API_KEY": openai_key,
                        "OPENAI_API_BASE": openai_base}, f)
    subprocess.run([... 'gcloud','run','deploy', ..., '--env-vars-file', path], check=True)
finally:
    os.remove(path)

(Separately worth revisiting: --allow-unauthenticated at :156 deploys a public service by default — out of scope here, but it pairs poorly with leaking the key.)

Validation

  • deploy.py:140 resolves openai_key = ep.api_key or 'Enter your API key'.
  • deploy.py:157 embeds OPENAI_API_KEY={openai_key} into the --set-env-vars argv string, run by subprocess.run(cmd, check=True) at deploy.py:181.

Summary

# Gap File(s) Impact
1 Magic-link cookie signed/verified with different secret sources; broken in default loopback setup (HTTP 500), HTTP↔WS divergence otherwise gateway/server.py:340,432,648, gateway/cookie_auth.py:223-241 Core auth feature fails out of the box; latent auth inconsistency
2 Sandbox timeout leaks remote temp file (no try/finally) and doesn't stop the remote process / docker container sandbox/ssh.py:186,203,213,537, sandbox/docker.py:260,304-306 Resource-limit/safety guarantee defeated on timeout; remote-state leak
3 Cloud deploy inlines OPENAI_API_KEY into gcloud --set-env-vars argv deploy.py:140,157 Credential exposed in process table / CI logs; comma-splitting corruption

All three are scoped to src/praisonai/praisonai, validated against the current tree, and independent of the existing wrapper-audit issues (#1502, #1508, #1614, #1620, #1735, #1738).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingclaudeAuto-trigger Claude analysis

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions