Wrapper security & lifecycle gaps: magic-link auth secret mismatch breaks default login, sandbox timeout leaks remote files + orphans workloads, and cloud deploy leaks the API key into gcloud argv

## Scope

In-depth analysis of `src/praisonai/praisonai` (the **wrapper** layer), focused only on key correctness/security gaps that violate the stated **"production-ready / safe by default"** pillar — **not** docs, tests, coverage, file sizes, or line counts.

These three were each read end-to-end on `claude/bold-bohr-pDWTv` and every line number checked against the current tree. They are deliberately chosen to **not overlap** with the existing wrapper audits:

- The magic-link work in #1502 / #1508 are *feature proposals*; neither flags the env-vs-config secret-source bug below (#1502 even treats the `~/.praisonai/.env` token persistence as a working precondition).
- #1614 covers sandbox *duplication* and a *different* docker bug (blocking `is_available()` on the loop); it does not touch the timeout cleanup leak.
- #1620 cites `deploy.py` only for `OPENAI_API_BASE` env-var precedence; the key-in-argv exposure is not raised anywhere.

Each finding includes a concrete fix. Happy to send PRs.

---

## 1) Gateway magic-link / cookie auth signs and verifies with *different* secrets — magic-link login is broken in the default setup, and HTTP vs WebSocket auth can silently diverge

### Where
`src/praisonai/praisonai/gateway/server.py` and `src/praisonai/praisonai/gateway/cookie_auth.py`

Three cookie-auth call sites resolve the signing secret from **two different sources**:

```python
# server.py:339-340  (HTTP cookie VERIFY)        -> secret from os.environ
from .cookie_auth import create_auth_manager_from_env
auth_manager = create_auth_manager_from_env()

# server.py:648      (magic-link MINT)            -> secret from os.environ
auth_manager = create_auth_manager_from_env()
if not auth_manager:
    return JSONResponse({"error": "Cookie authentication not available"}, status_code=500)

# server.py:431-432  (WebSocket cookie VERIFY)    -> secret from self.config.auth_token
from .cookie_auth import CookieAuthManager
auth_manager = CookieAuthManager(secret_key=self.config.auth_token)
```

`create_auth_manager_from_env()` reads the secret **only** from the process environment and returns `None` when neither var is set:

```python
# cookie_auth.py:223-241
def create_auth_manager_from_env() -> Optional[CookieAuthManager]:
    """Looks for GATEWAY_AUTH_TOKEN or PRAISONAI_SECRET_KEY."""
    import os
    secret = (os.environ.get("GATEWAY_AUTH_TOKEN")
              or os.environ.get("PRAISONAI_SECRET_KEY"))
    if not secret:
        return None
    return CookieAuthManager(secret_key=secret)
```

But the default token-resolution path generates the token and writes it to a **file**, never to `os.environ`:

```python
# server.py:210-236
if hasattr(self.config, 'auth_token') and not self.config.auth_token:
    env_tok = os.environ.get("GATEWAY_AUTH_TOKEN", "").strip()
    if env_tok:
        self.config.auth_token = env_tok          # env case: config == env  (OK)
    else:
        if is_loopback(self.config.bind_host):
            self.config.auth_token = secrets.token_hex(16)   # generated...
            ...
            save_auth_token_to_env(self.config.auth_token)   # ...saved to ~/.praisonai/.env, NOT os.environ
```

### The bug — two concrete failure modes

**(a) Default loopback setup → magic-link login is completely broken.** No `GATEWAY_AUTH_TOKEN` env var, no config token → `self.config.auth_token` is auto-generated and persisted to `~/.praisonai/.env`. `load_dotenv()` does not re-read that file into the running process, so `os.environ["GATEWAY_AUTH_TOKEN"]` stays unset. When the user opens their minted magic link, the mint handler calls `create_auth_manager_from_env()` → `None` → **HTTP 500 "Cookie authentication not available"** (server.py:649-653). The headline one-click login flow fails out of the box.

**(b) Token set in config + a *different* `GATEWAY_AUTH_TOKEN` env value → HTTP works, WebSocket silently fails.** If `config.auth_token = X` (so the block at :210 is skipped) while the env holds `Y`, cookies are **minted and HTTP-verified with `Y`** (:648 / :340) but **WebSocket-verified with `X`** (:432). A browser that authenticated over HTTP then fails every WebSocket cookie check and is silently downgraded to the deprecated `?token=` query-param path (or rejected at :446).

This is a single-source-of-truth violation on a security primitive: the secret that signs a session cookie must be the same one that verifies it on every transport.

### How to fix
Resolve the secret **once** and use it everywhere. Simplest: export the resolved token so the env helper and the WS path agree, right after token resolution in `__init__`:

```python
# server.py, immediately after the token-resolution block (~:236)
if self.config.auth_token:
    # Single source of truth: the env helper used by mint/HTTP-verify must see the same secret as the WS path.
    os.environ.setdefault("GATEWAY_AUTH_TOKEN", self.config.auth_token)
```

Better: drop `create_auth_manager_from_env()` from the request paths entirely and have all three sites build the manager from one resolved value:

```python
# one helper on the gateway
def _cookie_auth_manager(self) -> Optional[CookieAuthManager]:
    return CookieAuthManager(secret_key=self.config.auth_token) if self.config.auth_token else None
# use self._cookie_auth_manager() at :340, :432, and :648
```

The mint path at :648 should never be able to return 500 when `self.config.auth_token` is set.

### Validation
- `server.py:340` and `:648` both call `create_auth_manager_from_env()`; `:432` constructs `CookieAuthManager(secret_key=self.config.auth_token)` — two different secret sources for the same cookie.
- `cookie_auth.py:233-239` reads only `os.environ` and returns `None` when unset.
- `server.py:225-234` generates the token and calls `save_auth_token_to_env(...)` (file write) without ever setting `os.environ["GATEWAY_AUTH_TOKEN"]` → `create_auth_manager_from_env()` returns `None` in the default loopback case.

---

## 2) Sandbox execution timeout leaks a remote temp file on every failed run, and does not actually stop the remote/container workload — the resource-limit guarantee is defeated on the timeout path

### Where
`src/praisonai/praisonai/sandbox/ssh.py` and `src/praisonai/praisonai/sandbox/docker.py`

### 2a. SSH backend leaks the remote temp file and orphans the remote process

```python
# ssh.py:168-222  (execute)
try:
    remote_file = f"{self.working_dir}/exec_{execution_id}.{self._get_file_extension(language)}"
    await self.write_file(remote_file, code)                 # remote file created
    command = self._build_command(language, remote_file, limits, env)
    result = await self._run_command_with_limits(command, limits, working_dir or self.working_dir)  # can raise
    await self._connection.run(f"rm -f {shlex.quote(remote_file)}")   # :186 — cleanup, INSIDE try, AFTER the run
    ...
except asyncio.TimeoutError:        # :203 — cleanup skipped
    return SandboxResult(... status=TIMEOUT ...)
except Exception as e:              # :213 — cleanup skipped
    return SandboxResult(... status=FAILED ...)
```

There is no `try/finally`. The `rm -f` at line 186 only runs on the success path, so **every timed-out or failed execution permanently leaks `exec_<uuid>.<ext>` in the remote working dir**. Over time the remote host fills with orphaned files.

Compounding it, the timeout itself doesn't stop the remote work:

```python
# ssh.py:526-542  (_run_command_with_limits)
full_command = f"cd {shlex.quote(working_dir)} && {command}"   # no remote-side `timeout`
if timeout:
    result = await asyncio.wait_for(self._connection.run(full_command), timeout=timeout)
```

`asyncio.wait_for` cancels the **local** await; asyncssh does not guarantee the **remote** process dies. So the remote command (e.g. a runaway `python exec_….py`) keeps running after we report `TIMEOUT`.

### 2b. Docker backend kills the client, not the container

```python
# docker.py:259-275
docker_cmd = ["docker", "run", "--rm",            # no --name, not detached
              "--memory", f"{limits.memory_mb}m",
              "--cpus", str(limits.cpu_percent / 100), ...]
docker_cmd.extend([self._image, "sh", "-c", cmd_str])

# docker.py:304-306  (on timeout)
except asyncio.TimeoutError:
    proc.kill()          # kills the local `docker run` CLIENT
    await proc.wait()
```

Killing the `docker run` client does **not** stop the container — the daemon keeps it running, and `--rm` only removes it *after* it stops. A CPU/network-heavy task that times out therefore keeps consuming the configured `--cpus`/`--memory` indefinitely, exactly defeating the limit the sandbox exists to enforce.

### Why it matters
The sandbox is a *safety* component. "Timeout" silently failing to (a) clean up and (b) actually stop the workload turns a hard limit into a soft suggestion, and leaks remote state on every error.

### How to fix
**SSH** — move cleanup into `finally`, and bound the remote side so the process self-terminates:

```python
# ssh.py — execute()
remote_file = f"{self.working_dir}/exec_{execution_id}.{self._get_file_extension(language)}"
try:
    await self.write_file(remote_file, code)
    command = self._build_command(language, remote_file, limits, env)
    result = await self._run_command_with_limits(command, limits, working_dir or self.working_dir)
    ...
except asyncio.TimeoutError:
    ...
except Exception as e:
    ...
finally:
    try:
        await self._connection.run(f"rm -f {shlex.quote(remote_file)}")
    except Exception:
        pass  # never mask the real result/error with a cleanup failure

# _run_command_with_limits — wrap the remote command so it self-kills:
if timeout:
    full_command = f"cd {shlex.quote(working_dir)} && timeout {int(timeout)} sh -c {shlex.quote(command)}"
```

**Docker** — give the container a name and force-stop it on timeout (or launch detached and `docker rm -f`):

```python
container_name = f"praisonai-{execution_id}"
docker_cmd = ["docker", "run", "--rm", "--name", container_name, ...]
...
except asyncio.TimeoutError:
    with contextlib.suppress(Exception):
        kill = await asyncio.create_subprocess_exec("docker", "kill", container_name)
        await kill.wait()
    proc.kill(); await proc.wait()
```

### Validation
- `ssh.py:186` — `rm -f` is inside `try`, after `_run_command_with_limits`; `ssh.py:203` and `:213` `except` blocks have no cleanup.
- `ssh.py:537` — `asyncio.wait_for(self._connection.run(full_command), ...)`; `ssh.py:528` builds `cd … && {command}` with no remote `timeout`.
- `docker.py:260` — `docker run --rm` with no `--name`; `docker.py:305-306` only `proc.kill()` + `proc.wait()`; `grep "docker kill"`/`"docker rm"` in `docker.py` → none.

---

## 3) Cloud deploy passes the resolved `OPENAI_API_KEY` inline in the `gcloud` argv — credential exposed in the process table / CI logs

### Where
`src/praisonai/praisonai/deploy.py:136-157`

```python
# deploy.py:136-141
from praisonai.llm.env import resolve_llm_endpoint
ep = resolve_llm_endpoint()
openai_model = ep.model
openai_key = ep.api_key or 'Enter your API key'
openai_base = ep.base_url

# deploy.py:154-157
['gcloud', 'run', 'deploy', 'praisonai-service',
 '--image', f'us-central1-docker.pkg.dev/{project_id}/praisonai-repository/praisonai-app:latest',
 '--platform', 'managed', '--region', 'us-central1', '--allow-unauthenticated',
 '--set-env-vars', f'OPENAI_MODEL_NAME={openai_model},OPENAI_API_KEY={openai_key},OPENAI_API_BASE={openai_base}']
```

### The bug
The real `OPENAI_API_KEY` is interpolated directly into a command-line argument. Even though this is an argv list (so not a shell-injection vector), command-line arguments are:
- visible to any local user via `ps` / `/proc/<pid>/cmdline` while `gcloud` runs,
- routinely captured verbatim in CI/CD logs and shell history,
- baked into the Cloud Run service config in plaintext.

For a package whose philosophy is "safe by default," shipping the user's LLM credential through argv is a real exposure. Secondary fragility: `--set-env-vars` is comma-delimited, so any value containing a comma (a base URL or a model alias) silently corrupts the env mapping.

### How to fix
Pass secrets via a file or Secret Manager, never inline in argv:

```python
# Option A — Secret Manager (preferred for managed services)
#   echo -n "$OPENAI_API_KEY" | gcloud secrets create praisonai-openai-key --data-file=-
['gcloud', 'run', 'deploy', 'praisonai-service', ...,
 '--set-env-vars', f'OPENAI_MODEL_NAME={openai_model},OPENAI_API_BASE={openai_base}',
 '--set-secrets', 'OPENAI_API_KEY=praisonai-openai-key:latest']

# Option B — env-vars file (avoids argv + the comma-splitting problem), deleted in finally
import tempfile, os, yaml
fd, path = tempfile.mkstemp(suffix=".yaml"); os.close(fd); os.chmod(path, 0o600)
try:
    with open(path, "w") as f:
        yaml.safe_dump({"OPENAI_MODEL_NAME": openai_model,
                        "OPENAI_API_KEY": openai_key,
                        "OPENAI_API_BASE": openai_base}, f)
    subprocess.run([... 'gcloud','run','deploy', ..., '--env-vars-file', path], check=True)
finally:
    os.remove(path)
```

(Separately worth revisiting: `--allow-unauthenticated` at :156 deploys a public service by default — out of scope here, but it pairs poorly with leaking the key.)

### Validation
- `deploy.py:140` resolves `openai_key = ep.api_key or 'Enter your API key'`.
- `deploy.py:157` embeds `OPENAI_API_KEY={openai_key}` into the `--set-env-vars` argv string, run by `subprocess.run(cmd, check=True)` at `deploy.py:181`.

---

## Summary

| # | Gap | File(s) | Impact |
|---|-----|---------|--------|
| 1 | Magic-link cookie signed/verified with different secret sources; broken in default loopback setup (HTTP 500), HTTP↔WS divergence otherwise | `gateway/server.py:340,432,648`, `gateway/cookie_auth.py:223-241` | Core auth feature fails out of the box; latent auth inconsistency |
| 2 | Sandbox timeout leaks remote temp file (no `try/finally`) and doesn't stop the remote process / docker container | `sandbox/ssh.py:186,203,213,537`, `sandbox/docker.py:260,304-306` | Resource-limit/safety guarantee defeated on timeout; remote-state leak |
| 3 | Cloud deploy inlines `OPENAI_API_KEY` into `gcloud --set-env-vars` argv | `deploy.py:140,157` | Credential exposed in process table / CI logs; comma-splitting corruption |

All three are scoped to `src/praisonai/praisonai`, validated against the current tree, and independent of the existing wrapper-audit issues (#1502, #1508, #1614, #1620, #1735, #1738).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrapper security & lifecycle gaps: magic-link auth secret mismatch breaks default login, sandbox timeout leaks remote files + orphans workloads, and cloud deploy leaks the API key into gcloud argv #1743

Scope

1) Gateway magic-link / cookie auth signs and verifies with different secrets — magic-link login is broken in the default setup, and HTTP vs WebSocket auth can silently diverge

Where

The bug — two concrete failure modes

How to fix

Validation

2) Sandbox execution timeout leaks a remote temp file on every failed run, and does not actually stop the remote/container workload — the resource-limit guarantee is defeated on the timeout path

Where

2a. SSH backend leaks the remote temp file and orphans the remote process

2b. Docker backend kills the client, not the container

Why it matters

How to fix

Validation

3) Cloud deploy passes the resolved `OPENAI_API_KEY` inline in the `gcloud` argv — credential exposed in the process table / CI logs

Where

The bug

How to fix

Validation

Summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

#	Gap	File(s)	Impact
1	Magic-link cookie signed/verified with different secret sources; broken in default loopback setup (HTTP 500), HTTP↔WS divergence otherwise	`gateway/server.py:340,432,648`, `gateway/cookie_auth.py:223-241`	Core auth feature fails out of the box; latent auth inconsistency
2	Sandbox timeout leaks remote temp file (no `try/finally`) and doesn't stop the remote process / docker container	`sandbox/ssh.py:186,203,213,537`, `sandbox/docker.py:260,304-306`	Resource-limit/safety guarantee defeated on timeout; remote-state leak
3	Cloud deploy inlines `OPENAI_API_KEY` into `gcloud --set-env-vars` argv	`deploy.py:140,157`	Credential exposed in process table / CI logs; comma-splitting corruption

Uh oh!

Wrapper security & lifecycle gaps: magic-link auth secret mismatch breaks default login, sandbox timeout leaks remote files + orphans workloads, and cloud deploy leaks the API key into gcloud argv #1743

Description

Scope

1) Gateway magic-link / cookie auth signs and verifies with different secrets — magic-link login is broken in the default setup, and HTTP vs WebSocket auth can silently diverge

Where

The bug — two concrete failure modes

How to fix

Validation

2) Sandbox execution timeout leaks a remote temp file on every failed run, and does not actually stop the remote/container workload — the resource-limit guarantee is defeated on the timeout path

Where

2a. SSH backend leaks the remote temp file and orphans the remote process

2b. Docker backend kills the client, not the container

Why it matters

How to fix

Validation

3) Cloud deploy passes the resolved OPENAI_API_KEY inline in the gcloud argv — credential exposed in the process table / CI logs

Where

The bug

How to fix

Validation

Summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

3) Cloud deploy passes the resolved `OPENAI_API_KEY` inline in the `gcloud` argv — credential exposed in the process table / CI logs