server: OpenAI API and llama.cpp Web UI compatibility#121
Conversation
- OpenAI-style /v1/models, /v1/chat/completions (SSE), error bodies, CORS - GET /props and enriched model rows for ggml Web UI - Session file persistence and optional TILES_BOOTSTRAP_* startup load - POST /models/load when TILES_MODEL_CACHE_PATH is set (Web UI load) - Chat SSE: timings (Web UI) and metrics (Tiles CLI memory mode) - Pytest coverage; TILES_SKIP_SESSION_PERSIST in tests - just webui-llamacpp via scripts/phase2_llamacpp_webui.sh Made-with: Cursor
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 16 minutes and 53 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughThe PR introduces OpenAI-compatible API endpoints and llama.cpp Web UI integration to the Tiles server. Key additions include session persistence (saving/loading model state across restarts), new endpoints for health checks ( Sequence Diagram(s)sequenceDiagram
participant Client
participant API as Server API
participant SessionStore as Session Persist
participant Backend
participant Cache as Model Cache
Client->>API: POST /v1/chat/completions<br/>(with/without model, messages, input)
alt No model in request
API->>API: Use _loaded_model_id
alt No loaded model
API-->>Client: HTTP 400<br/>(require /start or model param)
end
end
alt input field present (memory-agent mode)
API->>Backend: generate_chat_stream<br/>(messages with tool result appended)
else input field absent (OpenAI-compat mode)
API->>API: _openai_compat_effective_messages<br/>(flatten content arrays,<br/>optionally prepend persisted<br/>system prompt)
alt stream=true
API->>Backend: generate_chat_stream<br/>(effective messages)
Backend-->>API: SSE stream chunks
else stream=false
API->>Backend: complete_openai_chat_completion<br/>(effective messages)
Backend-->>API: chat.completion object
end
end
Backend-->>API: completion response<br/>(with metrics/timings)
API->>SessionStore: save(model, cache_path,<br/>memory_path, system_prompt)
API-->>Client: SSE stream or JSON response
sequenceDiagram
participant Startup as Server Startup
participant SessionStore as Session Persist
participant API as API Lifespan
participant EnvVars as Environment
participant Backend
Startup->>API: FastAPI(lifespan=_lifespan)
activate API
API->>SessionStore: load()
alt Session file exists & valid
SessionStore-->>API: SessionData<br/>(model, cache_path,<br/>memory_path, system_prompt)
API->>API: _apply_loaded_session<br/>(restore from persisted data)
else Session file missing/invalid/<br/>persistence disabled
API->>EnvVars: read TILES_BOOTSTRAP_*<br/>environment variables
API->>Backend: get_or_load_model<br/>(from TILES_MODEL_CACHE_PATH)
Backend-->>API: loaded model
API->>API: _apply_loaded_session<br/>(bootstrap mode)
end
API->>API: Record global state<br/>(_loaded_model_id,<br/>_loaded_model_cache_path,<br/>_model_loaded_at)
deactivate API
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (1)
server/session_persist.py (1)
38-40: Use a unique temp file name to avoid cross-writer collisions.Using a fixed
*.tmppath can race if two saves happen concurrently (or multiple processes share the same session file). A unique temp file per write makes this robust.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/session_persist.py` around lines 38 - 40, The current atomic write uses a fixed tmp path (tmp = p.parent / f"{p.name}.tmp") which can collide across concurrent writers; change it to create a unique temp filename in the same directory (e.g. use tempfile.NamedTemporaryFile(delete=False, dir=p.parent, prefix=p.name + ".", suffix=".tmp") or generate a UUID via uuid.uuid4().hex and build tmp = p.parent / f"{p.name}.{uuid}.tmp"), write the text to that unique tmp, close it, then call os.replace(tmp, p) as before; ensure you add the appropriate import (tempfile or uuid) and clean up the temp file on error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/phase2_llamacpp_webui.sh`:
- Around line 28-32: After running the in-place perl replacement (perl -pi -e)
to swap http://localhost:8080 for ${BACKEND} in the file referenced by VITE_CFG,
immediately verify the substitution succeeded by searching the updated file for
the expected BACKEND string (e.g., grep -qF "$BACKEND" "$VITE_CFG"); if the
check fails, restore the backup ("${VITE_CFG}.tiles.bak"), print a clear error
mentioning BACKEND and VITE_CFG, and exit non‑zero so the script fails fast
instead of silently continuing.
In `@server/api.py`:
- Around line 162-168: The CORS middleware is currently wide open via
app.add_middleware(CORSMiddleware, allow_origins=["*"], ...) — change it to be
opt-in by reading an environment variable or dev flag (e.g., CORS_ALLOWLIST or
ENABLE_DEV_CORS) and use that to populate allow_origins instead of "*" (parse a
comma-separated allowlist or treat a dev boolean as enabling "*"); keep a safe
default of an empty list or no CORS when the env/flag is unset, and ensure
app.add_middleware(...) uses the computed allow_origins value (and document the
env var/flag).
- Around line 195-212: In the models_load async route, avoid calling the
blocking _apply_loaded_session(mid, cache, MEMORY_PATH, sysp) directly; instead
call it via asyncio.to_thread and await the result (e.g., await
asyncio.to_thread(_apply_loaded_session, mid, cache, MEMORY_PATH, sysp)) so the
synchronous runtime.backend.get_or_load_model call runs off the event loop and
doesn't block other requests; update the models_load handler to import/ use
asyncio accordingly and preserve existing error handling/return values after
awaiting the threaded call.
- Around line 78-80: The early return in _restore_session_on_startup caused by
session_persist.skip_persist() prevents the TILES_BOOTSTRAP_* env handling from
running; change the function so bootstrap env processing runs regardless of
skip_persist (either move the TILES_BOOTSTRAP_* branch before the return or
check for those env vars and execute bootstrap logic even when
session_persist.skip_persist() is true), referencing _restore_session_on_startup
and session_persist.skip_persist and ensuring any bootstrap code that reads
TILES_BOOTSTRAP_* is executed unconditionally.
- Around line 260-273: The session persistence write (session_persist.save) is
currently causing a 500 when it fails even though _apply_loaded_session has
already made the model live; wrap the call to session_persist.save in a
try/except/try-catch so persistence errors are handled as best-effort (log the
exception and continue) rather than re-raising, and apply the same pattern for
the analogous save in the /models/load path; ensure you use the module logger
(or session_persist logger) to record the error and do not change the successful
response path when save fails.
In `@server/backend/mlx.py`:
- Around line 233-244: The non-streaming completion can return stop text and
misses sequences spanning token boundaries; after joining parts into text
(variable parts -> text) and before constructing out (out: dict[...] = {...}),
normalize request.stop into stop_sequences (list), search for the earliest
occurrence of any stop sequence in text, truncate text up to that index (text =
text[:idx]) if found, and use this trimmed text when building the assistant
content in out; also remove the per-token containment check that attempted to
stop on individual tokens and ensure the same stop trimming logic is applied to
any returned fields derived from text.
In `@server/llamacpp_webui_compat.py`:
- Around line 76-87: The code in _read_n_ctx assumes cfg is a mapping and calls
cfg.get(...) which raises AttributeError for valid JSON that is not an object;
guard against that by verifying cfg is a mapping (e.g., isinstance(cfg, dict) or
collections.abc.Mapping) before iterating the keys and calling .get on it, and
if it's not a mapping skip to the existing fallback/error handling so malformed
but syntactically-valid JSON cannot crash the function.
In `@server/schemas.py`:
- Around line 84-90: normalize_max_tokens currently only maps -1 to None and
lets other negative values pass through; update the validator (the
`@field_validator`("max_tokens", mode="before") method normalize_max_tokens) to
explicitly reject any negative value other than -1 by raising a ValueError
(e.g., "max_tokens must be -1 or a non-negative integer") so Pydantic will
return a 422 instead of forwarding malformed negatives to the backend.
In `@server/session_persist.py`:
- Around line 39-40: The temporary file created for session persistence is
written with default permissions (tmp.write_text) and then atomically replaced
(os.replace(p)), risking group/world read access; after writing the content to
the temp Path (the variable tmp used with tmp.write_text) but before calling
os.replace(tmp, p), set restrictive owner-only permissions (e.g.,
tmp.chmod(0o600)) so the file is readable/writable only by the owner and then
perform the os.replace; this preserves atomic replacement while ensuring the
persisted session (including system_prompt) is not group/world-readable.
In `@server/tests/conftest.py`:
- Around line 17-27: The reset_api_state fixture currently leaves
api._current_model_path and api._default_max_tokens unchanged, risking
cross-test leaks; update the fixture (the setup/teardown surrounding yield in
reset_api_state) to also set api._current_model_path = None (or empty string per
API convention) and api._default_max_tokens = <default_value> (use the default
from server/api.py) both before yielding and after yield so the global state is
fully cleared between tests.
In `@server/tests/test_openai_compat.py`:
- Around line 8-10: test_v1_models_empty (and other tests) construct
TestClient(app) directly which bypasses FastAPI lifespan hooks; change
instantiation to use a context manager (with TestClient(app) as client:) so the
app's startup/shutdown handlers run and session restore/bootstrap code exercise;
update tests referencing TestClient(app) (e.g., test_v1_models_empty) to the
context-managed form or refactor into a shared fixture that yields a
context-managed client.
---
Nitpick comments:
In `@server/session_persist.py`:
- Around line 38-40: The current atomic write uses a fixed tmp path (tmp =
p.parent / f"{p.name}.tmp") which can collide across concurrent writers; change
it to create a unique temp filename in the same directory (e.g. use
tempfile.NamedTemporaryFile(delete=False, dir=p.parent, prefix=p.name + ".",
suffix=".tmp") or generate a UUID via uuid.uuid4().hex and build tmp = p.parent
/ f"{p.name}.{uuid}.tmp"), write the text to that unique tmp, close it, then
call os.replace(tmp, p) as before; ensure you add the appropriate import
(tempfile or uuid) and clean up the temp file on error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 3569692a-63c8-4a06-9841-a8ae4ce55464
⛔ Files ignored due to path filters (1)
server/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
justfilescripts/phase2_llamacpp_webui.shserver/api.pyserver/backend/mlx.pyserver/llamacpp_webui_compat.pyserver/main.pyserver/pyproject.tomlserver/schemas.pyserver/session_persist.pyserver/tests/conftest.pyserver/tests/test_openai_compat.py
| if ! grep -qF "$BACKEND" "$VITE_CFG" 2>/dev/null; then | ||
| echo "Patching $VITE_CFG to proxy API routes to $BACKEND" | ||
| cp "$VITE_CFG" "${VITE_CFG}.tiles.bak" | ||
| perl -pi -e "s|http://localhost:8080|${BACKEND}|g" "$VITE_CFG" | ||
| fi |
There was a problem hiding this comment.
Backend reconfiguration can silently fail after the first patch.
If vite.config.ts already contains a non-default backend URL, Line 31 won’t match http://localhost:8080, so switching TILES_BACKEND later may no-op while still printing “Patching…”. Add a post-patch verification (and fail fast) so misconfiguration is visible.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/phase2_llamacpp_webui.sh` around lines 28 - 32, After running the
in-place perl replacement (perl -pi -e) to swap http://localhost:8080 for
${BACKEND} in the file referenced by VITE_CFG, immediately verify the
substitution succeeded by searching the updated file for the expected BACKEND
string (e.g., grep -qF "$BACKEND" "$VITE_CFG"); if the check fails, restore the
backup ("${VITE_CFG}.tiles.bak"), print a clear error mentioning BACKEND and
VITE_CFG, and exit non‑zero so the script fails fast instead of silently
continuing.
| async def _restore_session_on_startup() -> None: | ||
| if session_persist.skip_persist(): | ||
| return |
There was a problem hiding this comment.
Don't let TILES_SKIP_SESSION_PERSIST disable env bootstrap.
This early return skips the TILES_BOOTSTRAP_* branch entirely. In a stateless/test setup that intentionally disables session file I/O, the server will no longer auto-load the bootstrap model even when the bootstrap env vars are present.
🐛 Proposed fix
async def _restore_session_on_startup() -> None:
- if session_persist.skip_persist():
- return
-
- data = session_persist.load()
+ data = None if session_persist.skip_persist() else session_persist.load()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/api.py` around lines 78 - 80, The early return in
_restore_session_on_startup caused by session_persist.skip_persist() prevents
the TILES_BOOTSTRAP_* env handling from running; change the function so
bootstrap env processing runs regardless of skip_persist (either move the
TILES_BOOTSTRAP_* branch before the return or check for those env vars and
execute bootstrap logic even when session_persist.skip_persist() is true),
referencing _restore_session_on_startup and session_persist.skip_persist and
ensuring any bootstrap code that reads TILES_BOOTSTRAP_* is executed
unconditionally.
| tmp.write_text(text, encoding="utf-8") | ||
| os.replace(tmp, p) |
There was a problem hiding this comment.
Persisted session file can be created with overly broad read permissions.
Line 39 writes sensitive session content (including system_prompt) using default umask-derived permissions. On common systems this can become group/world-readable. Restrict to owner-only before the replace.
🔒 Proposed fix
tmp = p.parent / f"{p.name}.tmp"
tmp.write_text(text, encoding="utf-8")
+ os.chmod(tmp, 0o600)
os.replace(tmp, p)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tmp.write_text(text, encoding="utf-8") | |
| os.replace(tmp, p) | |
| tmp = p.parent / f"{p.name}.tmp" | |
| tmp.write_text(text, encoding="utf-8") | |
| os.chmod(tmp, 0o600) | |
| os.replace(tmp, p) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/session_persist.py` around lines 39 - 40, The temporary file created
for session persistence is written with default permissions (tmp.write_text) and
then atomically replaced (os.replace(p)), risking group/world read access; after
writing the content to the temp Path (the variable tmp used with tmp.write_text)
but before calling os.replace(tmp, p), set restrictive owner-only permissions
(e.g., tmp.chmod(0o600)) so the file is readable/writable only by the owner and
then perform the os.replace; this preserves atomic replacement while ensuring
the persisted session (including system_prompt) is not group/world-readable.
| def test_v1_models_empty(): | ||
| client = TestClient(app) | ||
| r = client.get("/v1/models") |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
🌐 Web query:
In FastAPI/Starlette, does TestClient(app)run lifespan startup/shutdown handlers automatically, or iswith TestClient(app) as client: required?
💡 Result:
In FastAPI/Starlette, TestClient(app) alone does not automatically run lifespan startup/shutdown handlers. The with TestClient(app) as client: context manager is required to trigger the lifespan handlers. Instantiating TestClient(app) without the with block skips the lifespan execution, while entering the with block runs startup code and exiting runs shutdown code.
Citations:
- 1: https://fastapi.tiangolo.com/advanced/testing-events/
- 2: https://starlette.dev/testclient/
- 3: https://starlette.dev/lifespan/
- 4: Mention the difference between
with TestClient(app) as clientandclient = TestClient(app)Kludex/starlette#1733
Use context manager to exercise the lifespan startup/shutdown.
These tests instantiate TestClient(app) directly, which skips the FastAPI(lifespan=...) startup and shutdown handlers that the PR added. Tests should use with TestClient(app) as client: to properly exercise session restore and bootstrap behavior. Consider refactoring to a shared fixture.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@server/tests/test_openai_compat.py` around lines 8 - 10, test_v1_models_empty
(and other tests) construct TestClient(app) directly which bypasses FastAPI
lifespan hooks; change instantiation to use a context manager (with
TestClient(app) as client:) so the app's startup/shutdown handlers run and
session restore/bootstrap code exercise; update tests referencing
TestClient(app) (e.g., test_v1_models_empty) to the context-managed form or
refactor into a shared fixture that yields a context-managed client.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
- Gate CORS behind TILES_CORS_ORIGINS or TILES_CORS_ALLOW_ALL=1 - Run model load via asyncio.to_thread on /start and /models/load - Best-effort session_persist.save after successful load (OSError logged) - Harden _read_n_ctx for non-dict config.json; reject max_tokens < 0 except -1 - Reset _current_model_path and _default_max_tokens in test fixture - phase2: resolve llama.cpp via LLAMA_CPP_ROOT, third_party, or sibling clone - HACKING: optional submodule + Web UI section; add webui compat tests Made-with: Cursor
…:8080 Made-with: Cursor
Pull request: OpenAI-compatible HTTP API + llama.cpp Web UI hooks
Summary
Extends the Tiles Python inference server so OpenAI-style clients and the ggml-org llama.cpp SvelteKit Web UI can talk to Tiles (MLX backend) over HTTP, while keeping the existing Tiles CLI flow (
POST /start, memory-mode chat) intact.Motivation
llama-server; the UI expectsGET /props, enriched/v1/modelsrows, optionalPOST /models/load, OpenAI chat SSE, andtimingson the final stream chunk.metricsfor the Rust CLI’s memory-mode benchmark line./v1/*improve scripting and UI error dialogs.What changed
API (
server/api.py,server/main.py)~/.tiles/server_session.jsonif present; optionalTILES_BOOTSTRAP_MODEL+TILES_BOOTSTRAP_MODEL_CACHE_PATH(and optional memory/system prompt env) to load without a prior/start.POST /start: After successful load, persist session; shared_apply_loaded_sessionused by/start, restore, bootstrap, and/models/load.POST /models/load: For the Web UI’s{"model": "<id>"}body; MLX directory fromTILES_MODEL_CACHE_PATHorTILES_BOOTSTRAP_MODEL_CACHE_PATH.GET /props,GET /v1/models,GET /health/GET /v1/health, router stubsPOST /models/load(implemented) /POST /models/unload(stub).POST /v1/chat/completions: Resolvedmodel, optionalinputmemory path, OpenAI-shaped 422/400/500 for/v1/*, CORS for browser dev.Compatibility helpers
server/llamacpp_webui_compat.py:/props-shaped JSON and/v1/modelsentries withpath,status,in_cacheexpected by the Web UI.Session persistence
server/session_persist.py: JSON file under~/.tiles/server_session.json(override withTILES_SESSION_FILE); disabled in tests viaTILES_SKIP_SESSION_PERSIST=1. Atomic write (temp + replace).Streaming / metrics (
server/backend/mlx.py)timings(predicted_n,predicted_ms) for llama Web UI token/s UI.metrics(ttft_ms,total_tokens,tokens_per_second,total_latency_s) for Tiles CLI memory mode.Schemas (
server/schemas.py)ChatCompletionRequest/ChatMessageadjustments for Web UI (optionalmodel, multimodalcontentlist,extra="ignore",max_tokens: -1→ unlimited).RouterModelsLoadBodyfor/models/load.Tests
server/tests/test_openai_compat.py,server/tests/conftest.py: Cover health, props, models, chat stream, errors,/models/loadwith env path.just test-server:uv run --project server --with pytest pytest server/tests/so pytest is not added touv.lock(keeps lockfile identical tomain).Dev script
scripts/phase2_llamacpp_webui.sh+just webui-llamacpp: Clone/patch llama.cpp Web UI Vite proxy toTILES_BACKEND(defaulthttp://127.0.0.1:6969),npm run dev.How to test
Configuration reference
TILES_SESSION_FILETILES_SKIP_SESSION_PERSIST=1TILES_BOOTSTRAP_MODEL/TILES_BOOTSTRAP_MODEL_CACHE_PATHTILES_BOOTSTRAP_MEMORY_PATH/TILES_BOOTSTRAP_SYSTEM_PROMPTTILES_MODEL_CACHE_PATHPOST /models/load