Migrate graph backend to LadybugDB#4
Conversation
|
Warning Review limit reached
More reviews will be available in 9 minutes and 36 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR migrates OpenKL's graph database backend from the archived Kuzu DB to LadybugDB. The database layer, query processing, vector operations, and type annotations are updated throughout the codebase. Legacy Kuzu data detection and migration guidance are introduced, and tests verify the new initialization and vector search workflows. ChangesKuzu to LadybugDB Migration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
KuzuDB has been archived, and its extension server dependency makes OpenKL's graph and vector-search path brittle for fresh installs. LadybugDB is the active successor and preserves the embedded Cypher/vector-index API shape that OpenKL relies on, so this migrates the runtime backend while keeping the current CLI and graph schema behavior intact. Replace the runtime dependency and lockfile entry from kuzu to ladybug, update database initialization to create new graph stores under ~/.ok/ladybug, and add a legacy ~/.ok/kuzu warning instead of attempting implicit in-place reuse of old derived graph state. This keeps old Kuzu data untouched and makes the migration explicit for users. Update the doctor command and graph/vector-search modules for LadybugDB naming, tighten the type boundary around the migrated db/graph/vector modules, and keep the existing CREATE_VECTOR_INDEX / QUERY_VECTOR_INDEX path covered. The migration also includes smoke tests for schema creation, vector search, and legacy-path warning behavior. Refresh README and RFC references so the documented architecture points to LadybugDB and the new ~/.ok/ladybug location. Verification run before commit: - uv run pytest -q - uv run ruff check . && uv run ruff format --check . - uv run mypy openkl/db.py openkl/graph.py openkl/vector_search.py - isolated HOME CLI smoke covering ok doctor, mem add/search, graph cypher, and graph list-indexes
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
openkl/memory.py (1)
185-196:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse query parameters instead of interpolating escaped text.
At Line 186-187, manual escaping still leaves the update path vulnerable to Cypher injection and edge-case breakage. Build the
SETclause with parameters and pass values viaconn.execute(..., params).Suggested fix
# Build update query updates = [] + params: dict[str, Any] = {"memory_id": memory_id} if text is not None: - escaped_text = text.replace("'", "\\'") - updates.append(f"m.text = '{escaped_text}'") + updates.append("m.text = $text") + params["text"] = text if tags is not None: - tags_str = "[" + ", ".join([f"'{tag}'" for tag in tags]) + "]" - updates.append(f"m.tags = {tags_str}") + updates.append("m.tags = $tags") + params["tags"] = tags if updates: - update_query = ( - f"MATCH (m:MemoryNote {{id: '{memory_id}'}}) SET {', '.join(updates)}" - ) - conn.execute(update_query) + update_query = f"MATCH (m:MemoryNote {{id: $memory_id}}) SET {', '.join(updates)}" + conn.execute(update_query, params)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openkl/memory.py` around lines 185 - 196, The current update builds a Cypher string by interpolating escaped values (variables: memory_id, escaped_text, tags, updates, update_query) and calls conn.execute, which is vulnerable; instead construct the SET clause using parameter placeholders (e.g., m.text = $text_param, m.tags = $tags_param, m.id = $id_param) and collect a params dict with keys like text_param, tags_param, id_param (or similar), convert tags to a list value rather than stringifying, then call conn.execute(update_query, params=params) so all values are bound as parameters rather than interpolated.openkl/vector_search.py (1)
145-167:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate and sanitize
query_vectorbefore Cypher interpolation.
query_vector: Anycombined withvector_str = str(query_vector)allows malformed or injected Cypher when non-numeric/untrusted input reaches these paths. Normalize to a numeric list and serialize from validated floats only (same forkbounds).Suggested fix
+def _normalize_query_vector(query_vector: Any) -> list[float]: + if hasattr(query_vector, "tolist"): + query_vector = query_vector.tolist() + if not isinstance(query_vector, (list, tuple)): + raise TypeError("query_vector must be a list/tuple of numbers") + try: + return [float(v) for v in query_vector] + except (TypeError, ValueError) as exc: + raise ValueError("query_vector must contain only numeric values") from exc + def search_memory_vectors( query_vector: Any, k: int = 5, verbose: bool = False ) -> list[dict[str, Any]]: @@ - if hasattr(query_vector, "tolist"): - query_vector = query_vector.tolist() + query_vector = _normalize_query_vector(query_vector) + if k <= 0: + raise ValueError("k must be > 0") @@ - vector_str = str(query_vector) + vector_str = "[" + ", ".join(f"{v:.17g}" for v in query_vector) + "]" @@ def search_chunk_vectors( query_vector: Any, k: int = 5, verbose: bool = False ) -> list[dict[str, Any]]: @@ - if hasattr(query_vector, "tolist"): - query_vector = query_vector.tolist() + query_vector = _normalize_query_vector(query_vector) + if k <= 0: + raise ValueError("k must be > 0") @@ - vector_str = str(query_vector) + vector_str = "[" + ", ".join(f"{v:.17g}" for v in query_vector) + "]"Also applies to: 196-219
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openkl/vector_search.py` around lines 145 - 167, Normalize and sanitize query_vector and k before interpolating into the Cypher CALL: ensure query_vector is a sequence (list/tuple/np.ndarray), convert all elements to floats (reject/raise if any element cannot be cast to float), and serialize the validated numeric list (e.g., via a safe serializer rather than raw str()) before embedding in the CALL QUERY_VECTOR_INDEX invocation; also validate k is an integer within acceptable bounds (e.g., >=1 and a sensible upper limit) and coerce/raise on invalid input. Update the logic around query_vector, vector_str and the CALL QUERY_VECTOR_INDEX construction in the function that currently uses query_vector and k (references: query_vector, k, get_connection, _ensure_vector_extension_loaded, _ensure_vector_indexes_exist, CALL QUERY_VECTOR_INDEX) and apply the same validation/serialization fixes to the other occurrence noted (the block around lines 196-219).openkl/db.py (1)
60-65:⚠️ Potential issue | 🟠 Major | ⚡ Quick winVector extension failure allows silent downstream breakage.
If the vector extension fails to install/load, the code logs a warning but continues. Downstream vector operations (CREATE_VECTOR_INDEX, QUERY_VECTOR_INDEX) will then fail unexpectedly. Consider either:
- Re-raising the exception if vector search is required functionality, or
- Setting a module-level flag to gracefully degrade vector features.
Option 1: Fail fast if vector is required
try: conn.execute("INSTALL VECTOR;") conn.execute("LOAD VECTOR;") logger.info("Vector extension installed and loaded") except Exception as e: - logger.warning(f"Failed to install vector extension: {e}") + logger.error(f"Failed to install vector extension: {e}") + raise RuntimeError("Vector extension is required for OpenKL") from eOption 2: Track capability for graceful degradation
+_vector_available: bool = False + def init_db(db_path: Path | None = None) -> graphdb.Connection: ... + global _vector_available # Install and load vector extension try: conn.execute("INSTALL VECTOR;") conn.execute("LOAD VECTOR;") logger.info("Vector extension installed and loaded") + _vector_available = True except Exception as e: logger.warning(f"Failed to install vector extension: {e}") + _vector_available = False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@openkl/db.py` around lines 60 - 65, The current try/except around conn.execute("INSTALL VECTOR;") / conn.execute("LOAD VECTOR;") swallows failures which breaks downstream vector operations; update this by either (A) failing fast: re-raise the caught exception inside the except block (or raise a new RuntimeError with context) so callers like CREATE_VECTOR_INDEX and QUERY_VECTOR_INDEX cannot run without the extension, or (B) implement graceful degradation: introduce a module-level boolean (e.g. VECTOR_AVAILABLE = False) and set it to True on successful install/load and to False in the except path while logging the error; then ensure vector-using functions (CREATE_VECTOR_INDEX, QUERY_VECTOR_INDEX) check VECTOR_AVAILABLE and return a clear error or no-op when False. Use the existing logger for detailed error logging and reference conn.execute and the vector-related functions when making the change.
🧹 Nitpick comments (3)
tests/test_db_backend.py (1)
10-11: ⚡ Quick winUse
next(iter(...))for single-element extraction.Per static analysis (RUF015):
list(result)[0]materializes the entire result set just to get the first row. Usenext(iter(result))for efficiency and clarity.Suggested fix
- result = conn.execute("MATCH (m:MemoryNote) RETURN count(m)") - assert list(result)[0][0] == 0 + result = conn.execute("MATCH (m:MemoryNote) RETURN count(m)") + row = next(iter(result)) + assert row[0] == 0🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_db_backend.py` around lines 10 - 11, The test currently materializes the whole query result with list(result)[0][0]; change this to use an iterator to get only the first row: call next(iter(result)) to obtain the first row (e.g., row = next(iter(result))) and assert row[0] == 0; update the assertion in tests/test_db_backend.py where conn.execute(...) is assigned to result to use next(iter(result)) instead of list(result)[0][0].README.md (1)
124-127: ⚡ Quick winConsider adding a link to the detailed migration plan.
The migration note provides clear guidance for users with legacy data. For users who need more context or step-by-step instructions, consider linking to
docs/plans/2026-05-31-ladybug-migration.mdor creating a dedicated migration guide.📚 Suggested enhancement
### LadybugDB Migration -OpenKL now uses LadybugDB instead of the archived KuzuDB package. Existing `~/.ok/kuzu` graph data is treated as a legacy derived index; keep it as a backup and rebuild into `~/.ok/ladybug` before relying on old graph state. +OpenKL now uses LadybugDB instead of the archived KuzuDB package. Existing `~/.ok/kuzu` graph data is treated as a legacy derived index; keep it as a backup and rebuild into `~/.ok/ladybug` before relying on old graph state. + +For detailed migration guidance, see [docs/plans/2026-05-31-ladybug-migration.md](docs/plans/2026-05-31-ladybug-migration.md).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 124 - 127, The README's "### LadybugDB Migration" section lacks a link to the detailed migration plan; update that paragraph to include a clear hyperlink to the migration document (docs/plans/2026-05-31-ladybug-migration.md) or add a sentence pointing users to the dedicated migration guide, referencing the "### LadybugDB Migration" heading so reviewers can locate and verify the added link quickly.docs/plans/2026-05-31-ladybug-migration.md (1)
382-392: ⚡ Quick winConsider resolving open questions before final merge.
The open questions about package management (uv vs PDM), graph rebuild commands, and Python version support are reasonable for a planning document, but they introduce uncertainty for future maintenance. Consider documenting the final decisions made during implementation either in this plan or in a follow-up section.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/plans/2026-05-31-ladybug-migration.md` around lines 382 - 392, The plan currently leaves three unresolved items under "Open Questions Before Execution" (package manager choice: "uv" vs PDM, whether to include "ok graph rebuild" in the first migration, and the supported Python versions showing conflicting metadata), so update the document to record the final decisions and rationale: explicitly state whether the repo will retain "uv" or migrate to PDM, confirm whether "ok graph rebuild" will be part of the initial migration or left as a manual follow-up, and reconcile the Python version constraints (adjust classifiers or metadata to match the chosen supported versions); include these decisions either inline in the "Open Questions Before Execution" section or add a "Decisions/Follow-up" subsection and reference "Rollback Plan" and "ok graph rebuild" where relevant so future maintainers have a clear, authoritative action list.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@openkl/db.py`:
- Line 8: The code imports ladybug as graphdb and currently calls
graphdb.Database(str(db_path)), graphdb.Connection(db) and conn.execute(...);
update the vector-extension and schema error handling so failures are not
silently ignored: when executing the vector extension SQL (the INSTALL/LOAD
VECTOR block using conn.execute), detect if the extension failed and either
raise an exception or set a clear flag (e.g., vector_available) so any
vector-dependent functionality immediately fails or is prevented; for schema
initialization (the block that currently checks error strings for "already
exists"), switch to using SQL DDL with IF NOT EXISTS where possible or catch the
specific LadybugDB exception type instead of substring matching, re-raise
unexpected errors, and ensure you reference graphdb.Database,
graphdb.Connection, conn.execute and the vector init / schema init code paths
when making these changes.
In `@tests/test_vector_search.py`:
- Around line 21-22: Add an explicit non-empty assertion for the retrieval
results before indexing into them: in the test around the call to
search_memory_vectors([0.1] * 384, k=1) (variable results), assert that results
is not empty (e.g., assert len(results) > 0 or assert results) before the
existing assert results[0]["id"] == "m-test" so failures show a clear assertion
message instead of an IndexError.
---
Outside diff comments:
In `@openkl/db.py`:
- Around line 60-65: The current try/except around conn.execute("INSTALL
VECTOR;") / conn.execute("LOAD VECTOR;") swallows failures which breaks
downstream vector operations; update this by either (A) failing fast: re-raise
the caught exception inside the except block (or raise a new RuntimeError with
context) so callers like CREATE_VECTOR_INDEX and QUERY_VECTOR_INDEX cannot run
without the extension, or (B) implement graceful degradation: introduce a
module-level boolean (e.g. VECTOR_AVAILABLE = False) and set it to True on
successful install/load and to False in the except path while logging the error;
then ensure vector-using functions (CREATE_VECTOR_INDEX, QUERY_VECTOR_INDEX)
check VECTOR_AVAILABLE and return a clear error or no-op when False. Use the
existing logger for detailed error logging and reference conn.execute and the
vector-related functions when making the change.
In `@openkl/memory.py`:
- Around line 185-196: The current update builds a Cypher string by
interpolating escaped values (variables: memory_id, escaped_text, tags, updates,
update_query) and calls conn.execute, which is vulnerable; instead construct the
SET clause using parameter placeholders (e.g., m.text = $text_param, m.tags =
$tags_param, m.id = $id_param) and collect a params dict with keys like
text_param, tags_param, id_param (or similar), convert tags to a list value
rather than stringifying, then call conn.execute(update_query, params=params) so
all values are bound as parameters rather than interpolated.
In `@openkl/vector_search.py`:
- Around line 145-167: Normalize and sanitize query_vector and k before
interpolating into the Cypher CALL: ensure query_vector is a sequence
(list/tuple/np.ndarray), convert all elements to floats (reject/raise if any
element cannot be cast to float), and serialize the validated numeric list
(e.g., via a safe serializer rather than raw str()) before embedding in the CALL
QUERY_VECTOR_INDEX invocation; also validate k is an integer within acceptable
bounds (e.g., >=1 and a sensible upper limit) and coerce/raise on invalid input.
Update the logic around query_vector, vector_str and the CALL QUERY_VECTOR_INDEX
construction in the function that currently uses query_vector and k (references:
query_vector, k, get_connection, _ensure_vector_extension_loaded,
_ensure_vector_indexes_exist, CALL QUERY_VECTOR_INDEX) and apply the same
validation/serialization fixes to the other occurrence noted (the block around
lines 196-219).
---
Nitpick comments:
In `@docs/plans/2026-05-31-ladybug-migration.md`:
- Around line 382-392: The plan currently leaves three unresolved items under
"Open Questions Before Execution" (package manager choice: "uv" vs PDM, whether
to include "ok graph rebuild" in the first migration, and the supported Python
versions showing conflicting metadata), so update the document to record the
final decisions and rationale: explicitly state whether the repo will retain
"uv" or migrate to PDM, confirm whether "ok graph rebuild" will be part of the
initial migration or left as a manual follow-up, and reconcile the Python
version constraints (adjust classifiers or metadata to match the chosen
supported versions); include these decisions either inline in the "Open
Questions Before Execution" section or add a "Decisions/Follow-up" subsection
and reference "Rollback Plan" and "ok graph rebuild" where relevant so future
maintainers have a clear, authoritative action list.
In `@README.md`:
- Around line 124-127: The README's "### LadybugDB Migration" section lacks a
link to the detailed migration plan; update that paragraph to include a clear
hyperlink to the migration document (docs/plans/2026-05-31-ladybug-migration.md)
or add a sentence pointing users to the dedicated migration guide, referencing
the "### LadybugDB Migration" heading so reviewers can locate and verify the
added link quickly.
In `@tests/test_db_backend.py`:
- Around line 10-11: The test currently materializes the whole query result with
list(result)[0][0]; change this to use an iterator to get only the first row:
call next(iter(result)) to obtain the first row (e.g., row = next(iter(result)))
and assert row[0] == 0; update the assertion in tests/test_db_backend.py where
conn.execute(...) is assigned to result to use next(iter(result)) instead of
list(result)[0][0].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c21c04ed-c8bb-4cb2-aa31-5523c0786ccd
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
README.mddocs/plans/2026-05-31-ladybug-migration.mdopenkl/cli.pyopenkl/db.pyopenkl/distill.pyopenkl/graph.pyopenkl/memory.pyopenkl/vector_search.pypyproject.tomlrfcs/0000-openkl-design.mdtests/test_db_backend.pytests/test_vector_search.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
README.md (2)
126-126: Consider hyphenating the compound modifier.The phrase "legacy derived index" functions as a compound modifier and would be clearer as "legacy-derived index" per standard English hyphenation rules. As per coding guidelines, the static analysis tool flagged this as a potential clarity improvement.
✏️ Optional style fix
-OpenKL now uses LadybugDB instead of the archived KuzuDB package. Existing `~/.ok/kuzu` graph data is treated as a legacy derived index; keep it as a backup and rebuild into `~/.ok/ladybug` before relying on old graph state. +OpenKL now uses LadybugDB instead of the archived KuzuDB package. Existing `~/.ok/kuzu` graph data is treated as a legacy-derived index; keep it as a backup and rebuild into `~/.ok/ladybug` before relying on old graph state.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` at line 126, Update the phrase "legacy derived index" to the hyphenated compound modifier "legacy-derived index" in the README sentence that mentions OpenKL using LadybugDB and the existing ~/.ok/kuzu graph data so the wording reads "...treated as a legacy-derived index; keep it as a backup and rebuild into ~/.ok/ladybug..."—locate the occurrence of "legacy derived index" and replace it with "legacy-derived index".
124-126: ⚡ Quick winClarify the migration guidance with actionable steps.
The current guidance tells users to "rebuild into
~/.ok/ladybug" but doesn't explain how or mention that no automatic migration exists. Users with existing Kuzu data will be confused about what action to take.Consider adding explicit guidance such as: "No automatic migration is provided; the graph will be rebuilt from your canonical data (files and memory) when you use OpenKL commands. Alternatively, manually re-run your ingestion and memory commands to repopulate the graph."
📝 Suggested improvement
### LadybugDB Migration -OpenKL now uses LadybugDB instead of the archived KuzuDB package. Existing `~/.ok/kuzu` graph data is treated as a legacy derived index; keep it as a backup and rebuild into `~/.ok/ladybug` before relying on old graph state. +OpenKL now uses LadybugDB instead of the archived KuzuDB package. Existing `~/.ok/kuzu` graph data is treated as a legacy derived index; keep it as a backup. The graph will be automatically rebuilt in `~/.ok/ladybug` as you use OpenKL commands. No manual migration is required, but you can also manually re-ingest documents and re-add memory entries to repopulate the graph immediately.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 124 - 126, Update the README migration section to explicitly state there is no automatic migration and provide clear, actionable steps: mention that existing data in ~/.ok/kuzu is only a legacy backup, explain that OpenKL will rebuild the LadybugDB graph from canonical sources when you run OpenKL commands, and add concrete options—either let OpenKL rebuild automatically by running your usual OpenKL commands or manually re-run your ingestion and memory commands to repopulate ~/.ok/ladybug; reference the legacy path (~/.ok/kuzu), the new path (~/.ok/ladybug), and the terms "ingestion" and "memory" commands so users know what to run.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@README.md`:
- Line 126: Update the phrase "legacy derived index" to the hyphenated compound
modifier "legacy-derived index" in the README sentence that mentions OpenKL
using LadybugDB and the existing ~/.ok/kuzu graph data so the wording reads
"...treated as a legacy-derived index; keep it as a backup and rebuild into
~/.ok/ladybug..."—locate the occurrence of "legacy derived index" and replace it
with "legacy-derived index".
- Around line 124-126: Update the README migration section to explicitly state
there is no automatic migration and provide clear, actionable steps: mention
that existing data in ~/.ok/kuzu is only a legacy backup, explain that OpenKL
will rebuild the LadybugDB graph from canonical sources when you run OpenKL
commands, and add concrete options—either let OpenKL rebuild automatically by
running your usual OpenKL commands or manually re-run your ingestion and memory
commands to repopulate ~/.ok/ladybug; reference the legacy path (~/.ok/kuzu),
the new path (~/.ok/ladybug), and the terms "ingestion" and "memory" commands so
users know what to run.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 16b7b159-5e20-444f-8bab-e0651b762edb
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
README.mdopenkl/cli.pyopenkl/db.pyopenkl/distill.pyopenkl/graph.pyopenkl/memory.pyopenkl/vector_search.pypyproject.tomlrfcs/0000-openkl-design.mdtests/test_db_backend.pytests/test_vector_search.py
✅ Files skipped from review due to trivial changes (2)
- rfcs/0000-openkl-design.md
- openkl/distill.py
🚧 Files skipped from review as they are similar to previous changes (6)
- pyproject.toml
- tests/test_vector_search.py
- openkl/cli.py
- openkl/memory.py
- openkl/vector_search.py
- openkl/db.py
Address the actionable CodeRabbit review feedback on the LadybugDB migration.\n\nFail fast when the required LadybugDB VECTOR extension cannot be installed or loaded so OpenKL does not continue with a partially functional graph backend. Make schema creation idempotent with IF NOT EXISTS instead of relying on substring checks against exception messages.\n\nParameterize the MemoryManager.update existence check and SET clause so memory text and tags are passed through LadybugDB parameters instead of interpolated Cypher fragments. Validate vector-search inputs before touching the database, constrain k to a bounded positive integer, and serialize only finite numeric vector values into QUERY_VECTOR_INDEX calls.\n\nTighten tests around these behaviors: vector-extension failure now raises, vector search rejects invalid vectors and k before DB access, memory updates handle quoted text and tags through parameters, and the vector-search round trip asserts a non-empty result before indexing.\n\nVerification:\n- uv run pytest -q\n- uv run ruff check . && uv run ruff format --check .\n- uv run mypy openkl/db.py openkl/graph.py openkl/vector_search.py\n- HOME="/var/folders/vf/8zspxjpd00l_hmmb95ygld940000gn/T/tmp.fIREvsujDR" uv run ok doctor
|
Reviewed the remaining CodeRabbit items against the current PR diff and addressed the actionable ones in 00b499f. What changed:
The README/docs-plan suggestions are intentionally not applied. The migration plan document was removed from this PR per maintainer direction, so adding a README link to a non-committed Verification after the fix:
|
Why this change is needed
KuzuDB has been archived, which makes it a risky long-term backend for OpenKL's local-first knowledge graph. OpenKL depends on the embedded graph engine not only for basic Cypher storage, but also for vector index creation and vector search. In practice, the old Kuzu package also keeps a brittle dependency on the Kuzu extension distribution path for
INSTALL VECTOR/LOAD VECTOR, so fresh installs can fail before users ever get to the useful OpenKL workflows.LadybugDB is the active successor to KuzuDB and keeps the API shape OpenKL already relies on: embedded database construction, connection execution, Cypher support, and native vector index procedures. Moving now keeps the project aligned with the maintained upstream while preserving the current OpenKL mental model: files remain canonical, and the graph stays a derived retrieval/index layer.
The migration is intentionally conservative. We do not silently open an existing
~/.ok/kuzudirectory with the new backend, because binary/file-format compatibility is not a safe assumption. Instead, new graph state is created under~/.ok/ladybug, while legacy Kuzu graph data is left untouched and reported as a legacy derived index. That gives users a safer path forward and avoids destructive or ambiguous in-place behavior : )What changed
Runtime dependency and lockfile
kuzu>=0.8.0toladybug>=0.17.0.uv.lockso fresh installs resolveladybug 0.17.0instead of the archived Kuzu package.pytestandmypyto the active dev dependency group so the new migration smoke tests and scoped type checks are installable through the repository's currentuvworkflow.Graph backend initialization
openkl.dbto importladybugand expose it through a localgraphdbalias.~/.ok/kuzuto~/.ok/ladybug.LEGACY_KUZU_DB_PATHcheck. If OpenKL sees an old~/.ok/kuzudirectory and no new Ladybug store exists yet, it logs a warning instead of silently reusing the old path.CLI and user-facing naming
ok doctorso it checks forladybugand reportsLadybugDB available.Graph and vector-search compatibility cleanup
_process_kuzu_resultto_process_graph_resultso result handling is backend-neutral.db,graph, andvector_searchmodules. This keeps Ladybug's dynamic query-result typing from leaking as broadly through the migration path.CREATE_VECTOR_INDEX,QUERY_VECTOR_INDEX, andSHOW_INDEXESflow, which was verified against LadybugDB in smoke tests.Migration documentation
~/.ok/kuzuto~/.ok/ladybugbehavior.Tests
tests/test_db_backend.pyto verify Ladybug-backed schema initialization and the legacy Kuzu-path warning behavior.tests/test_vector_search.pyto verify a full memory vector-index round trip through LadybugDB.Data compatibility behavior
This PR intentionally treats the graph database as a derived index, consistent with the existing design docs. Existing user-authored content and store files remain the durable source of truth. Existing
~/.ok/kuzugraph state is not deleted, modified, or opened by LadybugDB automatically.The safe behavior after this PR is:
~/.ok/ladybug.~/.ok/kuzustate is left in place as a legacy derived index.Verification
Fresh checks run before opening this PR:
Results:
pytest: 3 passedruff check: all checks passedruff format --check: 14 files already formattedmypyon migrated backend modules: no issues foundI also ran an isolated
HOMECLI smoke test so the verification would not touch a real user~/.okdirectory. The smoke covered:ok doctor, confirming LadybugDB is available and the database initializesok mem addok mem search --jsonok graph cypherover theMemoryNote -> HasTopic -> Topicrelationshipok graph list-indexes, confirming bothmemory_vec_idxandchunk_vec_idxare created through LadybugDBRepresentative output from the final smoke:
Known follow-up
The full repository-wide
uv run mypy openklstill reports pre-existing typing debt outside the migrated backend path, mostly in CLI, parsers, citations, memory, and store modules. This PR keeps the migration focused and verifies the changed backend-critical modules with scoped mypy instead of turning the graph backend migration into a full type-hardening project.Summary by CodeRabbit
Documentation
Refactor
~/.ok/ladybug)Tests
Chores