feat: add dataset database handler logic and neo4j/lancedb/kuzu handlers #1776

dexters1 · 2025-11-11T18:45:51Z

Description

Add ability to use multi tenant multi user mode with Neo4j

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Code refactoring
Performance improvement
Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

I have tested my changes thoroughly before submitting this PR
This PR contains minimal changes necessary to address the issue/feature
My code follows the project's coding standards and style guidelines
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if applicable)
All new and existing tests pass
I have searched existing PRs to ensure this change hasn't been submitted already
I have linked any relevant issues in the description
My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

New Features
- Per-dataset multi-user database provisioning via pluggable handlers for graph and vector backends (Neo4j Aura, Kuzu, LanceDB).
Configuration
- New config options and environment variables to select dataset database handlers; backend access control enabled by default.
Database Migration
- Adds JSON connection fields for per-dataset graph/vector configs and updates uniqueness handling for safe upgrades/downgrades.
Improvements
- Handler-driven pruning support and startup warning about multi-user access control.
Tests
- New integration test and CI job exercising per-user dataset provisioning.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…b.com:topoteretes/cognee into feature/cog-3245-enable-multi-user-for-falkor

pull-checklist · 2025-11-11T18:45:55Z

Please make sure all the checkboxes are checked:

I have tested these changes locally.
I have reviewed the code changes.
I have added end-to-end and unit tests (if applicable).
I have updated the documentation and README.md file (if necessary).
I have removed unnecessary code and debug statements.
PR title is clear and follows the convention.
I have tagged reviewers or team members for feedback.

coderabbitai · 2025-11-11T18:45:57Z

Walkthrough

Adds a dataset-database handler framework, per-dataset JSON connection fields and migration, Kuzu/LanceDB/Neo4j Aura handlers and registry, config/env keys and resolution at runtime, pruning hooks, auth gating changes, tests, and CI workflow updates.

Changes

Cohort / File(s)	Summary
Env & Logging `/.env.template`, `cognee/shared/logging_utils.py`	Added `GRAPH_DATASET_DATABASE_HANDLER` / `VECTOR_DATASET_DATABASE_HANDLER` and startup warnings about multi-user backend access control.
DB Schema & Migration `cognee/modules/users/models/DatasetDatabase.py`, `alembic/versions/..._expand_dataset_database_with_json_*.py`	Made graph/vector name fields non-unique; added `graph_database_connection_info` and `vector_database_connection_info` JSON columns; migration adds columns and conditionally drops/recreates unique constraints (Postgres vs SQLite, with SQLite table-recreate helpers).
Handler API & Registry `cognee/infrastructure/databases/dataset_database_handler/__init__.py`, `.../dataset_database_handler_interface.py`, `.../supported_dataset_database_handlers.py`, `.../use_dataset_database_handler.py`	New `DatasetDatabaseHandlerInterface` (create/resolve/delete), registry `supported_dataset_database_handlers`, and `use_dataset_database_handler` registration helper; package exports updated.
Graph config, engine & handlers `cognee/infrastructure/databases/graph/config.py`, `.../get_graph_engine.py`, `.../graph/kuzu/...KuzuDatasetDatabaseHandler.py`, `.../graph/neo4j_driver/...Neo4jAuraDatasetDatabaseHandler.py`	Added `graph_dataset_database_handler` config; extended `create_graph_engine` signature; added Kuzu and Neo4j Aura handlers (provisioning, encrypted creds, resolve/delete hooks).
Vector config, engine & handlers `cognee/infrastructure/databases/vector/config.py`, `.../create_vector_engine.py`, `.../vector/lancedb/...LanceDBDatasetDatabaseHandler.py`, `.../vector/vector_db_interface.py`	Added `vector_dataset_database_handler` config; extended `create_vector_engine` signature; added LanceDB handler and dataset lifecycle methods on vector interface.
Utilities: get/create & resolve `cognee/infrastructure/databases/utils/get_or_create_dataset_database.py`, `.../resolve_dataset_database_connection_info.py`, `.../utils/__init__.py`	`get_or_create_dataset_database` refactored to use handler helpers and return existing records early; new resolver module delegates to handlers to resolve connection info asynchronously and is exported.
Context, globals & Auth `cognee/context_global_variables.py`, `cognee/modules/users/methods/get_authenticated_user.py`	Context setup now resolves dataset connection info and injects graph credentials from dataset info; dynamic handler inspection replaces static lists; authentication gating switched to `ENABLE_BACKEND_ACCESS_CONTROL` env var (defaults enabled).
Prune / Deletion `cognee/modules/data/deletion/prune_system.py`	Added `prune_graph_databases()` and `prune_vector_databases()` and conditional pruning that delegates to handlers when backend access control is enabled.
Tests & CI `cognee/tests/test_dataset_database_handler.py`, `.github/workflows/db_examples_tests.yml`, `.github/workflows/e2e_tests.yml`	New test registering custom handlers and asserting per-user dataset files; CI sets `ENABLE_BACKEND_ACCESS_CONTROL: 'false'` for example steps and adds job to run the handler test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Focus review on:
- Alembic migration: SQLite table recreation and unique-constraint detection/recreation.
- resolve_dataset_database_connection_info async flow and its integration in context initialization.
- Neo4jAuraDatasetDatabaseHandler: HTTP/OAuth provisioning, polling, encryption/decryption, and error handling.
- Changes to DatasetDatabase uniqueness and callers that assumed uniqueness.
- Tests that create filesystem artifacts and CI env overrides.

Possibly related PRs

Feat/cog 184 add falkordb #192 — Alters graph/vector engine factory signatures and handler plumbing, related to added engine parameters and handler integrations.
Merge main again #953 — Overlaps multi-tenant dataset-database handling, handler registry, model/migration, and access-control changes.
feat: require logged in user in routes #133 — Related changes to authentication gating and use of ENABLE_BACKEND_ACCESS_CONTROL affecting get_authenticated_user behavior.

Suggested labels

run-checks

Suggested reviewers

hajdul88

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is minimal and incomplete. While it mentions the main objective (multi-tenant/multi-user mode with Neo4j), most template sections are unchecked and the Pre-submission Checklist is entirely unchecked, indicating insufficient documentation of changes, testing, and adherence to project standards.	Complete the PR description by: (1) expanding the Description section with detailed change rationale, (2) selecting the appropriate Type of Change option, (3) checking Pre-submission Checklist items to confirm testing and code standards compliance, and (4) linking any relevant issues if applicable.
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.65% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically describes the main change: adding dataset database handler logic and implementations for Neo4j, LanceDB, and Kuzu database handlers.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch multi-tenant-neo4j

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cognee/infrastructure/databases/utils/get_or_create_dataset_database.py

borisarzentar · 2025-12-01T15:53:30Z

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

+
+    @classmethod
+    async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]):
+        pass


Do we want to keep this like that?
In case we want to delete the dataset, do we just delete data inside a dataset database, or we also delete the database?

we must also delete the database in most cases, for example in Neo4j if a dataset is empty and not used it will have high billing costs so this will allow us to keep costs lower for empty datasets

Ok, then I guess we need to handle delete here. In case of kuzu that is just calling delete_graph.

cognee/tests/test_dataset_database_handler.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)
9-12: Add a class-level docstring for completeness.

Per coding guidelines, undocumented class definitions are assumed incomplete. Consider adding a brief docstring summarizing the interface's purpose.
 class DatasetDatabaseHandlerInterface(ABC):
+    """
+    Abstract interface for dataset database handlers.
+
+    Implementations provide logic for creating, resolving, and deleting
+    per-dataset graph or vector database connections in multi-tenant mode.
+    """
     @classmethod
     @abstractmethod
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)
14-52: Implementation follows the interface contract correctly.

The method validates the provider, fetches configuration, and returns the expected dictionary structure. The inline import on line 27 is acceptable for avoiding circular dependencies.

Regarding the TODO on line 36 about graph file path info: Do you want me to help implement user-specific path construction similar to LanceDBDatasetDatabaseHandler (which builds paths under databases/{user.id}/)?

54-56: Add missing return type annotation for interface consistency.

The interface declares delete_dataset(...) -> None, but this implementation omits the return type. While Python allows this, adding the annotation maintains consistency with the interface contract.
     @classmethod
-    async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]):
+    async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> None:
         pass
Given the past discussion about needing database deletion for cost management, consider adding a TODO comment to track the implementation:
     @classmethod
     async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> None:
+        # TODO: Implement Kuzu database deletion to free resources
         pass

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e493ce and 702cdb4.

📒 Files selected for processing (3)

.github/workflows/db_examples_tests.yml (2 hunks)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1 hunks)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

.github/**

⚙️ CodeRabbit configuration file

.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.

'actionlint' erroneously generates false positives when dealing with GitHub's ${{ ... }} syntax in conditionals.

'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid ${{ ... }} syntax.

Files:

.github/workflows/db_examples_tests.yml

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

🧬 Code graph analysis (1)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (4)

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)

DatasetDatabaseHandlerInterface (9-81)

create_dataset (12-34)

delete_dataset (71-81)

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (2)

create_dataset (19-116)

delete_dataset (119-120)

cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (2)

create_dataset (17-37)

delete_dataset (40-41)

cognee/infrastructure/databases/graph/config.py (1)

get_graph_config (131-144)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)

GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: Basic Tests / Run Basic Graph Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: Basic Tests / Run Simple Examples BAML
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)

🔇 Additional comments (4)

.github/workflows/db_examples_tests.yml (2)

64-64: Verify: Why is ENABLE_BACKEND_ACCESS_CONTROL absent from the Kuzu example job?

The environment variable ENABLE_BACKEND_ACCESS_CONTROL: 'false' is added to both the Neo4j and PGVector example test steps, but is absent from the Kuzu example step. This inconsistency suggests either: (1) Kuzu does not yet support the multi-user mode feature introduced in this PR, or (2) this is an unintended omission.

Clarify whether this is intentional or if the variable should also be added to the Kuzu example.

Also applies to: 146-146

62-78: Environment variable addition aligns with PR objectives.

The addition of ENABLE_BACKEND_ACCESS_CONTROL: 'false' to test workflows is consistent with the PR's multi-tenant, multi-user mode feature. Disabling access control in test examples is appropriate for isolated testing scenarios.

Also applies to: 143-155

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)

10-67: Well-structured interface with comprehensive documentation.

The interface design is sound: abstract methods for create_dataset and delete_dataset with a concrete default for resolve_dataset_connection_info that can be optionally overridden. The docstrings appropriately emphasize security considerations around credential handling.

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)

1-12: Clean imports and proper class structure.

The unused imports from the previous review have been removed. The class correctly implements DatasetDatabaseHandlerInterface with an appropriate docstring.

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (3)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)

14-41: Validate dataset_id and document the legacy .pkl naming.

dataset_id is typed as Optional[UUID] but is required here to form graph_db_name; if it is ever None you’ll silently get "None.pkl". Consider either:

enforcing non-None with an explicit check and clear ValueError, or

supporting the None case explicitly in the naming logic.

Given the prior decision to keep the .pkl suffix for backward compatibility, it would help future maintainers to add a short comment explaining why this extension is used and what older Kuzu deployments expect, instead of leaving it implicit.

36-51: Avoid persisting raw DB credentials in the DatasetDatabase row.

Per the DatasetDatabaseHandlerInterface docstring, the dict returned from create_dataset is stored verbatim and later passed into resolve_dataset_connection_info, and the guidance is to prefer returning references to secrets/roles instead of plaintext credentials. Here you are returning graph_database_username and graph_database_password directly, which will end up persisted in the relational DB.

Consider adjusting this handler to:

store only non-secret identifiers (e.g., a secret name/ARN or config key) in graph_database_connection_info, and

override resolve_dataset_connection_info in this class to resolve those identifiers into short‑lived credentials at connection time without persisting them.

This keeps the schema compatible while improving the security posture.
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (1)
63-116: Consider adding logging for debugging and observability.

The function lacks logging statements, which would be valuable for debugging dataset database creation issues, tracking handler failures, and monitoring race conditions.

Consider adding logging at key points:
from cognee.shared.logging_utils import get_logger

logger = get_logger(__name__)

async def get_or_create_dataset_database(
    dataset: Union[str, UUID],
    user: User,
) -> DatasetDatabase:
    """..."""
    logger.debug(f"Getting or creating dataset database for dataset={dataset}, user={user.id}")
    
    db_engine = get_relational_engine()
    dataset_id = await get_unique_dataset_id(dataset, user)
    
    # If dataset is given as name make sure the dataset is created first
    if isinstance(dataset, str):
        logger.debug(f"Creating dataset '{dataset}' for user {user.id}")
        async with db_engine.get_async_session() as session:
            await create_dataset(dataset, user, session)
    
    # If dataset database already exists return it
    existing_dataset_database = await _existing_dataset_database(dataset_id, user)
    if existing_dataset_database:
        logger.debug(f"Found existing dataset database for dataset_id={dataset_id}")
        return existing_dataset_database
    
    logger.debug(f"Fetching handler configs for dataset_id={dataset_id}")
    graph_config_dict = await _get_graph_db_info(dataset_id, user)
    vector_config_dict = await _get_vector_db_info(dataset_id, user)
    
    # ... rest of function with additional logging for create/error paths
Based on coding guidelines for files matching cognee/**/*.py.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 702cdb4 and 9244876.

📒 Files selected for processing (3)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1 hunks)
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (4 hunks)
cognee/modules/users/models/DatasetDatabase.py (3 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/modules/users/models/DatasetDatabase.py
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/modules/users/models/DatasetDatabase.py
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/modules/users/models/DatasetDatabase.py
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/modules/users/models/DatasetDatabase.py
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

🧬 Code graph analysis (1)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (4)

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)

DatasetDatabaseHandlerInterface (9-81)

create_dataset (12-34)

delete_dataset (71-81)

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (2)

create_dataset (19-116)

delete_dataset (119-120)

cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (2)

create_dataset (17-37)

delete_dataset (40-41)

cognee/infrastructure/databases/graph/config.py (1)

get_graph_config (131-144)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)

GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: Basic Tests / Run Simple Examples BAML
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: Code Quality for Weighted Edges
GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)

🔇 Additional comments (4)

cognee/modules/users/models/DatasetDatabase.py (2)

3-3: LGTM!

The additional imports (JSON and text) are correctly added to support the new JSON fields with SQL-level defaults.

15-16: LGTM!

The removal of unique constraints on database names was discussed in prior reviews and enables the desired multi-tenant flexibility.

cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (2)

38-60: LGTM!

The function is well-documented and implements a straightforward database query to check for existing records. The docstring clearly explains the purpose, parameters, and return value.

104-105: The dict unpacking is safe by design. Both handlers use namespaced keys—graph handlers return graph_* prefixed keys and vector handlers return vector_* prefixed keys. These disjoint key sets cannot collide, so no silent overrides can occur.

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

cognee/infrastructure/databases/utils/get_or_create_dataset_database.py

coderabbitai · 2025-12-02T13:34:12Z

cognee/modules/users/models/DatasetDatabase.py

+    # configuration details for different database types. This would make it more flexible to add new database types
+    # without changing the database schema.


⚠️ Potential issue | 🟡 Minor

Clarify incomplete comment.

The comment starts mid-sentence. Consider adding a subject for clarity.

Apply this diff:

- # configuration details for different database types. This would make it more flexible to add new database types + # These fields store configuration details for different database types. This would make it more flexible to add new database types

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# configuration details for different database types. This would make it more flexible to add new database types

# without changing the database schema.

# These fields store configuration details for different database types. This would make it more flexible to add new database types

# without changing the database schema.

🤖 Prompt for AI Agents

In cognee/modules/users/models/DatasetDatabase.py around lines 27 to 28, the inline comment starts mid-sentence and is ambiguous; rewrite it to be a complete, clear sentence that states the subject (e.g., "Store database configuration details for different database types") and explains intent briefly — that the model stores configuration details per database type to allow adding new database types without changing the schema; update the comment text accordingly to be grammatically complete and self-contained.

cognee/modules/users/models/DatasetDatabase.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (3)

155-157: Make the unimplemented delete_dataset explicit.

delete_dataset currently just passes, which can lead to silent no‑ops if it’s ever invoked.

Replacing it with an explicit NotImplementedError (and a TODO) will fail fast and make the omission clear:

     @classmethod
     async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]):
-        pass
+        # TODO: Implement Neo4j Aura instance deletion via the Aura API.
+        raise NotImplementedError(
+            "Neo4jAuraDatasetDatabaseHandler.delete_dataset is not implemented yet."
+        )

58-67: Add timeouts and structured error handling to Aura token + instance creation calls.

Both the OAuth token request and the instance creation call use requests.post without a timeout or robust error handling, and they repeatedly call response.json(). That can lead to indefinite hangs and obscure failures.

A possible tightening:

-        # Make the request with HTTP Basic Auth
-        def get_aura_token(client_id: str, client_secret: str) -> dict:
-            url = "https://api.neo4j.io/oauth/token"
-            data = {"grant_type": "client_credentials"}  # sent as application/x-www-form-urlencoded
-
-            resp = requests.post(url, data=data, auth=(client_id, client_secret))
-            resp.raise_for_status()  # raises if the request failed
-            return resp.json()
+        # Make the request with HTTP Basic Auth
+        def get_aura_token(client_id: str, client_secret: str) -> dict:
+            url = "https://api.neo4j.io/oauth/token"
+            data = {"grant_type": "client_credentials"}  # sent as application/x-www-form-urlencoded
+
+            try:
+                resp = requests.post(
+                    url,
+                    data=data,
+                    auth=(client_id, client_secret),
+                    timeout=30,
+                )
+                resp.raise_for_status()
+            except requests.Timeout as exc:
+                raise TimeoutError("Timed out while requesting Neo4j Aura OAuth token") from exc
+            except requests.RequestException as exc:
+                raise RuntimeError("Failed to request Neo4j Aura OAuth token") from exc
+
+            return resp.json()
@@
-        response = requests.post(url, headers=headers, json=payload)
-
-        graph_db_name = "neo4j"  # Has to be 'neo4j' for Aura
-        graph_db_url = response.json()["data"]["connection_url"]
-        graph_db_key = resp["access_token"]
-        graph_db_username = response.json()["data"]["username"]
-        graph_db_password = response.json()["data"]["password"]
+        try:
+            response = requests.post(
+                url,
+                headers=headers,
+                json=payload,
+                timeout=60,
+            )
+            response.raise_for_status()
+        except requests.Timeout as exc:
+            raise TimeoutError(
+                f"Timed out while provisioning Neo4j Aura instance for dataset {dataset_id}"
+            ) from exc
+        except requests.RequestException as exc:
+            raise RuntimeError(
+                f"Failed to provision Neo4j Aura instance for dataset {dataset_id}"
+            ) from exc
+
+        response_data = response.json()["data"]
+
+        graph_db_name = "neo4j"  # Has to be 'neo4j' for Aura
+        graph_db_url = response_data["connection_url"]
+        graph_db_key = resp["access_token"]
+        graph_db_username = response_data["username"]
+        graph_db_password = response_data["password"]
@@
-        instance_id = response.json()["data"]["id"]
+        instance_id = response_data["id"]

This makes failures explicit, prevents indefinite blocking, and avoids reparsing the JSON body multiple times.

Provide the official `requests` documentation page describing the `timeout` parameter and recommended exception handling patterns for HTTP errors.

Also applies to: 90-97, 114-115

98-112: Avoid blocking the async event loop when polling Aura instance status; add timeout and error checks.

_wait_for_neo4j_instance_provisioning is async but calls requests.get directly in a loop with no timeout, which will block the event loop for up to ~5 minutes and still lacks raise_for_status().

One approach that keeps requests but avoids blocking:

-        async def _wait_for_neo4j_instance_provisioning(instance_id: str, headers: dict):
-            # Poll until the instance is running
-            status_url = f"https://api.neo4j.io/v1/instances/{instance_id}"
-            status = ""
-            for attempt in range(30):  # Try for up to ~5 minutes
-                status_resp = requests.get(
-                    status_url, headers=headers
-                )  # TODO: Use async requests with httpx
-                status = status_resp.json()["data"]["status"]
-                if status.lower() == "running":
-                    return
-                await asyncio.sleep(10)
-            raise TimeoutError(
-                f"Neo4j instance '{graph_db_name}' did not become ready within 5 minutes. Status: {status}"
-            )
+        async def _wait_for_neo4j_instance_provisioning(instance_id: str, headers: dict) -> None:
+            # Poll until the instance is running
+            status_url = f"https://api.neo4j.io/v1/instances/{instance_id}"
+            status = ""
+            for _ in range(30):  # Try for up to ~5 minutes
+                try:
+                    status_resp = await asyncio.to_thread(
+                        requests.get,
+                        status_url,
+                        headers=headers,
+                        timeout=30,
+                    )
+                    status_resp.raise_for_status()
+                except requests.Timeout as exc:
+                    raise TimeoutError(
+                        f"Timed out while polling Neo4j instance '{instance_id}' provisioning status."
+                    ) from exc
+                except requests.RequestException as exc:
+                    raise RuntimeError(
+                        f"Error while polling Neo4j instance '{instance_id}' provisioning status."
+                    ) from exc
+
+                status = status_resp.json()["data"]["status"]
+                if status.lower() == "running":
+                    return
+
+                await asyncio.sleep(10)
+
+            raise TimeoutError(
+                f"Neo4j instance '{instance_id}' did not become ready within 5 minutes. Last status: {status}"
+            )

Alternatively, switching to httpx.AsyncClient throughout would give you a fully async Aura client, but the above is a minimal, localized change.

Find the official Python documentation for `asyncio.to_thread` and any guidance on avoiding blocking I/O inside `async def` functions.

🧹 Nitpick comments (2)

cognee/modules/data/deletion/prune_system.py (1)
9-16: Refine access‑control gating: cache flag once and document behavior

The new checks correctly prevent graph/vector pruning when backend access control is enabled, which matches the multi‑user safety goal. Two improvements to consider:

Avoid repeated calls
Cache the flag once at the top of the function to keep behavior consistent and avoid redundant calls:
 async def prune_system(graph=True, vector=True, metadata=True, cache=True):
-    # TODO: prune_system should work with multi-user access control mode enabled
-    if graph and not backend_access_control_enabled():
+    # TODO: prune_system should work with multi-user access control mode enabled
+    access_control_enabled = backend_access_control_enabled()
+
+    if graph and not access_control_enabled:
         graph_engine = await get_graph_engine()
         await graph_engine.delete_graph()
 
-    if vector and not backend_access_control_enabled():
+    if vector and not access_control_enabled:
         vector_engine = get_vector_engine()
         await vector_engine.prune()
Document the new semantics
prune_system now has mode‑dependent behavior (graph/vector are no‑ops when access control is on). Adding a short docstring that explains this will help callers understand why pruning may be skipped and how the TODO will eventually be addressed (e.g., tenant‑aware pruning instead of a full disable). This also aligns with the project guideline that undocumented functions are considered incomplete.
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (1)
40-41: Clarify naming between dataset identifier and Aura DB name, and correct the slice length.

graph_db_name is initially set to f"{dataset_id}", later overwritten with "neo4j", and sliced as graph_db_name[0:29] while the comment mentions a 30‑character limit. That’s a bit confusing for future readers.

Consider something like:
-        graph_db_name = f"{dataset_id}"
+        dataset_label = str(dataset_id)
@@
-            "name": graph_db_name[
-                0:29
-            ],  # TODO: Find better name to name Neo4j instance within 30 character limit
+            "name": dataset_label[:30],  # TODO: Find better name within the 30-char limit
@@
-        graph_db_name = "neo4j"  # Has to be 'neo4j' for Aura
+        graph_db_name = "neo4j"  # Has to be 'neo4j' for Aura
This makes it clear which value is a human‑friendly instance label vs the fixed database name and aligns the slice with the stated limit.

Also applies to: 82-84, 92-92

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9244876 and 1282905.

📒 Files selected for processing (2)

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (1 hunks)
cognee/modules/data/deletion/prune_system.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

🧠 Learnings (1)

📚 Learning: 2025-10-11T04:18:24.594Z

Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.

Applied to files:

cognee/modules/data/deletion/prune_system.py

🧬 Code graph analysis (1)

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (3)

cognee/infrastructure/databases/graph/config.py (1)

get_graph_config (131-144)

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (4)

DatasetDatabaseHandlerInterface (9-81)

create_dataset (12-34)

resolve_dataset_connection_info (37-67)

delete_dataset (71-81)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)

create_dataset (15-51)

delete_dataset (54-55)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)

GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Run Telemetry Test
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: Basic Tests / Run Basic Graph Tests
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: Basic Tests / Run Simple Examples BAML
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: CLI Tests / CLI Unit Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)

🔇 Additional comments (1)

cognee/modules/data/deletion/prune_system.py (1)

1-1: Import of backend_access_control_enabled is appropriate

The new import cleanly supports the gating logic below; no issues from an API or style perspective.

coderabbitai · 2025-12-02T15:51:09Z

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

+        # Client credentials and encryption
+        client_id = os.environ.get("NEO4J_CLIENT_ID", None)
+        client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None)
+        tenant_id = os.environ.get("NEO4J_TENANT_ID", None)
+        encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
+        encryption_key = base64.urlsafe_b64encode(
+            hashlib.sha256(encryption_env_key.encode()).digest()
+        )
+        cipher = Fernet(encryption_key)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

Repository: topoteretes/cognee

Length of output: 151

🏁 Script executed:

cat -n cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py | head -160

Repository: topoteretes/cognee

Length of output: 7391

🌐 Web query:

cryptography.fernet key management best practices official documentation

💡 Result:

Short summary of official guidance and best practices for managing cryptography.fernet keys

Key format & secrecy: Fernet keys are URL-safe base64‑encoded 32‑byte secrets and "must be kept secret; anyone with this key can create and read messages." (cryptography.io)

Key rotation: Use MultiFernet to implement key rotation—place the new key first to encrypt new tokens, keep older keys to decrypt existing tokens, and remove/retire old keys when safe. Use MultiFernet.rotate() to re-encrypt tokens under the new primary key. Token/key rotation is described as a recommended cryptographic hygiene practice. (cryptography.io)

Practical key-management controls (recommended, not specific to library):

Generate keys with a secure RNG and store secrets in a dedicated secrets manager or HSM (do not hard-code keys in source or repos). See NIST and OWASP key‑management guidance for lifecycle controls (generation, storage, rotation, access control, backup/recovery, destruction). (csrc.nist.gov)

Limit access (least privilege), audit key usage, and have procedures for compromise (immediate rotation + re‑encryption/retirement). MultiFernet supports retiring old keys after rotation. (cryptography.io)

References (official docs and standards):

cryptography (Fernet / MultiFernet) official docs. (cryptography.io)

NIST key‑management guidance (SP 800‑57 series). (csrc.nist.gov)

OWASP Key Management Cheat Sheet. (cheatsheetseries.owasp.org)

If you want, I can: (a) extract the exact code patterns for generating, rotating, and retiring Fernet keys; or (b) produce a short ops checklist (storage options, rotation schedule, incident playbook) tailored to your environment. Which would you prefer?

Citations:

1: https://cryptography.io/en/45.0.5/fernet/?utm_source=openai

2: https://cryptography.io/en/45.0.5/fernet/?utm_source=openai

3: https://csrc.nist.gov/Projects/Key-Management/faqs?utm_source=openai

4: https://cryptography.io/en/45.0.5/fernet/?utm_source=openai

5: https://cryptography.io/en/45.0.5/fernet/?utm_source=openai

6: https://csrc.nist.gov/Projects/Key-Management/faqs?utm_source=openai

7: https://cheatsheetseries.owasp.org/cheatsheets/Key_Management_Cheat_Sheet.html?utm_source=openai

🏁 Script executed:

rg -A 10 "class DatasetDatabaseHandlerInterface" --type py

Repository: topoteretes/cognee

Length of output: 1862

🏁 Script executed:

cat cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py

Repository: topoteretes/cognee

Length of output: 4479

Require NEO4J_ENCRYPTION_KEY environment variable and reconsider secret storage design to align with interface guidance.

The code falls back to an insecure default encryption key ("test_key") at lines 46 and 141, which contradicts the cryptography library's fundamental requirement that Fernet keys "must be kept secret" and violates NIST/OWASP key management practices. More critically, the current design persists both an encrypted password ciphertext and a plaintext Aura access token in the relational database, which conflicts with the DatasetDatabaseHandlerInterface contract guidance to "prefer returning only references to secrets or role identifiers, not plaintext credentials" and "do not persist any resolved or decrypted secrets back to the relational database."

Two required changes:

Remove the insecure default encryption key. Require NEO4J_ENCRYPTION_KEY to be explicitly set:

- encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key") + encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY") + if not encryption_env_key: + raise ValueError( + "NEO4J_ENCRYPTION_KEY must be set to encrypt Neo4j Aura credentials safely." + )

Apply the same fix at line 141 in resolve_dataset_connection_info() with message: "NEO4J_ENCRYPTION_KEY must be set in order to decrypt Neo4j Aura credentials."

Store only references to credentials, not the credentials themselves. Instead of persisting the encrypted password ciphertext and plaintext access token, store only the Aura instance_id and tenant_id in the returned dictionary. In resolve_dataset_connection_info(), use those references to call the Aura API (via the stored OAuth credentials or a refresh flow) to obtain fresh, short-lived credentials at connection time. This aligns with the interface's explicit guidance and eliminates the risk that a DB dump exposes long-lived secrets.

See official guidance: cryptography.io Fernet key management, NIST SP 800-57 key management, OWASP Key Management Cheat Sheet.

Also fix the empty delete_dataset() stub at line 156–157 by raising NotImplementedError with a TODO comment.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Client credentials and encryption

client_id = os.environ.get("NEO4J_CLIENT_ID", None)

client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None)

tenant_id = os.environ.get("NEO4J_TENANT_ID", None)

encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")

encryption_key = base64.urlsafe_b64encode(

hashlib.sha256(encryption_env_key.encode()).digest()

)

cipher = Fernet(encryption_key)

# Client credentials and encryption

client_id = os.environ.get("NEO4J_CLIENT_ID", None)

client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None)

tenant_id = os.environ.get("NEO4J_TENANT_ID", None)

encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY")

if not encryption_env_key:

raise ValueError(

"NEO4J_ENCRYPTION_KEY must be set to encrypt Neo4j Aura credentials safely."

)

encryption_key = base64.urlsafe_b64encode(

hashlib.sha256(encryption_env_key.encode()).digest()

)

cipher = Fernet(encryption_key)

🤖 Prompt for AI Agents

In cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py around lines 42–50 (and apply same check at line 141), remove the insecure fallback by requiring NEO4J_ENCRYPTION_KEY to be present (raise a clear error if missing and at line 141 use the message "NEO4J_ENCRYPTION_KEY must be set in order to decrypt Neo4j Aura credentials."); stop writing encrypted password ciphertexts and plaintext access tokens back to the relational DB—modify the persistence/return values so only Aura references (instance_id and tenant_id) are stored/returned, and update resolve_dataset_connection_info() to use those references to obtain short‑lived credentials via the Aura OAuth/refresh flow at connection time; finally replace the empty delete_dataset() stub at lines 156–157 with a NotImplementedError and a TODO comment indicating proper deletion behavior to implement later.

@pazone How important you think this is? It is a risk, but we have it everywhere basically. This ensures that cognee will work with minimal config. The setup will be a bit more burdensome if we don't have those default values and expect users to provide them every time.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

btw the Neo4jAura password encryption is not safe and not for production use, the idea was just to have this as a simple examples with minimal config for our users on how it could be done and they do it themselves with in a safe way with secret vaults/load balancers etc.

IIUC we don't store credentials here, right?

nit: It's a repeated code :

encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key") encryption_key = base64.urlsafe_b64encode( hashlib.sha256(encryption_env_key.encode()).digest() ) cipher = Fernet(encryption_key)

Extract a private method?

We store credentials here, but they are stored encrypted with a custom string that can be set, we decrypt them in the resolution step. If there is a simple service we can call without much setup to store credentials instead let me know.

This is intended to be more of an example for our users how a Neo4j dataset database handler can be written than an actual handler we will use for our production or recommend for them to use in production. We can make a private handler for SaaS if we decide to use Neo4j there that will utilize a third party service for credential management.

Ideally all production users need to write their own credential resolution mechanism at the very least, we support registering and using custom dataset database handlers

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

pazone · 2025-12-02T17:20:03Z

cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py

+from cognee.infrastructure.databases.dataset_database_handler import DatasetDatabaseHandlerInterface
+
+
+class LanceDBDatasetDatabaseHandler(DatasetDatabaseHandlerInterface):


Question: Is there a test that covers this class? I tried to run the only one changed tests with coverage and there's no hits

Yes, all the standard Cognee tests run with this and the Kuzu class. Only class not covered currently is the Neo4jAura but we can't cover that one since it's an expensive paid service

pazone · 2025-12-02T17:37:18Z

cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py

+    dataset_database = await _get_vector_db_connection_info(dataset_database)
+    dataset_database = await _get_graph_db_connection_info(dataset_database)


Question: do we assign the value twice. Is it expected?

Yes, vector connection set values regarding vector databases, graph connection sets values for graph databases. They are not overwritten and have separate fields in the Data model

pazone · 2025-12-02T17:43:07Z

It's on the right track. I'd recommend to ensure the test coverage to avoid future problems

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

borisarzentar · 2025-12-03T10:24:31Z

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

+
+    @classmethod
+    async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]):
+        pass


Ok, then I guess we need to handle delete here. In case of kuzu that is just calling delete_graph.

borisarzentar · 2025-12-03T10:28:29Z

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

+        # Client credentials and encryption
+        client_id = os.environ.get("NEO4J_CLIENT_ID", None)
+        client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None)
+        tenant_id = os.environ.get("NEO4J_TENANT_ID", None)
+        encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
+        encryption_key = base64.urlsafe_b64encode(
+            hashlib.sha256(encryption_env_key.encode()).digest()
+        )
+        cipher = Fernet(encryption_key)


@pazone How important you think this is? It is a risk, but we have it everywhere basically. This ensures that cognee will work with minimal config. The setup will be a bit more burdensome if we don't have those default values and expect users to provide them every time.

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

cognee/tests/test_dataset_database_handler.py (1)
39-50: Align Kuzu handler database path with LanceDB handler and test assertions

KuzuTestDatasetDatabaseHandler.create_dataset currently uses a relative path:
databases_directory_path = os.path.join("databases", str(user.id))
while LanceDBTestDatasetDatabaseHandler builds an absolute path under .cognee_system/test_dataset_database_handler/databases/..., and main() asserts the Kuzu DB file exists under that same absolute base. This inconsistency can cause the Kuzu DB file to be written to a different location than the assertion expects (and from where LanceDB writes its DB).

You can mirror the LanceDB handler’s path logic so both handlers and the assertions agree:
 class KuzuTestDatasetDatabaseHandler(DatasetDatabaseHandlerInterface):
     @classmethod
-    async def create_dataset(cls, dataset_id, user):
-        databases_directory_path = os.path.join("databases", str(user.id))
-        os.makedirs(databases_directory_path, exist_ok=True)
+    async def create_dataset(cls, dataset_id, user):  # noqa: ARG003 - dataset_id unused in test
+        import pathlib
+
+        cognee_directory_path = str(
+            pathlib.Path(
+                os.path.join(
+                    pathlib.Path(__file__).parent, ".cognee_system/test_dataset_database_handler"
+                )
+            ).resolve()
+        )
+        databases_directory_path = os.path.join(cognee_directory_path, "databases", str(user.id))
+        os.makedirs(databases_directory_path, exist_ok=True)
@@
-        graph_db_name = "test.kuzu"
+        graph_db_name = "test.kuzu"
         return {
             "graph_database_name": graph_db_name,
             "graph_database_url": os.path.join(databases_directory_path, graph_db_name),
             "graph_database_provider": "kuzu",
         }
This also addresses the earlier concern about inconsistent test artifact locations between the two handlers.

🧹 Nitpick comments (5)

.github/workflows/e2e_tests.yml (2)

104-116: Selective disabling of backend access control looks reasonable

Using ENV: 'local'/'dev' plus ENABLE_BACKEND_ACCESS_CONTROL: 'false' for the telemetry pipeline and deduplication examples is a pragmatic way to keep existing scenarios working while you introduce the new gating. Just make sure there is at least one CI path that runs with backend access control enabled (or defaulted on) so those new checks are exercised as well.

Also applies to: 148-158

216-239: Consider wiring the dataset‑database handler scenario through pytest

The new test-dataset-database-handler job mirrors the other integration jobs, but it runs the module directly via python instead of through pytest. If you add a small @pytest.mark.asyncio test wrapper around main() in cognee/tests/test_dataset_database_handler.py and invoke this job with uv run pytest cognee/tests/test_dataset_database_handler.py, you’ll get consistent reporting and coverage accounting with the rest of your test suite.
cognee/tests/test_dataset_database_handler.py (3)
4-12: Be cautious with environment mutation at import time

Setting VECTOR_DATASET_DATABASE_HANDLER and GRAPH_DATASET_DATABASE_HANDLER at module import time is fine for a standalone example script, but if this module is ever imported by pytest as part of a larger test run it will silently affect global process state for other tests. If you intend this to behave as a “real” test module, consider moving these os.environ[...] assignments into main() (or a pytest fixture) so the scope and lifetime of the override are explicit.

15-36: Minor cleanup for LanceDB test handler (imports and lints)

The implementation looks correct for the test scenario. Two small cleanups you might consider:

Move the import pathlib to the top of the file with the other imports to avoid re-importing it every time create_dataset is called.

dataset_id is required by the interface but unused here; if Ruff’s ARG003 is enabled, you can either add a # noqa: ARG003 on the method definition or add a short comment explaining that it’s intentionally unused in this test handler.

Example diff:
-import asyncio
-import os
+import asyncio
+import os
+import pathlib
@@
-class LanceDBTestDatasetDatabaseHandler(DatasetDatabaseHandlerInterface):
+class LanceDBTestDatasetDatabaseHandler(DatasetDatabaseHandlerInterface):
@@
-    async def create_dataset(cls, dataset_id, user):
-        import pathlib
-
+    async def create_dataset(cls, dataset_id, user):  # noqa: ARG003 - dataset_id unused in test
         cognee_directory_path = str(
             pathlib.Path(
                 os.path.join(
53-135: Turn this example into a proper pytest test while keeping the script entry point

The main() coroutine plus the if __name__ == "__main__": ... block make this a nice executable example, but as-is it doesn’t define any pytest tests, so it won’t be exercised by pytest-based runs or contribute to pytest-driven coverage.

A lightweight way to get both benefits (example + test) is to add a small async test wrapper that just calls main():
 from cognee.api.v1.search import SearchType
@@
 async def main():
@@
         os.path.join(cognee_directory_path, "databases", str(default_user.id), "test.lance.db")
     ), "Vector database file not found."
 
-
-if __name__ == "__main__":
+import pytest
+
+
+@pytest.mark.asyncio
+async def test_dataset_database_handler_example():
+    """Exercise custom LanceDB and Kuzu dataset database handlers end-to-end."""
+    await main()
+
+
+if __name__ == "__main__":
     logger = setup_logging(log_level=ERROR)
     loop = asyncio.new_event_loop()
     asyncio.set_event_loop(loop)
That keeps the example runnable via python cognee/tests/test_dataset_database_handler.py while also letting pytest discover and run it with pytest, aligning with your testing guidelines and earlier review feedback about code coverage. Short docstrings for the two handler classes and main() would also help future readers understand this example’s intent.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1282905 and 5698c60.

📒 Files selected for processing (2)

.github/workflows/e2e_tests.yml (3 hunks)
cognee/tests/test_dataset_database_handler.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

.github/**

⚙️ CodeRabbit configuration file

.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.

'actionlint' erroneously generates false positives when dealing with GitHub's ${{ ... }} syntax in conditionals.

'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid ${{ ... }} syntax.

Files:

.github/workflows/e2e_tests.yml

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/tests/test_dataset_database_handler.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/tests/test_dataset_database_handler.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/tests/test_dataset_database_handler.py

cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

cognee/tests/test_dataset_database_handler.py

cognee/tests/*

⚙️ CodeRabbit configuration file

cognee/tests/*: When reviewing test code:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.

As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.

Files:

cognee/tests/test_dataset_database_handler.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)

GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: Basic Tests / Run Linting
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
GitHub Check: Test Weighted Edges Core Functionality (3.12)

…k with multi user mode

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (8)

cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (2)

28-30: Potential AttributeError when user is None.

The user parameter is Optional[User], but user.id is accessed without a null check. This will raise an AttributeError if the caller passes None.

Apply this diff to add validation:

     @classmethod
     async def create_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> dict:
         vector_config = get_vectordb_config()
         base_config = get_base_config()

         if vector_config.vector_db_provider != "lancedb":
             raise ValueError(
                 "LanceDBDatasetDatabaseHandler can only be used with LanceDB vector database provider."
             )
 
+        if user is None:
+            raise ValueError("User is required for LanceDB dataset creation.")
+
         databases_directory_path = os.path.join(
             base_config.system_root_directory, "databases", str(user.id)
         )

32-32: dataset_id is Optional[UUID] but used directly.

If dataset_id is None, this will produce "None.lance.db" as the database name, which is likely unintended.

Apply this diff to add validation:

+        if dataset_id is None:
+            raise ValueError("Dataset ID is required for LanceDB dataset creation.")
+
         vector_db_name = f"{dataset_id}.lance.db"

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (6)

46-50: Insecure default encryption key exposes credentials.

The fallback to "test_key" violates cryptographic best practices. As noted in past reviews and developer comments, this implementation is not production-safe.

Per extensive past review discussion, require the environment variable explicitly:

-        encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
+        encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY")
+        if not encryption_env_key:
+            raise ValueError(
+                "NEO4J_ENCRYPTION_KEY must be set to encrypt Neo4j Aura credentials safely."
+            )

Apply the same fix at Line 141 in resolve_dataset_connection_info().

62-62: Add timeout to HTTP request.

The requests.post call lacks a timeout, which can cause indefinite blocking if the Neo4j API is unresponsive.

-            resp = requests.post(url, data=data, auth=(client_id, client_secret))
+            resp = requests.post(url, data=data, auth=(client_id, client_secret), timeout=30)

As per coding guidelines, "Prefer explicit, structured error handling in Python code."

90-96: Missing timeout and inefficient JSON parsing.

The instance creation request lacks a timeout and calls response.json() multiple times, which is inefficient.

-        response = requests.post(url, headers=headers, json=payload)
+        response = requests.post(url, headers=headers, json=payload, timeout=60)
+        response.raise_for_status()

+        response_data = response.json()["data"]
         graph_db_name = "neo4j"  # Has to be 'neo4j' for Aura
-        graph_db_url = response.json()["data"]["connection_url"]
+        graph_db_url = response_data["connection_url"]
         graph_db_key = resp["access_token"]
-        graph_db_username = response.json()["data"]["username"]
-        graph_db_password = response.json()["data"]["password"]
+        graph_db_username = response_data["username"]
+        graph_db_password = response_data["password"]

98-112: Synchronous HTTP blocks the async event loop during polling.

Using requests.get() inside an async function blocks the event loop for up to 5 minutes. Consider using httpx with async support or asyncio.to_thread().

         async def _wait_for_neo4j_instance_provisioning(instance_id: str, headers: dict):
             # Poll until the instance is running
             status_url = f"https://api.neo4j.io/v1/instances/{instance_id}"
             status = ""
-            for attempt in range(30):  # Try for up to ~5 minutes
-                status_resp = requests.get(status_url, headers=headers)
+            for _ in range(30):  # Try for up to ~5 minutes
+                status_resp = await asyncio.to_thread(
+                    requests.get, status_url, headers=headers, timeout=30
+                )
                 status = status_resp.json()["data"]["status"]
                 if status.lower() == "running":
                     return
                 await asyncio.sleep(10)

146-148: Add error handling for decryption failures.

The decryption lacks error handling for KeyError (missing password field) or InvalidToken (wrong key/corrupted data).

-        graph_db_password = cipher.decrypt(
-            dataset_database.graph_database_connection_info["graph_database_password"].encode()
-        ).decode()
+        try:
+            encrypted_password = dataset_database.graph_database_connection_info["graph_database_password"]
+            graph_db_password = cipher.decrypt(encrypted_password.encode()).decode()
+        except KeyError:
+            raise ValueError("Missing encrypted password in dataset connection info")
+        except Exception as e:
+            raise ValueError(f"Failed to decrypt Neo4j password: {e}")

As per coding guidelines, "Prefer explicit, structured error handling in Python code."

155-157: Implement delete_dataset or raise NotImplementedError.

The abstract method is unimplemented. Raise NotImplementedError to make it explicit that this functionality is pending.

     @classmethod
     async def delete_dataset(cls, dataset_database: DatasetDatabase):
-        pass
+        # TODO: Implement instance deletion via Neo4j Aura API
+        raise NotImplementedError("Neo4j Aura instance deletion not yet implemented")

🧹 Nitpick comments (3)

cognee/modules/data/deletion/prune_system.py (2)
14-16: Move import to module level unless circular dependency exists.

Imports placed inside functions are generally a code smell and can impact readability and performance. Consider moving this import to the top of the module unless there's a specific circular dependency issue.

Apply this diff:
 from cognee.context_global_variables import backend_access_control_enabled
 from cognee.infrastructure.databases.vector import get_vector_engine
 from cognee.infrastructure.databases.graph.get_graph_engine import get_graph_engine
 from cognee.infrastructure.databases.relational import get_relational_engine
 from cognee.infrastructure.databases.vector.config import get_vectordb_config
 from cognee.infrastructure.databases.graph.config import get_graph_config
+from cognee.infrastructure.databases.dataset_database_handler.supported_dataset_database_handlers import (
+    supported_dataset_database_handlers,
+)
 from cognee.shared.cache import delete_cache
 from cognee.modules.users.models import DatasetDatabase
Then remove lines 14-16 from the function.

49-51: Update or remove the outdated TODO comment.

The TODO states "prune_system should work with multi-user access control mode enabled" but the code in lines 52-62 appears to implement exactly this functionality via prune_graph_databases() and prune_vector_databases().

Either remove the TODO if the implementation is complete, or clarify what remaining work is needed.
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
31-31: Remove redundant import.

get_graph_config is already imported at Line 9. Remove this duplicate import from within the method.

Apply this diff:
-        from cognee.infrastructure.databases.graph.config import get_graph_config
-
         graph_config = get_graph_config()

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5698c60 and f4078d1.

📒 Files selected for processing (6)

.github/workflows/e2e_tests.yml (2 hunks)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1 hunks)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1 hunks)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (1 hunks)
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1 hunks)
cognee/modules/data/deletion/prune_system.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

.github/**

⚙️ CodeRabbit configuration file

.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.

'actionlint' erroneously generates false positives when dealing with GitHub's ${{ ... }} syntax in conditionals.

'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid ${{ ... }} syntax.

Files:

.github/workflows/e2e_tests.yml

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py
cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py
cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py
cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py
cognee/modules/data/deletion/prune_system.py
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

🧠 Learnings (1)

📚 Learning: 2025-10-11T04:18:24.594Z

Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.

Applied to files:

cognee/modules/data/deletion/prune_system.py

🧬 Code graph analysis (2)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (7)

cognee/infrastructure/databases/graph/get_graph_engine.py (2)

get_graph_engine (10-24)

create_graph_engine (28-169)

cognee/modules/users/models/DatasetDatabase.py (1)

DatasetDatabase (7-37)

cognee/infrastructure/databases/graph/config.py (1)

get_graph_config (131-144)

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)

DatasetDatabaseHandlerInterface (9-80)

create_dataset (12-34)

delete_dataset (71-80)

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (2)

create_dataset (21-129)

delete_dataset (156-157)

cognee/infrastructure/databases/graph/graph_db_interface.py (1)

delete_graph (296-300)

cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)

delete_graph (839-861)

cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (3)

cognee/infrastructure/databases/graph/config.py (1)

get_graph_config (131-144)

cognee/modules/users/models/DatasetDatabase.py (1)

DatasetDatabase (7-37)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)

create_dataset (19-55)

delete_dataset (58-81)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
GitHub Check: Test Weighted Edges Examples
GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)

🔇 Additional comments (6)

.github/workflows/e2e_tests.yml (2)

215-238: The test job configuration is correct and complete.

The test file exists at ./cognee/tests/test_dataset_database_handler.py and uses custom LanceDB and Kuzu handlers with local file-based storage. No external service dependencies (postgres, redis, docker) are required. The environment variables for LLM and embedding configuration are already present in the job, and all test setup is self-contained within the test file.

150-150: No actionable change needed.

The ENABLE_BACKEND_ACCESS_CONTROL: 'false' setting for the Deduplication Example test is appropriate. This flag controls multi-tenant data isolation at the database layer and does not affect deduplication logic. The test file contains pure deduplication verification without any access control dependencies. This setting pattern is consistent with other non-access-control focused tests in the codebase (e.g., test_relational_db_migration.py, code_graph_example.py). Deduplication behavior is identical regardless of the access control mode.

cognee/modules/data/deletion/prune_system.py (1)

1-1: LGTM!

The new imports are appropriate for the added functionality.

Also applies to: 5-6, 8-8

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)

9-80: LGTM: Well-designed interface with clear security guidance.

The interface establishes a clean contract for dataset database provisioning with comprehensive docstrings. The security guidance regarding credential handling (preferring secret references over plaintext in create_dataset and resolving at connection time via resolve_dataset_connection_info) aligns with best practices.

cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1)

42-52: LGTM: Proper cleanup implementation.

The delete_dataset method correctly creates a vector engine with the dataset's connection info and invokes the prune operation to clean up resources.

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)

57-81: LGTM: Proper graph deletion implementation.

The delete_dataset method now correctly constructs the graph file path and invokes the graph engine's delete_graph() to remove the dataset's backing database. This addresses the prior review concern about the no-op implementation.

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

cognee/modules/data/deletion/prune_system.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

cognee/modules/data/deletion/prune_system.py (5)
17-39: Add docstring to prune_graph_databases function.

The function lacks a docstring. As per coding guidelines, undocumented function definitions are assumed incomplete and should have docstrings explaining purpose, behavior, and any important notes.

Add a concise docstring:
 async def prune_graph_databases():
+    """
+    Prune all graph databases for registered datasets.
+    
+    Iterates through dataset_database records and deletes each graph database
+    using the configured handler. Skips pruning if the dataset_database table
+    is not accessible.
+    """
     async def _prune_graph_db(dataset_database: DatasetDatabase) -> dict:
As per coding guidelines, docstrings are required for function definitions.

24-25: Add validation and error handling for handler access and deletion.

The dictionary access on line 24 could raise KeyError if the configured handler is not registered. Additionally, the delete_dataset call on line 25 lacks error handling, which could cause the entire pruning operation to fail when one dataset deletion fails.

Apply this diff:
-        handler = supported_dataset_database_handlers[graph_config.graph_dataset_database_handler]
-        return await handler["handler_instance"].delete_dataset(dataset_database)
+        handler_key = graph_config.graph_dataset_database_handler
+        if handler_key not in supported_dataset_database_handlers:
+            logger.error(f"Graph handler '{handler_key}' not registered")
+            raise ValueError(f"Graph dataset database handler '{handler_key}' not registered")
+        
+        handler = supported_dataset_database_handlers[handler_key]
+        try:
+            return await handler["handler_instance"].delete_dataset(dataset_database)
+        except Exception as e:
+            logger.error(f"Failed to delete graph database for dataset {dataset_database.dataset_id}: {e}")
+            raise
As per coding guidelines, use shared logging utilities from cognee.shared.logging_utils and prefer explicit error handling.

41-63: Address duplicate issues from prune_graph_databases.

This function has the same issues as prune_graph_databases:

Missing docstring (lines 41-63)

Import inside nested function (lines 45-47) should be moved to module level

Dictionary access without key validation (line 49) could raise KeyError

Missing error handling for delete_dataset call (line 50)

Apply similar fixes as suggested for prune_graph_databases:

Add docstring describing the function's purpose

Move the import to module level (already covered in the earlier comment)

Validate that vector_config.vector_dataset_database_handler exists in supported_dataset_database_handlers before accessing

Wrap delete_dataset in try-except with logging

As per coding guidelines, use shared logging utilities from cognee.shared.logging_utils and prefer explicit error handling.

66-66: Add docstring to prune_system function.

The function lacks a docstring. As per coding guidelines, undocumented function definitions are assumed incomplete and should have docstrings.

Add a docstring explaining the function's purpose, parameters, and the important note about API availability:
 async def prune_system(graph=True, vector=True, metadata=True, cache=True):
+    """
+    Prune system databases and cache.
+    
+    Args:
+        graph: If True, prune graph databases
+        vector: If True, prune vector databases  
+        metadata: If True, prune metadata database
+        cache: If True, delete cache
+    
+    Note:
+        This function should not be exposed through the API as it has no permission
+        checks and will delete all databases. Use only in development/testing.
+    """
     # Note: prune system should not be available through the API, it has no permission checks and will
As per coding guidelines, docstrings are required for function definitions.

72-79: Add error handling for new pruning functions.

The calls to prune_graph_databases() (line 73) and prune_vector_databases() (line 79) lack error handling. Given that this function is called at pipeline start (based on learnings) and is used in development/testing, failures should be handled gracefully to prevent cascading issues.

Apply this diff:
     elif graph and backend_access_control_enabled():
-        await prune_graph_databases()
+        try:
+            await prune_graph_databases()
+        except Exception as e:
+            logger.error(f"Failed to prune graph databases: {e}")
+            raise

     if vector and not backend_access_control_enabled():
         vector_engine = get_vector_engine()
         await vector_engine.prune()
     elif vector and backend_access_control_enabled():
-        await prune_vector_databases()
+        try:
+            await prune_vector_databases()
+        except Exception as e:
+            logger.error(f"Failed to prune vector databases: {e}")
+            raise
As per coding guidelines, use shared logging utilities from cognee.shared.logging_utils and prefer explicit error handling.

Based on learnings, this function is called at pipeline start, so proper error propagation is critical.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1961efc and fd84ede.

📒 Files selected for processing (1)

cognee/modules/data/deletion/prune_system.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/modules/data/deletion/prune_system.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/modules/data/deletion/prune_system.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/modules/data/deletion/prune_system.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/modules/data/deletion/prune_system.py

🧠 Learnings (1)

📚 Learning: 2025-10-11T04:18:24.594Z

Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.

Applied to files:

cognee/modules/data/deletion/prune_system.py

🧬 Code graph analysis (1)

cognee/modules/data/deletion/prune_system.py (8)

cognee/context_global_variables.py (1)

backend_access_control_enabled (74-83)

cognee/infrastructure/databases/vector/get_vector_engine.py (1)

get_vector_engine (5-7)

cognee/infrastructure/databases/graph/get_graph_engine.py (1)

get_graph_engine (10-24)

cognee/infrastructure/databases/relational/get_relational_engine.py (1)

get_relational_engine (5-21)

cognee/modules/users/models/DatasetDatabase.py (1)

DatasetDatabase (7-37)

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)

delete_dataset (71-80)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)

delete_dataset (58-81)

cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1)

delete_dataset (42-52)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)

🔇 Additional comments (1)

cognee/modules/data/deletion/prune_system.py (1)

28-38: Good addition of outer error handling.

The error handling for OperationalError and EntityNotFoundError is a good addition. This gracefully handles cases where the dataset_database table is not accessible (e.g., during initial setup or schema migration) and logs appropriately using the shared logger.

The use of logger.debug() is appropriate here since this is an expected condition in certain environments.

Also applies to: 52-63

cognee/modules/data/deletion/prune_system.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
18-54: Validate dataset_id before using it to build the Kuzu filename.

dataset_id is typed as Optional[UUID] but is used directly in graph_db_name = f"{dataset_id}.pkl". If dataset_id is ever None, this will silently create a "None.pkl" database name, which is almost certainly unintended and will be hard to diagnose.

Consider failing fast by validating dataset_id at the top of the method:
     @classmethod
     async def create_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> dict:
@@
-        graph_config = get_graph_config()
+        graph_config = get_graph_config()
@@
-        if graph_config.graph_database_provider != "kuzu":
+        if graph_config.graph_database_provider != "kuzu":
             raise ValueError(
                 "KuzuDatasetDatabaseHandler can only be used with Kuzu graph database provider."
             )
+
+        if dataset_id is None:
+            raise ValueError("dataset_id is required for Kuzu dataset creation.")
@@
-        graph_db_name = f"{dataset_id}.pkl"
+        graph_db_name = f"{dataset_id}.pkl"
This keeps the handler’s contract strict and avoids generating ambiguous filenames.

🧹 Nitpick comments (1)

cognee/context_global_variables.py (1)

124-125: TODO comment flags technical debt.

The TODO comment correctly identifies that vector and graph config handling across Cognee needs improvement, particularly regarding LRU_CACHE sensitivity to input order.

Would you like me to open a new issue to track this technical debt, or do you plan to address it in a follow-up PR?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd84ede and f1c5b9a.

📒 Files selected for processing (3)

cognee/context_global_variables.py (6 hunks)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1 hunks)
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/context_global_variables.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/context_global_variables.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/context_global_variables.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

🧬 Code graph analysis (2)

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (5)

cognee/infrastructure/databases/graph/get_graph_engine.py (2)

get_graph_engine (10-24)

create_graph_engine (28-169)

cognee/base_config.py (1)

get_base_config (63-64)

cognee/modules/users/models/DatasetDatabase.py (1)

DatasetDatabase (7-37)

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)

DatasetDatabaseHandlerInterface (9-80)

create_dataset (12-34)

delete_dataset (71-80)

cognee/infrastructure/databases/graph/config.py (1)

get_graph_config (131-144)

cognee/context_global_variables.py (4)

cognee/infrastructure/databases/vector/config.py (1)

get_vectordb_config (72-86)

cognee/infrastructure/databases/graph/config.py (1)

get_graph_config (131-144)

cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (1)

get_or_create_dataset_database (63-116)

cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py (1)

resolve_dataset_database_connection_info (28-42)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: CLI Tests / CLI Unit Tests
GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
GitHub Check: Code Quality for Weighted Edges

🔇 Additional comments (6)

cognee/context_global_variables.py (5)

7-10: LGTM! Import updates align with new configuration patterns.

The imports correctly reflect the refactored configuration utilities and the new dataset database connection info resolver.

82-82: LGTM! Simplified logic reduces duplication.

The direct call to multi_user_support_possible() simplifies the logic and ensures consistent validation.

141-146: Verify empty string defaults for credentials are acceptable.

Using .get() with empty string defaults for graph_database_username and graph_database_password might mask missing required credentials. For Neo4j Aura (mentioned in the PR title), these credentials are typically required.

Verify whether:

Empty credentials are valid for any supported graph database providers

Downstream code properly validates and reports authentication failures

The graph_database_connection_info dictionary is guaranteed to contain these keys after resolve_dataset_database_connection_info() is called

Consider using None as the default and explicitly checking for required credentials, or raising an error if credentials are missing for providers that require them.

112-113: LGTM! Proper connection info resolution added.

The call to resolve_dataset_database_connection_info() ensures that all connection details (both vector and graph) are properly resolved before use.

29-30: No action required—handler attributes cannot be None.

Both graph_dataset_database_handler and vector_dataset_database_handler are defined as str type with non-empty default values ("kuzu" and "lancedb" respectively) in their respective Pydantic BaseSettings config classes. These attributes are guaranteed to always return string values and will not fail the in checks on lines 35 and 42.

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)

1-10: Imports and interface wiring look consistent.

The imports are all used, and the handler cleanly depends on the shared DatasetDatabaseHandlerInterface, graph engine factory, and user/database models. No changes needed here.

coderabbitai · 2025-12-03T17:10:09Z

cognee/context_global_variables.py

 def multi_user_support_possible():
-    graph_db_config = get_graph_context_config()
-    vector_db_config = get_vectordb_context_config()
-    return (
-        graph_db_config["graph_database_provider"] in GRAPH_DBS_WITH_MULTI_USER_SUPPORT
-        and vector_db_config["vector_db_provider"] in VECTOR_DBS_WITH_MULTI_USER_SUPPORT
+    graph_db_config = get_graph_config()
+    vector_db_config = get_vectordb_config()
+
+    graph_handler = graph_db_config.graph_dataset_database_handler
+    vector_handler = vector_db_config.vector_dataset_database_handler
+    from cognee.infrastructure.databases.dataset_database_handler import (
+        supported_dataset_database_handlers,
    )

+    if graph_handler not in supported_dataset_database_handlers:
+        raise EnvironmentError(
+            "Unsupported graph dataset to database handler configured. Cannot add support for multi-user access control mode. Please use a supported graph dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n"
+            f"Selected graph dataset to database handler: {graph_handler}\n"
+            f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n"
+        )
+
+    if vector_handler not in supported_dataset_database_handlers:
+        raise EnvironmentError(
+            "Unsupported vector dataset to database handler configured. Cannot add support for multi-user access control mode. Please use a supported vector dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n"
+            f"Selected vector dataset to database handler: {vector_handler}\n"
+            f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n"
+        )
+
+    if (
+        supported_dataset_database_handlers[graph_handler]["handler_provider"]
+        != graph_db_config.graph_database_provider
+    ):
+        raise EnvironmentError(
+            "The selected graph dataset to database handler does not work with the configured graph database provider. Cannot add support for multi-user access control mode. Please use a supported graph dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n"
+            f"Selected graph database provider: {graph_db_config.graph_database_provider}\n"
+            f"Selected graph dataset to database handler: {graph_handler}\n"
+            f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n"
+        )
+
+    if (
+        supported_dataset_database_handlers[vector_handler]["handler_provider"]
+        != vector_db_config.vector_db_provider
+    ):
+        raise EnvironmentError(
+            "The selected vector dataset to database handler does not work with the configured vector database provider. Cannot add support for multi-user access control mode. Please use a supported vector dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n"
+            f"Selected vector database provider: {vector_db_config.vector_db_provider}\n"
+            f"Selected vector dataset to database handler: {vector_handler}\n"
+            f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n"
+        )
+
+    return True


🛠️ Refactor suggestion | 🟠 Major

Add a docstring to document validation behavior.

This function lacks a docstring. Per the coding guidelines, undocumented function definitions are assumed incomplete.

Apply this diff to add a docstring:

def multi_user_support_possible(): + """ + Validate that the configured graph and vector dataset database handlers are supported + and compatible with their respective database providers. + + Raises: + EnvironmentError: If handlers are unsupported or if handler providers don't match + the configured database providers. + + Returns: + bool: True if validation passes. + """ graph_db_config = get_graph_config() vector_db_config = get_vectordb_config()

As per coding guidelines, all functions should have docstrings.

🤖 Prompt for AI Agents

In cognee/context_global_variables.py around lines 25 to 71, the function multi_user_support_possible lacks a docstring; add a concise triple-quoted docstring immediately under the def line that states the function’s purpose (validate whether configured graph and vector dataset handlers/providers support multi-user access control), documents that it takes no arguments, returns True when validation passes, and raises EnvironmentError with explanatory messages when any handler/provider is unsupported or mismatched; keep the docstring short (one to three sentences) and include a brief note about the raised exceptions and return value.

cognee/context_global_variables.py

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

Andrej Milicevic and others added 21 commits October 27, 2025 08:12

Initial commit, still wip

813ee94

feature: add multi-user for Falkor db

bbcd8ba

chore: ruff formatting

9c93958

fix: delete unnecessary comments, add to config

c3f0cb9

fix: PR comment fixes

70f3ced

feat: add alembic migrations

908d329

fix: fix small naming error

ce92561

fix: added vector db name to test configs

28f28f0

chore: ruff format

4b0b9bf

Merge branch 'dev' into feature/cog-3245-enable-multi-user-for-falkor

322c134

fix: changes based on PR comments

ed2d687

fix: Resolve multi user migration

011a7fb

Apply suggestion from @dexters1

bb8de7b

Apply suggestion from @dexters1

20d49ee

Apply suggestion from @dexters1

41b844a

fix: PR comment changes

4f57712

Merge branch 'feature/cog-3245-enable-multi-user-for-falkor' of githu…

cfc1313

…b.com:topoteretes/cognee into feature/cog-3245-enable-multi-user-for-falkor

fix: fix names, add falkor to constants

ac6c3ef

fix: Update vector db url properly

6a64023

Merge branch 'feature/cog-3245-enable-multi-user-for-falkor' of githu…

a8706a2

…b.com:topoteretes/cognee into feature/cog-3245-enable-multi-user-for-falkor

feat: Add initial multi tenant neo4j support

432d4a1

dexters1 self-assigned this Nov 11, 2025

dexters1 added 2 commits November 11, 2025 20:04

Merge branch 'dev' into multi-tenant-neo4j

a06c58a

refactor: Update dataset database class

a0a14e7

dexters1 commented Nov 11, 2025

View reviewed changes

cognee/infrastructure/databases/utils/get_or_create_dataset_database.py Outdated Show resolved Hide resolved

dexters1 added 2 commits November 12, 2025 17:58

refactor: Make neo4j auto scaling more readable

b017fcc

refactor: Add todo point

0176cd5

dexters1 commented Nov 12, 2025

View reviewed changes

cognee/infrastructure/databases/utils/get_or_create_dataset_database.py Outdated Show resolved Hide resolved

borisarzentar requested changes Dec 1, 2025

View reviewed changes

borisarzentar changed the title ~~Multi tenant neo4j~~ feat: Multi tenant neo4j Dec 1, 2025

borisarzentar changed the title ~~feat: Multi tenant neo4j~~ feat: add dataset database handler logic and neo4j/lancedb/kuzu handlers Dec 1, 2025

pazone reviewed Dec 1, 2025

View reviewed changes

cognee/tests/test_dataset_database_handler.py Show resolved Hide resolved

dexters1 and others added 2 commits December 2, 2025 13:09

chore: remove unused imports, add optional for delete dataset statement

dbcb35a

Merge branch 'dev' into multi-tenant-neo4j

702cdb4

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

dexters1 requested review from borisarzentar and pazone December 2, 2025 13:27

refactor: remove done TODOs

9244876

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

feat: add password encryption for Neo4j

1282905

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

pazone reviewed Dec 2, 2025

View reviewed changes

test: Update tests with regards to auto scaling changes

5698c60

borisarzentar requested changes Dec 3, 2025

View reviewed changes

coderabbitai bot reviewed Dec 3, 2025

View reviewed changes

feat: Add ability to delete lance and kuzu datasets, add prune to wor…

f4078d1

…k with multi user mode

coderabbitai bot reviewed Dec 3, 2025

View reviewed changes

dexters1 and others added 3 commits December 3, 2025 14:27

fix: Handle scenario when there is no relational database on prune time

1961efc

Merge branch 'dev' into multi-tenant-neo4j

45f32f8

refactor: change getting of tables during deletion

fd84ede

coderabbitai bot reviewed Dec 3, 2025

View reviewed changes

cognee/modules/data/deletion/prune_system.py Show resolved Hide resolved

fix: Resolve DB caching issues when deleting databases

f1c5b9a

coderabbitai bot reviewed Dec 3, 2025

View reviewed changes

dexters1 requested review from borisarzentar, pazone and siillee December 5, 2025 14:18

		# configuration details for different database types. This would make it more flexible to add new database types
		# without changing the database schema.

		from cognee.infrastructure.databases.dataset_database_handler import DatasetDatabaseHandlerInterface


		class LanceDBDatasetDatabaseHandler(DatasetDatabaseHandlerInterface):

		dataset_database = await _get_vector_db_connection_info(dataset_database)
		dataset_database = await _get_graph_db_connection_info(dataset_database)

feat: add dataset database handler logic and neo4j/lancedb/kuzu handlers #1776

Are you sure you want to change the base?

feat: add dataset database handler logic and neo4j/lancedb/kuzu handlers #1776

Uh oh!

Conversation

dexters1 commented Nov 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Screenshots/Videos (if applicable)

Pre-submission Checklist

DCO Affirmation

Summary by CodeRabbit

Uh oh!

pull-checklist bot commented Nov 11, 2025

Please make sure all the checkboxes are checked:

Uh oh!

coderabbitai bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dexters1 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pazone Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dexters1 Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

dexters1 commented Nov 11, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 11, 2025 •

edited

Loading

dexters1 Dec 2, 2025 •

edited

Loading

coderabbitai bot Dec 2, 2025 •

edited

Loading

pazone Dec 9, 2025 •

edited

Loading

dexters1 Dec 3, 2025 •

edited

Loading