Skip to content

Conversation

@dexters1
Copy link
Collaborator

@dexters1 dexters1 commented Nov 11, 2025

Description

Add ability to use multi tenant multi user mode with Neo4j

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Code refactoring
  • Performance improvement
  • Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

  • New Features
    • Per-dataset multi-user database provisioning via pluggable handlers for graph and vector backends (Neo4j Aura, Kuzu, LanceDB).
  • Configuration
    • New config options and environment variables to select dataset database handlers; backend access control enabled by default.
  • Database Migration
    • Adds JSON connection fields for per-dataset graph/vector configs and updates uniqueness handling for safe upgrades/downgrades.
  • Improvements
    • Handler-driven pruning support and startup warning about multi-user access control.
  • Tests
    • New integration test and CI job exercising per-user dataset provisioning.

✏️ Tip: You can customize this high-level summary in your review settings.

@dexters1 dexters1 self-assigned this Nov 11, 2025
@pull-checklist
Copy link

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 11, 2025

Walkthrough

Adds a dataset-database handler framework, per-dataset JSON connection fields and migration, Kuzu/LanceDB/Neo4j Aura handlers and registry, config/env keys and resolution at runtime, pruning hooks, auth gating changes, tests, and CI workflow updates.

Changes

Cohort / File(s) Summary
Env & Logging
/.env.template, cognee/shared/logging_utils.py
Added GRAPH_DATASET_DATABASE_HANDLER / VECTOR_DATASET_DATABASE_HANDLER and startup warnings about multi-user backend access control.
DB Schema & Migration
cognee/modules/users/models/DatasetDatabase.py, alembic/versions/..._expand_dataset_database_with_json_*.py
Made graph/vector name fields non-unique; added graph_database_connection_info and vector_database_connection_info JSON columns; migration adds columns and conditionally drops/recreates unique constraints (Postgres vs SQLite, with SQLite table-recreate helpers).
Handler API & Registry
cognee/infrastructure/databases/dataset_database_handler/__init__.py, .../dataset_database_handler_interface.py, .../supported_dataset_database_handlers.py, .../use_dataset_database_handler.py
New DatasetDatabaseHandlerInterface (create/resolve/delete), registry supported_dataset_database_handlers, and use_dataset_database_handler registration helper; package exports updated.
Graph config, engine & handlers
cognee/infrastructure/databases/graph/config.py, .../get_graph_engine.py, .../graph/kuzu/...KuzuDatasetDatabaseHandler.py, .../graph/neo4j_driver/...Neo4jAuraDatasetDatabaseHandler.py
Added graph_dataset_database_handler config; extended create_graph_engine signature; added Kuzu and Neo4j Aura handlers (provisioning, encrypted creds, resolve/delete hooks).
Vector config, engine & handlers
cognee/infrastructure/databases/vector/config.py, .../create_vector_engine.py, .../vector/lancedb/...LanceDBDatasetDatabaseHandler.py, .../vector/vector_db_interface.py
Added vector_dataset_database_handler config; extended create_vector_engine signature; added LanceDB handler and dataset lifecycle methods on vector interface.
Utilities: get/create & resolve
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py, .../resolve_dataset_database_connection_info.py, .../utils/__init__.py
get_or_create_dataset_database refactored to use handler helpers and return existing records early; new resolver module delegates to handlers to resolve connection info asynchronously and is exported.
Context, globals & Auth
cognee/context_global_variables.py, cognee/modules/users/methods/get_authenticated_user.py
Context setup now resolves dataset connection info and injects graph credentials from dataset info; dynamic handler inspection replaces static lists; authentication gating switched to ENABLE_BACKEND_ACCESS_CONTROL env var (defaults enabled).
Prune / Deletion
cognee/modules/data/deletion/prune_system.py
Added prune_graph_databases() and prune_vector_databases() and conditional pruning that delegates to handlers when backend access control is enabled.
Tests & CI
cognee/tests/test_dataset_database_handler.py, .github/workflows/db_examples_tests.yml, .github/workflows/e2e_tests.yml
New test registering custom handlers and asserting per-user dataset files; CI sets ENABLE_BACKEND_ACCESS_CONTROL: 'false' for example steps and adds job to run the handler test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Focus review on:
    • Alembic migration: SQLite table recreation and unique-constraint detection/recreation.
    • resolve_dataset_database_connection_info async flow and its integration in context initialization.
    • Neo4jAuraDatasetDatabaseHandler: HTTP/OAuth provisioning, polling, encryption/decryption, and error handling.
    • Changes to DatasetDatabase uniqueness and callers that assumed uniqueness.
    • Tests that create filesystem artifacts and CI env overrides.

Possibly related PRs

Suggested labels

run-checks

Suggested reviewers

  • hajdul88

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is minimal and incomplete. While it mentions the main objective (multi-tenant/multi-user mode with Neo4j), most template sections are unchecked and the Pre-submission Checklist is entirely unchecked, indicating insufficient documentation of changes, testing, and adherence to project standards. Complete the PR description by: (1) expanding the Description section with detailed change rationale, (2) selecting the appropriate Type of Change option, (3) checking Pre-submission Checklist items to confirm testing and code standards compliance, and (4) linking any relevant issues if applicable.
Docstring Coverage ⚠️ Warning Docstring coverage is 45.65% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically describes the main change: adding dataset database handler logic and implementations for Neo4j, LanceDB, and Kuzu database handlers.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch multi-tenant-neo4j

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.


@classmethod
async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this like that?
In case we want to delete the dataset, do we just delete data inside a dataset database, or we also delete the database?

Copy link
Collaborator Author

@dexters1 dexters1 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we must also delete the database in most cases, for example in Neo4j if a dataset is empty and not used it will have high billing costs so this will allow us to keep costs lower for empty datasets

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I guess we need to handle delete here. In case of kuzu that is just calling delete_graph.

@borisarzentar borisarzentar changed the title Multi tenant neo4j feat: Multi tenant neo4j Dec 1, 2025
@borisarzentar borisarzentar changed the title feat: Multi tenant neo4j feat: add dataset database handler logic and neo4j/lancedb/kuzu handlers Dec 1, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)

9-12: Add a class-level docstring for completeness.

Per coding guidelines, undocumented class definitions are assumed incomplete. Consider adding a brief docstring summarizing the interface's purpose.

 class DatasetDatabaseHandlerInterface(ABC):
+    """
+    Abstract interface for dataset database handlers.
+
+    Implementations provide logic for creating, resolving, and deleting
+    per-dataset graph or vector database connections in multi-tenant mode.
+    """
     @classmethod
     @abstractmethod
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)

14-52: Implementation follows the interface contract correctly.

The method validates the provider, fetches configuration, and returns the expected dictionary structure. The inline import on line 27 is acceptable for avoiding circular dependencies.

Regarding the TODO on line 36 about graph file path info: Do you want me to help implement user-specific path construction similar to LanceDBDatasetDatabaseHandler (which builds paths under databases/{user.id}/)?


54-56: Add missing return type annotation for interface consistency.

The interface declares delete_dataset(...) -> None, but this implementation omits the return type. While Python allows this, adding the annotation maintains consistency with the interface contract.

     @classmethod
-    async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]):
+    async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> None:
         pass

Given the past discussion about needing database deletion for cost management, consider adding a TODO comment to track the implementation:

     @classmethod
     async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> None:
+        # TODO: Implement Kuzu database deletion to free resources
         pass
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e493ce and 702cdb4.

📒 Files selected for processing (3)
  • .github/workflows/db_examples_tests.yml (2 hunks)
  • cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1 hunks)
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
.github/**

⚙️ CodeRabbit configuration file

.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.

  • 'actionlint' erroneously generates false positives when dealing with GitHub's ${{ ... }} syntax in conditionals.
  • 'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid ${{ ... }} syntax.

Files:

  • .github/workflows/db_examples_tests.yml
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
🧬 Code graph analysis (1)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (4)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)
  • DatasetDatabaseHandlerInterface (9-81)
  • create_dataset (12-34)
  • delete_dataset (71-81)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (2)
  • create_dataset (19-116)
  • delete_dataset (119-120)
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (2)
  • create_dataset (17-37)
  • delete_dataset (40-41)
cognee/infrastructure/databases/graph/config.py (1)
  • get_graph_config (131-144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: Basic Tests / Run Basic Graph Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: Basic Tests / Run Simple Examples BAML
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
  • GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
🔇 Additional comments (4)
.github/workflows/db_examples_tests.yml (2)

64-64: Verify: Why is ENABLE_BACKEND_ACCESS_CONTROL absent from the Kuzu example job?

The environment variable ENABLE_BACKEND_ACCESS_CONTROL: 'false' is added to both the Neo4j and PGVector example test steps, but is absent from the Kuzu example step. This inconsistency suggests either: (1) Kuzu does not yet support the multi-user mode feature introduced in this PR, or (2) this is an unintended omission.

Clarify whether this is intentional or if the variable should also be added to the Kuzu example.

Also applies to: 146-146


62-78: Environment variable addition aligns with PR objectives.

The addition of ENABLE_BACKEND_ACCESS_CONTROL: 'false' to test workflows is consistent with the PR's multi-tenant, multi-user mode feature. Disabling access control in test examples is appropriate for isolated testing scenarios.

Also applies to: 143-155

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)

10-67: Well-structured interface with comprehensive documentation.

The interface design is sound: abstract methods for create_dataset and delete_dataset with a concrete default for resolve_dataset_connection_info that can be optionally overridden. The docstrings appropriately emphasize security considerations around credential handling.

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)

1-12: Clean imports and proper class structure.

The unused imports from the previous review have been removed. The class correctly implements DatasetDatabaseHandlerInterface with an appropriate docstring.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (3)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)

14-41: Validate dataset_id and document the legacy .pkl naming.

dataset_id is typed as Optional[UUID] but is required here to form graph_db_name; if it is ever None you’ll silently get "None.pkl". Consider either:

  • enforcing non-None with an explicit check and clear ValueError, or
  • supporting the None case explicitly in the naming logic.

Given the prior decision to keep the .pkl suffix for backward compatibility, it would help future maintainers to add a short comment explaining why this extension is used and what older Kuzu deployments expect, instead of leaving it implicit.


36-51: Avoid persisting raw DB credentials in the DatasetDatabase row.

Per the DatasetDatabaseHandlerInterface docstring, the dict returned from create_dataset is stored verbatim and later passed into resolve_dataset_connection_info, and the guidance is to prefer returning references to secrets/roles instead of plaintext credentials. Here you are returning graph_database_username and graph_database_password directly, which will end up persisted in the relational DB.

Consider adjusting this handler to:

  • store only non-secret identifiers (e.g., a secret name/ARN or config key) in graph_database_connection_info, and
  • override resolve_dataset_connection_info in this class to resolve those identifiers into short‑lived credentials at connection time without persisting them.

This keeps the schema compatible while improving the security posture.

cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (1)

63-116: Consider adding logging for debugging and observability.

The function lacks logging statements, which would be valuable for debugging dataset database creation issues, tracking handler failures, and monitoring race conditions.

Consider adding logging at key points:

from cognee.shared.logging_utils import get_logger

logger = get_logger(__name__)

async def get_or_create_dataset_database(
    dataset: Union[str, UUID],
    user: User,
) -> DatasetDatabase:
    """..."""
    logger.debug(f"Getting or creating dataset database for dataset={dataset}, user={user.id}")
    
    db_engine = get_relational_engine()
    dataset_id = await get_unique_dataset_id(dataset, user)
    
    # If dataset is given as name make sure the dataset is created first
    if isinstance(dataset, str):
        logger.debug(f"Creating dataset '{dataset}' for user {user.id}")
        async with db_engine.get_async_session() as session:
            await create_dataset(dataset, user, session)
    
    # If dataset database already exists return it
    existing_dataset_database = await _existing_dataset_database(dataset_id, user)
    if existing_dataset_database:
        logger.debug(f"Found existing dataset database for dataset_id={dataset_id}")
        return existing_dataset_database
    
    logger.debug(f"Fetching handler configs for dataset_id={dataset_id}")
    graph_config_dict = await _get_graph_db_info(dataset_id, user)
    vector_config_dict = await _get_vector_db_info(dataset_id, user)
    
    # ... rest of function with additional logging for create/error paths

Based on coding guidelines for files matching cognee/**/*.py.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 702cdb4 and 9244876.

📒 Files selected for processing (3)
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1 hunks)
  • cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (4 hunks)
  • cognee/modules/users/models/DatasetDatabase.py (3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/modules/users/models/DatasetDatabase.py
  • cognee/infrastructure/databases/utils/get_or_create_dataset_database.py
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/modules/users/models/DatasetDatabase.py
  • cognee/infrastructure/databases/utils/get_or_create_dataset_database.py
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/modules/users/models/DatasetDatabase.py
  • cognee/infrastructure/databases/utils/get_or_create_dataset_database.py
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/modules/users/models/DatasetDatabase.py
  • cognee/infrastructure/databases/utils/get_or_create_dataset_database.py
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
🧬 Code graph analysis (1)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (4)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)
  • DatasetDatabaseHandlerInterface (9-81)
  • create_dataset (12-34)
  • delete_dataset (71-81)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (2)
  • create_dataset (19-116)
  • delete_dataset (119-120)
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (2)
  • create_dataset (17-37)
  • delete_dataset (40-41)
cognee/infrastructure/databases/graph/config.py (1)
  • get_graph_config (131-144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: Basic Tests / Run Simple Examples BAML
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: Code Quality for Weighted Edges
  • GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
  • GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
🔇 Additional comments (4)
cognee/modules/users/models/DatasetDatabase.py (2)

3-3: LGTM!

The additional imports (JSON and text) are correctly added to support the new JSON fields with SQL-level defaults.


15-16: LGTM!

The removal of unique constraints on database names was discussed in prior reviews and enables the desired multi-tenant flexibility.

cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (2)

38-60: LGTM!

The function is well-documented and implements a straightforward database query to check for existing records. The docstring clearly explains the purpose, parameters, and return value.


104-105: The dict unpacking is safe by design. Both handlers use namespaced keys—graph handlers return graph_* prefixed keys and vector handlers return vector_* prefixed keys. These disjoint key sets cannot collide, so no silent overrides can occur.

Comment on lines +27 to +28
# configuration details for different database types. This would make it more flexible to add new database types
# without changing the database schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify incomplete comment.

The comment starts mid-sentence. Consider adding a subject for clarity.

Apply this diff:

-    # configuration details for different database types. This would make it more flexible to add new database types
+    # These fields store configuration details for different database types. This would make it more flexible to add new database types
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# configuration details for different database types. This would make it more flexible to add new database types
# without changing the database schema.
# These fields store configuration details for different database types. This would make it more flexible to add new database types
# without changing the database schema.
🤖 Prompt for AI Agents
In cognee/modules/users/models/DatasetDatabase.py around lines 27 to 28, the
inline comment starts mid-sentence and is ambiguous; rewrite it to be a
complete, clear sentence that states the subject (e.g., "Store database
configuration details for different database types") and explains intent briefly
— that the model stores configuration details per database type to allow adding
new database types without changing the schema; update the comment text
accordingly to be grammatically complete and self-contained.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (3)

155-157: Make the unimplemented delete_dataset explicit.

delete_dataset currently just passes, which can lead to silent no‑ops if it’s ever invoked.

Replacing it with an explicit NotImplementedError (and a TODO) will fail fast and make the omission clear:

     @classmethod
     async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]):
-        pass
+        # TODO: Implement Neo4j Aura instance deletion via the Aura API.
+        raise NotImplementedError(
+            "Neo4jAuraDatasetDatabaseHandler.delete_dataset is not implemented yet."
+        )

58-67: Add timeouts and structured error handling to Aura token + instance creation calls.

Both the OAuth token request and the instance creation call use requests.post without a timeout or robust error handling, and they repeatedly call response.json(). That can lead to indefinite hangs and obscure failures.

A possible tightening:

-        # Make the request with HTTP Basic Auth
-        def get_aura_token(client_id: str, client_secret: str) -> dict:
-            url = "https://api.neo4j.io/oauth/token"
-            data = {"grant_type": "client_credentials"}  # sent as application/x-www-form-urlencoded
-
-            resp = requests.post(url, data=data, auth=(client_id, client_secret))
-            resp.raise_for_status()  # raises if the request failed
-            return resp.json()
+        # Make the request with HTTP Basic Auth
+        def get_aura_token(client_id: str, client_secret: str) -> dict:
+            url = "https://api.neo4j.io/oauth/token"
+            data = {"grant_type": "client_credentials"}  # sent as application/x-www-form-urlencoded
+
+            try:
+                resp = requests.post(
+                    url,
+                    data=data,
+                    auth=(client_id, client_secret),
+                    timeout=30,
+                )
+                resp.raise_for_status()
+            except requests.Timeout as exc:
+                raise TimeoutError("Timed out while requesting Neo4j Aura OAuth token") from exc
+            except requests.RequestException as exc:
+                raise RuntimeError("Failed to request Neo4j Aura OAuth token") from exc
+
+            return resp.json()
@@
-        response = requests.post(url, headers=headers, json=payload)
-
-        graph_db_name = "neo4j"  # Has to be 'neo4j' for Aura
-        graph_db_url = response.json()["data"]["connection_url"]
-        graph_db_key = resp["access_token"]
-        graph_db_username = response.json()["data"]["username"]
-        graph_db_password = response.json()["data"]["password"]
+        try:
+            response = requests.post(
+                url,
+                headers=headers,
+                json=payload,
+                timeout=60,
+            )
+            response.raise_for_status()
+        except requests.Timeout as exc:
+            raise TimeoutError(
+                f"Timed out while provisioning Neo4j Aura instance for dataset {dataset_id}"
+            ) from exc
+        except requests.RequestException as exc:
+            raise RuntimeError(
+                f"Failed to provision Neo4j Aura instance for dataset {dataset_id}"
+            ) from exc
+
+        response_data = response.json()["data"]
+
+        graph_db_name = "neo4j"  # Has to be 'neo4j' for Aura
+        graph_db_url = response_data["connection_url"]
+        graph_db_key = resp["access_token"]
+        graph_db_username = response_data["username"]
+        graph_db_password = response_data["password"]
@@
-        instance_id = response.json()["data"]["id"]
+        instance_id = response_data["id"]

This makes failures explicit, prevents indefinite blocking, and avoids reparsing the JSON body multiple times.

Provide the official `requests` documentation page describing the `timeout` parameter and recommended exception handling patterns for HTTP errors.

Also applies to: 90-97, 114-115


98-112: Avoid blocking the async event loop when polling Aura instance status; add timeout and error checks.

_wait_for_neo4j_instance_provisioning is async but calls requests.get directly in a loop with no timeout, which will block the event loop for up to ~5 minutes and still lacks raise_for_status().

One approach that keeps requests but avoids blocking:

-        async def _wait_for_neo4j_instance_provisioning(instance_id: str, headers: dict):
-            # Poll until the instance is running
-            status_url = f"https://api.neo4j.io/v1/instances/{instance_id}"
-            status = ""
-            for attempt in range(30):  # Try for up to ~5 minutes
-                status_resp = requests.get(
-                    status_url, headers=headers
-                )  # TODO: Use async requests with httpx
-                status = status_resp.json()["data"]["status"]
-                if status.lower() == "running":
-                    return
-                await asyncio.sleep(10)
-            raise TimeoutError(
-                f"Neo4j instance '{graph_db_name}' did not become ready within 5 minutes. Status: {status}"
-            )
+        async def _wait_for_neo4j_instance_provisioning(instance_id: str, headers: dict) -> None:
+            # Poll until the instance is running
+            status_url = f"https://api.neo4j.io/v1/instances/{instance_id}"
+            status = ""
+            for _ in range(30):  # Try for up to ~5 minutes
+                try:
+                    status_resp = await asyncio.to_thread(
+                        requests.get,
+                        status_url,
+                        headers=headers,
+                        timeout=30,
+                    )
+                    status_resp.raise_for_status()
+                except requests.Timeout as exc:
+                    raise TimeoutError(
+                        f"Timed out while polling Neo4j instance '{instance_id}' provisioning status."
+                    ) from exc
+                except requests.RequestException as exc:
+                    raise RuntimeError(
+                        f"Error while polling Neo4j instance '{instance_id}' provisioning status."
+                    ) from exc
+
+                status = status_resp.json()["data"]["status"]
+                if status.lower() == "running":
+                    return
+
+                await asyncio.sleep(10)
+
+            raise TimeoutError(
+                f"Neo4j instance '{instance_id}' did not become ready within 5 minutes. Last status: {status}"
+            )

Alternatively, switching to httpx.AsyncClient throughout would give you a fully async Aura client, but the above is a minimal, localized change.

Find the official Python documentation for `asyncio.to_thread` and any guidance on avoiding blocking I/O inside `async def` functions.
🧹 Nitpick comments (2)
cognee/modules/data/deletion/prune_system.py (1)

9-16: Refine access‑control gating: cache flag once and document behavior

The new checks correctly prevent graph/vector pruning when backend access control is enabled, which matches the multi‑user safety goal. Two improvements to consider:

  1. Avoid repeated calls
    Cache the flag once at the top of the function to keep behavior consistent and avoid redundant calls:
 async def prune_system(graph=True, vector=True, metadata=True, cache=True):
-    # TODO: prune_system should work with multi-user access control mode enabled
-    if graph and not backend_access_control_enabled():
+    # TODO: prune_system should work with multi-user access control mode enabled
+    access_control_enabled = backend_access_control_enabled()
+
+    if graph and not access_control_enabled:
         graph_engine = await get_graph_engine()
         await graph_engine.delete_graph()
 
-    if vector and not backend_access_control_enabled():
+    if vector and not access_control_enabled:
         vector_engine = get_vector_engine()
         await vector_engine.prune()
  1. Document the new semantics
    prune_system now has mode‑dependent behavior (graph/vector are no‑ops when access control is on). Adding a short docstring that explains this will help callers understand why pruning may be skipped and how the TODO will eventually be addressed (e.g., tenant‑aware pruning instead of a full disable). This also aligns with the project guideline that undocumented functions are considered incomplete.
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (1)

40-41: Clarify naming between dataset identifier and Aura DB name, and correct the slice length.

graph_db_name is initially set to f"{dataset_id}", later overwritten with "neo4j", and sliced as graph_db_name[0:29] while the comment mentions a 30‑character limit. That’s a bit confusing for future readers.

Consider something like:

-        graph_db_name = f"{dataset_id}"
+        dataset_label = str(dataset_id)
@@
-            "name": graph_db_name[
-                0:29
-            ],  # TODO: Find better name to name Neo4j instance within 30 character limit
+            "name": dataset_label[:30],  # TODO: Find better name within the 30-char limit
@@
-        graph_db_name = "neo4j"  # Has to be 'neo4j' for Aura
+        graph_db_name = "neo4j"  # Has to be 'neo4j' for Aura

This makes it clear which value is a human‑friendly instance label vs the fixed database name and aligns the slice with the stated limit.

Also applies to: 82-84, 92-92

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9244876 and 1282905.

📒 Files selected for processing (2)
  • cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (1 hunks)
  • cognee/modules/data/deletion/prune_system.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
🧠 Learnings (1)
📚 Learning: 2025-10-11T04:18:24.594Z
Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.

Applied to files:

  • cognee/modules/data/deletion/prune_system.py
🧬 Code graph analysis (1)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (3)
cognee/infrastructure/databases/graph/config.py (1)
  • get_graph_config (131-144)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (4)
  • DatasetDatabaseHandlerInterface (9-81)
  • create_dataset (12-34)
  • resolve_dataset_connection_info (37-67)
  • delete_dataset (71-81)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)
  • create_dataset (15-51)
  • delete_dataset (54-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / Run Telemetry Test
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: Basic Tests / Run Basic Graph Tests
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: Basic Tests / Run Simple Examples BAML
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: CLI Tests / CLI Unit Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
  • GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
🔇 Additional comments (1)
cognee/modules/data/deletion/prune_system.py (1)

1-1: Import of backend_access_control_enabled is appropriate

The new import cleanly supports the gating logic below; no issues from an API or style perspective.

Comment on lines +42 to +50
# Client credentials and encryption
client_id = os.environ.get("NEO4J_CLIENT_ID", None)
client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None)
tenant_id = os.environ.get("NEO4J_TENANT_ID", None)
encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
encryption_key = base64.urlsafe_b64encode(
hashlib.sha256(encryption_env_key.encode()).digest()
)
cipher = Fernet(encryption_key)
Copy link
Contributor

@coderabbitai coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

wc -l cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

Repository: topoteretes/cognee

Length of output: 151


🏁 Script executed:

cat -n cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py | head -160

Repository: topoteretes/cognee

Length of output: 7391


🌐 Web query:

cryptography.fernet key management best practices official documentation

💡 Result:

Short summary of official guidance and best practices for managing cryptography.fernet keys

  • Key format & secrecy: Fernet keys are URL-safe base64‑encoded 32‑byte secrets and "must be kept secret; anyone with this key can create and read messages." (cryptography.io)
  • Key rotation: Use MultiFernet to implement key rotation—place the new key first to encrypt new tokens, keep older keys to decrypt existing tokens, and remove/retire old keys when safe. Use MultiFernet.rotate() to re-encrypt tokens under the new primary key. Token/key rotation is described as a recommended cryptographic hygiene practice. (cryptography.io)
  • Practical key-management controls (recommended, not specific to library):
    • Generate keys with a secure RNG and store secrets in a dedicated secrets manager or HSM (do not hard-code keys in source or repos). See NIST and OWASP key‑management guidance for lifecycle controls (generation, storage, rotation, access control, backup/recovery, destruction). (csrc.nist.gov)
    • Limit access (least privilege), audit key usage, and have procedures for compromise (immediate rotation + re‑encryption/retirement). MultiFernet supports retiring old keys after rotation. (cryptography.io)
  • References (official docs and standards):
    1. cryptography (Fernet / MultiFernet) official docs. (cryptography.io)
    2. NIST key‑management guidance (SP 800‑57 series). (csrc.nist.gov)
    3. OWASP Key Management Cheat Sheet. (cheatsheetseries.owasp.org)

If you want, I can: (a) extract the exact code patterns for generating, rotating, and retiring Fernet keys; or (b) produce a short ops checklist (storage options, rotation schedule, incident playbook) tailored to your environment. Which would you prefer?

Citations:


🏁 Script executed:

rg -A 10 "class DatasetDatabaseHandlerInterface" --type py

Repository: topoteretes/cognee

Length of output: 1862


🏁 Script executed:

cat cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py

Repository: topoteretes/cognee

Length of output: 4479


Require NEO4J_ENCRYPTION_KEY environment variable and reconsider secret storage design to align with interface guidance.

The code falls back to an insecure default encryption key ("test_key") at lines 46 and 141, which contradicts the cryptography library's fundamental requirement that Fernet keys "must be kept secret" and violates NIST/OWASP key management practices. More critically, the current design persists both an encrypted password ciphertext and a plaintext Aura access token in the relational database, which conflicts with the DatasetDatabaseHandlerInterface contract guidance to "prefer returning only references to secrets or role identifiers, not plaintext credentials" and "do not persist any resolved or decrypted secrets back to the relational database."

Two required changes:

  1. Remove the insecure default encryption key. Require NEO4J_ENCRYPTION_KEY to be explicitly set:
-        encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
+        encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY")
+        if not encryption_env_key:
+            raise ValueError(
+                "NEO4J_ENCRYPTION_KEY must be set to encrypt Neo4j Aura credentials safely."
+            )

Apply the same fix at line 141 in resolve_dataset_connection_info() with message: "NEO4J_ENCRYPTION_KEY must be set in order to decrypt Neo4j Aura credentials."

  1. Store only references to credentials, not the credentials themselves. Instead of persisting the encrypted password ciphertext and plaintext access token, store only the Aura instance_id and tenant_id in the returned dictionary. In resolve_dataset_connection_info(), use those references to call the Aura API (via the stored OAuth credentials or a refresh flow) to obtain fresh, short-lived credentials at connection time. This aligns with the interface's explicit guidance and eliminates the risk that a DB dump exposes long-lived secrets.

See official guidance: cryptography.io Fernet key management, NIST SP 800-57 key management, OWASP Key Management Cheat Sheet.

Also fix the empty delete_dataset() stub at line 156–157 by raising NotImplementedError with a TODO comment.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Client credentials and encryption
client_id = os.environ.get("NEO4J_CLIENT_ID", None)
client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None)
tenant_id = os.environ.get("NEO4J_TENANT_ID", None)
encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
encryption_key = base64.urlsafe_b64encode(
hashlib.sha256(encryption_env_key.encode()).digest()
)
cipher = Fernet(encryption_key)
# Client credentials and encryption
client_id = os.environ.get("NEO4J_CLIENT_ID", None)
client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None)
tenant_id = os.environ.get("NEO4J_TENANT_ID", None)
encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY")
if not encryption_env_key:
raise ValueError(
"NEO4J_ENCRYPTION_KEY must be set to encrypt Neo4j Aura credentials safely."
)
encryption_key = base64.urlsafe_b64encode(
hashlib.sha256(encryption_env_key.encode()).digest()
)
cipher = Fernet(encryption_key)
🤖 Prompt for AI Agents
In
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
around lines 42–50 (and apply same check at line 141), remove the insecure
fallback by requiring NEO4J_ENCRYPTION_KEY to be present (raise a clear error if
missing and at line 141 use the message "NEO4J_ENCRYPTION_KEY must be set in
order to decrypt Neo4j Aura credentials."); stop writing encrypted password
ciphertexts and plaintext access tokens back to the relational DB—modify the
persistence/return values so only Aura references (instance_id and tenant_id)
are stored/returned, and update resolve_dataset_connection_info() to use those
references to obtain short‑lived credentials via the Aura OAuth/refresh flow at
connection time; finally replace the empty delete_dataset() stub at lines
156–157 with a NotImplementedError and a TODO comment indicating proper deletion
behavior to implement later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pazone How important you think this is? It is a risk, but we have it everywhere basically. This ensures that cognee will work with minimal config. The setup will be a bit more burdensome if we don't have those default values and expect users to provide them every time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw the Neo4jAura password encryption is not safe and not for production use, the idea was just to have this as a simple examples with minimal config for our users on how it could be done and they do it themselves with in a safe way with secret vaults/load balancers etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we don't store credentials here, right?

Copy link
Contributor

@pazone pazone Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's a repeated code :

encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
        encryption_key = base64.urlsafe_b64encode(
            hashlib.sha256(encryption_env_key.encode()).digest()
        )
cipher = Fernet(encryption_key)

Extract a private method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We store credentials here, but they are stored encrypted with a custom string that can be set, we decrypt them in the resolution step. If there is a simple service we can call without much setup to store credentials instead let me know.

This is intended to be more of an example for our users how a Neo4j dataset database handler can be written than an actual handler we will use for our production or recommend for them to use in production. We can make a private handler for SaaS if we decide to use Neo4j there that will utilize a third party service for credential management.

Ideally all production users need to write their own credential resolution mechanism at the very least, we support registering and using custom dataset database handlers

from cognee.infrastructure.databases.dataset_database_handler import DatasetDatabaseHandlerInterface


class LanceDBDatasetDatabaseHandler(DatasetDatabaseHandlerInterface):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a test that covers this class? I tried to run the only one changed tests with coverage and there's no hits

Copy link
Collaborator Author

@dexters1 dexters1 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all the standard Cognee tests run with this and the Kuzu class. Only class not covered currently is the Neo4jAura but we can't cover that one since it's an expensive paid service

Comment on lines +40 to +41
dataset_database = await _get_vector_db_connection_info(dataset_database)
dataset_database = await _get_graph_db_connection_info(dataset_database)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: do we assign the value twice. Is it expected?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, vector connection set values regarding vector databases, graph connection sets values for graph databases. They are not overwritten and have separate fields in the Data model

@pazone
Copy link
Contributor

pazone commented Dec 2, 2025

It's on the right track. I'd recommend to ensure the test coverage to avoid future problems


@classmethod
async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I guess we need to handle delete here. In case of kuzu that is just calling delete_graph.

Comment on lines +42 to +50
# Client credentials and encryption
client_id = os.environ.get("NEO4J_CLIENT_ID", None)
client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None)
tenant_id = os.environ.get("NEO4J_TENANT_ID", None)
encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
encryption_key = base64.urlsafe_b64encode(
hashlib.sha256(encryption_env_key.encode()).digest()
)
cipher = Fernet(encryption_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pazone How important you think this is? It is a risk, but we have it everywhere basically. This ensures that cognee will work with minimal config. The setup will be a bit more burdensome if we don't have those default values and expect users to provide them every time.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
cognee/tests/test_dataset_database_handler.py (1)

39-50: Align Kuzu handler database path with LanceDB handler and test assertions

KuzuTestDatasetDatabaseHandler.create_dataset currently uses a relative path:

databases_directory_path = os.path.join("databases", str(user.id))

while LanceDBTestDatasetDatabaseHandler builds an absolute path under .cognee_system/test_dataset_database_handler/databases/..., and main() asserts the Kuzu DB file exists under that same absolute base. This inconsistency can cause the Kuzu DB file to be written to a different location than the assertion expects (and from where LanceDB writes its DB).

You can mirror the LanceDB handler’s path logic so both handlers and the assertions agree:

 class KuzuTestDatasetDatabaseHandler(DatasetDatabaseHandlerInterface):
     @classmethod
-    async def create_dataset(cls, dataset_id, user):
-        databases_directory_path = os.path.join("databases", str(user.id))
-        os.makedirs(databases_directory_path, exist_ok=True)
+    async def create_dataset(cls, dataset_id, user):  # noqa: ARG003 - dataset_id unused in test
+        import pathlib
+
+        cognee_directory_path = str(
+            pathlib.Path(
+                os.path.join(
+                    pathlib.Path(__file__).parent, ".cognee_system/test_dataset_database_handler"
+                )
+            ).resolve()
+        )
+        databases_directory_path = os.path.join(cognee_directory_path, "databases", str(user.id))
+        os.makedirs(databases_directory_path, exist_ok=True)
@@
-        graph_db_name = "test.kuzu"
+        graph_db_name = "test.kuzu"
         return {
             "graph_database_name": graph_db_name,
             "graph_database_url": os.path.join(databases_directory_path, graph_db_name),
             "graph_database_provider": "kuzu",
         }

This also addresses the earlier concern about inconsistent test artifact locations between the two handlers.

🧹 Nitpick comments (5)
.github/workflows/e2e_tests.yml (2)

104-116: Selective disabling of backend access control looks reasonable

Using ENV: 'local'/'dev' plus ENABLE_BACKEND_ACCESS_CONTROL: 'false' for the telemetry pipeline and deduplication examples is a pragmatic way to keep existing scenarios working while you introduce the new gating. Just make sure there is at least one CI path that runs with backend access control enabled (or defaulted on) so those new checks are exercised as well.

Also applies to: 148-158


216-239: Consider wiring the dataset‑database handler scenario through pytest

The new test-dataset-database-handler job mirrors the other integration jobs, but it runs the module directly via python instead of through pytest. If you add a small @pytest.mark.asyncio test wrapper around main() in cognee/tests/test_dataset_database_handler.py and invoke this job with uv run pytest cognee/tests/test_dataset_database_handler.py, you’ll get consistent reporting and coverage accounting with the rest of your test suite.

cognee/tests/test_dataset_database_handler.py (3)

4-12: Be cautious with environment mutation at import time

Setting VECTOR_DATASET_DATABASE_HANDLER and GRAPH_DATASET_DATABASE_HANDLER at module import time is fine for a standalone example script, but if this module is ever imported by pytest as part of a larger test run it will silently affect global process state for other tests. If you intend this to behave as a “real” test module, consider moving these os.environ[...] assignments into main() (or a pytest fixture) so the scope and lifetime of the override are explicit.


15-36: Minor cleanup for LanceDB test handler (imports and lints)

The implementation looks correct for the test scenario. Two small cleanups you might consider:

  • Move the import pathlib to the top of the file with the other imports to avoid re-importing it every time create_dataset is called.
  • dataset_id is required by the interface but unused here; if Ruff’s ARG003 is enabled, you can either add a # noqa: ARG003 on the method definition or add a short comment explaining that it’s intentionally unused in this test handler.

Example diff:

-import asyncio
-import os
+import asyncio
+import os
+import pathlib
@@
-class LanceDBTestDatasetDatabaseHandler(DatasetDatabaseHandlerInterface):
+class LanceDBTestDatasetDatabaseHandler(DatasetDatabaseHandlerInterface):
@@
-    async def create_dataset(cls, dataset_id, user):
-        import pathlib
-
+    async def create_dataset(cls, dataset_id, user):  # noqa: ARG003 - dataset_id unused in test
         cognee_directory_path = str(
             pathlib.Path(
                 os.path.join(

53-135: Turn this example into a proper pytest test while keeping the script entry point

The main() coroutine plus the if __name__ == "__main__": ... block make this a nice executable example, but as-is it doesn’t define any pytest tests, so it won’t be exercised by pytest-based runs or contribute to pytest-driven coverage.

A lightweight way to get both benefits (example + test) is to add a small async test wrapper that just calls main():

 from cognee.api.v1.search import SearchType
@@
 async def main():
@@
         os.path.join(cognee_directory_path, "databases", str(default_user.id), "test.lance.db")
     ), "Vector database file not found."
 
-
-if __name__ == "__main__":
+import pytest
+
+
+@pytest.mark.asyncio
+async def test_dataset_database_handler_example():
+    """Exercise custom LanceDB and Kuzu dataset database handlers end-to-end."""
+    await main()
+
+
+if __name__ == "__main__":
     logger = setup_logging(log_level=ERROR)
     loop = asyncio.new_event_loop()
     asyncio.set_event_loop(loop)

That keeps the example runnable via python cognee/tests/test_dataset_database_handler.py while also letting pytest discover and run it with pytest, aligning with your testing guidelines and earlier review feedback about code coverage. Short docstrings for the two handler classes and main() would also help future readers understand this example’s intent.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1282905 and 5698c60.

📒 Files selected for processing (2)
  • .github/workflows/e2e_tests.yml (3 hunks)
  • cognee/tests/test_dataset_database_handler.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
.github/**

⚙️ CodeRabbit configuration file

.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.

  • 'actionlint' erroneously generates false positives when dealing with GitHub's ${{ ... }} syntax in conditionals.
  • 'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid ${{ ... }} syntax.

Files:

  • .github/workflows/e2e_tests.yml
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/tests/test_dataset_database_handler.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/tests/test_dataset_database_handler.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/tests/test_dataset_database_handler.py
cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

  • cognee/tests/test_dataset_database_handler.py
cognee/tests/*

⚙️ CodeRabbit configuration file

cognee/tests/*: When reviewing test code:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.
  6. As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.

Files:

  • cognee/tests/test_dataset_database_handler.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: Basic Tests / Run Linting
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
  • GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
  • GitHub Check: Test Weighted Edges Core Functionality (3.12)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (8)
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (2)

28-30: Potential AttributeError when user is None.

The user parameter is Optional[User], but user.id is accessed without a null check. This will raise an AttributeError if the caller passes None.

Apply this diff to add validation:

     @classmethod
     async def create_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> dict:
         vector_config = get_vectordb_config()
         base_config = get_base_config()

         if vector_config.vector_db_provider != "lancedb":
             raise ValueError(
                 "LanceDBDatasetDatabaseHandler can only be used with LanceDB vector database provider."
             )
 
+        if user is None:
+            raise ValueError("User is required for LanceDB dataset creation.")
+
         databases_directory_path = os.path.join(
             base_config.system_root_directory, "databases", str(user.id)
         )

32-32: dataset_id is Optional[UUID] but used directly.

If dataset_id is None, this will produce "None.lance.db" as the database name, which is likely unintended.

Apply this diff to add validation:

+        if dataset_id is None:
+            raise ValueError("Dataset ID is required for LanceDB dataset creation.")
+
         vector_db_name = f"{dataset_id}.lance.db"
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (6)

46-50: Insecure default encryption key exposes credentials.

The fallback to "test_key" violates cryptographic best practices. As noted in past reviews and developer comments, this implementation is not production-safe.

Per extensive past review discussion, require the environment variable explicitly:

-        encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
+        encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY")
+        if not encryption_env_key:
+            raise ValueError(
+                "NEO4J_ENCRYPTION_KEY must be set to encrypt Neo4j Aura credentials safely."
+            )

Apply the same fix at Line 141 in resolve_dataset_connection_info().


62-62: Add timeout to HTTP request.

The requests.post call lacks a timeout, which can cause indefinite blocking if the Neo4j API is unresponsive.

-            resp = requests.post(url, data=data, auth=(client_id, client_secret))
+            resp = requests.post(url, data=data, auth=(client_id, client_secret), timeout=30)

As per coding guidelines, "Prefer explicit, structured error handling in Python code."


90-96: Missing timeout and inefficient JSON parsing.

The instance creation request lacks a timeout and calls response.json() multiple times, which is inefficient.

-        response = requests.post(url, headers=headers, json=payload)
+        response = requests.post(url, headers=headers, json=payload, timeout=60)
+        response.raise_for_status()

+        response_data = response.json()["data"]
         graph_db_name = "neo4j"  # Has to be 'neo4j' for Aura
-        graph_db_url = response.json()["data"]["connection_url"]
+        graph_db_url = response_data["connection_url"]
         graph_db_key = resp["access_token"]
-        graph_db_username = response.json()["data"]["username"]
-        graph_db_password = response.json()["data"]["password"]
+        graph_db_username = response_data["username"]
+        graph_db_password = response_data["password"]

98-112: Synchronous HTTP blocks the async event loop during polling.

Using requests.get() inside an async function blocks the event loop for up to 5 minutes. Consider using httpx with async support or asyncio.to_thread().

         async def _wait_for_neo4j_instance_provisioning(instance_id: str, headers: dict):
             # Poll until the instance is running
             status_url = f"https://api.neo4j.io/v1/instances/{instance_id}"
             status = ""
-            for attempt in range(30):  # Try for up to ~5 minutes
-                status_resp = requests.get(status_url, headers=headers)
+            for _ in range(30):  # Try for up to ~5 minutes
+                status_resp = await asyncio.to_thread(
+                    requests.get, status_url, headers=headers, timeout=30
+                )
                 status = status_resp.json()["data"]["status"]
                 if status.lower() == "running":
                     return
                 await asyncio.sleep(10)

146-148: Add error handling for decryption failures.

The decryption lacks error handling for KeyError (missing password field) or InvalidToken (wrong key/corrupted data).

-        graph_db_password = cipher.decrypt(
-            dataset_database.graph_database_connection_info["graph_database_password"].encode()
-        ).decode()
+        try:
+            encrypted_password = dataset_database.graph_database_connection_info["graph_database_password"]
+            graph_db_password = cipher.decrypt(encrypted_password.encode()).decode()
+        except KeyError:
+            raise ValueError("Missing encrypted password in dataset connection info")
+        except Exception as e:
+            raise ValueError(f"Failed to decrypt Neo4j password: {e}")

As per coding guidelines, "Prefer explicit, structured error handling in Python code."


155-157: Implement delete_dataset or raise NotImplementedError.

The abstract method is unimplemented. Raise NotImplementedError to make it explicit that this functionality is pending.

     @classmethod
     async def delete_dataset(cls, dataset_database: DatasetDatabase):
-        pass
+        # TODO: Implement instance deletion via Neo4j Aura API
+        raise NotImplementedError("Neo4j Aura instance deletion not yet implemented")
🧹 Nitpick comments (3)
cognee/modules/data/deletion/prune_system.py (2)

14-16: Move import to module level unless circular dependency exists.

Imports placed inside functions are generally a code smell and can impact readability and performance. Consider moving this import to the top of the module unless there's a specific circular dependency issue.

Apply this diff:

 from cognee.context_global_variables import backend_access_control_enabled
 from cognee.infrastructure.databases.vector import get_vector_engine
 from cognee.infrastructure.databases.graph.get_graph_engine import get_graph_engine
 from cognee.infrastructure.databases.relational import get_relational_engine
 from cognee.infrastructure.databases.vector.config import get_vectordb_config
 from cognee.infrastructure.databases.graph.config import get_graph_config
+from cognee.infrastructure.databases.dataset_database_handler.supported_dataset_database_handlers import (
+    supported_dataset_database_handlers,
+)
 from cognee.shared.cache import delete_cache
 from cognee.modules.users.models import DatasetDatabase

Then remove lines 14-16 from the function.


49-51: Update or remove the outdated TODO comment.

The TODO states "prune_system should work with multi-user access control mode enabled" but the code in lines 52-62 appears to implement exactly this functionality via prune_graph_databases() and prune_vector_databases().

Either remove the TODO if the implementation is complete, or clarify what remaining work is needed.

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)

31-31: Remove redundant import.

get_graph_config is already imported at Line 9. Remove this duplicate import from within the method.

Apply this diff:

-        from cognee.infrastructure.databases.graph.config import get_graph_config
-
         graph_config = get_graph_config()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5698c60 and f4078d1.

📒 Files selected for processing (6)
  • .github/workflows/e2e_tests.yml (2 hunks)
  • cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1 hunks)
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1 hunks)
  • cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (1 hunks)
  • cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1 hunks)
  • cognee/modules/data/deletion/prune_system.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
.github/**

⚙️ CodeRabbit configuration file

.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.

  • 'actionlint' erroneously generates false positives when dealing with GitHub's ${{ ... }} syntax in conditionals.
  • 'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid ${{ ... }} syntax.

Files:

  • .github/workflows/e2e_tests.yml
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
  • cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
  • cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py
  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
  • cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
  • cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py
  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
  • cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
  • cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py
  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
  • cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py
  • cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py
  • cognee/modules/data/deletion/prune_system.py
  • cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
🧠 Learnings (1)
📚 Learning: 2025-10-11T04:18:24.594Z
Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.

Applied to files:

  • cognee/modules/data/deletion/prune_system.py
🧬 Code graph analysis (2)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (7)
cognee/infrastructure/databases/graph/get_graph_engine.py (2)
  • get_graph_engine (10-24)
  • create_graph_engine (28-169)
cognee/modules/users/models/DatasetDatabase.py (1)
  • DatasetDatabase (7-37)
cognee/infrastructure/databases/graph/config.py (1)
  • get_graph_config (131-144)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)
  • DatasetDatabaseHandlerInterface (9-80)
  • create_dataset (12-34)
  • delete_dataset (71-80)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (2)
  • create_dataset (21-129)
  • delete_dataset (156-157)
cognee/infrastructure/databases/graph/graph_db_interface.py (1)
  • delete_graph (296-300)
cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)
  • delete_graph (839-861)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (3)
cognee/infrastructure/databases/graph/config.py (1)
  • get_graph_config (131-144)
cognee/modules/users/models/DatasetDatabase.py (1)
  • DatasetDatabase (7-37)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)
  • create_dataset (19-55)
  • delete_dataset (58-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
  • GitHub Check: Test Weighted Edges Examples
  • GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
🔇 Additional comments (6)
.github/workflows/e2e_tests.yml (2)

215-238: The test job configuration is correct and complete.

The test file exists at ./cognee/tests/test_dataset_database_handler.py and uses custom LanceDB and Kuzu handlers with local file-based storage. No external service dependencies (postgres, redis, docker) are required. The environment variables for LLM and embedding configuration are already present in the job, and all test setup is self-contained within the test file.


150-150: No actionable change needed.

The ENABLE_BACKEND_ACCESS_CONTROL: 'false' setting for the Deduplication Example test is appropriate. This flag controls multi-tenant data isolation at the database layer and does not affect deduplication logic. The test file contains pure deduplication verification without any access control dependencies. This setting pattern is consistent with other non-access-control focused tests in the codebase (e.g., test_relational_db_migration.py, code_graph_example.py). Deduplication behavior is identical regardless of the access control mode.

cognee/modules/data/deletion/prune_system.py (1)

1-1: LGTM!

The new imports are appropriate for the added functionality.

Also applies to: 5-6, 8-8

cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)

9-80: LGTM: Well-designed interface with clear security guidance.

The interface establishes a clean contract for dataset database provisioning with comprehensive docstrings. The security guidance regarding credential handling (preferring secret references over plaintext in create_dataset and resolving at connection time via resolve_dataset_connection_info) aligns with best practices.

cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1)

42-52: LGTM: Proper cleanup implementation.

The delete_dataset method correctly creates a vector engine with the dataset's connection info and invokes the prune operation to clean up resources.

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)

57-81: LGTM: Proper graph deletion implementation.

The delete_dataset method now correctly constructs the graph file path and invokes the graph engine's delete_graph() to remove the dataset's backing database. This addresses the prior review concern about the no-op implementation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
cognee/modules/data/deletion/prune_system.py (5)

17-39: Add docstring to prune_graph_databases function.

The function lacks a docstring. As per coding guidelines, undocumented function definitions are assumed incomplete and should have docstrings explaining purpose, behavior, and any important notes.

Add a concise docstring:

 async def prune_graph_databases():
+    """
+    Prune all graph databases for registered datasets.
+    
+    Iterates through dataset_database records and deletes each graph database
+    using the configured handler. Skips pruning if the dataset_database table
+    is not accessible.
+    """
     async def _prune_graph_db(dataset_database: DatasetDatabase) -> dict:

As per coding guidelines, docstrings are required for function definitions.


24-25: Add validation and error handling for handler access and deletion.

The dictionary access on line 24 could raise KeyError if the configured handler is not registered. Additionally, the delete_dataset call on line 25 lacks error handling, which could cause the entire pruning operation to fail when one dataset deletion fails.

Apply this diff:

-        handler = supported_dataset_database_handlers[graph_config.graph_dataset_database_handler]
-        return await handler["handler_instance"].delete_dataset(dataset_database)
+        handler_key = graph_config.graph_dataset_database_handler
+        if handler_key not in supported_dataset_database_handlers:
+            logger.error(f"Graph handler '{handler_key}' not registered")
+            raise ValueError(f"Graph dataset database handler '{handler_key}' not registered")
+        
+        handler = supported_dataset_database_handlers[handler_key]
+        try:
+            return await handler["handler_instance"].delete_dataset(dataset_database)
+        except Exception as e:
+            logger.error(f"Failed to delete graph database for dataset {dataset_database.dataset_id}: {e}")
+            raise

As per coding guidelines, use shared logging utilities from cognee.shared.logging_utils and prefer explicit error handling.


41-63: Address duplicate issues from prune_graph_databases.

This function has the same issues as prune_graph_databases:

  1. Missing docstring (lines 41-63)
  2. Import inside nested function (lines 45-47) should be moved to module level
  3. Dictionary access without key validation (line 49) could raise KeyError
  4. Missing error handling for delete_dataset call (line 50)

Apply similar fixes as suggested for prune_graph_databases:

  • Add docstring describing the function's purpose
  • Move the import to module level (already covered in the earlier comment)
  • Validate that vector_config.vector_dataset_database_handler exists in supported_dataset_database_handlers before accessing
  • Wrap delete_dataset in try-except with logging

As per coding guidelines, use shared logging utilities from cognee.shared.logging_utils and prefer explicit error handling.


66-66: Add docstring to prune_system function.

The function lacks a docstring. As per coding guidelines, undocumented function definitions are assumed incomplete and should have docstrings.

Add a docstring explaining the function's purpose, parameters, and the important note about API availability:

 async def prune_system(graph=True, vector=True, metadata=True, cache=True):
+    """
+    Prune system databases and cache.
+    
+    Args:
+        graph: If True, prune graph databases
+        vector: If True, prune vector databases  
+        metadata: If True, prune metadata database
+        cache: If True, delete cache
+    
+    Note:
+        This function should not be exposed through the API as it has no permission
+        checks and will delete all databases. Use only in development/testing.
+    """
     # Note: prune system should not be available through the API, it has no permission checks and will

As per coding guidelines, docstrings are required for function definitions.


72-79: Add error handling for new pruning functions.

The calls to prune_graph_databases() (line 73) and prune_vector_databases() (line 79) lack error handling. Given that this function is called at pipeline start (based on learnings) and is used in development/testing, failures should be handled gracefully to prevent cascading issues.

Apply this diff:

     elif graph and backend_access_control_enabled():
-        await prune_graph_databases()
+        try:
+            await prune_graph_databases()
+        except Exception as e:
+            logger.error(f"Failed to prune graph databases: {e}")
+            raise

     if vector and not backend_access_control_enabled():
         vector_engine = get_vector_engine()
         await vector_engine.prune()
     elif vector and backend_access_control_enabled():
-        await prune_vector_databases()
+        try:
+            await prune_vector_databases()
+        except Exception as e:
+            logger.error(f"Failed to prune vector databases: {e}")
+            raise

As per coding guidelines, use shared logging utilities from cognee.shared.logging_utils and prefer explicit error handling.

Based on learnings, this function is called at pipeline start, so proper error propagation is critical.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1961efc and fd84ede.

📒 Files selected for processing (1)
  • cognee/modules/data/deletion/prune_system.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/modules/data/deletion/prune_system.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/modules/data/deletion/prune_system.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/modules/data/deletion/prune_system.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/modules/data/deletion/prune_system.py
🧠 Learnings (1)
📚 Learning: 2025-10-11T04:18:24.594Z
Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.

Applied to files:

  • cognee/modules/data/deletion/prune_system.py
🧬 Code graph analysis (1)
cognee/modules/data/deletion/prune_system.py (8)
cognee/context_global_variables.py (1)
  • backend_access_control_enabled (74-83)
cognee/infrastructure/databases/vector/get_vector_engine.py (1)
  • get_vector_engine (5-7)
cognee/infrastructure/databases/graph/get_graph_engine.py (1)
  • get_graph_engine (10-24)
cognee/infrastructure/databases/relational/get_relational_engine.py (1)
  • get_relational_engine (5-21)
cognee/modules/users/models/DatasetDatabase.py (1)
  • DatasetDatabase (7-37)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)
  • delete_dataset (71-80)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
  • delete_dataset (58-81)
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1)
  • delete_dataset (42-52)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: Basic Tests / Run Simple Examples
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
  • GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
🔇 Additional comments (1)
cognee/modules/data/deletion/prune_system.py (1)

28-38: Good addition of outer error handling.

The error handling for OperationalError and EntityNotFoundError is a good addition. This gracefully handles cases where the dataset_database table is not accessible (e.g., during initial setup or schema migration) and logs appropriately using the shared logger.

The use of logger.debug() is appropriate here since this is an expected condition in certain environments.

Also applies to: 52-63

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)

18-54: Validate dataset_id before using it to build the Kuzu filename.

dataset_id is typed as Optional[UUID] but is used directly in graph_db_name = f"{dataset_id}.pkl". If dataset_id is ever None, this will silently create a "None.pkl" database name, which is almost certainly unintended and will be hard to diagnose.

Consider failing fast by validating dataset_id at the top of the method:

     @classmethod
     async def create_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> dict:
@@
-        graph_config = get_graph_config()
+        graph_config = get_graph_config()
@@
-        if graph_config.graph_database_provider != "kuzu":
+        if graph_config.graph_database_provider != "kuzu":
             raise ValueError(
                 "KuzuDatasetDatabaseHandler can only be used with Kuzu graph database provider."
             )
+
+        if dataset_id is None:
+            raise ValueError("dataset_id is required for Kuzu dataset creation.")
@@
-        graph_db_name = f"{dataset_id}.pkl"
+        graph_db_name = f"{dataset_id}.pkl"

This keeps the handler’s contract strict and avoids generating ambiguous filenames.

🧹 Nitpick comments (1)
cognee/context_global_variables.py (1)

124-125: TODO comment flags technical debt.

The TODO comment correctly identifies that vector and graph config handling across Cognee needs improvement, particularly regarding LRU_CACHE sensitivity to input order.

Would you like me to open a new issue to track this technical debt, or do you plan to address it in a follow-up PR?

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd84ede and f1c5b9a.

📒 Files selected for processing (3)
  • cognee/context_global_variables.py (6 hunks)
  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1 hunks)
  • cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
  • cognee/context_global_variables.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

  1. Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
  2. As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
  3. As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
  4. As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
  5. As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
  • cognee/context_global_variables.py
cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
  • cognee/context_global_variables.py
cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

  • cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
🧬 Code graph analysis (2)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (5)
cognee/infrastructure/databases/graph/get_graph_engine.py (2)
  • get_graph_engine (10-24)
  • create_graph_engine (28-169)
cognee/base_config.py (1)
  • get_base_config (63-64)
cognee/modules/users/models/DatasetDatabase.py (1)
  • DatasetDatabase (7-37)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)
  • DatasetDatabaseHandlerInterface (9-80)
  • create_dataset (12-34)
  • delete_dataset (71-80)
cognee/infrastructure/databases/graph/config.py (1)
  • get_graph_config (131-144)
cognee/context_global_variables.py (4)
cognee/infrastructure/databases/vector/config.py (1)
  • get_vectordb_config (72-86)
cognee/infrastructure/databases/graph/config.py (1)
  • get_graph_config (131-144)
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (1)
  • get_or_create_dataset_database (63-116)
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py (1)
  • resolve_dataset_database_connection_info (28-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: End-to-End Tests / Test Feedback Enrichment
  • GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
  • GitHub Check: End-to-End Tests / Concurrent Subprocess access test
  • GitHub Check: End-to-End Tests / Conversation sessions test (FS)
  • GitHub Check: End-to-End Tests / Test Entity Extraction
  • GitHub Check: End-to-End Tests / Test graph edge ingestion
  • GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
  • GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
  • GitHub Check: End-to-End Tests / Server Start Test
  • GitHub Check: End-to-End Tests / S3 Bucket Test
  • GitHub Check: End-to-End Tests / Deduplication Test
  • GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
  • GitHub Check: Basic Tests / Run Integration Tests
  • GitHub Check: Basic Tests / Run Unit Tests
  • GitHub Check: CLI Tests / CLI Integration Tests
  • GitHub Check: CLI Tests / CLI Functionality Tests
  • GitHub Check: CLI Tests / CLI Unit Tests
  • GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
  • GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
  • GitHub Check: Code Quality for Weighted Edges
🔇 Additional comments (6)
cognee/context_global_variables.py (5)

7-10: LGTM! Import updates align with new configuration patterns.

The imports correctly reflect the refactored configuration utilities and the new dataset database connection info resolver.


82-82: LGTM! Simplified logic reduces duplication.

The direct call to multi_user_support_possible() simplifies the logic and ensures consistent validation.


141-146: Verify empty string defaults for credentials are acceptable.

Using .get() with empty string defaults for graph_database_username and graph_database_password might mask missing required credentials. For Neo4j Aura (mentioned in the PR title), these credentials are typically required.

Verify whether:

  1. Empty credentials are valid for any supported graph database providers
  2. Downstream code properly validates and reports authentication failures
  3. The graph_database_connection_info dictionary is guaranteed to contain these keys after resolve_dataset_database_connection_info() is called

Consider using None as the default and explicitly checking for required credentials, or raising an error if credentials are missing for providers that require them.


112-113: LGTM! Proper connection info resolution added.

The call to resolve_dataset_database_connection_info() ensures that all connection details (both vector and graph) are properly resolved before use.


29-30: No action required—handler attributes cannot be None.

Both graph_dataset_database_handler and vector_dataset_database_handler are defined as str type with non-empty default values ("kuzu" and "lancedb" respectively) in their respective Pydantic BaseSettings config classes. These attributes are guaranteed to always return string values and will not fail the in checks on lines 35 and 42.

cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)

1-10: Imports and interface wiring look consistent.

The imports are all used, and the handler cleanly depends on the shared DatasetDatabaseHandlerInterface, graph engine factory, and user/database models. No changes needed here.

Comment on lines 25 to +71
def multi_user_support_possible():
graph_db_config = get_graph_context_config()
vector_db_config = get_vectordb_context_config()
return (
graph_db_config["graph_database_provider"] in GRAPH_DBS_WITH_MULTI_USER_SUPPORT
and vector_db_config["vector_db_provider"] in VECTOR_DBS_WITH_MULTI_USER_SUPPORT
graph_db_config = get_graph_config()
vector_db_config = get_vectordb_config()

graph_handler = graph_db_config.graph_dataset_database_handler
vector_handler = vector_db_config.vector_dataset_database_handler
from cognee.infrastructure.databases.dataset_database_handler import (
supported_dataset_database_handlers,
)

if graph_handler not in supported_dataset_database_handlers:
raise EnvironmentError(
"Unsupported graph dataset to database handler configured. Cannot add support for multi-user access control mode. Please use a supported graph dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n"
f"Selected graph dataset to database handler: {graph_handler}\n"
f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n"
)

if vector_handler not in supported_dataset_database_handlers:
raise EnvironmentError(
"Unsupported vector dataset to database handler configured. Cannot add support for multi-user access control mode. Please use a supported vector dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n"
f"Selected vector dataset to database handler: {vector_handler}\n"
f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n"
)

if (
supported_dataset_database_handlers[graph_handler]["handler_provider"]
!= graph_db_config.graph_database_provider
):
raise EnvironmentError(
"The selected graph dataset to database handler does not work with the configured graph database provider. Cannot add support for multi-user access control mode. Please use a supported graph dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n"
f"Selected graph database provider: {graph_db_config.graph_database_provider}\n"
f"Selected graph dataset to database handler: {graph_handler}\n"
f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n"
)

if (
supported_dataset_database_handlers[vector_handler]["handler_provider"]
!= vector_db_config.vector_db_provider
):
raise EnvironmentError(
"The selected vector dataset to database handler does not work with the configured vector database provider. Cannot add support for multi-user access control mode. Please use a supported vector dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n"
f"Selected vector database provider: {vector_db_config.vector_db_provider}\n"
f"Selected vector dataset to database handler: {vector_handler}\n"
f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n"
)

return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add a docstring to document validation behavior.

This function lacks a docstring. Per the coding guidelines, undocumented function definitions are assumed incomplete.

Apply this diff to add a docstring:

 def multi_user_support_possible():
+    """
+    Validate that the configured graph and vector dataset database handlers are supported
+    and compatible with their respective database providers.
+
+    Raises:
+        EnvironmentError: If handlers are unsupported or if handler providers don't match
+                         the configured database providers.
+
+    Returns:
+        bool: True if validation passes.
+    """
     graph_db_config = get_graph_config()
     vector_db_config = get_vectordb_config()

As per coding guidelines, all functions should have docstrings.

🤖 Prompt for AI Agents
In cognee/context_global_variables.py around lines 25 to 71, the function
multi_user_support_possible lacks a docstring; add a concise triple-quoted
docstring immediately under the def line that states the function’s purpose
(validate whether configured graph and vector dataset handlers/providers support
multi-user access control), documents that it takes no arguments, returns True
when validation passes, and raises EnvironmentError with explanatory messages
when any handler/provider is unsupported or mismatched; keep the docstring short
(one to three sentences) and include a brief note about the raised exceptions
and return value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants