-
Notifications
You must be signed in to change notification settings - Fork 918
feat: add dataset database handler logic and neo4j/lancedb/kuzu handlers #1776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…b.com:topoteretes/cognee into feature/cog-3245-enable-multi-user-for-falkor
…b.com:topoteretes/cognee into feature/cog-3245-enable-multi-user-for-falkor
Please make sure all the checkboxes are checked:
|
WalkthroughAdds a dataset-database handler framework, per-dataset JSON connection fields and migration, Kuzu/LanceDB/Neo4j Aura handlers and registry, config/env keys and resolution at runtime, pruning hooks, auth gating changes, tests, and CI workflow updates. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py
Outdated
Show resolved
Hide resolved
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py
Outdated
Show resolved
Hide resolved
|
|
||
| @classmethod | ||
| async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep this like that?
In case we want to delete the dataset, do we just delete data inside a dataset database, or we also delete the database?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we must also delete the database in most cases, for example in Neo4j if a dataset is empty and not used it will have high billing costs so this will allow us to keep costs lower for empty datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then I guess we need to handle delete here. In case of kuzu that is just calling delete_graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)
9-12: Add a class-level docstring for completeness.Per coding guidelines, undocumented class definitions are assumed incomplete. Consider adding a brief docstring summarizing the interface's purpose.
class DatasetDatabaseHandlerInterface(ABC): + """ + Abstract interface for dataset database handlers. + + Implementations provide logic for creating, resolving, and deleting + per-dataset graph or vector database connections in multi-tenant mode. + """ @classmethod @abstractmethodcognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)
14-52: Implementation follows the interface contract correctly.The method validates the provider, fetches configuration, and returns the expected dictionary structure. The inline import on line 27 is acceptable for avoiding circular dependencies.
Regarding the TODO on line 36 about graph file path info: Do you want me to help implement user-specific path construction similar to
LanceDBDatasetDatabaseHandler(which builds paths underdatabases/{user.id}/)?
54-56: Add missing return type annotation for interface consistency.The interface declares
delete_dataset(...) -> None, but this implementation omits the return type. While Python allows this, adding the annotation maintains consistency with the interface contract.@classmethod - async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]): + async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> None: passGiven the past discussion about needing database deletion for cost management, consider adding a TODO comment to track the implementation:
@classmethod async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> None: + # TODO: Implement Kuzu database deletion to free resources pass
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/db_examples_tests.yml(2 hunks)cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py(1 hunks)cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
.github/**
⚙️ CodeRabbit configuration file
.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.
- 'actionlint' erroneously generates false positives when dealing with GitHub's
${{ ... }}syntax in conditionals.- 'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid
${{ ... }}syntax.
Files:
.github/workflows/db_examples_tests.yml
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
🧬 Code graph analysis (1)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (4)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)
DatasetDatabaseHandlerInterface(9-81)create_dataset(12-34)delete_dataset(71-81)cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (2)
create_dataset(19-116)delete_dataset(119-120)cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (2)
create_dataset(17-37)delete_dataset(40-41)cognee/infrastructure/databases/graph/config.py (1)
get_graph_config(131-144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: Basic Tests / Run Basic Graph Tests
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: Basic Tests / Run Simple Examples BAML
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
- GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
🔇 Additional comments (4)
.github/workflows/db_examples_tests.yml (2)
64-64: Verify: Why isENABLE_BACKEND_ACCESS_CONTROLabsent from the Kuzu example job?The environment variable
ENABLE_BACKEND_ACCESS_CONTROL: 'false'is added to both the Neo4j and PGVector example test steps, but is absent from the Kuzu example step. This inconsistency suggests either: (1) Kuzu does not yet support the multi-user mode feature introduced in this PR, or (2) this is an unintended omission.Clarify whether this is intentional or if the variable should also be added to the Kuzu example.
Also applies to: 146-146
62-78: Environment variable addition aligns with PR objectives.The addition of
ENABLE_BACKEND_ACCESS_CONTROL: 'false'to test workflows is consistent with the PR's multi-tenant, multi-user mode feature. Disabling access control in test examples is appropriate for isolated testing scenarios.Also applies to: 143-155
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)
10-67: Well-structured interface with comprehensive documentation.The interface design is sound: abstract methods for
create_datasetanddelete_datasetwith a concrete default forresolve_dataset_connection_infothat can be optionally overridden. The docstrings appropriately emphasize security considerations around credential handling.cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
1-12: Clean imports and proper class structure.The unused imports from the previous review have been removed. The class correctly implements
DatasetDatabaseHandlerInterfacewith an appropriate docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (3)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)
14-41: Validatedataset_idand document the legacy.pklnaming.
dataset_idis typed asOptional[UUID]but is required here to formgraph_db_name; if it is everNoneyou’ll silently get"None.pkl". Consider either:
- enforcing non-
Nonewith an explicit check and clearValueError, or- supporting the
Nonecase explicitly in the naming logic.Given the prior decision to keep the
.pklsuffix for backward compatibility, it would help future maintainers to add a short comment explaining why this extension is used and what older Kuzu deployments expect, instead of leaving it implicit.
36-51: Avoid persisting raw DB credentials in theDatasetDatabaserow.Per the
DatasetDatabaseHandlerInterfacedocstring, the dict returned fromcreate_datasetis stored verbatim and later passed intoresolve_dataset_connection_info, and the guidance is to prefer returning references to secrets/roles instead of plaintext credentials. Here you are returninggraph_database_usernameandgraph_database_passworddirectly, which will end up persisted in the relational DB.Consider adjusting this handler to:
- store only non-secret identifiers (e.g., a secret name/ARN or config key) in
graph_database_connection_info, and- override
resolve_dataset_connection_infoin this class to resolve those identifiers into short‑lived credentials at connection time without persisting them.This keeps the schema compatible while improving the security posture.
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (1)
63-116: Consider adding logging for debugging and observability.The function lacks logging statements, which would be valuable for debugging dataset database creation issues, tracking handler failures, and monitoring race conditions.
Consider adding logging at key points:
from cognee.shared.logging_utils import get_logger logger = get_logger(__name__) async def get_or_create_dataset_database( dataset: Union[str, UUID], user: User, ) -> DatasetDatabase: """...""" logger.debug(f"Getting or creating dataset database for dataset={dataset}, user={user.id}") db_engine = get_relational_engine() dataset_id = await get_unique_dataset_id(dataset, user) # If dataset is given as name make sure the dataset is created first if isinstance(dataset, str): logger.debug(f"Creating dataset '{dataset}' for user {user.id}") async with db_engine.get_async_session() as session: await create_dataset(dataset, user, session) # If dataset database already exists return it existing_dataset_database = await _existing_dataset_database(dataset_id, user) if existing_dataset_database: logger.debug(f"Found existing dataset database for dataset_id={dataset_id}") return existing_dataset_database logger.debug(f"Fetching handler configs for dataset_id={dataset_id}") graph_config_dict = await _get_graph_db_info(dataset_id, user) vector_config_dict = await _get_vector_db_info(dataset_id, user) # ... rest of function with additional logging for create/error pathsBased on coding guidelines for files matching
cognee/**/*.py.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py(1 hunks)cognee/infrastructure/databases/utils/get_or_create_dataset_database.py(4 hunks)cognee/modules/users/models/DatasetDatabase.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/modules/users/models/DatasetDatabase.pycognee/infrastructure/databases/utils/get_or_create_dataset_database.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/modules/users/models/DatasetDatabase.pycognee/infrastructure/databases/utils/get_or_create_dataset_database.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/modules/users/models/DatasetDatabase.pycognee/infrastructure/databases/utils/get_or_create_dataset_database.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/modules/users/models/DatasetDatabase.pycognee/infrastructure/databases/utils/get_or_create_dataset_database.pycognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
🧬 Code graph analysis (1)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (4)
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)
DatasetDatabaseHandlerInterface(9-81)create_dataset(12-34)delete_dataset(71-81)cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (2)
create_dataset(19-116)delete_dataset(119-120)cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (2)
create_dataset(17-37)delete_dataset(40-41)cognee/infrastructure/databases/graph/config.py (1)
get_graph_config(131-144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: Basic Tests / Run Simple Examples BAML
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: Code Quality for Weighted Edges
- GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
- GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
🔇 Additional comments (4)
cognee/modules/users/models/DatasetDatabase.py (2)
3-3: LGTM!The additional imports (JSON and text) are correctly added to support the new JSON fields with SQL-level defaults.
15-16: LGTM!The removal of unique constraints on database names was discussed in prior reviews and enables the desired multi-tenant flexibility.
cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (2)
38-60: LGTM!The function is well-documented and implements a straightforward database query to check for existing records. The docstring clearly explains the purpose, parameters, and return value.
104-105: The dict unpacking is safe by design. Both handlers use namespaced keys—graph handlers returngraph_*prefixed keys and vector handlers returnvector_*prefixed keys. These disjoint key sets cannot collide, so no silent overrides can occur.
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
Outdated
Show resolved
Hide resolved
| # configuration details for different database types. This would make it more flexible to add new database types | ||
| # without changing the database schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify incomplete comment.
The comment starts mid-sentence. Consider adding a subject for clarity.
Apply this diff:
- # configuration details for different database types. This would make it more flexible to add new database types
+ # These fields store configuration details for different database types. This would make it more flexible to add new database types📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # configuration details for different database types. This would make it more flexible to add new database types | |
| # without changing the database schema. | |
| # These fields store configuration details for different database types. This would make it more flexible to add new database types | |
| # without changing the database schema. |
🤖 Prompt for AI Agents
In cognee/modules/users/models/DatasetDatabase.py around lines 27 to 28, the
inline comment starts mid-sentence and is ambiguous; rewrite it to be a
complete, clear sentence that states the subject (e.g., "Store database
configuration details for different database types") and explains intent briefly
— that the model stores configuration details per database type to allow adding
new database types without changing the schema; update the comment text
accordingly to be grammatically complete and self-contained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (3)
155-157: Make the unimplemented delete_dataset explicit.
delete_datasetcurrently justpasses, which can lead to silent no‑ops if it’s ever invoked.Replacing it with an explicit
NotImplementedError(and a TODO) will fail fast and make the omission clear:@classmethod async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]): - pass + # TODO: Implement Neo4j Aura instance deletion via the Aura API. + raise NotImplementedError( + "Neo4jAuraDatasetDatabaseHandler.delete_dataset is not implemented yet." + )
58-67: Add timeouts and structured error handling to Aura token + instance creation calls.Both the OAuth token request and the instance creation call use
requests.postwithout a timeout or robust error handling, and they repeatedly callresponse.json(). That can lead to indefinite hangs and obscure failures.A possible tightening:
- # Make the request with HTTP Basic Auth - def get_aura_token(client_id: str, client_secret: str) -> dict: - url = "https://api.neo4j.io/oauth/token" - data = {"grant_type": "client_credentials"} # sent as application/x-www-form-urlencoded - - resp = requests.post(url, data=data, auth=(client_id, client_secret)) - resp.raise_for_status() # raises if the request failed - return resp.json() + # Make the request with HTTP Basic Auth + def get_aura_token(client_id: str, client_secret: str) -> dict: + url = "https://api.neo4j.io/oauth/token" + data = {"grant_type": "client_credentials"} # sent as application/x-www-form-urlencoded + + try: + resp = requests.post( + url, + data=data, + auth=(client_id, client_secret), + timeout=30, + ) + resp.raise_for_status() + except requests.Timeout as exc: + raise TimeoutError("Timed out while requesting Neo4j Aura OAuth token") from exc + except requests.RequestException as exc: + raise RuntimeError("Failed to request Neo4j Aura OAuth token") from exc + + return resp.json() @@ - response = requests.post(url, headers=headers, json=payload) - - graph_db_name = "neo4j" # Has to be 'neo4j' for Aura - graph_db_url = response.json()["data"]["connection_url"] - graph_db_key = resp["access_token"] - graph_db_username = response.json()["data"]["username"] - graph_db_password = response.json()["data"]["password"] + try: + response = requests.post( + url, + headers=headers, + json=payload, + timeout=60, + ) + response.raise_for_status() + except requests.Timeout as exc: + raise TimeoutError( + f"Timed out while provisioning Neo4j Aura instance for dataset {dataset_id}" + ) from exc + except requests.RequestException as exc: + raise RuntimeError( + f"Failed to provision Neo4j Aura instance for dataset {dataset_id}" + ) from exc + + response_data = response.json()["data"] + + graph_db_name = "neo4j" # Has to be 'neo4j' for Aura + graph_db_url = response_data["connection_url"] + graph_db_key = resp["access_token"] + graph_db_username = response_data["username"] + graph_db_password = response_data["password"] @@ - instance_id = response.json()["data"]["id"] + instance_id = response_data["id"]This makes failures explicit, prevents indefinite blocking, and avoids reparsing the JSON body multiple times.
Provide the official `requests` documentation page describing the `timeout` parameter and recommended exception handling patterns for HTTP errors.Also applies to: 90-97, 114-115
98-112: Avoid blocking the async event loop when polling Aura instance status; add timeout and error checks.
_wait_for_neo4j_instance_provisioningisasyncbut callsrequests.getdirectly in a loop with no timeout, which will block the event loop for up to ~5 minutes and still lacksraise_for_status().One approach that keeps
requestsbut avoids blocking:- async def _wait_for_neo4j_instance_provisioning(instance_id: str, headers: dict): - # Poll until the instance is running - status_url = f"https://api.neo4j.io/v1/instances/{instance_id}" - status = "" - for attempt in range(30): # Try for up to ~5 minutes - status_resp = requests.get( - status_url, headers=headers - ) # TODO: Use async requests with httpx - status = status_resp.json()["data"]["status"] - if status.lower() == "running": - return - await asyncio.sleep(10) - raise TimeoutError( - f"Neo4j instance '{graph_db_name}' did not become ready within 5 minutes. Status: {status}" - ) + async def _wait_for_neo4j_instance_provisioning(instance_id: str, headers: dict) -> None: + # Poll until the instance is running + status_url = f"https://api.neo4j.io/v1/instances/{instance_id}" + status = "" + for _ in range(30): # Try for up to ~5 minutes + try: + status_resp = await asyncio.to_thread( + requests.get, + status_url, + headers=headers, + timeout=30, + ) + status_resp.raise_for_status() + except requests.Timeout as exc: + raise TimeoutError( + f"Timed out while polling Neo4j instance '{instance_id}' provisioning status." + ) from exc + except requests.RequestException as exc: + raise RuntimeError( + f"Error while polling Neo4j instance '{instance_id}' provisioning status." + ) from exc + + status = status_resp.json()["data"]["status"] + if status.lower() == "running": + return + + await asyncio.sleep(10) + + raise TimeoutError( + f"Neo4j instance '{instance_id}' did not become ready within 5 minutes. Last status: {status}" + )Alternatively, switching to
httpx.AsyncClientthroughout would give you a fully async Aura client, but the above is a minimal, localized change.Find the official Python documentation for `asyncio.to_thread` and any guidance on avoiding blocking I/O inside `async def` functions.
🧹 Nitpick comments (2)
cognee/modules/data/deletion/prune_system.py (1)
9-16: Refine access‑control gating: cache flag once and document behaviorThe new checks correctly prevent graph/vector pruning when backend access control is enabled, which matches the multi‑user safety goal. Two improvements to consider:
- Avoid repeated calls
Cache the flag once at the top of the function to keep behavior consistent and avoid redundant calls:async def prune_system(graph=True, vector=True, metadata=True, cache=True): - # TODO: prune_system should work with multi-user access control mode enabled - if graph and not backend_access_control_enabled(): + # TODO: prune_system should work with multi-user access control mode enabled + access_control_enabled = backend_access_control_enabled() + + if graph and not access_control_enabled: graph_engine = await get_graph_engine() await graph_engine.delete_graph() - if vector and not backend_access_control_enabled(): + if vector and not access_control_enabled: vector_engine = get_vector_engine() await vector_engine.prune()
- Document the new semantics
prune_systemnow has mode‑dependent behavior (graph/vector are no‑ops when access control is on). Adding a short docstring that explains this will help callers understand why pruning may be skipped and how the TODO will eventually be addressed (e.g., tenant‑aware pruning instead of a full disable). This also aligns with the project guideline that undocumented functions are considered incomplete.cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (1)
40-41: Clarify naming between dataset identifier and Aura DB name, and correct the slice length.
graph_db_nameis initially set tof"{dataset_id}", later overwritten with"neo4j", and sliced asgraph_db_name[0:29]while the comment mentions a 30‑character limit. That’s a bit confusing for future readers.Consider something like:
- graph_db_name = f"{dataset_id}" + dataset_label = str(dataset_id) @@ - "name": graph_db_name[ - 0:29 - ], # TODO: Find better name to name Neo4j instance within 30 character limit + "name": dataset_label[:30], # TODO: Find better name within the 30-char limit @@ - graph_db_name = "neo4j" # Has to be 'neo4j' for Aura + graph_db_name = "neo4j" # Has to be 'neo4j' for AuraThis makes it clear which value is a human‑friendly instance label vs the fixed database name and aligns the slice with the stated limit.
Also applies to: 82-84, 92-92
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py(1 hunks)cognee/modules/data/deletion/prune_system.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/modules/data/deletion/prune_system.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/modules/data/deletion/prune_system.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/modules/data/deletion/prune_system.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/modules/data/deletion/prune_system.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
🧠 Learnings (1)
📚 Learning: 2025-10-11T04:18:24.594Z
Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/modules/data/deletion/prune_system.py
🧬 Code graph analysis (1)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (3)
cognee/infrastructure/databases/graph/config.py (1)
get_graph_config(131-144)cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (4)
DatasetDatabaseHandlerInterface(9-81)create_dataset(12-34)resolve_dataset_connection_info(37-67)delete_dataset(71-81)cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)
create_dataset(15-51)delete_dataset(54-55)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / Run Telemetry Test
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: Basic Tests / Run Basic Graph Tests
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: Basic Tests / Run Simple Examples BAML
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: CLI Tests / CLI Unit Tests
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
- GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
🔇 Additional comments (1)
cognee/modules/data/deletion/prune_system.py (1)
1-1: Import ofbackend_access_control_enabledis appropriateThe new import cleanly supports the gating logic below; no issues from an API or style perspective.
| # Client credentials and encryption | ||
| client_id = os.environ.get("NEO4J_CLIENT_ID", None) | ||
| client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None) | ||
| tenant_id = os.environ.get("NEO4J_TENANT_ID", None) | ||
| encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key") | ||
| encryption_key = base64.urlsafe_b64encode( | ||
| hashlib.sha256(encryption_env_key.encode()).digest() | ||
| ) | ||
| cipher = Fernet(encryption_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
wc -l cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.pyRepository: topoteretes/cognee
Length of output: 151
🏁 Script executed:
cat -n cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py | head -160Repository: topoteretes/cognee
Length of output: 7391
🌐 Web query:
cryptography.fernet key management best practices official documentation
💡 Result:
Short summary of official guidance and best practices for managing cryptography.fernet keys
- Key format & secrecy: Fernet keys are URL-safe base64‑encoded 32‑byte secrets and "must be kept secret; anyone with this key can create and read messages." (cryptography.io)
- Key rotation: Use MultiFernet to implement key rotation—place the new key first to encrypt new tokens, keep older keys to decrypt existing tokens, and remove/retire old keys when safe. Use MultiFernet.rotate() to re-encrypt tokens under the new primary key. Token/key rotation is described as a recommended cryptographic hygiene practice. (cryptography.io)
- Practical key-management controls (recommended, not specific to library):
- Generate keys with a secure RNG and store secrets in a dedicated secrets manager or HSM (do not hard-code keys in source or repos). See NIST and OWASP key‑management guidance for lifecycle controls (generation, storage, rotation, access control, backup/recovery, destruction). (csrc.nist.gov)
- Limit access (least privilege), audit key usage, and have procedures for compromise (immediate rotation + re‑encryption/retirement). MultiFernet supports retiring old keys after rotation. (cryptography.io)
- References (official docs and standards):
- cryptography (Fernet / MultiFernet) official docs. (cryptography.io)
- NIST key‑management guidance (SP 800‑57 series). (csrc.nist.gov)
- OWASP Key Management Cheat Sheet. (cheatsheetseries.owasp.org)
If you want, I can: (a) extract the exact code patterns for generating, rotating, and retiring Fernet keys; or (b) produce a short ops checklist (storage options, rotation schedule, incident playbook) tailored to your environment. Which would you prefer?
Citations:
- 1: https://cryptography.io/en/45.0.5/fernet/?utm_source=openai
- 2: https://cryptography.io/en/45.0.5/fernet/?utm_source=openai
- 3: https://csrc.nist.gov/Projects/Key-Management/faqs?utm_source=openai
- 4: https://cryptography.io/en/45.0.5/fernet/?utm_source=openai
- 5: https://cryptography.io/en/45.0.5/fernet/?utm_source=openai
- 6: https://csrc.nist.gov/Projects/Key-Management/faqs?utm_source=openai
- 7: https://cheatsheetseries.owasp.org/cheatsheets/Key_Management_Cheat_Sheet.html?utm_source=openai
🏁 Script executed:
rg -A 10 "class DatasetDatabaseHandlerInterface" --type pyRepository: topoteretes/cognee
Length of output: 1862
🏁 Script executed:
cat cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.pyRepository: topoteretes/cognee
Length of output: 4479
Require NEO4J_ENCRYPTION_KEY environment variable and reconsider secret storage design to align with interface guidance.
The code falls back to an insecure default encryption key ("test_key") at lines 46 and 141, which contradicts the cryptography library's fundamental requirement that Fernet keys "must be kept secret" and violates NIST/OWASP key management practices. More critically, the current design persists both an encrypted password ciphertext and a plaintext Aura access token in the relational database, which conflicts with the DatasetDatabaseHandlerInterface contract guidance to "prefer returning only references to secrets or role identifiers, not plaintext credentials" and "do not persist any resolved or decrypted secrets back to the relational database."
Two required changes:
- Remove the insecure default encryption key. Require
NEO4J_ENCRYPTION_KEYto be explicitly set:
- encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
+ encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY")
+ if not encryption_env_key:
+ raise ValueError(
+ "NEO4J_ENCRYPTION_KEY must be set to encrypt Neo4j Aura credentials safely."
+ )Apply the same fix at line 141 in resolve_dataset_connection_info() with message: "NEO4J_ENCRYPTION_KEY must be set in order to decrypt Neo4j Aura credentials."
- Store only references to credentials, not the credentials themselves. Instead of persisting the encrypted password ciphertext and plaintext access token, store only the Aura
instance_idandtenant_idin the returned dictionary. Inresolve_dataset_connection_info(), use those references to call the Aura API (via the stored OAuth credentials or a refresh flow) to obtain fresh, short-lived credentials at connection time. This aligns with the interface's explicit guidance and eliminates the risk that a DB dump exposes long-lived secrets.
See official guidance: cryptography.io Fernet key management, NIST SP 800-57 key management, OWASP Key Management Cheat Sheet.
Also fix the empty delete_dataset() stub at line 156–157 by raising NotImplementedError with a TODO comment.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Client credentials and encryption | |
| client_id = os.environ.get("NEO4J_CLIENT_ID", None) | |
| client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None) | |
| tenant_id = os.environ.get("NEO4J_TENANT_ID", None) | |
| encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key") | |
| encryption_key = base64.urlsafe_b64encode( | |
| hashlib.sha256(encryption_env_key.encode()).digest() | |
| ) | |
| cipher = Fernet(encryption_key) | |
| # Client credentials and encryption | |
| client_id = os.environ.get("NEO4J_CLIENT_ID", None) | |
| client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None) | |
| tenant_id = os.environ.get("NEO4J_TENANT_ID", None) | |
| encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY") | |
| if not encryption_env_key: | |
| raise ValueError( | |
| "NEO4J_ENCRYPTION_KEY must be set to encrypt Neo4j Aura credentials safely." | |
| ) | |
| encryption_key = base64.urlsafe_b64encode( | |
| hashlib.sha256(encryption_env_key.encode()).digest() | |
| ) | |
| cipher = Fernet(encryption_key) |
🤖 Prompt for AI Agents
In
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
around lines 42–50 (and apply same check at line 141), remove the insecure
fallback by requiring NEO4J_ENCRYPTION_KEY to be present (raise a clear error if
missing and at line 141 use the message "NEO4J_ENCRYPTION_KEY must be set in
order to decrypt Neo4j Aura credentials."); stop writing encrypted password
ciphertexts and plaintext access tokens back to the relational DB—modify the
persistence/return values so only Aura references (instance_id and tenant_id)
are stored/returned, and update resolve_dataset_connection_info() to use those
references to obtain short‑lived credentials via the Aura OAuth/refresh flow at
connection time; finally replace the empty delete_dataset() stub at lines
156–157 with a NotImplementedError and a TODO comment indicating proper deletion
behavior to implement later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pazone How important you think this is? It is a risk, but we have it everywhere basically. This ensures that cognee will work with minimal config. The setup will be a bit more burdensome if we don't have those default values and expect users to provide them every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw the Neo4jAura password encryption is not safe and not for production use, the idea was just to have this as a simple examples with minimal config for our users on how it could be done and they do it themselves with in a safe way with secret vaults/load balancers etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we don't store credentials here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's a repeated code :
encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key")
encryption_key = base64.urlsafe_b64encode(
hashlib.sha256(encryption_env_key.encode()).digest()
)
cipher = Fernet(encryption_key)
Extract a private method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We store credentials here, but they are stored encrypted with a custom string that can be set, we decrypt them in the resolution step. If there is a simple service we can call without much setup to store credentials instead let me know.
This is intended to be more of an example for our users how a Neo4j dataset database handler can be written than an actual handler we will use for our production or recommend for them to use in production. We can make a private handler for SaaS if we decide to use Neo4j there that will utilize a third party service for credential management.
Ideally all production users need to write their own credential resolution mechanism at the very least, we support registering and using custom dataset database handlers
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
Show resolved
Hide resolved
| from cognee.infrastructure.databases.dataset_database_handler import DatasetDatabaseHandlerInterface | ||
|
|
||
|
|
||
| class LanceDBDatasetDatabaseHandler(DatasetDatabaseHandlerInterface): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is there a test that covers this class? I tried to run the only one changed tests with coverage and there's no hits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all the standard Cognee tests run with this and the Kuzu class. Only class not covered currently is the Neo4jAura but we can't cover that one since it's an expensive paid service
| dataset_database = await _get_vector_db_connection_info(dataset_database) | ||
| dataset_database = await _get_graph_db_connection_info(dataset_database) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: do we assign the value twice. Is it expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, vector connection set values regarding vector databases, graph connection sets values for graph databases. They are not overwritten and have separate fields in the Data model
|
It's on the right track. I'd recommend to ensure the test coverage to avoid future problems |
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
Show resolved
Hide resolved
|
|
||
| @classmethod | ||
| async def delete_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then I guess we need to handle delete here. In case of kuzu that is just calling delete_graph.
| # Client credentials and encryption | ||
| client_id = os.environ.get("NEO4J_CLIENT_ID", None) | ||
| client_secret = os.environ.get("NEO4J_CLIENT_SECRET", None) | ||
| tenant_id = os.environ.get("NEO4J_TENANT_ID", None) | ||
| encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key") | ||
| encryption_key = base64.urlsafe_b64encode( | ||
| hashlib.sha256(encryption_env_key.encode()).digest() | ||
| ) | ||
| cipher = Fernet(encryption_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pazone How important you think this is? It is a risk, but we have it everywhere basically. This ensures that cognee will work with minimal config. The setup will be a bit more burdensome if we don't have those default values and expect users to provide them every time.
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
Show resolved
Hide resolved
cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
cognee/tests/test_dataset_database_handler.py (1)
39-50: Align Kuzu handler database path with LanceDB handler and test assertions
KuzuTestDatasetDatabaseHandler.create_datasetcurrently uses a relative path:databases_directory_path = os.path.join("databases", str(user.id))while
LanceDBTestDatasetDatabaseHandlerbuilds an absolute path under.cognee_system/test_dataset_database_handler/databases/..., andmain()asserts the Kuzu DB file exists under that same absolute base. This inconsistency can cause the Kuzu DB file to be written to a different location than the assertion expects (and from where LanceDB writes its DB).You can mirror the LanceDB handler’s path logic so both handlers and the assertions agree:
class KuzuTestDatasetDatabaseHandler(DatasetDatabaseHandlerInterface): @classmethod - async def create_dataset(cls, dataset_id, user): - databases_directory_path = os.path.join("databases", str(user.id)) - os.makedirs(databases_directory_path, exist_ok=True) + async def create_dataset(cls, dataset_id, user): # noqa: ARG003 - dataset_id unused in test + import pathlib + + cognee_directory_path = str( + pathlib.Path( + os.path.join( + pathlib.Path(__file__).parent, ".cognee_system/test_dataset_database_handler" + ) + ).resolve() + ) + databases_directory_path = os.path.join(cognee_directory_path, "databases", str(user.id)) + os.makedirs(databases_directory_path, exist_ok=True) @@ - graph_db_name = "test.kuzu" + graph_db_name = "test.kuzu" return { "graph_database_name": graph_db_name, "graph_database_url": os.path.join(databases_directory_path, graph_db_name), "graph_database_provider": "kuzu", }This also addresses the earlier concern about inconsistent test artifact locations between the two handlers.
🧹 Nitpick comments (5)
.github/workflows/e2e_tests.yml (2)
104-116: Selective disabling of backend access control looks reasonableUsing
ENV: 'local'/'dev'plusENABLE_BACKEND_ACCESS_CONTROL: 'false'for the telemetry pipeline and deduplication examples is a pragmatic way to keep existing scenarios working while you introduce the new gating. Just make sure there is at least one CI path that runs with backend access control enabled (or defaulted on) so those new checks are exercised as well.Also applies to: 148-158
216-239: Consider wiring the dataset‑database handler scenario through pytestThe new
test-dataset-database-handlerjob mirrors the other integration jobs, but it runs the module directly viapythoninstead of throughpytest. If you add a small@pytest.mark.asynciotest wrapper aroundmain()incognee/tests/test_dataset_database_handler.pyand invoke this job withuv run pytest cognee/tests/test_dataset_database_handler.py, you’ll get consistent reporting and coverage accounting with the rest of your test suite.cognee/tests/test_dataset_database_handler.py (3)
4-12: Be cautious with environment mutation at import timeSetting
VECTOR_DATASET_DATABASE_HANDLERandGRAPH_DATASET_DATABASE_HANDLERat module import time is fine for a standalone example script, but if this module is ever imported by pytest as part of a larger test run it will silently affect global process state for other tests. If you intend this to behave as a “real” test module, consider moving theseos.environ[...]assignments intomain()(or a pytest fixture) so the scope and lifetime of the override are explicit.
15-36: Minor cleanup for LanceDB test handler (imports and lints)The implementation looks correct for the test scenario. Two small cleanups you might consider:
- Move the
import pathlibto the top of the file with the other imports to avoid re-importing it every timecreate_datasetis called.dataset_idis required by the interface but unused here; if Ruff’sARG003is enabled, you can either add a# noqa: ARG003on the method definition or add a short comment explaining that it’s intentionally unused in this test handler.Example diff:
-import asyncio -import os +import asyncio +import os +import pathlib @@ -class LanceDBTestDatasetDatabaseHandler(DatasetDatabaseHandlerInterface): +class LanceDBTestDatasetDatabaseHandler(DatasetDatabaseHandlerInterface): @@ - async def create_dataset(cls, dataset_id, user): - import pathlib - + async def create_dataset(cls, dataset_id, user): # noqa: ARG003 - dataset_id unused in test cognee_directory_path = str( pathlib.Path( os.path.join(
53-135: Turn this example into a proper pytest test while keeping the script entry pointThe
main()coroutine plus theif __name__ == "__main__": ...block make this a nice executable example, but as-is it doesn’t define any pytest tests, so it won’t be exercised by pytest-based runs or contribute to pytest-driven coverage.A lightweight way to get both benefits (example + test) is to add a small async test wrapper that just calls
main():from cognee.api.v1.search import SearchType @@ async def main(): @@ os.path.join(cognee_directory_path, "databases", str(default_user.id), "test.lance.db") ), "Vector database file not found." - -if __name__ == "__main__": +import pytest + + +@pytest.mark.asyncio +async def test_dataset_database_handler_example(): + """Exercise custom LanceDB and Kuzu dataset database handlers end-to-end.""" + await main() + + +if __name__ == "__main__": logger = setup_logging(log_level=ERROR) loop = asyncio.new_event_loop() asyncio.set_event_loop(loop)That keeps the example runnable via
python cognee/tests/test_dataset_database_handler.pywhile also letting pytest discover and run it withpytest, aligning with your testing guidelines and earlier review feedback about code coverage. Short docstrings for the two handler classes andmain()would also help future readers understand this example’s intent.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/e2e_tests.yml(3 hunks)cognee/tests/test_dataset_database_handler.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
.github/**
⚙️ CodeRabbit configuration file
.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.
- 'actionlint' erroneously generates false positives when dealing with GitHub's
${{ ... }}syntax in conditionals.- 'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid
${{ ... }}syntax.
Files:
.github/workflows/e2e_tests.yml
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/tests/test_dataset_database_handler.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/tests/test_dataset_database_handler.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/tests/test_dataset_database_handler.py
cognee/tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests
Files:
cognee/tests/test_dataset_database_handler.py
cognee/tests/*
⚙️ CodeRabbit configuration file
cognee/tests/*: When reviewing test code:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance, pointing out any violations discovered.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a project rule, Python source files with names prefixed by the string "test_" and located in the project's "tests" directory are the project's unit-testing code. It is safe, albeit a heuristic, to assume these are considered part of the project's minimal acceptance testing unless a justifying exception to this assumption is documented.
- As a project rule, any files without extensions and with names prefixed by either the string "check_" or the string "test_", and located in the project's "tests" directory, are the project's non-unit test code. "Non-unit test" in this context refers to any type of testing other than unit testing, such as (but not limited to) functional testing, style linting, regression testing, etc. It can also be assumed that non-unit testing code is usually written as Bash shell scripts.
Files:
cognee/tests/test_dataset_database_handler.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (22)
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: Basic Tests / Run Linting
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
- GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
- GitHub Check: Test Weighted Edges Core Functionality (3.12)
…k with multi user mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (8)
cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (2)
28-30: PotentialAttributeErrorwhenuserisNone.The
userparameter isOptional[User], butuser.idis accessed without a null check. This will raise anAttributeErrorif the caller passesNone.Apply this diff to add validation:
@classmethod async def create_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> dict: vector_config = get_vectordb_config() base_config = get_base_config() if vector_config.vector_db_provider != "lancedb": raise ValueError( "LanceDBDatasetDatabaseHandler can only be used with LanceDB vector database provider." ) + if user is None: + raise ValueError("User is required for LanceDB dataset creation.") + databases_directory_path = os.path.join( base_config.system_root_directory, "databases", str(user.id) )
32-32:dataset_idisOptional[UUID]but used directly.If
dataset_idisNone, this will produce"None.lance.db"as the database name, which is likely unintended.Apply this diff to add validation:
+ if dataset_id is None: + raise ValueError("Dataset ID is required for LanceDB dataset creation.") + vector_db_name = f"{dataset_id}.lance.db"cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (6)
46-50: Insecure default encryption key exposes credentials.The fallback to
"test_key"violates cryptographic best practices. As noted in past reviews and developer comments, this implementation is not production-safe.Per extensive past review discussion, require the environment variable explicitly:
- encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY", "test_key") + encryption_env_key = os.environ.get("NEO4J_ENCRYPTION_KEY") + if not encryption_env_key: + raise ValueError( + "NEO4J_ENCRYPTION_KEY must be set to encrypt Neo4j Aura credentials safely." + )Apply the same fix at Line 141 in
resolve_dataset_connection_info().
62-62: Add timeout to HTTP request.The
requests.postcall lacks a timeout, which can cause indefinite blocking if the Neo4j API is unresponsive.- resp = requests.post(url, data=data, auth=(client_id, client_secret)) + resp = requests.post(url, data=data, auth=(client_id, client_secret), timeout=30)As per coding guidelines, "Prefer explicit, structured error handling in Python code."
90-96: Missing timeout and inefficient JSON parsing.The instance creation request lacks a timeout and calls
response.json()multiple times, which is inefficient.- response = requests.post(url, headers=headers, json=payload) + response = requests.post(url, headers=headers, json=payload, timeout=60) + response.raise_for_status() + response_data = response.json()["data"] graph_db_name = "neo4j" # Has to be 'neo4j' for Aura - graph_db_url = response.json()["data"]["connection_url"] + graph_db_url = response_data["connection_url"] graph_db_key = resp["access_token"] - graph_db_username = response.json()["data"]["username"] - graph_db_password = response.json()["data"]["password"] + graph_db_username = response_data["username"] + graph_db_password = response_data["password"]
98-112: Synchronous HTTP blocks the async event loop during polling.Using
requests.get()inside an async function blocks the event loop for up to 5 minutes. Consider usinghttpxwith async support orasyncio.to_thread().async def _wait_for_neo4j_instance_provisioning(instance_id: str, headers: dict): # Poll until the instance is running status_url = f"https://api.neo4j.io/v1/instances/{instance_id}" status = "" - for attempt in range(30): # Try for up to ~5 minutes - status_resp = requests.get(status_url, headers=headers) + for _ in range(30): # Try for up to ~5 minutes + status_resp = await asyncio.to_thread( + requests.get, status_url, headers=headers, timeout=30 + ) status = status_resp.json()["data"]["status"] if status.lower() == "running": return await asyncio.sleep(10)
146-148: Add error handling for decryption failures.The decryption lacks error handling for
KeyError(missing password field) orInvalidToken(wrong key/corrupted data).- graph_db_password = cipher.decrypt( - dataset_database.graph_database_connection_info["graph_database_password"].encode() - ).decode() + try: + encrypted_password = dataset_database.graph_database_connection_info["graph_database_password"] + graph_db_password = cipher.decrypt(encrypted_password.encode()).decode() + except KeyError: + raise ValueError("Missing encrypted password in dataset connection info") + except Exception as e: + raise ValueError(f"Failed to decrypt Neo4j password: {e}")As per coding guidelines, "Prefer explicit, structured error handling in Python code."
155-157: Implementdelete_datasetor raiseNotImplementedError.The abstract method is unimplemented. Raise
NotImplementedErrorto make it explicit that this functionality is pending.@classmethod async def delete_dataset(cls, dataset_database: DatasetDatabase): - pass + # TODO: Implement instance deletion via Neo4j Aura API + raise NotImplementedError("Neo4j Aura instance deletion not yet implemented")
🧹 Nitpick comments (3)
cognee/modules/data/deletion/prune_system.py (2)
14-16: Move import to module level unless circular dependency exists.Imports placed inside functions are generally a code smell and can impact readability and performance. Consider moving this import to the top of the module unless there's a specific circular dependency issue.
Apply this diff:
from cognee.context_global_variables import backend_access_control_enabled from cognee.infrastructure.databases.vector import get_vector_engine from cognee.infrastructure.databases.graph.get_graph_engine import get_graph_engine from cognee.infrastructure.databases.relational import get_relational_engine from cognee.infrastructure.databases.vector.config import get_vectordb_config from cognee.infrastructure.databases.graph.config import get_graph_config +from cognee.infrastructure.databases.dataset_database_handler.supported_dataset_database_handlers import ( + supported_dataset_database_handlers, +) from cognee.shared.cache import delete_cache from cognee.modules.users.models import DatasetDatabaseThen remove lines 14-16 from the function.
49-51: Update or remove the outdated TODO comment.The TODO states "prune_system should work with multi-user access control mode enabled" but the code in lines 52-62 appears to implement exactly this functionality via
prune_graph_databases()andprune_vector_databases().Either remove the TODO if the implementation is complete, or clarify what remaining work is needed.
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
31-31: Remove redundant import.
get_graph_configis already imported at Line 9. Remove this duplicate import from within the method.Apply this diff:
- from cognee.infrastructure.databases.graph.config import get_graph_config - graph_config = get_graph_config()
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.github/workflows/e2e_tests.yml(2 hunks)cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py(1 hunks)cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py(1 hunks)cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py(1 hunks)cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py(1 hunks)cognee/modules/data/deletion/prune_system.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
.github/**
⚙️ CodeRabbit configuration file
.github/**: * When the project is hosted on GitHub: All GitHub-specific configurations, templates, and tools should be found in the '.github' directory tree.
- 'actionlint' erroneously generates false positives when dealing with GitHub's
${{ ... }}syntax in conditionals.- 'actionlint' erroneously generates incorrect solutions when suggesting the removal of valid
${{ ... }}syntax.
Files:
.github/workflows/e2e_tests.yml
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.pycognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.pycognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.pycognee/modules/data/deletion/prune_system.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.pycognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.pycognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.pycognee/modules/data/deletion/prune_system.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.pycognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.pycognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.pycognee/modules/data/deletion/prune_system.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.pycognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.pycognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.pycognee/modules/data/deletion/prune_system.pycognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py
🧠 Learnings (1)
📚 Learning: 2025-10-11T04:18:24.594Z
Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/modules/data/deletion/prune_system.py
🧬 Code graph analysis (2)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (7)
cognee/infrastructure/databases/graph/get_graph_engine.py (2)
get_graph_engine(10-24)create_graph_engine(28-169)cognee/modules/users/models/DatasetDatabase.py (1)
DatasetDatabase(7-37)cognee/infrastructure/databases/graph/config.py (1)
get_graph_config(131-144)cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)
DatasetDatabaseHandlerInterface(9-80)create_dataset(12-34)delete_dataset(71-80)cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (2)
create_dataset(21-129)delete_dataset(156-157)cognee/infrastructure/databases/graph/graph_db_interface.py (1)
delete_graph(296-300)cognee/infrastructure/databases/graph/neo4j_driver/adapter.py (1)
delete_graph(839-861)
cognee/infrastructure/databases/graph/neo4j_driver/Neo4jAuraDatasetDatabaseHandler.py (3)
cognee/infrastructure/databases/graph/config.py (1)
get_graph_config(131-144)cognee/modules/users/models/DatasetDatabase.py (1)
DatasetDatabase(7-37)cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (2)
create_dataset(19-55)delete_dataset(58-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
- GitHub Check: Test Weighted Edges Examples
- GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
🔇 Additional comments (6)
.github/workflows/e2e_tests.yml (2)
215-238: The test job configuration is correct and complete.The test file exists at
./cognee/tests/test_dataset_database_handler.pyand uses custom LanceDB and Kuzu handlers with local file-based storage. No external service dependencies (postgres, redis, docker) are required. The environment variables for LLM and embedding configuration are already present in the job, and all test setup is self-contained within the test file.
150-150: No actionable change needed.The
ENABLE_BACKEND_ACCESS_CONTROL: 'false'setting for the Deduplication Example test is appropriate. This flag controls multi-tenant data isolation at the database layer and does not affect deduplication logic. The test file contains pure deduplication verification without any access control dependencies. This setting pattern is consistent with other non-access-control focused tests in the codebase (e.g.,test_relational_db_migration.py,code_graph_example.py). Deduplication behavior is identical regardless of the access control mode.cognee/modules/data/deletion/prune_system.py (1)
1-1: LGTM!The new imports are appropriate for the added functionality.
Also applies to: 5-6, 8-8
cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)
9-80: LGTM: Well-designed interface with clear security guidance.The interface establishes a clean contract for dataset database provisioning with comprehensive docstrings. The security guidance regarding credential handling (preferring secret references over plaintext in
create_datasetand resolving at connection time viaresolve_dataset_connection_info) aligns with best practices.cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1)
42-52: LGTM: Proper cleanup implementation.The
delete_datasetmethod correctly creates a vector engine with the dataset's connection info and invokes the prune operation to clean up resources.cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
57-81: LGTM: Proper graph deletion implementation.The
delete_datasetmethod now correctly constructs the graph file path and invokes the graph engine'sdelete_graph()to remove the dataset's backing database. This addresses the prior review concern about the no-op implementation.
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
cognee/modules/data/deletion/prune_system.py (5)
17-39: Add docstring to prune_graph_databases function.The function lacks a docstring. As per coding guidelines, undocumented function definitions are assumed incomplete and should have docstrings explaining purpose, behavior, and any important notes.
Add a concise docstring:
async def prune_graph_databases(): + """ + Prune all graph databases for registered datasets. + + Iterates through dataset_database records and deletes each graph database + using the configured handler. Skips pruning if the dataset_database table + is not accessible. + """ async def _prune_graph_db(dataset_database: DatasetDatabase) -> dict:As per coding guidelines, docstrings are required for function definitions.
24-25: Add validation and error handling for handler access and deletion.The dictionary access on line 24 could raise
KeyErrorif the configured handler is not registered. Additionally, thedelete_datasetcall on line 25 lacks error handling, which could cause the entire pruning operation to fail when one dataset deletion fails.Apply this diff:
- handler = supported_dataset_database_handlers[graph_config.graph_dataset_database_handler] - return await handler["handler_instance"].delete_dataset(dataset_database) + handler_key = graph_config.graph_dataset_database_handler + if handler_key not in supported_dataset_database_handlers: + logger.error(f"Graph handler '{handler_key}' not registered") + raise ValueError(f"Graph dataset database handler '{handler_key}' not registered") + + handler = supported_dataset_database_handlers[handler_key] + try: + return await handler["handler_instance"].delete_dataset(dataset_database) + except Exception as e: + logger.error(f"Failed to delete graph database for dataset {dataset_database.dataset_id}: {e}") + raiseAs per coding guidelines, use shared logging utilities from
cognee.shared.logging_utilsand prefer explicit error handling.
41-63: Address duplicate issues from prune_graph_databases.This function has the same issues as
prune_graph_databases:
- Missing docstring (lines 41-63)
- Import inside nested function (lines 45-47) should be moved to module level
- Dictionary access without key validation (line 49) could raise
KeyError- Missing error handling for
delete_datasetcall (line 50)Apply similar fixes as suggested for
prune_graph_databases:
- Add docstring describing the function's purpose
- Move the import to module level (already covered in the earlier comment)
- Validate that
vector_config.vector_dataset_database_handlerexists insupported_dataset_database_handlersbefore accessing- Wrap
delete_datasetin try-except with loggingAs per coding guidelines, use shared logging utilities from
cognee.shared.logging_utilsand prefer explicit error handling.
66-66: Add docstring to prune_system function.The function lacks a docstring. As per coding guidelines, undocumented function definitions are assumed incomplete and should have docstrings.
Add a docstring explaining the function's purpose, parameters, and the important note about API availability:
async def prune_system(graph=True, vector=True, metadata=True, cache=True): + """ + Prune system databases and cache. + + Args: + graph: If True, prune graph databases + vector: If True, prune vector databases + metadata: If True, prune metadata database + cache: If True, delete cache + + Note: + This function should not be exposed through the API as it has no permission + checks and will delete all databases. Use only in development/testing. + """ # Note: prune system should not be available through the API, it has no permission checks and willAs per coding guidelines, docstrings are required for function definitions.
72-79: Add error handling for new pruning functions.The calls to
prune_graph_databases()(line 73) andprune_vector_databases()(line 79) lack error handling. Given that this function is called at pipeline start (based on learnings) and is used in development/testing, failures should be handled gracefully to prevent cascading issues.Apply this diff:
elif graph and backend_access_control_enabled(): - await prune_graph_databases() + try: + await prune_graph_databases() + except Exception as e: + logger.error(f"Failed to prune graph databases: {e}") + raise if vector and not backend_access_control_enabled(): vector_engine = get_vector_engine() await vector_engine.prune() elif vector and backend_access_control_enabled(): - await prune_vector_databases() + try: + await prune_vector_databases() + except Exception as e: + logger.error(f"Failed to prune vector databases: {e}") + raiseAs per coding guidelines, use shared logging utilities from
cognee.shared.logging_utilsand prefer explicit error handling.Based on learnings, this function is called at pipeline start, so proper error propagation is critical.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/modules/data/deletion/prune_system.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/modules/data/deletion/prune_system.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/modules/data/deletion/prune_system.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/modules/data/deletion/prune_system.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/modules/data/deletion/prune_system.py
🧠 Learnings (1)
📚 Learning: 2025-10-11T04:18:24.594Z
Learnt from: Vattikuti-Manideep-Sitaram
Repo: topoteretes/cognee PR: 1529
File: cognee/api/v1/cognify/ontology_graph_pipeline.py:69-74
Timestamp: 2025-10-11T04:18:24.594Z
Learning: The code_graph_pipeline.py and ontology_graph_pipeline.py both follow an established pattern of calling cognee.prune.prune_data() and cognee.prune.prune_system(metadata=True) at the start of pipeline execution. This appears to be intentional behavior for pipeline operations in the cognee codebase.
Applied to files:
cognee/modules/data/deletion/prune_system.py
🧬 Code graph analysis (1)
cognee/modules/data/deletion/prune_system.py (8)
cognee/context_global_variables.py (1)
backend_access_control_enabled(74-83)cognee/infrastructure/databases/vector/get_vector_engine.py (1)
get_vector_engine(5-7)cognee/infrastructure/databases/graph/get_graph_engine.py (1)
get_graph_engine(10-24)cognee/infrastructure/databases/relational/get_relational_engine.py (1)
get_relational_engine(5-21)cognee/modules/users/models/DatasetDatabase.py (1)
DatasetDatabase(7-37)cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (1)
delete_dataset(71-80)cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
delete_dataset(58-81)cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py (1)
delete_dataset(42-52)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: Basic Tests / Run Simple Examples
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
- GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
🔇 Additional comments (1)
cognee/modules/data/deletion/prune_system.py (1)
28-38: Good addition of outer error handling.The error handling for
OperationalErrorandEntityNotFoundErroris a good addition. This gracefully handles cases where thedataset_databasetable is not accessible (e.g., during initial setup or schema migration) and logs appropriately using the shared logger.The use of
logger.debug()is appropriate here since this is an expected condition in certain environments.Also applies to: 52-63
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
18-54: Validatedataset_idbefore using it to build the Kuzu filename.
dataset_idis typed asOptional[UUID]but is used directly ingraph_db_name = f"{dataset_id}.pkl". Ifdataset_idis everNone, this will silently create a"None.pkl"database name, which is almost certainly unintended and will be hard to diagnose.Consider failing fast by validating
dataset_idat the top of the method:@classmethod async def create_dataset(cls, dataset_id: Optional[UUID], user: Optional[User]) -> dict: @@ - graph_config = get_graph_config() + graph_config = get_graph_config() @@ - if graph_config.graph_database_provider != "kuzu": + if graph_config.graph_database_provider != "kuzu": raise ValueError( "KuzuDatasetDatabaseHandler can only be used with Kuzu graph database provider." ) + + if dataset_id is None: + raise ValueError("dataset_id is required for Kuzu dataset creation.") @@ - graph_db_name = f"{dataset_id}.pkl" + graph_db_name = f"{dataset_id}.pkl"This keeps the handler’s contract strict and avoids generating ambiguous filenames.
🧹 Nitpick comments (1)
cognee/context_global_variables.py (1)
124-125: TODO comment flags technical debt.The TODO comment correctly identifies that vector and graph config handling across Cognee needs improvement, particularly regarding LRU_CACHE sensitivity to input order.
Would you like me to open a new issue to track this technical debt, or do you plan to address it in a follow-up PR?
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/context_global_variables.py(6 hunks)cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py(1 hunks)cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/infrastructure/databases/vector/lancedb/LanceDBDatasetDatabaseHandler.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.pycognee/context_global_variables.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.pycognee/context_global_variables.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.pycognee/context_global_variables.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py
🧬 Code graph analysis (2)
cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (5)
cognee/infrastructure/databases/graph/get_graph_engine.py (2)
get_graph_engine(10-24)create_graph_engine(28-169)cognee/base_config.py (1)
get_base_config(63-64)cognee/modules/users/models/DatasetDatabase.py (1)
DatasetDatabase(7-37)cognee/infrastructure/databases/dataset_database_handler/dataset_database_handler_interface.py (3)
DatasetDatabaseHandlerInterface(9-80)create_dataset(12-34)delete_dataset(71-80)cognee/infrastructure/databases/graph/config.py (1)
get_graph_config(131-144)
cognee/context_global_variables.py (4)
cognee/infrastructure/databases/vector/config.py (1)
get_vectordb_config(72-86)cognee/infrastructure/databases/graph/config.py (1)
get_graph_config(131-144)cognee/infrastructure/databases/utils/get_or_create_dataset_database.py (1)
get_or_create_dataset_database(63-116)cognee/infrastructure/databases/utils/resolve_dataset_database_connection_info.py (1)
resolve_dataset_database_connection_info(28-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: End-to-End Tests / Test Feedback Enrichment
- GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
- GitHub Check: End-to-End Tests / Concurrent Subprocess access test
- GitHub Check: End-to-End Tests / Conversation sessions test (FS)
- GitHub Check: End-to-End Tests / Test Entity Extraction
- GitHub Check: End-to-End Tests / Test graph edge ingestion
- GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
- GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
- GitHub Check: End-to-End Tests / Server Start Test
- GitHub Check: End-to-End Tests / S3 Bucket Test
- GitHub Check: End-to-End Tests / Deduplication Test
- GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
- GitHub Check: Basic Tests / Run Integration Tests
- GitHub Check: Basic Tests / Run Unit Tests
- GitHub Check: CLI Tests / CLI Integration Tests
- GitHub Check: CLI Tests / CLI Functionality Tests
- GitHub Check: CLI Tests / CLI Unit Tests
- GitHub Check: Test Weighted Edges with Different Graph Databases (neo4j)
- GitHub Check: Test Weighted Edges with Different Graph Databases (kuzu)
- GitHub Check: Code Quality for Weighted Edges
🔇 Additional comments (6)
cognee/context_global_variables.py (5)
7-10: LGTM! Import updates align with new configuration patterns.The imports correctly reflect the refactored configuration utilities and the new dataset database connection info resolver.
82-82: LGTM! Simplified logic reduces duplication.The direct call to
multi_user_support_possible()simplifies the logic and ensures consistent validation.
141-146: Verify empty string defaults for credentials are acceptable.Using
.get()with empty string defaults forgraph_database_usernameandgraph_database_passwordmight mask missing required credentials. For Neo4j Aura (mentioned in the PR title), these credentials are typically required.Verify whether:
- Empty credentials are valid for any supported graph database providers
- Downstream code properly validates and reports authentication failures
- The
graph_database_connection_infodictionary is guaranteed to contain these keys afterresolve_dataset_database_connection_info()is calledConsider using
Noneas the default and explicitly checking for required credentials, or raising an error if credentials are missing for providers that require them.
112-113: LGTM! Proper connection info resolution added.The call to
resolve_dataset_database_connection_info()ensures that all connection details (both vector and graph) are properly resolved before use.
29-30: No action required—handler attributes cannot be None.Both
graph_dataset_database_handlerandvector_dataset_database_handlerare defined asstrtype with non-empty default values ("kuzu" and "lancedb" respectively) in their respective Pydantic BaseSettings config classes. These attributes are guaranteed to always return string values and will not fail theinchecks on lines 35 and 42.cognee/infrastructure/databases/graph/kuzu/KuzuDatasetDatabaseHandler.py (1)
1-10: Imports and interface wiring look consistent.The imports are all used, and the handler cleanly depends on the shared
DatasetDatabaseHandlerInterface, graph engine factory, and user/database models. No changes needed here.
| def multi_user_support_possible(): | ||
| graph_db_config = get_graph_context_config() | ||
| vector_db_config = get_vectordb_context_config() | ||
| return ( | ||
| graph_db_config["graph_database_provider"] in GRAPH_DBS_WITH_MULTI_USER_SUPPORT | ||
| and vector_db_config["vector_db_provider"] in VECTOR_DBS_WITH_MULTI_USER_SUPPORT | ||
| graph_db_config = get_graph_config() | ||
| vector_db_config = get_vectordb_config() | ||
|
|
||
| graph_handler = graph_db_config.graph_dataset_database_handler | ||
| vector_handler = vector_db_config.vector_dataset_database_handler | ||
| from cognee.infrastructure.databases.dataset_database_handler import ( | ||
| supported_dataset_database_handlers, | ||
| ) | ||
|
|
||
| if graph_handler not in supported_dataset_database_handlers: | ||
| raise EnvironmentError( | ||
| "Unsupported graph dataset to database handler configured. Cannot add support for multi-user access control mode. Please use a supported graph dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n" | ||
| f"Selected graph dataset to database handler: {graph_handler}\n" | ||
| f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n" | ||
| ) | ||
|
|
||
| if vector_handler not in supported_dataset_database_handlers: | ||
| raise EnvironmentError( | ||
| "Unsupported vector dataset to database handler configured. Cannot add support for multi-user access control mode. Please use a supported vector dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n" | ||
| f"Selected vector dataset to database handler: {vector_handler}\n" | ||
| f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n" | ||
| ) | ||
|
|
||
| if ( | ||
| supported_dataset_database_handlers[graph_handler]["handler_provider"] | ||
| != graph_db_config.graph_database_provider | ||
| ): | ||
| raise EnvironmentError( | ||
| "The selected graph dataset to database handler does not work with the configured graph database provider. Cannot add support for multi-user access control mode. Please use a supported graph dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n" | ||
| f"Selected graph database provider: {graph_db_config.graph_database_provider}\n" | ||
| f"Selected graph dataset to database handler: {graph_handler}\n" | ||
| f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n" | ||
| ) | ||
|
|
||
| if ( | ||
| supported_dataset_database_handlers[vector_handler]["handler_provider"] | ||
| != vector_db_config.vector_db_provider | ||
| ): | ||
| raise EnvironmentError( | ||
| "The selected vector dataset to database handler does not work with the configured vector database provider. Cannot add support for multi-user access control mode. Please use a supported vector dataset to database handler or set the environment variables ENABLE_BACKEND_ACCESS_CONTROL to false to switch off multi-user access control mode.\n" | ||
| f"Selected vector database provider: {vector_db_config.vector_db_provider}\n" | ||
| f"Selected vector dataset to database handler: {vector_handler}\n" | ||
| f"Supported dataset to database handlers: {list(supported_dataset_database_handlers.keys())}\n" | ||
| ) | ||
|
|
||
| return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add a docstring to document validation behavior.
This function lacks a docstring. Per the coding guidelines, undocumented function definitions are assumed incomplete.
Apply this diff to add a docstring:
def multi_user_support_possible():
+ """
+ Validate that the configured graph and vector dataset database handlers are supported
+ and compatible with their respective database providers.
+
+ Raises:
+ EnvironmentError: If handlers are unsupported or if handler providers don't match
+ the configured database providers.
+
+ Returns:
+ bool: True if validation passes.
+ """
graph_db_config = get_graph_config()
vector_db_config = get_vectordb_config()As per coding guidelines, all functions should have docstrings.
🤖 Prompt for AI Agents
In cognee/context_global_variables.py around lines 25 to 71, the function
multi_user_support_possible lacks a docstring; add a concise triple-quoted
docstring immediately under the def line that states the function’s purpose
(validate whether configured graph and vector dataset handlers/providers support
multi-user access control), documents that it takes no arguments, returns True
when validation passes, and raises EnvironmentError with explanatory messages
when any handler/provider is unsupported or mismatched; keep the docstring short
(one to three sentences) and include a brief note about the raised exceptions
and return value.
Description
Add ability to use multi tenant multi user mode with Neo4j
Type of Change
Screenshots/Videos (if applicable)
Pre-submission Checklist
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.