-
Notifications
You must be signed in to change notification settings - Fork 93
Unified stores configuration with configurable prefixes and filepath_default #1333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Remove legacy ExternalSettings and ObjectStorageSettings classes - Update Config to use stores.default and stores.<name> structure - Update get_store_spec() to support default store (store=None) - Add partition_pattern and token_length support for stores - Update secrets loading to support per-store credentials - Update save_template() to generate unified stores config - Update _update_from_flat_dict() to handle stores.<name>.<attr> pattern - Remove external.* from ENV_VAR_MAPPING Unified stores configuration supports both: - Hash-addressed storage (<blob@>, <attach@>) via _hash section - Schema-addressed storage (<object@>, <npy@>) via _schema section Configuration structure: stores.default - name of default store stores.<name>.protocol - storage protocol (file, s3, gcs, azure) stores.<name>.location - base path (includes project context) stores.<name>.partition_pattern - schema-addressed partitioning stores.<name>.token_length - random token length stores.<name>.subfolding - hash-addressed subfolding
- Update hash_registry.py to use get_store_spec() instead of get_object_store_spec() - Update staged_insert.py to use get_store_spec() for default store - Update error messages to reference new stores configuration - Remove references to object_storage.default_store (now stores.default)
- Replace config.external tests with stores credential tests - Update template test to check for stores structure instead of object_storage - Update get_store_spec tests for new default behavior (None instead of DEFAULT_SUBFOLDING) - Add tests for default store lookup (store=None) - Add tests for loading per-store credentials from .secrets/ - Verify partition_pattern and token_length defaults
- Update mock_stores fixture to use config.stores instead of config.object_storage - Update mock_object_storage fixture to configure stores.default and stores.local - Remove project_name from object_storage_config (now embedded in location path) - Simplify fixture by using unified stores API
- Update mock_stores_update fixture to use config.stores - Remove project_name (now embedded in location path) - Simplify fixture using unified stores API
- Add validation to prevent filepath paths starting with _hash/ or _schema/ - Update FilepathCodec docstring to clarify reserved sections - Filepath gives users maximum freedom while protecting DataJoint-managed sections - Users can organize files anywhere in store except reserved sections
- Test that filepath rejects paths starting with _hash/ - Test that filepath rejects paths starting with _schema/ - Test that filepath allows all other user-managed paths - Test filepath codec properties and registration
The 'secure' parameter is only valid for S3 stores, not for file/GCS/Azure protocols. Move the default setting to protocol-specific section to avoid validation errors when using file stores.
Allow users to configure custom prefixes for hash-addressed, schema-addressed,
and filepath storage sections per store. This enables mapping DataJoint to
existing storage layouts without restructuring.
Configuration:
- hash_prefix (default: '_hash') - Hash-addressed storage section
- schema_prefix (default: '_schema') - Schema-addressed storage section
- filepath_prefix (default: None) - Optional filepath restriction
Features:
- Validates prefixes don't overlap (mutual exclusion)
- FilepathCodec enforces dynamic reserved prefixes
- Optional filepath_prefix to restrict filepath paths
- Backwards compatible defaults
Examples:
{
"stores": {
"legacy": {
"protocol": "file",
"location": "/data/existing",
"hash_prefix": "content_addressed",
"schema_prefix": "structured_data",
"filepath_prefix": "raw_files"
}
}
}
Changes:
- settings.py: Add prefix fields, validation logic
- builtin_codecs.py: Dynamic prefix checking in FilepathCodec
- test_settings.py: 7 new tests for prefix validation
- test_codecs.py: 2 new tests for custom prefixes
Filepath storage is NOT part of the Object-Augmented Schema - it only
provides references to externally-managed files. Allow separate default
configuration for filepath references vs integrated storage.
Configuration:
- stores.default - for integrated storage (<blob>, <object>, <npy>, <attach>)
- stores.filepath_default - for filepath references (<filepath>)
This allows:
- Integrated storage on S3 or fast filesystem
- Filepath references to acquisition files on NAS or different location
Example:
{
"stores": {
"default": "main",
"filepath_default": "raw_data",
"main": {
"protocol": "s3",
"bucket": "processed-data",
"location": "lab-project"
},
"raw_data": {
"protocol": "file",
"location": "/mnt/nas/acquisition"
}
}
}
Usage:
- data : <blob> # Uses stores.default (main)
- arrays : <object> # Uses stores.default (main)
- raw : <filepath> # Uses stores.filepath_default (raw_data)
- raw : <filepath@acq> # Explicitly names store (overrides default)
Changes:
- settings.py: Add use_filepath_default parameter to get_store_spec()
- builtin_codecs.py: FilepathCodec uses use_filepath_default=True
- test_settings.py: Add 3 tests for filepath_default behavior
- settings.py: Update template to include filepath_default example
Architectural rationale:
- Hash/schema storage: integrated into OAS, DataJoint manages lifecycle
- Filepath storage: references only, users manage lifecycle
- Different defaults reflect this fundamental distinction
…alidation All 24 object storage test failures were due to test fixtures not creating the directories they configured. StorageBackend validates that file protocol locations exist, so fixtures must create them. - conftest.py: Create test_project subdirectory in object_storage_config - test_update1.py: Create djtest subdirectories in mock_stores_update Test results: 520 passed, 7 skipped, 0 failures ✓
Add blank line after import statement per PEP 8 style guidelines.
- Removed hardcoded 'objects' directory level from build_object_path()
- Updated path pattern comment to reflect new structure
- Updated all test expectations to match new path format
Previous path: {schema}/{table}/objects/{key}/{file}
New path: {schema}/{table}/{key}/{file}
The 'objects' literal was a legacy remnant intended for future tabular
storage alongside objects. Removing it simplifies the path structure
and aligns with documented behavior.
Verified:
- All test_object.py tests pass (43 tests)
- All test_npy_codec.py tests pass (22 tests)
- All test_hash_storage.py tests pass (14 tests)
- Updated SchemaCodec._build_path() to accept store_name parameter
- _build_path() now retrieves partition_pattern and token_length from store spec
- ObjectCodec and NpyCodec encode methods pass store_name to _build_path
- Enables partitioning configuration like partition_pattern: '{mouse_id}/{session_date}'
This allows organizing storage by experimental structure:
- Without: {schema}/{table}/{mouse_id=X}/{session_date=Y}/...
- With: {mouse_id=X}/{session_date=Y}/{schema}/{table}/...
Partitioning makes storage browsable by subject/session and enables
selective sync/backup of individual subjects or sessions.
The partition_pattern was not preserving the order of attributes specified in the pattern because it was iterating over a set (unordered). This caused paths like 'neuron_id=0/mouse_id=5/session_date=2017-01-05/...' instead of the expected 'mouse_id=5/session_date=2017-01-05/neuron_id=0/...'. Changes: - Extract partition attributes as a list to preserve order - Keep a set for efficient lookup when filtering remaining PK attributes - Iterate over the ordered list when building partition path components Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Added helper functions for safe 0.14.6 → 2.0 migration using parallel schemas: New functions in datajoint.migrate: - create_parallel_schema() - Create _v20 schema copy for testing - copy_table_data() - Copy data from production to test schema - compare_query_results() - Validate results match between schemas - backup_schema() - Create full schema backup before cutover - restore_schema() - Restore from backup if needed - verify_schema_v20() - Check if schema is 2.0 compatible These functions support the parallel schema migration approach which: - Keeps production untouched during testing - Allows unlimited practice runs - Enables side-by-side validation - Provides easy rollback (just drop _v20 schemas) See: datajoint-docs/src/how-to/migrate-to-v20.md Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Added helper function for migrating external storage pointers when copying
production data to _v2 schemas during git branch-based migration.
Function: migrate_external_pointers_v2()
- Converts BINARY(16) UUID → JSON metadata
- Points to existing files (no file copying required)
- Enables access to external data in _v2 test schemas
- Supports deferred external storage migration approach
Use case:
When using git branch workflow (main: 0.14.6, migrate-to-v2: 2.0), this
function allows copied production data to access external storage without
moving the actual blob files until production cutover.
Example:
migrate_external_pointers_v2(
schema='my_pipeline_v2',
table='recording',
attribute='signal',
source_store='external-raw',
dest_store='raw',
copy_files=False # Keep files in place
)
Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Remove trailing whitespace from SQL query - Remove unused dest_spec variable - Fix blank line whitespace (auto-fixed by ruff)
Auto-formatted by ruff-format to collapse multi-line function calls
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements a unified stores configuration system for DataJoint 2.0, replacing the separate
external.*andobject_storage.*configurations with a singlestores.*configuration.Key Changes
1. Unified Stores Configuration
ExternalSettingsandObjectStorageSettingsclassesstores: dict[str, Any]field in Config2. Configurable Storage Prefixes
hash_prefix(default:_hash) for hash-addressed storageschema_prefix(default:_schema) for schema-addressed storagefilepath_prefix(default:null) for optional filepath restriction3. Separate Filepath Default
stores.filepath_defaultseparate fromstores.default4. Enhanced Filepath Validation
Configuration Example
{ "stores": { "default": "main", "filepath_default": "raw_data", "main": { "protocol": "file", "location": "/data/fast-storage", "hash_prefix": "_hash", "schema_prefix": "_schema" }, "raw_data": { "protocol": "file", "location": "/data/acquisition", "filepath_prefix": "recordings" } } }Breaking Changes
None - this is pre-production 2.0 work. No backward compatibility concerns with 2.0 itself.
Testing
Related Documentation
reference/specs/object-store-configuration.mdhow-to/configure-storage.md🤖 Generated with Claude Code