Unified stores configuration with configurable prefixes and filepath_default #1333

dimitri-yatsenko · 2026-01-14T06:01:57Z

Summary

This PR implements a unified stores configuration system for DataJoint 2.0, replacing the separate external.* and object_storage.* configurations with a single stores.* configuration.

Key Changes

1. Unified Stores Configuration

Removed ExternalSettings and ObjectStorageSettings classes
Added single stores: dict[str, Any] field in Config
All storage (hash, schema, filepath) now uses the same store configuration

2. Configurable Storage Prefixes

Added hash_prefix (default: _hash) for hash-addressed storage
Added schema_prefix (default: _schema) for schema-addressed storage
Added filepath_prefix (default: null) for optional filepath restriction
Validation ensures prefixes don't overlap (no nesting)

3. Separate Filepath Default

Added stores.filepath_default separate from stores.default
Reflects architectural distinction: integrated storage (hash/schema) vs. filepath references
Filepath storage is NOT part of OAS - just references to user-managed files

4. Enhanced Filepath Validation

FilepathCodec validates paths against configured prefixes dynamically
Prevents filepath from using reserved sections (hash_prefix, schema_prefix)
Optional filepath_prefix enforcement for user-defined restrictions

Configuration Example

{
  "stores": {
    "default": "main",
    "filepath_default": "raw_data",
    "main": {
      "protocol": "file",
      "location": "/data/fast-storage",
      "hash_prefix": "_hash",
      "schema_prefix": "_schema"
    },
    "raw_data": {
      "protocol": "file",
      "location": "/data/acquisition",
      "filepath_prefix": "recordings"
    }
  }
}

Breaking Changes

None - this is pre-production 2.0 work. No backward compatibility concerns with 2.0 itself.

Testing

54 configuration tests passing
8 filepath codec tests with prefix validation
Full integration test suite passing

Related Documentation

Comprehensive specification in datajoint-docs: reference/specs/object-store-configuration.md
Updated how-to guide: how-to/configure-storage.md

🤖 Generated with Claude Code

- Remove legacy ExternalSettings and ObjectStorageSettings classes - Update Config to use stores.default and stores.<name> structure - Update get_store_spec() to support default store (store=None) - Add partition_pattern and token_length support for stores - Update secrets loading to support per-store credentials - Update save_template() to generate unified stores config - Update _update_from_flat_dict() to handle stores.<name>.<attr> pattern - Remove external.* from ENV_VAR_MAPPING Unified stores configuration supports both: - Hash-addressed storage (<blob@>, <attach@>) via _hash section - Schema-addressed storage (<object@>, <npy@>) via _schema section Configuration structure: stores.default - name of default store stores.<name>.protocol - storage protocol (file, s3, gcs, azure) stores.<name>.location - base path (includes project context) stores.<name>.partition_pattern - schema-addressed partitioning stores.<name>.token_length - random token length stores.<name>.subfolding - hash-addressed subfolding

- Update hash_registry.py to use get_store_spec() instead of get_object_store_spec() - Update staged_insert.py to use get_store_spec() for default store - Update error messages to reference new stores configuration - Remove references to object_storage.default_store (now stores.default)

- Replace config.external tests with stores credential tests - Update template test to check for stores structure instead of object_storage - Update get_store_spec tests for new default behavior (None instead of DEFAULT_SUBFOLDING) - Add tests for default store lookup (store=None) - Add tests for loading per-store credentials from .secrets/ - Verify partition_pattern and token_length defaults

- Update mock_stores fixture to use config.stores instead of config.object_storage - Update mock_object_storage fixture to configure stores.default and stores.local - Remove project_name from object_storage_config (now embedded in location path) - Simplify fixture by using unified stores API

- Update mock_stores_update fixture to use config.stores - Remove project_name (now embedded in location path) - Simplify fixture using unified stores API

- Add validation to prevent filepath paths starting with _hash/ or _schema/ - Update FilepathCodec docstring to clarify reserved sections - Filepath gives users maximum freedom while protecting DataJoint-managed sections - Users can organize files anywhere in store except reserved sections

- Test that filepath rejects paths starting with _hash/ - Test that filepath rejects paths starting with _schema/ - Test that filepath allows all other user-managed paths - Test filepath codec properties and registration

The 'secure' parameter is only valid for S3 stores, not for file/GCS/Azure protocols. Move the default setting to protocol-specific section to avoid validation errors when using file stores.

Allow users to configure custom prefixes for hash-addressed, schema-addressed, and filepath storage sections per store. This enables mapping DataJoint to existing storage layouts without restructuring. Configuration: - hash_prefix (default: '_hash') - Hash-addressed storage section - schema_prefix (default: '_schema') - Schema-addressed storage section - filepath_prefix (default: None) - Optional filepath restriction Features: - Validates prefixes don't overlap (mutual exclusion) - FilepathCodec enforces dynamic reserved prefixes - Optional filepath_prefix to restrict filepath paths - Backwards compatible defaults Examples: { "stores": { "legacy": { "protocol": "file", "location": "/data/existing", "hash_prefix": "content_addressed", "schema_prefix": "structured_data", "filepath_prefix": "raw_files" } } } Changes: - settings.py: Add prefix fields, validation logic - builtin_codecs.py: Dynamic prefix checking in FilepathCodec - test_settings.py: 7 new tests for prefix validation - test_codecs.py: 2 new tests for custom prefixes

Filepath storage is NOT part of the Object-Augmented Schema - it only provides references to externally-managed files. Allow separate default configuration for filepath references vs integrated storage. Configuration: - stores.default - for integrated storage (<blob>, <object>, <npy>, <attach>) - stores.filepath_default - for filepath references (<filepath>) This allows: - Integrated storage on S3 or fast filesystem - Filepath references to acquisition files on NAS or different location Example: { "stores": { "default": "main", "filepath_default": "raw_data", "main": { "protocol": "s3", "bucket": "processed-data", "location": "lab-project" }, "raw_data": { "protocol": "file", "location": "/mnt/nas/acquisition" } } } Usage: - data : <blob> # Uses stores.default (main) - arrays : <object> # Uses stores.default (main) - raw : <filepath> # Uses stores.filepath_default (raw_data) - raw : <filepath@acq> # Explicitly names store (overrides default) Changes: - settings.py: Add use_filepath_default parameter to get_store_spec() - builtin_codecs.py: FilepathCodec uses use_filepath_default=True - test_settings.py: Add 3 tests for filepath_default behavior - settings.py: Update template to include filepath_default example Architectural rationale: - Hash/schema storage: integrated into OAS, DataJoint manages lifecycle - Filepath storage: references only, users manage lifecycle - Different defaults reflect this fundamental distinction

…alidation All 24 object storage test failures were due to test fixtures not creating the directories they configured. StorageBackend validates that file protocol locations exist, so fixtures must create them. - conftest.py: Create test_project subdirectory in object_storage_config - test_update1.py: Create djtest subdirectories in mock_stores_update Test results: 520 passed, 7 skipped, 0 failures ✓

Add blank line after import statement per PEP 8 style guidelines.

- Removed hardcoded 'objects' directory level from build_object_path() - Updated path pattern comment to reflect new structure - Updated all test expectations to match new path format Previous path: {schema}/{table}/objects/{key}/{file} New path: {schema}/{table}/{key}/{file} The 'objects' literal was a legacy remnant intended for future tabular storage alongside objects. Removing it simplifies the path structure and aligns with documented behavior. Verified: - All test_object.py tests pass (43 tests) - All test_npy_codec.py tests pass (22 tests) - All test_hash_storage.py tests pass (14 tests)

- Updated SchemaCodec._build_path() to accept store_name parameter - _build_path() now retrieves partition_pattern and token_length from store spec - ObjectCodec and NpyCodec encode methods pass store_name to _build_path - Enables partitioning configuration like partition_pattern: '{mouse_id}/{session_date}' This allows organizing storage by experimental structure: - Without: {schema}/{table}/{mouse_id=X}/{session_date=Y}/... - With: {mouse_id=X}/{session_date=Y}/{schema}/{table}/... Partitioning makes storage browsable by subject/session and enables selective sync/backup of individual subjects or sessions.

The partition_pattern was not preserving the order of attributes specified in the pattern because it was iterating over a set (unordered). This caused paths like 'neuron_id=0/mouse_id=5/session_date=2017-01-05/...' instead of the expected 'mouse_id=5/session_date=2017-01-05/neuron_id=0/...'. Changes: - Extract partition attributes as a list to preserve order - Keep a set for efficient lookup when filtering remaining PK attributes - Iterate over the ordered list when building partition path components Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Added helper functions for safe 0.14.6 → 2.0 migration using parallel schemas: New functions in datajoint.migrate: - create_parallel_schema() - Create _v20 schema copy for testing - copy_table_data() - Copy data from production to test schema - compare_query_results() - Validate results match between schemas - backup_schema() - Create full schema backup before cutover - restore_schema() - Restore from backup if needed - verify_schema_v20() - Check if schema is 2.0 compatible These functions support the parallel schema migration approach which: - Keeps production untouched during testing - Allows unlimited practice runs - Enables side-by-side validation - Provides easy rollback (just drop _v20 schemas) See: datajoint-docs/src/how-to/migrate-to-v20.md Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Added helper function for migrating external storage pointers when copying production data to _v2 schemas during git branch-based migration. Function: migrate_external_pointers_v2() - Converts BINARY(16) UUID → JSON metadata - Points to existing files (no file copying required) - Enables access to external data in _v2 test schemas - Supports deferred external storage migration approach Use case: When using git branch workflow (main: 0.14.6, migrate-to-v2: 2.0), this function allows copied production data to access external storage without moving the actual blob files until production cutover. Example: migrate_external_pointers_v2( schema='my_pipeline_v2', table='recording', attribute='signal', source_store='external-raw', dest_store='raw', copy_files=False # Keep files in place ) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

- Remove trailing whitespace from SQL query - Remove unused dest_spec variable - Fix blank line whitespace (auto-fixed by ruff)

Auto-formatted by ruff-format to collapse multi-line function calls

dimitri-yatsenko added 11 commits January 13, 2026 23:04

test: update test_update1.py fixture for unified stores

1366776

- Update mock_stores_update fixture to use config.stores - Remove project_name (now embedded in location path) - Simplify fixture using unified stores API

test: Add unit tests for filepath reserved section validation

7df2f97

- Test that filepath rejects paths starting with _hash/ - Test that filepath rejects paths starting with _schema/ - Test that filepath allows all other user-managed paths - Test filepath codec properties and registration

fix: Only set 'secure' default for S3 protocol

fe29274

The 'secure' parameter is only valid for S3 stores, not for file/GCS/Azure protocols. Move the default setting to protocol-specific section to avoid validation errors when using file stores.

docs: clarify <filepath@> error message to enforce @ convention

2d7d935

github-actions bot added enhancement Indicates new improvements feature Indicates new features labels Jan 14, 2026

dimitri-yatsenko added 5 commits January 14, 2026 00:04

chore: bump version to 2.0.0a22 and apply pre-commit formatting

401bffe

test: integration tests - 496 passed, 24 object storage fixture failures

5ccf3aa

test: update summary - all 520 integration tests passing ✓

6487ae4

style: apply ruff-format to conftest.py

35a2c60

Add blank line after import statement per PEP 8 style guidelines.

dimitri-yatsenko requested a review from d-v-b January 14, 2026 06:52

dimitri-yatsenko and others added 7 commits January 14, 2026 01:09

style: fix linting issues in migrate.py

4b0e9a8

- Remove trailing whitespace from SQL query - Remove unused dest_spec variable - Fix blank line whitespace (auto-fixed by ruff)

style: apply ruff-format to builtin_codecs.py

63ecba9

Auto-formatted by ruff-format to collapse multi-line function calls

dimitri-yatsenko merged commit bf62620 into pre/v2.0 Jan 14, 2026
7 of 8 checks passed

dimitri-yatsenko deleted the feature/unified-stores-config branch January 14, 2026 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unified stores configuration with configurable prefixes and filepath_default #1333

Unified stores configuration with configurable prefixes and filepath_default #1333

Uh oh!

dimitri-yatsenko commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Unified stores configuration with configurable prefixes and filepath_default #1333

Unified stores configuration with configurable prefixes and filepath_default #1333

Uh oh!

Conversation

dimitri-yatsenko commented Jan 14, 2026

Summary

Key Changes

1. Unified Stores Configuration

2. Configurable Storage Prefixes

3. Separate Filepath Default

4. Enhanced Filepath Validation

Configuration Example

Breaking Changes

Testing

Related Documentation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants