Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

Summary

This PR implements a unified stores configuration system for DataJoint 2.0, replacing the separate external.* and object_storage.* configurations with a single stores.* configuration.

Key Changes

1. Unified Stores Configuration

  • Removed ExternalSettings and ObjectStorageSettings classes
  • Added single stores: dict[str, Any] field in Config
  • All storage (hash, schema, filepath) now uses the same store configuration

2. Configurable Storage Prefixes

  • Added hash_prefix (default: _hash) for hash-addressed storage
  • Added schema_prefix (default: _schema) for schema-addressed storage
  • Added filepath_prefix (default: null) for optional filepath restriction
  • Validation ensures prefixes don't overlap (no nesting)

3. Separate Filepath Default

  • Added stores.filepath_default separate from stores.default
  • Reflects architectural distinction: integrated storage (hash/schema) vs. filepath references
  • Filepath storage is NOT part of OAS - just references to user-managed files

4. Enhanced Filepath Validation

  • FilepathCodec validates paths against configured prefixes dynamically
  • Prevents filepath from using reserved sections (hash_prefix, schema_prefix)
  • Optional filepath_prefix enforcement for user-defined restrictions

Configuration Example

{
  "stores": {
    "default": "main",
    "filepath_default": "raw_data",
    "main": {
      "protocol": "file",
      "location": "/data/fast-storage",
      "hash_prefix": "_hash",
      "schema_prefix": "_schema"
    },
    "raw_data": {
      "protocol": "file",
      "location": "/data/acquisition",
      "filepath_prefix": "recordings"
    }
  }
}

Breaking Changes

None - this is pre-production 2.0 work. No backward compatibility concerns with 2.0 itself.

Testing

  • 54 configuration tests passing
  • 8 filepath codec tests with prefix validation
  • Full integration test suite passing

Related Documentation

  • Comprehensive specification in datajoint-docs: reference/specs/object-store-configuration.md
  • Updated how-to guide: how-to/configure-storage.md

🤖 Generated with Claude Code

- Remove legacy ExternalSettings and ObjectStorageSettings classes
- Update Config to use stores.default and stores.<name> structure
- Update get_store_spec() to support default store (store=None)
- Add partition_pattern and token_length support for stores
- Update secrets loading to support per-store credentials
- Update save_template() to generate unified stores config
- Update _update_from_flat_dict() to handle stores.<name>.<attr> pattern
- Remove external.* from ENV_VAR_MAPPING

Unified stores configuration supports both:
- Hash-addressed storage (<blob@>, <attach@>) via _hash section
- Schema-addressed storage (<object@>, <npy@>) via _schema section

Configuration structure:
stores.default - name of default store
stores.<name>.protocol - storage protocol (file, s3, gcs, azure)
stores.<name>.location - base path (includes project context)
stores.<name>.partition_pattern - schema-addressed partitioning
stores.<name>.token_length - random token length
stores.<name>.subfolding - hash-addressed subfolding
- Update hash_registry.py to use get_store_spec() instead of get_object_store_spec()
- Update staged_insert.py to use get_store_spec() for default store
- Update error messages to reference new stores configuration
- Remove references to object_storage.default_store (now stores.default)
- Replace config.external tests with stores credential tests
- Update template test to check for stores structure instead of object_storage
- Update get_store_spec tests for new default behavior (None instead of DEFAULT_SUBFOLDING)
- Add tests for default store lookup (store=None)
- Add tests for loading per-store credentials from .secrets/
- Verify partition_pattern and token_length defaults
- Update mock_stores fixture to use config.stores instead of config.object_storage
- Update mock_object_storage fixture to configure stores.default and stores.local
- Remove project_name from object_storage_config (now embedded in location path)
- Simplify fixture by using unified stores API
- Update mock_stores_update fixture to use config.stores
- Remove project_name (now embedded in location path)
- Simplify fixture using unified stores API
- Add validation to prevent filepath paths starting with _hash/ or _schema/
- Update FilepathCodec docstring to clarify reserved sections
- Filepath gives users maximum freedom while protecting DataJoint-managed sections
- Users can organize files anywhere in store except reserved sections
- Test that filepath rejects paths starting with _hash/
- Test that filepath rejects paths starting with _schema/
- Test that filepath allows all other user-managed paths
- Test filepath codec properties and registration
The 'secure' parameter is only valid for S3 stores, not for file/GCS/Azure
protocols. Move the default setting to protocol-specific section to avoid
validation errors when using file stores.
Allow users to configure custom prefixes for hash-addressed, schema-addressed,
and filepath storage sections per store. This enables mapping DataJoint to
existing storage layouts without restructuring.

Configuration:
- hash_prefix (default: '_hash') - Hash-addressed storage section
- schema_prefix (default: '_schema') - Schema-addressed storage section
- filepath_prefix (default: None) - Optional filepath restriction

Features:
- Validates prefixes don't overlap (mutual exclusion)
- FilepathCodec enforces dynamic reserved prefixes
- Optional filepath_prefix to restrict filepath paths
- Backwards compatible defaults

Examples:
{
  "stores": {
    "legacy": {
      "protocol": "file",
      "location": "/data/existing",
      "hash_prefix": "content_addressed",
      "schema_prefix": "structured_data",
      "filepath_prefix": "raw_files"
    }
  }
}

Changes:
- settings.py: Add prefix fields, validation logic
- builtin_codecs.py: Dynamic prefix checking in FilepathCodec
- test_settings.py: 7 new tests for prefix validation
- test_codecs.py: 2 new tests for custom prefixes
Filepath storage is NOT part of the Object-Augmented Schema - it only
provides references to externally-managed files. Allow separate default
configuration for filepath references vs integrated storage.

Configuration:
- stores.default - for integrated storage (<blob>, <object>, <npy>, <attach>)
- stores.filepath_default - for filepath references (<filepath>)

This allows:
- Integrated storage on S3 or fast filesystem
- Filepath references to acquisition files on NAS or different location

Example:
{
  "stores": {
    "default": "main",
    "filepath_default": "raw_data",
    "main": {
      "protocol": "s3",
      "bucket": "processed-data",
      "location": "lab-project"
    },
    "raw_data": {
      "protocol": "file",
      "location": "/mnt/nas/acquisition"
    }
  }
}

Usage:
- data : <blob>        # Uses stores.default (main)
- arrays : <object>    # Uses stores.default (main)
- raw : <filepath>     # Uses stores.filepath_default (raw_data)
- raw : <filepath@acq> # Explicitly names store (overrides default)

Changes:
- settings.py: Add use_filepath_default parameter to get_store_spec()
- builtin_codecs.py: FilepathCodec uses use_filepath_default=True
- test_settings.py: Add 3 tests for filepath_default behavior
- settings.py: Update template to include filepath_default example

Architectural rationale:
- Hash/schema storage: integrated into OAS, DataJoint manages lifecycle
- Filepath storage: references only, users manage lifecycle
- Different defaults reflect this fundamental distinction
@github-actions github-actions bot added enhancement Indicates new improvements feature Indicates new features labels Jan 14, 2026
…alidation

All 24 object storage test failures were due to test fixtures not creating
the directories they configured. StorageBackend validates that file protocol
locations exist, so fixtures must create them.

- conftest.py: Create test_project subdirectory in object_storage_config
- test_update1.py: Create djtest subdirectories in mock_stores_update

Test results: 520 passed, 7 skipped, 0 failures ✓
Add blank line after import statement per PEP 8 style guidelines.
dimitri-yatsenko and others added 7 commits January 14, 2026 01:09
- Removed hardcoded 'objects' directory level from build_object_path()
- Updated path pattern comment to reflect new structure
- Updated all test expectations to match new path format

Previous path: {schema}/{table}/objects/{key}/{file}
New path: {schema}/{table}/{key}/{file}

The 'objects' literal was a legacy remnant intended for future tabular
storage alongside objects. Removing it simplifies the path structure
and aligns with documented behavior.

Verified:
- All test_object.py tests pass (43 tests)
- All test_npy_codec.py tests pass (22 tests)
- All test_hash_storage.py tests pass (14 tests)
- Updated SchemaCodec._build_path() to accept store_name parameter
- _build_path() now retrieves partition_pattern and token_length from store spec
- ObjectCodec and NpyCodec encode methods pass store_name to _build_path
- Enables partitioning configuration like partition_pattern: '{mouse_id}/{session_date}'

This allows organizing storage by experimental structure:
- Without: {schema}/{table}/{mouse_id=X}/{session_date=Y}/...
- With: {mouse_id=X}/{session_date=Y}/{schema}/{table}/...

Partitioning makes storage browsable by subject/session and enables
selective sync/backup of individual subjects or sessions.
The partition_pattern was not preserving the order of attributes specified
in the pattern because it was iterating over a set (unordered). This caused
paths like 'neuron_id=0/mouse_id=5/session_date=2017-01-05/...' instead of
the expected 'mouse_id=5/session_date=2017-01-05/neuron_id=0/...'.

Changes:
- Extract partition attributes as a list to preserve order
- Keep a set for efficient lookup when filtering remaining PK attributes
- Iterate over the ordered list when building partition path components

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Added helper functions for safe 0.14.6 → 2.0 migration using parallel schemas:

New functions in datajoint.migrate:
- create_parallel_schema() - Create _v20 schema copy for testing
- copy_table_data() - Copy data from production to test schema
- compare_query_results() - Validate results match between schemas
- backup_schema() - Create full schema backup before cutover
- restore_schema() - Restore from backup if needed
- verify_schema_v20() - Check if schema is 2.0 compatible

These functions support the parallel schema migration approach which:
- Keeps production untouched during testing
- Allows unlimited practice runs
- Enables side-by-side validation
- Provides easy rollback (just drop _v20 schemas)

See: datajoint-docs/src/how-to/migrate-to-v20.md

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Added helper function for migrating external storage pointers when copying
production data to _v2 schemas during git branch-based migration.

Function: migrate_external_pointers_v2()
- Converts BINARY(16) UUID → JSON metadata
- Points to existing files (no file copying required)
- Enables access to external data in _v2 test schemas
- Supports deferred external storage migration approach

Use case:
When using git branch workflow (main: 0.14.6, migrate-to-v2: 2.0), this
function allows copied production data to access external storage without
moving the actual blob files until production cutover.

Example:
  migrate_external_pointers_v2(
      schema='my_pipeline_v2',
      table='recording',
      attribute='signal',
      source_store='external-raw',
      dest_store='raw',
      copy_files=False  # Keep files in place
  )

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Remove trailing whitespace from SQL query
- Remove unused dest_spec variable
- Fix blank line whitespace (auto-fixed by ruff)
Auto-formatted by ruff-format to collapse multi-line function calls
@dimitri-yatsenko dimitri-yatsenko merged commit bf62620 into pre/v2.0 Jan 14, 2026
7 of 8 checks passed
@dimitri-yatsenko dimitri-yatsenko deleted the feature/unified-stores-config branch January 14, 2026 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Indicates new improvements feature Indicates new features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants