Skip to content

mulgadc/predastore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

281 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Predastore

Predastore developed by Mulga Defense Corporation is a distributed, S3-compatible object storage system with Reed-Solomon erasure coding, built for bare-metal, edge, and on-premise deployments. It is the storage backend for Spinifex β€” an AWS-compatible infrastructure stack for private clouds.

Predastore runs as a distributed cluster with erasure-coded shards, Raft-consensus metadata, and QUIC-based inter-node transport. For development, all nodes run in a single process on loopback.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     S3 Client (AWS CLI/SDK)                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Predastore S3D (HTTP/TLS)                     β”‚
β”‚         Auth (SigV4)  Β·  Routing  Β·  Backend Abstraction        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                   β”‚
         β–Ό                                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  s3db Cluster       β”‚       β”‚  QUIC Shard Nodes                β”‚
β”‚  (Raft Consensus)   β”‚       β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚                     β”‚       β”‚  β”‚ Node 0 β”‚β”‚ Node 1 β”‚β”‚ Node 2 β”‚  β”‚
β”‚  BoltDB (Raft log)  β”‚       β”‚  β”‚ Store  β”‚β”‚ Store  β”‚β”‚ Store  β”‚  β”‚
β”‚  BadgerDB (FSM)     β”‚       β”‚  β”‚(seg+ix)β”‚β”‚(seg+ix)β”‚β”‚(seg+ix)β”‚  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

S3D serves the S3 HTTP API with AWS Signature V4 authentication. The s3db cluster provides strongly consistent metadata via Raft (HashiCorp Raft + BoltDB + BadgerDB). QUIC shard nodes store erasure-coded object data in append-only segment files, with each shard occupying a contiguous extent indexed by a per-node BadgerDB. Inter-node communication uses persistent QUIC connections with pooled, multiplexed streams β€” eliminating per-request TLS handshakes.

See DESIGN.md for the full architecture reference, including the data model, QUIC protocol format, Raft consensus details, hash ring placement, and failure handling.

Key Design Decisions

  • Reed-Solomon erasure coding β€” objects are split into data + parity shards (configurable, e.g. RS(3,2) tolerates loss of any 2 nodes). No full replication overhead.
  • Raft consensus for metadata β€” bucket and object metadata is strongly consistent across the cluster. Reads can go to any node; writes go through the leader.
  • QUIC transport β€” node-to-node shard I/O uses QUIC over UDP with connection pooling. A single long-lived connection per node pair carries multiplexed streams, so shard writes cost only a stream ID allocation, not a TLS handshake.
  • Append-only segments β€” each shard node writes data to large append-only segment files. A shard occupies a contiguous extent within one segment, pre-allocated to enable lock-free writing to disk. A per-node BadgerDB index maps shard keys to extents.
  • AES-256-GCM encryption at rest β€” every 8 KiB fragment is sealed under a per-fragment GCM nonce with AAD binding it to its (objectHash, shardIndex, shardNum, fragNum) position, so tamper, replay, and cross-shard splice attempts fail to authenticate. GCM is the sole on-disk integrity authority (no separate CRC). A 32-byte cluster master key is loaded from a 0600 file path supplied via -encryption-key-file / ENCRYPTION_KEY_FILE.
  • Consistent hash ring β€” shard placement is deterministic via a hash ring with virtual nodes. Adding nodes bumps a ring epoch; old objects stay on the old epoch, new writes use the new one.
  • Single binary β€” ./bin/s3d runs one cluster node (S3 API server + Raft database + QUIC shard node). A cluster is N s3d processes pointed at the same config; ./scripts/start.sh launches all of them locally on loopback aliases for development.

S3 API Compatibility

Predastore implements key S3 operations compatible with AWS CLI, SDKs, and existing S3 tools:

Category Operations
Buckets CreateBucket, DeleteBucket, ListBuckets, HeadBucket
Objects PutObject, GetObject, DeleteObject, HeadObject, ListObjects/V2
Multipart InitiateMultipartUpload, UploadPart, CompleteMultipartUpload
Auth AWS Signature V4

Quick Start

Build

make build              # builds ./bin/s3d (also generates dev TLS certs)

Run a Dev Cluster

The ./scripts/ directory contains helpers for running a multi-node cluster locally on loopback IP aliases β€” the recommended way to exercise the distributed code paths in development:

./scripts/start.sh 3node        # launch a 3-node cluster
./scripts/start.sh -w 5node     # launch a 5-node cluster, wait until ready
./scripts/stop.sh               # stop all running clusters
./scripts/clean.sh              # stop and wipe cluster data
./scripts/bench.sh 3node        # run warp benchmark against a cluster
./scripts/bench.sh disk         # run raw-disk fio benchmark

Cluster runtime data (logs, PID files, segment files, BadgerDB indexes) lives under $PREDA_DIR (default /tmp/predastore/<clustername>/). The start script sets up loopback IP aliases (requires sudo) and generates TLS certs on first run.

Run a Single Node

./bin/s3d is a single-node process β€” for running one node of a cluster directly (e.g. on a dedicated host in production, or for inspecting one node in isolation):

./bin/s3d \
  --config config/3node.toml \
  --node 1 \
  --host 10.11.12.1 \
  --port 8443 \
  --base-path /tmp/predastore/3node \
  --tls-key /tmp/predastore/3node/server.key \
  --tls-cert /tmp/predastore/3node/server.pem \
  --encryption-key-file /tmp/predastore/3node/master.key

The encryption key file must be exactly 32 raw bytes (no base64, no header) with mode 0600 β€” group/other-readable keys are rejected outright. Generate one with ( umask 0177 && openssl rand -out master.key 32 ). The same key must be supplied to every node in a cluster; rotating it is not supported (see Roadmap β†’ envelope encryption).

Configuration

Cluster configurations live under config/ as TOML files, one per topology:

config/
  3node.toml    # 3 db + 3 storage nodes
  5node.toml    # 5 db + 5 storage nodes
  7node.toml    # 7 db + 7 storage nodes

Each config defines [[db]] and [[storage]] sections specifying node IDs, hosts, ports, and Reed-Solomon parameters.

TLS certificates are generated on first build:

make certs              # Generate certs/server.{pem,key}

Standalone TLS Trust

The QUIC inter-node transport and the s3db REST client now verify the server certificate β€” InsecureSkipVerify is gone from both the production code path and the test fixtures (tests inject an ephemeral CA via quicclient.SetDefaultRootCAs). Standalone operators must install the cluster CA into the host trust store before launching s3d, otherwise nodes cannot dial each other:

# Debian / Ubuntu
sudo cp cluster-ca.pem /usr/local/share/ca-certificates/predastore-cluster-ca.crt
sudo update-ca-certificates

# RHEL / Fedora / Amazon Linux
sudo cp cluster-ca.pem /etc/pki/ca-trust/source/anchors/predastore-cluster-ca.pem
sudo update-ca-trust

When predastore is deployed by Spinifex, the cluster CA is installed into the host trust store automatically as part of node bootstrap β€” no manual action is required.

AWS CLI Examples

# Create a bucket
aws --endpoint-url https://10.11.12.1:8443/ s3 mb s3://my-bucket

# Upload a file
aws --endpoint-url https://10.11.12.1:8443/ s3 cp ./file.txt s3://my-bucket/

# List bucket contents
aws --endpoint-url https://10.11.12.1:8443/ s3 ls s3://my-bucket/

# Download a file
aws --endpoint-url https://10.11.12.1:8443/ s3 cp s3://my-bucket/file.txt ./downloaded.txt

Storage Backend

Distributed storage with erasure coding, Raft-consensus metadata, and QUIC transport. The simplest way to bring up a cluster locally:

./scripts/start.sh -w 3node     # 3-node cluster on loopback aliases

The distributed backend's data model:

Unit Size Description
Object arbitrary RS-encoded end-to-end into K data + M parity shards
Shard ⌈object_size / KβŒ‰ Per-node RS slice; occupies a contiguous extent
Fragment 32 B header + 8 KiB body + 16 B GCM tag = 8240 B On-disk unit; AES-256-GCM seals body with AAD bound to (objectHash, shardIndex, shardNum, fragNum)
Segment file up to 4 GiB Append-only container holding extents from one or more shards

See DESIGN.md for full configuration reference, including database node setup, shard node setup, RS tuning, and deployment modes.

Spinifex Integration

Predastore is the default S3 storage provider for Spinifex. When running as part of the Spinifex stack, Predastore integrates via NATS messaging and provides storage for:

  • EC2 AMI images β€” machine images for VM launches
  • EBS volume snapshots β€” via Viperblock, which uses Predastore as its S3-compatible backend
  • User data β€” cloud-init configurations and system artifacts

Predastore subscribes to NATS topics (s3.putobject, s3.getobject, s3.createbucket, etc.) for seamless integration with the rest of the Spinifex control plane.

Development

make build            # Build s3d binary (also generates TLS certs)
make certs            # Generate dev TLS certs
make test             # Run tests
make preflight        # Full CI checks (lint, govulncheck, tests, race detector)
make clean            # Clean build artifacts

Docker

make docker_s3d           # Build Docker image
make docker_compose_up    # Start with docker-compose
make docker_compose_down  # Stop services

Performance Tuning

For distributed mode, increase system socket buffers for QUIC:

sudo sysctl -w net.core.rmem_max=7500000
sudo sysctl -w net.core.wmem_max=7500000

Roadmap

  • S3 API core (buckets, objects, multipart)
  • AWS Signature V4 authentication
  • Distributed storage with Reed-Solomon erasure coding
  • Raft-consensus metadata (s3db)
  • QUIC transport with connection pooling
  • Consistent hash ring placement
  • AES-256-GCM encryption at rest (single cluster-wide master key)
  • Envelope encryption (master key rotation, per-bucket / per-tenant keys)
  • Gossip-based node discovery
  • Segment compaction and garbage collection
  • Automatic shard rebalancing
  • Background read-repair
  • Bucket versioning
  • Lifecycle policies

License

Apache 2.0 License. See LICENSE for details.

About

PredaStore is a high-performance, on-premise and edge-ready object storage platform, fully compatible with the Amazon S3 API. Designed for environments where speed, redundancy, and resiliency are critical, PredaStore is the ideal solution for edge data centers, private clouds, and hybrid deployments that demand low latency and high availability.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors