Engineering guide for turning the Caselaw Access Project (CAP) corpus + Qwen-3-14B into a continuously-improving legal LLM with fully deterministic rewards using GRPO (Group Relative Policy Optimization).
cap-rlvr/
├── README.md # This file
├── CLAUDE.md # Project instructions and SSH setup
├── requirements.txt # Python dependencies for gym environments
├── test_gym_envs.py # Comprehensive test suite for all environments
├── docs/
│ └── cap_rlvr_grpo_plan.md # Comprehensive implementation plan
├── scripts/
│ ├── vast_setup.sh # Remote system setup script
│ ├── prep_utils.py # Shared utilities for data preparation
│ ├── prep_holding_task.py # Generate holding selection tasks
│ ├── prep_bluebook_task.py # Generate citation format tasks
│ ├── prep_summarise_task.py # Generate IRAC summarization tasks
│ ├── prep_retrieval_task.py # Generate case retrieval tasks (original)
│ ├── prep_retrieval_task_streaming.py # Memory-optimized streaming retrieval (SQLite-based)
│ ├── prep_entail_task.py # Generate case relationship tasks
│ ├── build_faiss.py # Build FAISS index for retrieval evaluation
│ ├── format_for_sft.py # Format task data for SFT training (TRL-compatible)
│ ├── migrate_to_lambda.py # Automated Vast.ai -> Lambda Labs migration
│ ├── prep_grpo_dataset.py # Generate GRPO training datasets with scored responses
│ ├── train_grpo.py # Complete GRPO training implementation with eval-only mode
│ ├── validate_stage_progression.py # Validate reward thresholds for stage progression
│ ├── orchestrate_grpo_training.py # Automated multi-stage training pipeline
│ ├── reward_holding.py # Reward function for holding selection
│ ├── reward_bluebook.py # Reward function for citation completion
│ ├── reward_irac.py # Reward function for IRAC summarization
│ ├── reward_retrieval.py # Reward function for case retrieval
│ ├── reward_entail.py # Reward function for relationship classification
│ └── rewards.py # Unified reward interface for all tasks
├── envs/ # OpenAI Gym environments
│ ├── __init__.py # Environment package initialization
│ ├── base_env.py # Base environment class (BaseCapRLVREnv)
│ ├── holding_env.py # Holding selection environment
│ ├── bluebook_env.py # Bluebook citation environment
│ ├── summarise_env.py # IRAC summary environment
│ ├── retrieval_env.py # Case retrieval environment
│ ├── entail_env.py # Entailment environment
│ └── README.md # Environment documentation and usage
├── downloads/
│ ├── cli_download.py # Robust CLI-based dataset download
│ ├── robust_download.py # Alternative download with retry logic
│ └── streaming_download.py # Memory-efficient streaming download
└── cap_rlvr_env/ # Virtual environment with gym dependencies
-
Install Gym Environment Dependencies:
python -m venv cap_rlvr_env source cap_rlvr_env/bin/activate pip install -r requirements.txt -
Test Gym Environments:
python test_gym_envs.py # Expected: 5/5 environments passed testing -
Use Individual Environments:
from envs import HoldingSelectionEnv, BluebookCitationEnv # Create environment env = HoldingSelectionEnv(subset_size=100) obs = env.reset() reward = env.step("A")[1] # Model chooses option A
- Setup Environment: Use
scripts/vast_setup.shon a remote CPU instance with sufficient storage (80GB+ for CAP dataset) - Download Dataset: Run
downloads/cli_download.pyfor robust CAP dataset acquisition - Prepare Tasks: Execute all
scripts/prep_*.pyscripts to generate training data (useprep_retrieval_task_streaming.pyfor memory efficiency) - Build Embeddings: Run
scripts/build_faiss.pyto create retrieval index - Format for SFT: Generate TRL-compatible datasets with
scripts/format_for_sft.py
For systems with limited RAM (32GB+), use the streaming retrieval processor:
# Memory-efficient retrieval task generation
python scripts/prep_retrieval_task_streaming.py
# Benefits:
# - Uses SQLite for indexing (vs in-memory storage)
# - 98% less RAM usage (28GB → 600MB)
# - Handles full CAP dataset on modest hardware
# - Two-phase processing: index building → task generation- Transfer Data: Copy prepared datasets to GPU training environment with
scripts/migrate_to_lambda.py
- LoRA SFT Training: Parameter-efficient fine-tuning with optimized configurations for available GPU memory
- Generate GRPO Data: Create multi-response datasets with
scripts/prep_grpo_dataset.pyusing SFT model - GRPO Training: Execute reinforcement learning with
scripts/train_grpo.pyusing the generated datasets
The project supports various GPU configurations with optimized LoRA settings:
# Memory-efficient LoRA training with streaming datasets
python scripts/train_sft_simple.py \
--dataset_name kylebrussell/cap-rlvr-sft \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 8 \
--num_train_epochs 1 \
--output_dir models/sft_qwen3_14b_lora
# For development/testing with subset
python scripts/train_sft_simple.py \
--dataset_name kylebrussell/cap-rlvr-sft \
--max_samples 10000 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4LoRA Optimizations:
- Parameter Efficiency: Only 0.43% of model parameters trainable (64M/14.8B)
- Memory Efficient: Significant reduction from full fine-tuning
- Flexible Batching: Configurable batch sizes based on available GPU memory
- Quality Retention: 85-95% of full fine-tuning performance
- Complete Gym Environments: Full OpenAI Gym interface for all 5 legal reasoning tasks
- RLHF/GRPO Ready: Environments integrate seamlessly with reinforcement learning training
- Automated Migration Pipeline: Seamless Vast.ai → Lambda Labs data transfer with verification
- TRL-Compatible SFT Formatting: Ready-to-use prompt-completion datasets for supervised fine-tuning
- Apache 2.0 Licensed: Commercial-friendly licensing with proper attribution to dependencies
- Robust Dataset Download: Multiple approaches for handling 78GB CAP dataset with resume capability
- Multi-Task Training Data: 5 legal reasoning tasks (holdings, citations, summaries, retrieval, relationships)
- Complete Reward System: Deterministic scoring functions for all task types with unified interface
- Process Supervision: GRPO training with group-based reward comparisons
- Production Ready: Quantization, serving, and deployment pipeline included
- Holding Selection: Multiple-choice questions identifying correct legal holdings
- Bluebook Citations: Fill-in-the-blank citation format completion
- IRAC Summaries: Structured case summarization using Issue-Rule-Application-Conclusion
- Case Retrieval: Finding analogous cases based on legal concepts
- Relationship Classification: Determining how cases relate (overrule, distinguish, affirm, etc.)
All task types have comprehensive, deterministic reward functions:
| Task | Reward Components | Score Range |
|---|---|---|
| Holding Selection | Binary accuracy (correct choice = 1.0) | 0.0 - 1.0 |
| Bluebook Citation | Component accuracy (80%) + format validation (20%) | 0.0 - 1.0 |
| IRAC Summary | Structure (40%) + content (30%) + length (15%) + legal language (15%) | 0.0 - 1.0 |
| Case Retrieval | FAISS similarity matching + quantity bonus | 0.0 - 1.0 |
| Relationship | Classification accuracy (60%) + context consistency (25%) + quality (15%) | 0.0 - 1.0 |
Use the unified interface:
from scripts.rewards import UnifiedRewardFunction
reward_fn = UnifiedRewardFunction()
score = reward_fn.reward(sample, model_output) # Auto-detects task typeComplete Implementation: All 5 legal reasoning tasks have been implemented as OpenAI Gym environments.
- Standard Gym Interface:
reset(),step(action),render(),close() - Text-Based Actions: Natural language model responses as actions
- Unified Reward Integration: Automatic scoring using the reward functions
- Flexible Data Loading: Support for subsets during development
- Task-Specific Observations: Formatted prompts and context for each legal task
- RLHF/GRPO Ready: Direct integration with reinforcement learning pipelines
| Environment | Task | Action Format | Reward Range |
|---|---|---|---|
HoldingSelectionEnv |
Multiple choice holding selection | Letter choice (A,B,C,D) or text | 0.0 - 1.0 |
BluebookCitationEnv |
Legal citation completion | Complete citation string | 0.0 - 1.0 |
IRACsSummaryEnv |
Structured case summarization | IRAC-formatted text | 0.0 - 1.0 |
CaseRetrievalEnv |
Analogous case finding | List of case IDs/descriptions | 0.0 - 1.0 |
EntailmentEnv |
Case relationship classification | Relationship label (AFFIRMS, etc.) | 0.0 - 1.0 |
# Individual task environment
from envs import HoldingSelectionEnv
env = HoldingSelectionEnv(data_path="data_tasks/holding/train.jsonl")
obs = env.reset()
reward = env.step("A")[1]
# Multi-task training setup
from envs import *
environments = {
'holding': HoldingSelectionEnv(),
'citation': BluebookCitationEnv(),
'summary': IRACsSummaryEnv(),
'retrieval': CaseRetrievalEnv(),
'entailment': EntailmentEnv()
}
# GRPO/RLHF training integration
for task, env in environments.items():
obs = env.reset()
model_response = policy.generate(obs['inputs'])
obs, reward, done, info = env.step(model_response)
policy.update(reward, info)See envs/README.md for comprehensive documentation and advanced usage patterns.
The format_for_sft.py script converts raw task data into ready-to-use SFT datasets:
# Generate all SFT format variants
python scripts/format_for_sft.py --format separate # Individual task files
python scripts/format_for_sft.py --format unified # Multi-task training
python scripts/format_for_sft.py --format chat # Chat message format
# View statistics without saving
python scripts/format_for_sft.py --stats-onlyOutput Formats:
- Separate:
data_tasks/sft_formatted/bluebook/train_sft.jsonl(per-task training) - Unified:
data_tasks/sft_formatted/unified/train_sft_unified.jsonl(multi-task) - Chat:
data_tasks/sft_formatted/chat_format/(messages format for newer models)
Transfer processed data from Vast.ai CPU instances to Lambda Labs filesystem:
# Check data readiness on Vast.ai
python scripts/migrate_to_lambda.py --check-only
# Transfer data to Lambda Labs
python scripts/migrate_to_lambda.py --lambda-host your-lambda-host
# Test migration without executing
python scripts/migrate_to_lambda.py --dry-run --lambda-host test-hostMigration Features:
- Data Transfer Only: Copies prepared datasets to Lambda Labs filesystem
- Validation: Verifies all 5 data prep tasks completed before transfer
- Compression: Creates efficient archive (~5-8GB from ~16GB raw)
- Integrity: MD5 checksums ensure data transfer accuracy
- Clean Transfer: Both raw task data and SFT-formatted datasets
- No Training Steps: Migration script only handles data movement, not training orchestration
After SFT training completes, generate multi-response datasets for GRPO training using the fine-tuned model:
# Run on Lambda Labs GPU instance with SFT model
python scripts/prep_grpo_dataset.py --task all --model_path models/sft --num_candidates 4
# For development/testing with subset
python scripts/prep_grpo_dataset.py --task bluebook --model_path models/sft --subset 1000
# Mock mode for testing script without model loading
python scripts/prep_grpo_dataset.py --task bluebook --model_path models/sft --mock_modeKey Features:
- Multi-response generation: Creates 4 candidate responses per query using different sampling parameters
- Unified reward scoring: Integrates with existing reward functions for consistent evaluation
- GPU-optimized: Designed to run on Lambda Labs GPU instances with model loaded in memory
- Flexible output: Generates JSON files with scored response groups ready for GRPO training
Output: Creates data_grpo/{task}/train_grpo.json files with multiple scored responses per query, enabling GRPO's group-based ranking approach.
Execute reinforcement learning training using the complete GRPO implementation with progressive training sequence:
The project implements a sequential improvement approach:
- SFT Base Model → Bluebook GRPO (citation formatting mastery)
- Bluebook GRPO → Holding GRPO (building on citation knowledge)
- Holding GRPO → Summarise GRPO (adding structured reasoning)
- Summarise GRPO → Entail GRPO (completing legal reasoning suite)
# Progressive training sequence (recommended)
# 1. Start with SFT model for first task
python scripts/train_grpo.py --task bluebook --model_path models/sft \
--data_path data_grpo/bluebook/train_grpo.json
# 2. Use previous GRPO model as base for next task
python scripts/train_grpo.py --task holding --model_path models/grpo_bluebook \
--data_path data_grpo/holding/train_grpo.json
# 3. Continue building on previous improvements
python scripts/train_grpo.py --task summarise --model_path models/grpo_holding \
--data_path data_grpo/summarise/train_grpo.json
# 4. Final task builds on all previous knowledge
python scripts/train_grpo.py --task entail --model_path models/grpo_summarise \
--data_path data_grpo/entail/train_grpo.json
# Multi-task GRPO training with evaluation
python scripts/train_grpo.py --task all --multi_task --model_path models/sft \
--data_path data_grpo/unified/train_grpo.json \
--eval_data_path data_grpo/unified/eval_grpo.jsonKey Features:
- Production-Ready: Complete error handling, checkpointing, and resumption
- Multi-Task Support: Train on individual tasks or combined datasets
- Memory Optimized: Conservative batch sizes and gradient accumulation for large models
- Legal-Specific Metrics: Custom callbacks and logging for legal reasoning evaluation
- TRL Integration: Modern TRL library compatibility with proper GRPO implementation
- Evaluation-Only Mode: Stage progression validation with
--eval_onlyflag
The project includes comprehensive automation for the iterative GRPO training pipeline:
# Generate unified multi-task datasets
python scripts/prep_grpo_dataset.py --task all --unified_output --model_path models/sft
# Validate stage progression
python scripts/validate_stage_progression.py --stage 0 --check_all_tasks --model_path models/grpo/
# Evaluation-only mode for testing
python scripts/train_grpo.py --eval_only --task all --model_path models/grpo/current# Complete 4-stage automated training from SFT to production
python scripts/orchestrate_grpo_training.py --sft_model_path models/sft --start_stage 0
# Resume from specific stage
python scripts/orchestrate_grpo_training.py --base_model_path models/grpo/stage1_complete --start_stage 2
# Preview execution plan
python scripts/orchestrate_grpo_training.py --sft_model_path models/sft --dry_runAutomation Features:
- 4-Stage Pipeline: Individual mastery → Multi-task integration → Curriculum refinement → Production optimization
- Model-Size-Aware Naming: Automatic detection and organization by model size (7B, 14B, 32B, etc.)
- Auto-Validation: Reward thresholds checked automatically between stages
- Smart Retry Logic: Failed stages retry with adjusted parameters (max 2 retries per stage)
- Multi-Hour Training Support: Enhanced monitoring for long-duration runs (up to 6 hours per stage)
- Graceful Shutdown: Signal handling for clean interruption and resumption
- Progress Persistence: Training state saved to disk for crash recovery
- Resource Monitoring: Memory and system resource tracking during execution
- Heartbeat Logging: Progress updates every 10 minutes during long runs
- Comprehensive Logging: Detailed progress tracking and error reporting
- Flexible Resumption: Start from any stage with appropriate base model
Stage Progression Thresholds:
- Stage 0: ≥80% reward per individual task
- Stage 1: ≥75% reward across all tasks simultaneously
- Stage 2: ≥85% reward with variance <0.15
- Stage 3: ≥90% reward with variance <0.10
Model-Size-Aware Output Structure:
models/
├── qwen3-14b/
│ ├── grpo/
│ │ ├── qwen3-cap-rlvr-14b-production/ # Final production model
│ │ ├── stage0_complete/
│ │ └── stage1_complete/
│ └── sft/
├── qwen3-7b/
│ └── grpo/
│ └── qwen3-cap-rlvr-7b-production/
└── qwen3-32b/ # Future large model support
└── grpo/
Based on the Caselaw Access Project (CAP) containing millions of US court decisions, processed into structured training tasks for legal reasoning.
Available on HuggingFace:
Training Datasets:
kylebrussell/cap-rlvr-holding: 20K train, 2.5K val/testkylebrussell/cap-rlvr-bluebook: 253K train, 32K val/testkylebrussell/cap-rlvr-summarise: 4.4M train, 555K val/testkylebrussell/cap-rlvr-sft: 9.9M train, 1.2M val/test (multi-task dataset)kylebrussell/cap-rlvr-retrieval: Includes FAISS embeddings for case retrieval
GRPO-Trained Models (Progressive Sequence):
kylebrussell/cap-rlvr-grpo-bluebook: Citation formatting specialist (2,988 training pairs)kylebrussell/cap-rlvr-grpo-holding: Holding selection expert (3,000 training pairs, builds on bluebook)kylebrussell/cap-rlvr-grpo-summarise: IRAC summarization expert (755 training pairs, builds on holding)kylebrussell/cap-rlvr-grpo-entail: Case relationship classifier (840 training pairs, builds on summarise)
See docs/cap_rlvr_grpo_plan.md for the complete implementation plan and training details.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Qwen Models: Apache 2.0 License (Alibaba Cloud) - Base models for fine-tuning
- Caselaw Access Project: Public domain legal case data (Harvard Law School)
- PyTorch & Dependencies: Various open-source licenses - see requirements for details
Commercial use is permitted. See NOTICE file for complete attribution information.