hinbox is a flexible, domain-configurable entity extraction system designed
for historians and researchers. It processes historical documents, academic
papers, news articles, and book chapters to extract structured information about
people, organizations, locations, and events. Originally designed for GuantΓ‘namo
Bay media coverage analysis, it now supports any historical or research domain
through a simple configuration system.
- Research-Focused: Designed for historians, academics, and researchers
- Flexible Sources: Process historical documents, academic papers, news articles, book chapters
- Domain-Agnostic: Configure for any historical period, region, or research topic
- Multiple AI Models: Support for both cloud (Gemini default, but supports
anything that
litellmsupports) and local (Ollama default, but works withlitellm) models - Entity Extraction: Automatically extract people, organizations, locations, and events
- Smart Deduplication: RapidFuzz lexical blocking + embedding similarity with per-entity-type thresholds
- Merge Dispute Agent: Second-stage LLM arbitration for ambiguous gray-band entity matches
- 5-Layer Canonical Name Selection: Deterministic scoring picks the best display name across aliases, penalizing acronyms and generic phrases
- Profile Versioning: Track how entity profiles evolve as new sources are processed
- Profile Grounding: Citation-backed claim verification checks that profile text is supported by source articles
- Extraction Quality Controls: Deterministic QC with automatic retry when severe issues (zero entities, high drop rates) are detected
- Extraction Caching: Persistent sidecar cache avoids redundant LLM calls when re-processing unchanged articles
- Parallel Pipeline: Concurrent article extraction with LLM rate limiting and batched embedding computation
- Modular Engine:
src/enginecoordinates article processing, extraction, merging, and profile versioning so new domains can reuse the same pipeline - Privacy Mode:
--localflag enforces local-only embeddings and disables all LLM telemetry callbacks - Web Interface: FastHTML-based "Archival Elegance" UI with confidence badges, alias display, tag pills, and profile version navigation
- Easy Setup: Simple configuration files, no Python coding required
just domainsjust init palestine_food_historyEdit the generated files in configs/palestine_food_history/:
config.yaml- Research domain settings and data pathsprompts/*.md- Extraction instructions tailored to your sourcescategories/*.yaml- Entity type definitions relevant to your research
just process-domain palestine_food_history --limit 5just frontend- Python 3.12+
uv(for dependency management)- Optional: Ollama (for local model support)
just(task runner)
-
Clone the repository:
git clone https://github.com/strickvl/hinbox.git cd hinbox -
Install dependencies:
# Cloud embeddings only (works on all platforms including Intel Mac): uv sync # With local embeddings (requires PyTorch β Linux/Apple Silicon/Windows): uv sync --extra local-embeddings
-
Set up environment variables:
# Create a .env file (auto-loaded by just): echo 'GEMINI_API_KEY=your-gemini-api-key' > .env # Optional for local processing: echo 'OLLAMA_API_URL=http://localhost:11434/v1' >> .env
-
Set up local model (optional, requires Ollama):
# Pull the default local model (Qwen 2.5 32B, ~23GB download): ollama pull qwen2.5:32b-instruct-q5_K_M # Set a realistic context window (Ollama defaults are modest; # Qwen 2.5 supports up to 131K tokens but more context = more VRAM). # Add to your shell profile or systemd unit for the Ollama server: export OLLAMA_CONTEXT_LENGTH=32768
You can override the default models without editing code by setting environment variables in your
.envfile:# Override the local model (default: ollama/qwen2.5:32b-instruct-q5_K_M): echo 'HINBOX_OLLAMA_MODEL=ollama/gemma3:27b' >> .env # Override the cloud model (default: gemini/gemini-2.0-flash): echo 'HINBOX_CLOUD_MODEL=gemini/gemini-2.5-flash' >> .env
-
Verify installation:
just domains
just init palestine_food_history
# Edit configs/palestine_food_history/ to focus on:
# - People: farmers, traders, cookbook authors, anthropologists
# - Organizations: agricultural cooperatives, food companies, research institutions
# - Events: harvests, famines, recipe documentation, cultural exchanges
# - Locations: villages, markets, agricultural regions, refugee campsjust init afghanistan_1980s
# Configure for:
# - People: military leaders, diplomats, journalists, mujahideen commanders
# - Organizations: military units, intelligence agencies, NGOs, tribal groups
# - Events: battles, negotiations, refugee movements, arms shipments
# - Locations: provinces, military bases, refugee camps, border crossingsjust init medieval_trade
# Set up for:
# - People: merchants, rulers, scholars, travelers
# - Organizations: trading companies, guilds, monasteries, courts
# - Events: trade agreements, diplomatic missions, market fairs
# - Locations: trading posts, cities, trade routes, ports# Process with different options
just process --domain afghanistan_1980s --limit 20 --verbose
just process-domain palestine_food_history --limit 10 --relevance-check
# Use local models (requires Ollama) - useful for sensitive historical research
just process --domain medieval_trade --local
# Force reprocessing when you update your configuration
just process --domain afghanistan_1980s --force-reprocessjust frontendExplore extracted entities at http://localhost:5001
# Check processing status
just check
# Reset processing status
just reset
# View available domains
just domainsconfigs/
βββ guantanamo/ # Example domain shipped with the project
βββ template/ # Starter files copied by `just init`
βββ README.md # Domain configuration walkthrough
src/
βββ process_and_extract.py # CLI entry point β parallel producer/consumer pipeline
βββ engine/
β βββ article_processor.py # Relevance β extraction β QC retry orchestration
β βββ extractors.py # Unified cloud/local entity extraction
β βββ mergers.py # Lexical blocking β embedding similarity β match check β dispute
β βββ match_checker.py # LLM-based match verification
β βββ merge_dispute_agent.py # Second-stage arbitration for gray-band matches
β βββ profiles.py # VersionedProfile history management
β βββ relevance.py # Domain-specific relevance filtering
βββ frontend/ # FastHTML "Archival Elegance" UI
β βββ routes/ # Modular route handlers (home, people, orgs, locations, events)
β βββ components.py # Reusable UI building blocks (badges, version selectors, tags)
β βββ data_access.py # Centralised Parquet data loading
β βββ static/styles.css # CSS variables, fonts, layout
βββ utils/
β βββ embeddings/ # EmbeddingManager, cloud/local providers, similarity helpers
β βββ cache_utils.py # Thread-safe LRU cache and stable hashing
β βββ extraction_cache.py # Persistent sidecar cache for extraction results
β βββ name_variants.py # Deterministic name normalisation, acronym detection, canonical scoring
β βββ processing_status.py # Sidecar JSON tracker (replaces in-Parquet status)
β βββ outcomes.py # PhaseOutcome structured result objects
β βββ quality_controls.py # Extraction QC, profile QC, and profile grounding verification
βββ config_loader.py # Domain config loader (incl. per-type thresholds, lexical blocking)
βββ dynamic_models.py # Domain-driven Pydantic model factories
βββ constants.py # Model defaults, embedding settings, thresholds, privacy controls
βββ logging_config.py # Rich-based structured logging with colour-coded decision lines
βββ exceptions.py # Custom exception types used across the pipeline
tests/
βββ embeddings/ # Embedding manager, cloud provider, config integration
βββ test_canonical_name.py # 5-layer canonical name scoring and rekey-on-merge
βββ test_cli_privacy_mode.py # --local flag enforces local embeddings
βββ test_domain_paths.py # Domain-specific path resolution and batch writes
βββ test_entity_merger_merge_smoke.py # Embedding-based merge smoke tests
βββ test_entity_merger_similarity.py # Similarity scoring, lexical blocking, fingerprints
βββ test_extraction_cache.py # Sidecar cache key determinism and roundtrip
βββ test_extraction_retry.py # QC-triggered retry logic and repair hints
βββ test_llm_multiple_tool_calls.py # Instructor multi-tool-call recovery
βββ test_merge_dispute_agent_routing.py # Gray-band routing and dispute decisions
βββ test_name_variants.py # Name normalisation, acronyms, equivalence groups
βββ test_profile_grounding.py # Citation extraction and grounding verification
βββ test_profile_versioning.py # Versioned profile regression tests
βββ test_frontend_versioning.py # UI behaviour for profile history
data/
βββ guantanamo/ # Default domain data directory (created locally)
βββ {domain}/ # Additional domains maintain their own raw/entity data
Each domain has its own configs/{domain}/ directory with:
config.yaml - Main settings:
domain: "palestine_food_history"
description: "Historical analysis of Palestinian food culture and agriculture"
data_sources:
default_path: "data/palestine_food_history/raw_sources/historical_sources.parquet"
output:
directory: "data/palestine_food_history/entities"categories/*.yaml - Entity type definitions:
person_types:
player:
description: "Professional football players"
examples: ["Lionel Messi", "Cristiano Ronaldo"]prompts/*.md - Extraction instructions (plain English!):
You are an expert at extracting people from historical documents about Palestinian food culture.
Focus on farmers, traders, cookbook authors, researchers, and community leaders...Historical sources should be in Parquet format with columns:
title: Document/article titlecontent: Full text contenturl: Source URL (if applicable)published_date: Publication/creation datesource_type: "book_chapter", "journal_article", "news_article", "archival_document", etc.
- Configuration Loading: Read domain-specific settings
- Source Loading: Process historical documents in Parquet format
- Relevance Filtering: Domain-specific content filtering for research focus
- Parallel Extraction: Concurrent article + entity-type extraction with LLM rate limiting (
ThreadPoolExecutorworkers, bounded semaphore for API calls) - Extraction Caching: Persistent sidecar cache keyed on content hash, model, prompt, and schema β skips redundant LLM calls on re-runs
- Quality Controls: Deterministic QC validates extraction output (required fields, name normalisation, within-article dedup) with automatic retry on severe flags
- Smart Deduplication: Lexical blocking pre-filter β batched embedding similarity β evidence-first merge cost structure (cheap checks before expensive LLM calls)
- Merge Dispute Resolution: Gray-band matches (similarity near threshold with low confidence) get a second-stage LLM arbitration via
MergeDisputeAgent - Canonical Name Selection: 5-layer scoring picks the best display name when entities merge, penalizing acronyms and generic phrases
- Profile Generation: Create comprehensive entity profiles with citation-backed claims and automatic versioning
- Profile Grounding: Post-processing verification that profile claims are supported by cited source articles
- Persistence: Batched Parquet writes per entity type (avoiding write amplification), sidecar JSON for processing status
ArticleProcessororchestrates relevance checks, extraction dispatch (with QC retry), and per-article metadata aggregation (src/engine/article_processor.py)EntityExtractorunifies cloud and local model calls using domain-specific Pydantic schemas (src/engine/extractors.py)EntityMergerpre-filters with RapidFuzz lexical blocking, compares batched embeddings, calls match-checkers, routes gray-band cases to the dispute agent, and updates persisted Parquet rows (src/engine/mergers.py)MergeDisputeAgentprovides second-stage structured LLM analysis for ambiguous merge/skip decisions (src/engine/merge_dispute_agent.py)VersionedProfileand helper functions maintain profile history for each entity (src/engine/profiles.py)
- Domain-Agnostic: Easy to configure for any topic
- Multiple AI Models: Cloud (Gemini) and local (Ollama) support
- Privacy Mode:
--localflag forces local embeddings and disables all LLM telemetry - Smart Processing: Automatic relevance filtering, caching, and multi-layer deduplication
- Profile Versioning: Track entity profile changes over time with full version history
- Profile Grounding: Citation-backed claim verification for generated profiles
- Modern Interface: FastHTML "Archival Elegance" theme with confidence badges, aliases, tag pills, and version navigation
- Robust Pipeline: Structured
PhaseOutcomeerror handling, quality controls, extraction retry, and progress tracking - Structured Logging: Colour-coded decision lines (
NEW,MERGE,SKIP,DISPUTE,DEFER) for pipeline transparency
# Run all tests
just test
# Run specific test files
just test -k test_profile_versioning
just test tests/test_entity_merger_similarity.pyCI runs lint and tests automatically on every PR (.github/workflows/test.yml). The test suite covers embedding similarity, lexical blocking, per-type threshold resolution, entity merger behavior, merge dispute agent routing, extraction caching, extraction retry logic, canonical name selection, name variant detection, profile grounding, profile versioning, privacy mode enforcement, and frontend components β all without requiring API keys or GPU.
# Format code
just format
# Lint code
just lint
# Both together
just check-code
# Run exactly what CI runs (recommended before pushing)
just ciContributions welcome! Areas of interest:
- New domain templates and examples
- Additional language model integrations
- Enhanced web interface features
- Performance optimizations
MIT License - see LICENSE file for details.
For questions about:
- Configuration: See
configs/README.md - Setup: Check installation steps above
- Usage: Try
justorjust --list - Issues: Open a GitHub issue
Built for: Historians, researchers, and academics working with large document collections
Built with: Python, Pydantic, FastHTML, LiteLLM, Instructor, RapidFuzz, Jina Embeddings, Rich



