SearchPak is a local, self-hosted search service that integrates SearXNG for search results and Playwright/BeautifulSoup for scraping and enrichment, exposed via a FastAPI REST API.
- Search: Proxies requests to a local SearXNG instance.
- Scraping:
- Static: Uses
httpxandBeautifulSoupfor fast scraping of static pages. - Dynamic: Uses
Playwrightto render JavaScript-heavy pages.
- Static: Uses
- Enrichment: Extracts title, content, and metadata.
- API: Clean REST API for programmatic access.
SearchPak uses a unified Docker image that runs both the FastAPI service and SearXNG using supervisor.
api/: FastAPI application and routes.search/: SearXNG client and normalization.scraping/: URL classification and scraping logic.utils/: Configuration.supervisord.conf: Process management for the unified container.
- Docker and Docker Compose
-
Start the service:
docker compose up -d --build
-
The API will be available at
http://localhost:8000. SearXNG will be available athttp://localhost:8080.
curl -X POST "http://localhost:8000/search" \
-H "Content-Type: application/json" \
-d '{"query": "python fastapi", "limit": 5}'curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"]}'curl -X POST "http://localhost:8000/search-and-scrape" \
-H "Content-Type: application/json" \
-d '{"query": "python fastapi", "scrape_limit": 3}'Environment variables can be set in .env or docker-compose.yml:
SEARXNG_BASE_URL: URL of the SearXNG instance.PLAYWRIGHT_HEADLESS: Run Playwright in headless mode (default: True).REQUEST_TIMEOUT: Timeout for requests in seconds.DOMAIN_WHITELIST: Comma-separated list of allowed domains (e.g.,example.com,google.com).
You can restrict scraping to specific domains globally via DOMAIN_WHITELIST or per-request using the whitelist field:
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com", "https://other.com"],
"whitelist": ["example.com"]
}'