Skip to content

stakpak/local_web_search

Repository files navigation

SearchPak - Local Search Solution

SearchPak is a local, self-hosted search service that integrates SearXNG for search results and Playwright/BeautifulSoup for scraping and enrichment, exposed via a FastAPI REST API.

Features

  • Search: Proxies requests to a local SearXNG instance.
  • Scraping:
    • Static: Uses httpx and BeautifulSoup for fast scraping of static pages.
    • Dynamic: Uses Playwright to render JavaScript-heavy pages.
  • Enrichment: Extracts title, content, and metadata.
  • API: Clean REST API for programmatic access.

Architecture

SearchPak uses a unified Docker image that runs both the FastAPI service and SearXNG using supervisor.

  • api/: FastAPI application and routes.
  • search/: SearXNG client and normalization.
  • scraping/: URL classification and scraping logic.
  • utils/: Configuration.
  • supervisord.conf: Process management for the unified container.

Setup

Prerequisites

  • Docker and Docker Compose

Running

  1. Start the service:

    docker compose up -d --build
  2. The API will be available at http://localhost:8000. SearXNG will be available at http://localhost:8080.

API Usage

Search

curl -X POST "http://localhost:8000/search" \
     -H "Content-Type: application/json" \
     -d '{"query": "python fastapi", "limit": 5}'

Scrape

curl -X POST "http://localhost:8000/scrape" \
     -H "Content-Type: application/json" \
     -d '{"urls": ["https://example.com"]}'

Search and Scrape

curl -X POST "http://localhost:8000/search-and-scrape" \
     -H "Content-Type: application/json" \
     -d '{"query": "python fastapi", "scrape_limit": 3}'

Configuration

Environment variables can be set in .env or docker-compose.yml:

  • SEARXNG_BASE_URL: URL of the SearXNG instance.
  • PLAYWRIGHT_HEADLESS: Run Playwright in headless mode (default: True).
  • REQUEST_TIMEOUT: Timeout for requests in seconds.
  • DOMAIN_WHITELIST: Comma-separated list of allowed domains (e.g., example.com,google.com).

Domain Whitelisting

You can restrict scraping to specific domains globally via DOMAIN_WHITELIST or per-request using the whitelist field:

curl -X POST "http://localhost:8000/scrape" \
     -H "Content-Type: application/json" \
     -d '{
       "urls": ["https://example.com", "https://other.com"],
       "whitelist": ["example.com"]
     }'

About

A local web search service for stakpak

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors