Upload a document — PDF, DOCX, TXT, HTML, or CSV — and get back a structured AI-powered summary with key points, detected language, and word count. Built as two separate services (FastAPI backend + Streamlit frontend) communicating over HTTP, mirroring real production architecture.
┌──────────────────┐ HTTP ┌──────────────────────────────┐
│ Streamlit UI │ ◄────────────────► │ FastAPI Backend │
│ (port 8501) │ │ (port 8000) │
│ │ POST /process │ │
│ Upload file ────┼──────────────────► │ ┌────────────────────────┐ │
│ │ { job_id } │ │ Pipeline │ │
│ Poll status ────┼──────────────────► │ │ detector → extractor │ │
│ │ { status } │ │ → cleaner → chunker │ │
│ Fetch result ───┼──────────────────► │ │ → LLM summarizer │ │
│ │ { summary } │ └────────────────────────┘ │
└──────────────────┘ └──────────────────────────────┘
- Upload — User selects a file in the Streamlit UI and clicks "Summarize it"
- Detect —
python-magicreads file bytes to determine MIME type - Extract — Format-specific extractor pulls plain text (pypdf, python-docx, BeautifulSoup, csv)
- Clean — Unicode normalization (NFKC), null byte removal, whitespace collapsing
- Chunk — Text is split by paragraph boundaries into chunks of ~8000 characters
- Summarize — Each chunk is sent to Google Gemini, responses are validated against a Pydantic schema
- Merge — Multi-chunk documents get a final merge pass through the LLM
- Return — Structured JSON result: summary, key points, language, word count
Jobs are processed asynchronously — the upload returns instantly with a job_id, and the frontend polls for status every 2 seconds.
| Format | Library | Notes |
|---|---|---|
pypdf |
Extracts text from all pages | |
| DOCX | python-docx |
Paragraph-level text extraction |
| HTML | beautifulsoup4 |
Strips scripts, styles, and tags |
| CSV | csv (stdlib) |
Converts rows to readable text |
| TXT | — | UTF-8 decode with error replacement |
pip install fastapi uvicorn google-generativeai pypdf python-docx beautifulsoup4 python-magic python-dotenv pydantic streamlit requests python-multipartCreate a .env file in the project root:
GEMINI_KEY=your_gemini_api_key
MAX_CHUNK_SIZE=8000Get a Gemini API key at aistudio.google.com/apikey.
Start both services in separate terminals:
# Terminal 1 — Backend
uvicorn backend.main:app --reload
# Terminal 2 — Frontend
streamlit run frontend/app.pyOpen http://localhost:8501 in your browser.
| Method | Endpoint | Description |
|---|---|---|
POST |
/process |
Upload a file (multipart), returns { "job_id": "..." } |
GET |
/jobs/{job_id} |
Poll job status: pending → processing → done / failed |
GET |
/jobs/{job_id}/result |
Fetch result (only when status is done) |
You can test the API directly at http://localhost:8000/docs (auto-generated Swagger UI).
backend/
├── main.py FastAPI app, endpoints, background pipeline runner
├── config.py Environment variables (GEMINI_KEY, MAX_CHUNK_SIZE), Status enum
├── models.py DocumentSummary Pydantic model
├── jobs.py In-memory job store (dict keyed by UUID)
└── pipeline/
├── detector.py MIME type detection via python-magic
├── extractor.py Per-format text extraction (match/case routing)
├── cleaner.py Text normalization (unicode, whitespace, null bytes)
├── chunker.py Paragraph-aware text splitting
├── llm.py Gemini API calls, JSON parsing, retry logic
└── constants.py MIME subtype → canonical type mapping
frontend/
└── app.py Streamlit UI (upload, polling, result display, download)
- Unsupported file type — Backend raises
ValueError, frontend shows error message - LLM quota exceeded (429) — Caught and reported as "AI service quota exceeded"
- LLM server errors (5xx) — Reported as "AI service temporarily unavailable"
- Invalid LLM response — Pydantic validation catches malformed JSON, retries once
- File too large — Backend rejects uploads over 50 MB with HTTP 413
- Polling timeout — Frontend stops polling after 5 minutes and shows timeout error