This is a workflow to automate the conversion of PDFs to markdown using the Mistral AI OCR API. It extracts text and images from PDFs and organizes the output into structured markdown documents with images properly linked using Obsidian-style wikilinks.
The initial version was a Jupyter Notebook. Recently, I vibe-coded a Local Web App where you can do the same in a more visual and understandable way, might have some defects and problems, feel free to improve it or host it online for others.
You can also find useful the OCR Extractor Plugin for Obsidian, made by jritzi (GitHub). Which uses the same Mistral OCR technology.
- Batch processing: Place multiple PDFs in the input folder and process them automatically.
- Text extraction: Converts scanned PDFs into structured markdown format while preserving document hierarchy.
- Image extraction: Saves images separately and links them in the markdown using Obsidian-compatible
![[image-name]]format. - Automatic organization: Each processed PDF gets its own output folder with the markdown and images.
- OCR caching: Saves the OCR response as JSON to avoid redundant API calls.
- Notebook mode: Running step-by-step OCR processing in a Jupyter Notebook.
Contributions to improve compatibility and robustness are welcome!
pip install -r requirements.txt # (I recommend creating a virtual environment to not clutter your OS)
python app.pyThen open your browser at http://localhost:5000/
The app inserts --- between PDF pages by default. Set the PAGE_SEPARATOR environment
variable to change this text or leave it empty to merge pages without separators.
The web interface also lets you toggle and edit the separator before processing.
Ensure you have Python 3.9+. Then install dependencies:
pip install mistralai jupyter python-dotenvBefore running the notebook, get your free API key from Mistral's API Key Console. It's free.
Edit the env.example with your key, rename it to .env and you're good to go.
Or set it manually:
export MISTRAL_API_KEY='your_api_key_here' # For Linux/macOS
set MISTRAL_API_KEY='your_api_key_here' # For Windowsjupyter notebook pdf-markdown-ocr.ipynbOr open the Notebook file directly in your IDE.
Before first use, create a pdfs_to_process folder in the project directory and drop your PDFs in there.
Go cell by cell and make sure everything runs as expected.
Each processed PDF gets its own folder inside ocr_output, structured like this:
ocr_output/
├── MyDocument/
│ ├── output.md # Extracted markdown with wikilinks
│ ├── ocr_response.json # Raw OCR response (for reuse)
│ ├── images/
│ │ ├── MyDocument_img_1.jpeg
│ │ ├── MyDocument_img_2.jpeg
pdfs-done/
├── MyDocument.pdf # Moved here after OCR completion
Move the generated output.md file into your Obsidian vault and also move the images to your attachments folder.
Heads up!: For now, Obsidian must be configured to support ![[image-name]] style links. If your setup is different, the script might not work as-is. Feel free to fork and tweak it.
- The notebook scans
pdfs_to_processfor PDFs. - Each PDF is uploaded to Mistral AI for OCR processing.
- The text is extracted and saved as markdown (
output.md). - Images are extracted, saved in a subfolder, and referenced in the markdown using
![[image-name]]. - The original PDF is moved to
pdfs-doneto avoid duplicate processing. - The full OCR response is saved as JSON for later use.
