Reference implementation of a Retrieval-Augmented Generation (RAG) pipeline supporting ingestion, hybrid retrieval, reranking, and streaming responses.
- Ollama (local model serving)
- Fast / Low-latency
qwen2.5qwen2.5:0.5b
mistral-small
nomic-embed-text
Ensure the following are installed:
- Docker & Docker Compose
- Ollama
Pull required models:
ollama pull qwen2.5
ollama pull qwen2.5:0.5b
ollama pull nomic-embed-textdocker compose build --no-cache
docker compose upThe API will be available at: http://localhost:8000
Ingest sample documents to populate the retrieval index.
curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{
"text": "A backup failure occurs when data cannot be written to storage or restored properly.",
"doc_id": "doc1"
}'curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{
"text": "A database transaction failure happens when ACID properties are violated during commit.",
"doc_id": "doc2"
}'curl -X POST http://localhost:8000/ingest \
-H "Content-Type: application/json" \
-d '{
"text": "Ollama is a local runtime for running large language models like Qwen and Mistral.",
"doc_id": "doc3"
}'curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{
"question": "What happens when a backup fails?"
}'- Retrieves semantically similar content
- Returns explanation of backup failure
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{
"question": "ACID transaction commit failure database"
}'- Keyword matching is effective
- Relevant document about ACID violations is retrieved
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{
"question": "Why does my system fail to save data when something breaks?"
}'- Query rewriting improves retrieval
- Correct document is retrieved despite vague phrasing
- Ragas — retrieval and answer quality evaluation
- DeepEval — behavioral and LLM evaluation
