SDP_LLM (June 2025)
A lightweight chatbot app that lets users upload a PDF and ask questions about its content using Google's Gemini LLM. It integrates PDF uploading, text extraction, embedding generation, vector storage, and Retrieval Q&A, all in one seamless pipeline.
- PDF Upload and Text Extraction: User uploads a PDF file via a Gradio interface. The text is extracted using pdfplumber.
- Text Preprocessing: The text is split into manageable chunks using LangChain's CharacterTextSplitter.
- Embeddings & Vector Store: Each chunk is embedded using HuggingFaceEmbeddings. Chunks are stored in Chroma, a vector database. This allows for efficient semantic similarity searches when a user asks a question.
- Question Answering Chain:
When a user inputs a question, the chatbot retrieves the most relevant text chunks from ChromaDB and sends them to the Gemini model(gemini-2.5-pro). LangChain’s RetrievalQA chain is used to handle this process.
- Programming Language: Python is used for building, training and deploying the model
- Libraries Used:
↣ Gradio: The trained model is integrated into a Gradio Interface
↣ pdfplumer: For extracting text from PDF files
↣ Langchain: For creating the retrieval-based QA chain
↣ chromaDB: Vector database used to store and retrieve text embeddings - Large Language Model(LLM) : Google Generative AI (gemini-2.5-pro)
- Embeddings: HuggingFace Embedding - To convert text chunks into numerical vectors
- Install Dependencies:
pip install langchain chromadb gradio google-generativeai pdfplumber transformers langchain-google-genai langchain-community - Setting API Key:
import os os.environ['GOOGLE_API_KEY'] = 'your_google_api_key_here'