Skip to content

tuiiitendinh/AI-Research-Paper-Chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Research Paper Chatbot

A comprehensive system that crawls, processes, and provides conversational interaction with AI research papers from arXiv. The system uses RAG (Retrieval-Augmented Generation) architecture to enable intelligent querying of research papers and can also summarize external papers on demand.

Features

  • Automated ArXiv Crawling: Crawls research papers from specific categories (Computer Science, Machine Learning, AI, etc.)
  • PDF Processing: Converts PDFs to markdown format for efficient text processing
  • Vector Database Storage: Uses ChromaDB for semantic search and document retrieval
  • Interactive Chatbot: Streamlit-based interface for querying research papers
  • External Paper Summarization: Fetch and summarize papers from arXiv using paper codes
  • Conversational Memory: Maintains context across conversations with prompt and response summarization

Requirements

Install the required dependencies using pip:

pip install -r requirements.txt

Core Dependencies

  • Python 3.8+
  • Streamlit (for web interface)
  • ChromaDB (vector database)
  • LlamaIndex (document indexing and retrieval)
  • PyMuPDF4LLM (PDF processing)
  • Transformers (language models)
  • Prefect (workflow orchestration)
  • Ollama (language model serving)

Setup and Configuration

1. ChromaDB Setup

Start ChromaDB server using Docker:

cd src/chromadb
docker-compose up -d

The ChromaDB server will be available on port 8300.

2. Ollama Setup

Start the Ollama server for language model inference:

cd ollama
docker-compose up -d

The Ollama server will be available on port 16122.

3. Environment Configuration

The system requires specific proxy configurations and host settings. Update the host addresses in the code files if needed:

  • ChromaDB host: 172.16.87.75:8300
  • Ollama host configuration in docker-compose files

Running the Demo

Option 1: Full Data Pipeline

Run the complete pipeline to crawl, process, and index papers:

  1. Start Prefect server:
prefect server start
  1. Run the data pipeline:
cd src/pipelines/data_pipeline
python data_pipeline.py

This will:

  • Crawl new papers from arXiv
  • Convert PDFs to markdown
  • Index documents in ChromaDB

Option 2: Direct Chatbot Usage

If you already have processed documents in ChromaDB:

  1. Start the chatbot interface:
cd chat-bot
streamlit run bot.py
  1. Access the web interface: Open your browser and navigate to http://localhost:8501

Option 3: Individual Components

Crawl ArXiv papers only:

cd src/crawlers
python category_crawler.py

Process existing PDFs:

cd src/preprocessor
python preprocess.py

Usage

Chatbot Interface

The Streamlit interface supports two main interaction modes:

  1. General Queries: Ask questions about research topics and the system will retrieve relevant papers from the database to answer your questions.

  2. External Paper Summarization: Use the format "Summary this external paper: [arXiv_code]" to fetch and summarize papers directly from arXiv (e.g., "Summary this external paper: 1706.03762v7").

API Usage

The system also provides API endpoints for programmatic access:

cd chat-bot
python serving.py

Project Structure

AI-Research-Paper-Chatbot/
├── chat-bot/              # Streamlit chatbot interface and API
├── src/
│   ├── crawlers/          # ArXiv paper crawling scripts
│   ├── preprocessor/      # PDF processing and indexing
│   ├── pipelines/         # Data pipeline orchestration
│   └── chromadb/          # ChromaDB configuration and setup
├── ollama/                # Ollama model serving setup
└── requirements.txt       # Python dependencies

Notes

  • The system is configured for specific network environments with proxy settings
  • ChromaDB and Ollama services must be running before starting the chatbot
  • The data pipeline can take significant time depending on the number of papers being processed
  • Processed PDFs are logged to avoid reprocessing in subsequent runs

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published