AI Research Paper Chatbot

A comprehensive system that crawls, processes, and provides conversational interaction with AI research papers from arXiv. The system uses RAG (Retrieval-Augmented Generation) architecture to enable intelligent querying of research papers and can also summarize external papers on demand.

Features

Automated ArXiv Crawling: Crawls research papers from specific categories (Computer Science, Machine Learning, AI, etc.)
PDF Processing: Converts PDFs to markdown format for efficient text processing
Vector Database Storage: Uses ChromaDB for semantic search and document retrieval
Interactive Chatbot: Streamlit-based interface for querying research papers
External Paper Summarization: Fetch and summarize papers from arXiv using paper codes
Conversational Memory: Maintains context across conversations with prompt and response summarization

Requirements

Install the required dependencies using pip:

pip install -r requirements.txt

Core Dependencies

Python 3.8+
Streamlit (for web interface)
ChromaDB (vector database)
LlamaIndex (document indexing and retrieval)
PyMuPDF4LLM (PDF processing)
Transformers (language models)
Prefect (workflow orchestration)
Ollama (language model serving)

Setup and Configuration

1. ChromaDB Setup

Start ChromaDB server using Docker:

cd src/chromadb
docker-compose up -d

The ChromaDB server will be available on port 8300.

2. Ollama Setup

Start the Ollama server for language model inference:

cd ollama
docker-compose up -d

The Ollama server will be available on port 16122.

3. Environment Configuration

The system requires specific proxy configurations and host settings. Update the host addresses in the code files if needed:

ChromaDB host: 172.16.87.75:8300
Ollama host configuration in docker-compose files

Running the Demo

Option 1: Full Data Pipeline

Run the complete pipeline to crawl, process, and index papers:

Start Prefect server:

prefect server start

Run the data pipeline:

cd src/pipelines/data_pipeline
python data_pipeline.py

This will:

Crawl new papers from arXiv
Convert PDFs to markdown
Index documents in ChromaDB

Option 2: Direct Chatbot Usage

If you already have processed documents in ChromaDB:

Start the chatbot interface:

cd chat-bot
streamlit run bot.py

Access the web interface: Open your browser and navigate to http://localhost:8501

Option 3: Individual Components

Crawl ArXiv papers only:

cd src/crawlers
python category_crawler.py

Process existing PDFs:

cd src/preprocessor
python preprocess.py

Usage

Chatbot Interface

The Streamlit interface supports two main interaction modes:

General Queries: Ask questions about research topics and the system will retrieve relevant papers from the database to answer your questions.
External Paper Summarization: Use the format "Summary this external paper: [arXiv_code]" to fetch and summarize papers directly from arXiv (e.g., "Summary this external paper: 1706.03762v7").

API Usage

The system also provides API endpoints for programmatic access:

cd chat-bot
python serving.py

Project Structure

AI-Research-Paper-Chatbot/
├── chat-bot/              # Streamlit chatbot interface and API
├── src/
│   ├── crawlers/          # ArXiv paper crawling scripts
│   ├── preprocessor/      # PDF processing and indexing
│   ├── pipelines/         # Data pipeline orchestration
│   └── chromadb/          # ChromaDB configuration and setup
├── ollama/                # Ollama model serving setup
└── requirements.txt       # Python dependencies

Notes

The system is configured for specific network environments with proxy settings
ChromaDB and Ollama services must be running before starting the chatbot
The data pipeline can take significant time depending on the number of papers being processed
Processed PDFs are logged to avoid reprocessing in subsequent runs

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
chat-bot		chat-bot
chromadb		chromadb
ollama		ollama
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Research Paper Chatbot

Features

Requirements

Core Dependencies

Setup and Configuration

1. ChromaDB Setup

2. Ollama Setup

3. Environment Configuration

Running the Demo

Option 1: Full Data Pipeline

Option 2: Direct Chatbot Usage

Option 3: Individual Components

Usage

Chatbot Interface

API Usage

Project Structure

Notes

About

Uh oh!

Releases

Packages

Languages

License

tuiiitendinh/AI-Research-Paper-Chatbot

Folders and files

Latest commit

History

Repository files navigation

AI Research Paper Chatbot

Features

Requirements

Core Dependencies

Setup and Configuration

1. ChromaDB Setup

2. Ollama Setup

3. Environment Configuration

Running the Demo

Option 1: Full Data Pipeline

Option 2: Direct Chatbot Usage

Option 3: Individual Components

Usage

Chatbot Interface

API Usage

Project Structure

Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages