A comprehensive system that crawls, processes, and provides conversational interaction with AI research papers from arXiv. The system uses RAG (Retrieval-Augmented Generation) architecture to enable intelligent querying of research papers and can also summarize external papers on demand.
- Automated ArXiv Crawling: Crawls research papers from specific categories (Computer Science, Machine Learning, AI, etc.)
- PDF Processing: Converts PDFs to markdown format for efficient text processing
- Vector Database Storage: Uses ChromaDB for semantic search and document retrieval
- Interactive Chatbot: Streamlit-based interface for querying research papers
- External Paper Summarization: Fetch and summarize papers from arXiv using paper codes
- Conversational Memory: Maintains context across conversations with prompt and response summarization
Install the required dependencies using pip:
pip install -r requirements.txt- Python 3.8+
- Streamlit (for web interface)
- ChromaDB (vector database)
- LlamaIndex (document indexing and retrieval)
- PyMuPDF4LLM (PDF processing)
- Transformers (language models)
- Prefect (workflow orchestration)
- Ollama (language model serving)
Start ChromaDB server using Docker:
cd src/chromadb
docker-compose up -dThe ChromaDB server will be available on port 8300.
Start the Ollama server for language model inference:
cd ollama
docker-compose up -dThe Ollama server will be available on port 16122.
The system requires specific proxy configurations and host settings. Update the host addresses in the code files if needed:
- ChromaDB host:
172.16.87.75:8300 - Ollama host configuration in docker-compose files
Run the complete pipeline to crawl, process, and index papers:
- Start Prefect server:
prefect server start- Run the data pipeline:
cd src/pipelines/data_pipeline
python data_pipeline.pyThis will:
- Crawl new papers from arXiv
- Convert PDFs to markdown
- Index documents in ChromaDB
If you already have processed documents in ChromaDB:
- Start the chatbot interface:
cd chat-bot
streamlit run bot.py- Access the web interface:
Open your browser and navigate to
http://localhost:8501
Crawl ArXiv papers only:
cd src/crawlers
python category_crawler.pyProcess existing PDFs:
cd src/preprocessor
python preprocess.pyThe Streamlit interface supports two main interaction modes:
-
General Queries: Ask questions about research topics and the system will retrieve relevant papers from the database to answer your questions.
-
External Paper Summarization: Use the format "Summary this external paper: [arXiv_code]" to fetch and summarize papers directly from arXiv (e.g., "Summary this external paper: 1706.03762v7").
The system also provides API endpoints for programmatic access:
cd chat-bot
python serving.pyAI-Research-Paper-Chatbot/
├── chat-bot/ # Streamlit chatbot interface and API
├── src/
│ ├── crawlers/ # ArXiv paper crawling scripts
│ ├── preprocessor/ # PDF processing and indexing
│ ├── pipelines/ # Data pipeline orchestration
│ └── chromadb/ # ChromaDB configuration and setup
├── ollama/ # Ollama model serving setup
└── requirements.txt # Python dependencies
- The system is configured for specific network environments with proxy settings
- ChromaDB and Ollama services must be running before starting the chatbot
- The data pipeline can take significant time depending on the number of papers being processed
- Processed PDFs are logged to avoid reprocessing in subsequent runs