Skip to content

kXborg/LLMChat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Chat

LLM Chat

A chatML interface integrated with RAG, Web search and more for LLMs and VLMs. Simply plugin your v1 endpoint from OpenAI, Google AI Studio, Ollama, vLLM or any OpenAI compatible endpoint.

Don't have enough resources? No worries, choose from serveral models in the drop down, and run LLMs in your browser locally!

Installation and Setup

Install vllm, fastapi, uvicorn, and ngrok (optional, you can also use any other reverse proxy).

pip install -r requirements.txt

Terminal 1

uvicorn app.main:app --host 0.0.0.0 --port 3000

Tested Models on vLLM Server (RTX 3080 Ti, 12 GB)

The following are the list of models we have successfully tried so far on vllm==0.12.x versions. The errors we faced and fixes are also logged within.

Model Name Command Remarks
Falcon3-7B-Instruct-GPTQ-Int4 vllm serve tiiuae/Falcon3-7B-Instruct-GPTQ-Int4 --max-model-len 4096 --gpu-memory-utilization 0.85
Ministral-3-8B-Instruct-2512-AWQ-4bit vllm serve cyankiwi/Ministral-3-8B-Instruct-2512-AWQ-4bit --gpu-memory-utilization 0.85 --max-model-len 6144 --max-num-batched-tokens 1024
OpenGVLab/InternVL3-8B-AWQ vllm serve OpenGVLab/InternVL3-8B-AWQ --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code --quantization awq Note: AWQ quantized model won't work unless --quantization awq flag is set.
OpenGVLab/InternVL3-2B vllm serve OpenGVLab/InternVL3-2B --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code
Nemotron Cascade 8B vllm serve cyankiwi/Nemotron-Cascade-8B-AWQ-4bit --max-model-len 4096 --gpu-memory-utilization 0.85 --max-num-batched-tokens 1024 --trust-remote-code
Nemotron Orchestrator 8B vllm serve cyankiwi/Nemotron-Orchestrator-8B-AWQ-4bit --served-model-name Nemotron-orchestrator --max-model-len 4096 --gpu-memory-utilization 0.85 --max-num-batched-tokens 1024 --trust-remote-code
Qwen VL 2B Instruct vllm serve Qwen/Qwen2-VL-2B-Instruct --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024
Nvidia Cosmos Reason2 2B vllm serve nvidia/Cosmos-Reason2-2B --max-model-len 8192 --max-num-batched-tokens 2048 --gpu-memory-utilization 0.8
H20VL Mississipi 2B vllm serve h2oai/h2ovl-mississippi-2b --max-model-len 4096 --max-num-batched-tokens 2048 --gpu-memory-utilization 0.75 Does not support system prompt, need to take care of this. (Not yet fixed)
Gemma 3 4B Instruct vllm serve ISTA-DASLab/gemma-3-4b-it-GPTQ-4b-128g --max-model-len 4096 --max-num-batched-tokens 1024 --gpu-memory-utilization 0.8 Original model OOM. Using GPTQ quantized model from community. Max concurrency observed: 9

About

A chat interface integrated with RAG, web search and more for LLMs and VLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published