LLM Chat

A chatML interface integrated with RAG, Web search and more for LLMs and VLMs. Simply plugin your v1 endpoint from OpenAI, Google AI Studio, Ollama, vLLM or any OpenAI compatible endpoint.

Don't have enough resources? No worries, choose from serveral models in the drop down, and run LLMs in your browser locally!

Installation and Setup

Install vllm, fastapi, uvicorn, and ngrok (optional, you can also use any other reverse proxy).

pip install -r requirements.txt

Terminal 1

uvicorn app.main:app --host 0.0.0.0 --port 3000

Tested Models on vLLM Server (RTX 3080 Ti, 12 GB)

The following are the list of models we have successfully tried so far on vllm==0.12.x versions. The errors we faced and fixes are also logged within.

Model Name	Command	Remarks
Falcon3-7B-Instruct-GPTQ-Int4	`vllm serve tiiuae/Falcon3-7B-Instruct-GPTQ-Int4 --max-model-len 4096 --gpu-memory-utilization 0.85`
Ministral-3-8B-Instruct-2512-AWQ-4bit	`vllm serve cyankiwi/Ministral-3-8B-Instruct-2512-AWQ-4bit --gpu-memory-utilization 0.85 --max-model-len 6144 --max-num-batched-tokens 1024`
OpenGVLab/InternVL3-8B-AWQ	`vllm serve OpenGVLab/InternVL3-8B-AWQ --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code --quantization awq`	Note: AWQ quantized model won't work unless `--quantization awq` flag is set.
OpenGVLab/InternVL3-2B	`vllm serve OpenGVLab/InternVL3-2B --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code`
Nemotron Cascade 8B	`vllm serve cyankiwi/Nemotron-Cascade-8B-AWQ-4bit --max-model-len 4096 --gpu-memory-utilization 0.85 --max-num-batched-tokens 1024 --trust-remote-code`
Nemotron Orchestrator 8B	`vllm serve cyankiwi/Nemotron-Orchestrator-8B-AWQ-4bit --served-model-name Nemotron-orchestrator --max-model-len 4096 --gpu-memory-utilization 0.85 --max-num-batched-tokens 1024 --trust-remote-code`
Qwen VL 2B Instruct	`vllm serve Qwen/Qwen2-VL-2B-Instruct --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024`
Nvidia Cosmos Reason2 2B	`vllm serve nvidia/Cosmos-Reason2-2B --max-model-len 8192 --max-num-batched-tokens 2048 --gpu-memory-utilization 0.8`
H20VL Mississipi 2B	`vllm serve h2oai/h2ovl-mississippi-2b --max-model-len 4096 --max-num-batched-tokens 2048 --gpu-memory-utilization 0.75`	Does not support system prompt, need to take care of this. (Not yet fixed)
Gemma 3 4B Instruct	`vllm serve ISTA-DASLab/gemma-3-4b-it-GPTQ-4b-128g --max-model-len 4096 --max-num-batched-tokens 1024 --gpu-memory-utilization 0.8`	Original model OOM. Using GPTQ quantized model from community. Max concurrency observed: 9

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
app		app
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Chat

Installation and Setup

Tested Models on vLLM Server (RTX 3080 Ti, 12 GB)

About

Uh oh!

Releases

Packages

Languages

License

kXborg/LLMChat

Folders and files

Latest commit

History

Repository files navigation

LLM Chat

Installation and Setup

Tested Models on vLLM Server (RTX 3080 Ti, 12 GB)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages