VrunGPU

Remote GPU execution server for training and inference. Send Python code from your laptop to a GPU server via REST API and get results back.

Demo Video

Features

Sync/Async Execution: Sync for quick tasks, async for long training jobs
Multi-GPU Support: Automatic job distribution across multiple GPUs
ZIP Project Upload: Upload multiple Python files + datasets as a ZIP archive
File Upload API: Streaming multipart upload for datasets/checkpoints — no base64 embed workaround
WebSocket Streaming: Real-time training logs via WebSocket
Web Dashboard: Visual monitoring of GPU status and tasks (D3.js charts)
SQLite Persistence: Task and model data survives server restarts
Model Registry: Register, query, and download trained models
Inference API: Run inference on registered models
Progress Tracking: Real-time training progress monitoring
LLM Chat API: Built-in LLM inference service (Qwen2.5/Qwen3/Qwen3.5, Vision-Language)
Fine-Tuning: SFT LoRA + DPO (RLHF) with TRL DPOTrainer, V100-compatible fp16
Session Management: Server-side multi-turn chat sessions with TTL
Auto GPU Switching: LLM and training jobs share GPU automatically

Who is this for?

ML engineers who want quick GPU experiments without infrastructure overhead
Small teams prototyping before building full MLOps pipelines
Researchers running ad-hoc experiments on remote GPU servers
Anyone tired of the SSH → git pull → run → copy results workflow

When to use VrunGPU vs MLOps platforms

Scenario	MLOps Platforms	VrunGPU
Quick experiments / prototyping	Heavy setup overhead	✅ Lightweight
Ad-hoc GPU tasks	Pipeline setup required	✅ Instant execution
Small teams / individuals	Infrastructure cost	✅ Minimal setup
Production deployment	✅ Recommended	Not designed for
CI/CD pipelines	✅ Recommended	Not designed for
A/B testing / Feature stores	✅ Recommended	Not designed for

Where VrunGPU fits in your ML workflow

ML Development Lifecycle:

[Idea] → [Experiment] → [Development] → [Production]
              ↑               ↑               ↑
          VrunGPU         VrunGPU          MLOps
        (quick tests)   (iteration)    (deployment)

Note: VrunGPU complements MLOps platforms by providing a lightweight execution layer for experimentation and development phases.

Quick Start

Installation

# Clone the repository
git clone https://github.com/maior/vrungpu.git
cd vrungpu

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Install PyTorch with CUDA
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install dashboard (optional)
cd dashboard && npm install && npm run build

Running the Server

# Start backend (port 9825)
.venv/bin/python server.py

# Start dashboard (port 9824)
cd dashboard && npm run start -- -p 9824

Access Points

Service	URL
Backend API	http://your-server:9825
Dashboard	http://your-server:9824
API Docs (Swagger)	http://your-server:9825/docs

Usage Examples

1. Check Server Status

curl http://your-server:9825/

{
  "service": "VrunGPU",
  "status": "running",
  "version": "0.4.0",
  "gpu_count": 2,
  "available_gpus": 2,
  "total_tasks": 150,
  "total_models": 10
}

2. Synchronous Execution (Quick Tasks)

curl -X POST http://your-server:9825/run/sync \
  -H "Content-Type: application/json" \
  -d '{
    "code": "import torch\nprint(torch.cuda.get_device_name(0))"
  }'

3. Asynchronous Execution (Long Training)

# Submit job - runs until completion (no timeout by default)
curl -X POST http://your-server:9825/run/async \
  -H "Content-Type: application/json" \
  -d '{
    "code": "import time\nfor i in range(10):\n    print(f\"Step {i+1}/10\")\n    time.sleep(1)"
  }'

# Response: {"task_id": "abc-123", "status": "pending", ...}

# Check status
curl http://your-server:9825/task/abc-123

Note: By default, tasks run without timeout. Training can run for hours or days. To set a timeout, add "timeout": 3600 (seconds).

4. Specify GPU

curl -X POST http://your-server:9825/run/sync \
  -H "Content-Type: application/json" \
  -d '{
    "code": "import torch\nprint(f\"Running on GPU: {torch.cuda.current_device()}\")",
    "gpu_id": 1
  }'

Progress Tracking

Report training and evaluation progress from your code and monitor it in real-time.

Progress Output Format

# Print progress in this format from your code
print(f"[PROGRESS:{progress_percent}:{message}]")

Example: Training + Evaluation Progress

import torch
import torch.nn as nn

# === Training Phase (0-85%) ===
print("[PROGRESS:0:Training started]")

num_epochs = 3
for epoch in range(1, num_epochs + 1):
    for batch_idx, batch in enumerate(train_loader):
        # Training logic...
        loss = train_step(batch)

        # Report every 100 batches
        if batch_idx % 100 == 0:
            progress = int((epoch - 1 + batch_idx / len(train_loader)) / num_epochs * 85)
            print(f"[PROGRESS:{progress}:Epoch {epoch} Batch {batch_idx}, Loss: {loss:.4f}]")

# === Evaluation Phase (85-95%) ===
print("[PROGRESS:85:Evaluating...]")
model.eval()

with torch.no_grad():
    for batch_idx, batch in enumerate(eval_loader):
        # Evaluation logic...
        evaluate_step(batch)

        # Report every 10 batches
        if batch_idx % 10 == 0:
            progress = 85 + int((batch_idx + 1) / len(eval_loader) * 10)
            print(f"[PROGRESS:{progress}:Eval {batch_idx+1}/{len(eval_loader)}]")

# === Save Model (95-100%) ===
print("[PROGRESS:95:Saving model...]")
torch.save(model.state_dict(), "model.pt")

print("[PROGRESS:100:Training complete!]")

Recommended Progress Allocation:

Phase	Progress Range	Description
Training	0-85%	Main training loop
Evaluation	85-95%	Validation/test evaluation
Saving	95-100%	Model checkpoint & results

Query Progress

curl http://your-server:9825/task/{task_id}

Response:

{
  "task_id": "abc-123",
  "status": "running",
  "progress": 60.0,
  "progress_message": "Epoch 6/10, Loss: 0.0234",
  "task_type": "training"
}

Model Management

Register a Model

curl -X POST http://your-server:9825/model/register \
  -F "name=mnist-classifier-v1" \
  -F "model_file=@model.pt" \
  -F "model_type=classifier" \
  -F "framework=pytorch"

Response:

{
  "model_id": "a1b2c3d4",
  "name": "mnist-classifier-v1",
  "model_type": "classifier",
  "framework": "pytorch",
  "status": "ready",
  "file_size": 12345,
  "created_at": "2026-01-05T12:00:00"
}

List Models

curl http://your-server:9825/models

Get Model Details

curl http://your-server:9825/model/{model_id}

Download Model

curl -O http://your-server:9825/model/{model_id}/download

Delete Model

curl -X DELETE http://your-server:9825/model/{model_id}

Inference API

Run inference on registered models.

Execute Inference

curl -X POST http://your-server:9825/model/{model_id}/inference \
  -H "Content-Type: application/json" \
  -d '{
    "input_data": {"image": "base64...", "batch_size": 1},
    "timeout": 60
  }'

Response:

{
  "task_id": "inference-xyz",
  "status": "pending",
  "message": "Inference task started. Model: mnist-classifier-v1"
}

Get Inference Results

curl http://your-server:9825/task/{task_id}

LLM Chat API

Built-in LLM inference service with multi-GPU support. Run different (or same) models on each GPU simultaneously. Supports Qwen2.5/Qwen3/Qwen3.5, Vision-Language, DeepSeek, GPT-OSS models.

Supported Models

Model aliases (short names) are accepted wherever a model name is expected.

Model	Alias	HuggingFace Name	VRAM (FP16)	Note
Qwen3.5-9B	`qwen3.5-9b`	`Qwen/Qwen3.5-9B`	~18GB	Default, fine-tuning ready
Qwen2.5-7B	`qwen2.5-7b`	`Qwen/Qwen2.5-7B-Instruct`	~16GB
Qwen2.5-VL-7B	`qwen2.5-vl-7b`	`Qwen/Qwen2.5-VL-7B-Instruct`	~16GB	Vision-Language (use `/run/async`)
Qwen3-8B	`qwen3-8b`	`Qwen/Qwen3-8B`	~18GB
DeepSeek-R1-7B (Qwen)	`deepseek-7b`	`deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`	~16GB
DeepSeek-R1-8B (Llama)	`deepseek-8b`	`deepseek-ai/DeepSeek-R1-Distill-Llama-8B`	~18GB
DeepSeek-R1-14B	`deepseek-14b`	`deepseek-ai/DeepSeek-R1-Distill-Qwen-14B`	~28GB	V100 32GB recommended
GPT-OSS-20B	`gpt-oss-20b`	`openai/gpt-oss-20b`	~41GB	MXFP4 → BF16 dequant for V100

Start LLM Service

Each GPU runs an independent inference server (port = 9826 + gpu_id). You can run different or the same models on multiple GPUs simultaneously.

# Start model on GPU 0 (port 9826)
curl -X POST "http://your-server:9825/llm/start?model=qwen3.5-9b&gpu=0"

# Start a different model on GPU 1 (port 9827) — runs simultaneously
curl -X POST "http://your-server:9825/llm/start?model=qwen2.5-7b&gpu=1"

# Or same model on both GPUs for higher throughput
curl -X POST "http://your-server:9825/llm/start?model=qwen2.5-7b&gpu=0"
curl -X POST "http://your-server:9825/llm/start?model=qwen2.5-7b&gpu=1"

Chat API

# Route to specific GPU
curl -X POST "http://your-server:9825/llm/chat?gpu=0" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_new_tokens": 256,
    "temperature": 0.7
  }'

# Omit gpu → routes to any running instance
curl -X POST http://your-server:9825/llm/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_new_tokens": 256
  }'

Text Generation API

# Route to GPU 1 specifically
curl -X POST "http://your-server:9825/llm/generate?gpu=1" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a Python quicksort function",
    "max_new_tokens": 512
  }'

LLM Service Management

# Check all instances
curl http://your-server:9825/llm/status
# Returns: instances[] with per-GPU model, port, pid, running status

# Stop specific GPU only
curl -X POST "http://your-server:9825/llm/stop?gpu=0"

# Stop all instances
curl -X POST http://your-server:9825/llm/stop

Multi-GPU Architecture

Client Request              VrunGPU Server (port 9825)
     │                              │
     ├─ /llm/chat?gpu=0 ──────────►├──► inference_server (GPU 0, port 9826)
     │                              │
     ├─ /llm/chat?gpu=1 ──────────►├──► inference_server (GPU 1, port 9827)
     │                              │
     └─ /llm/chat (no gpu) ───────►├──► any running instance (auto-select)

Independent instances: Each GPU loads its own model, has its own sessions
Auto GPU switching: Training jobs still trigger LLM auto-stop per GPU
Selective stop: Free one GPU for training while keeping the other serving

Python Example

import requests

SERVER = "http://your-server:9825"

# Start models on both GPUs
requests.post(f"{SERVER}/llm/start", params={"model": "qwen3.5-9b", "gpu": 0})
requests.post(f"{SERVER}/llm/start", params={"model": "qwen2.5-7b", "gpu": 1})

# Chat with GPU 0 (Qwen3.5-9B)
r0 = requests.post(f"{SERVER}/llm/chat", params={"gpu": 0}, json={
    "messages": [{"role": "user", "content": "Explain machine learning"}],
    "max_new_tokens": 512
})
print("GPU0:", r0.json()["generated_text"])

# Chat with GPU 1 (Qwen2.5-7B)
r1 = requests.post(f"{SERVER}/llm/chat", params={"gpu": 1}, json={
    "messages": [{"role": "user", "content": "Explain machine learning"}],
    "max_new_tokens": 512
})
print("GPU1:", r1.json()["generated_text"])

# Stop GPU 0, keep GPU 1 running
requests.post(f"{SERVER}/llm/stop", params={"gpu": 0})

# Chat without gpu param → routes to GPU 1 (only running instance)
r = requests.post(f"{SERVER}/llm/chat", json={
    "messages": [{"role": "user", "content": "Hello"}]
})
print(r.json()["generated_text"])

# Stop all
requests.post(f"{SERVER}/llm/stop")

File Upload API

Upload arbitrary files (datasets, checkpoints, configs) via streaming multipart. The returned path can be reused directly in /llm/finetune, /llm/finetune/dpo, /run/async, etc. — no base64 embedding in your code.

Upload a File

curl -X POST http://your-server:9825/upload \
  -F "file=@/local/path/dataset.jsonl"

Response:

{
  "file_id": "f3a1...",
  "filename": "dataset.jsonl",
  "path": "/home/maiordba/projects/vrungpu/data/uploads/f3a1.../dataset.jsonl",
  "size": 12345,
  "sha256": "9a4961...",
  "uploaded_at": "2026-04-13T09:34:39"
}

Manage Uploads

curl http://your-server:9825/uploads                          # list
curl http://your-server:9825/upload/{file_id}                 # metadata
curl -O http://your-server:9825/upload/{file_id}/download     # download
curl -X DELETE http://your-server:9825/upload/{file_id}       # delete

Python Example

import requests

SERVER = "http://your-server:9825"

# Upload dataset
with open("pairs.jsonl", "rb") as f:
    r = requests.post(f"{SERVER}/upload", files={"file": f})
dataset_path = r.json()["path"]

# Use it in fine-tuning directly
requests.post(f"{SERVER}/llm/finetune/dpo", json={
    "model": "qwen2.5-7b",
    "dataset_path": dataset_path,
    "epochs": 1,
    "beta": 0.1,
})

Fine-Tuning API

LoRA-based fine-tuning with PEFT. Two modes: SFT (supervised) and DPO (preference learning / RLHF). Both V100-compatible fp16 with AMP GradScaler.

SFT Fine-Tuning

Dataset formats auto-detected (JSONL):

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"instruction": "...", "input": "...", "output": "..."} (Alpaca)
{"text": "..."} (raw)

curl -X POST http://your-server:9825/llm/finetune \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-9b",
    "dataset_path": "/path/to/train.jsonl",
    "epochs": 3,
    "lora_r": 16,
    "learning_rate": 2e-4,
    "batch_size": 4,
    "gpu": 1
  }'

DPO Fine-Tuning (RLHF)

Preference-pair dataset (JSONL):

{"prompt": "What is 2+2?", "chosen": "2+2 equals 4.", "rejected": "idk lol"}

curl -X POST http://your-server:9825/llm/finetune/dpo \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b",
    "dataset_path": "/path/to/pairs.jsonl",
    "epochs": 1,
    "beta": 0.1,
    "lora_r": 16,
    "batch_size": 2,
    "grad_accum": 4,
    "learning_rate": 5e-6
  }'

Uses TRL DPOTrainer with ref_model=None — the LoRA adapter is disabled to reuse the base model as reference, saving VRAM (critical on V100 32GB).

Progress Monitoring (shared by SFT/DPO)

curl "http://your-server:9825/llm/finetune/status?task_id={task_id}"

Response includes epoch, step, total_steps, loss, progress, and for DPO also reward margins via training logs.

Stop Fine-Tuning

# Graceful stop (saves checkpoint)
curl -X POST "http://your-server:9825/llm/finetune/stop?task_id={task_id}"

Use the Fine-Tuned Adapter

# Start LLM with the adapter merged in
curl -X POST "http://your-server:9825/llm/start?model=qwen3.5-9b&lora_adapter={task_id}&gpu=1"

# Then chat normally via /llm/chat

End-to-End Python Example

import requests
SERVER = "http://your-server:9825"

# 1. Upload preference pairs
with open("pairs.jsonl", "rb") as f:
    dataset_path = requests.post(f"{SERVER}/upload", files={"file": f}).json()["path"]

# 2. Start DPO
job = requests.post(f"{SERVER}/llm/finetune/dpo", json={
    "model": "qwen2.5-7b",
    "dataset_path": dataset_path,
    "epochs": 1, "beta": 0.1, "lora_r": 16,
}).json()
task_id = job["task_id"]

# 3. Poll status
import time
while True:
    s = requests.get(f"{SERVER}/llm/finetune/status", params={"task_id": task_id}).json()
    print(f"{s['status']} {s['progress']:.1f}% loss={s['loss']}")
    if s["status"] in ("completed", "failed", "cancelled"):
        break
    time.sleep(5)

# 4. Serve the DPO-trained adapter
requests.post(f"{SERVER}/llm/start", params={
    "model": "qwen2.5-7b", "lora_adapter": task_id, "gpu": 0,
})
reply = requests.post(f"{SERVER}/llm/chat", json={
    "messages": [{"role": "user", "content": "What is 2+2?"}]
}).json()
print(reply["generated_text"])

Custom Inference Logic

For complex inference, use the /run/async endpoint:

code = """
import torch
import json

# Load model
model = torch.load('/path/to/model.pt')
model.eval()

# Process input and run inference
input_tensor = preprocess(input_data)
with torch.no_grad():
    output = model(input_tensor)

# Output results
result = {"prediction": output.argmax().item()}
print(json.dumps(result))
"""

response = requests.post(
    "http://your-server:9825/run/async",
    json={"code": code, "timeout": 60}
)

ZIP Project Upload

Upload multiple Python files and datasets as a ZIP archive.

Project Structure Example

my_project/
├── train.py          # Entry point
├── model.py          # Model definition
├── utils.py          # Utility functions
├── config.json       # Configuration
└── data/             # Dataset folder
    ├── train.csv
    └── test.csv

Create ZIP Archive

# Method 1: Compress entire folder
cd /path/to/projects
zip -r my_project.zip my_project/

# Method 2: Compress files from inside folder
cd my_project
zip -r ../my_project.zip .

Upload and Run

Using curl:

curl -X POST http://your-server:9825/run/project \
  -F "file=@my_project.zip" \
  -F "entry_point=train.py"

Using Python:

import requests

with open("my_project.zip", "rb") as f:
    response = requests.post(
        "http://your-server:9825/run/project",
        files={"file": f},
        data={
            "entry_point": "train.py",
            "gpu_id": 0  # Optional: specify GPU
        }
    )

task_id = response.json()["task_id"]
print(f"Task started: {task_id}")

Note: No timeout by default. Add "timeout": 3600 if you want to limit execution time.

Using Dashboard:

Go to http://your-server:9824
Select "File Upload" tab
Choose ZIP file
Enter entry point (e.g., train.py)
Click "Upload & Run"

API Reference

Core Endpoints

Endpoint	Method	Description
`/`	GET	Server status and storage info
`/gpu`	GET	GPU details (memory, utilization)
`/gpu/pool`	GET	GPU pool status (allocation)
`/ws`	WebSocket	Real-time monitoring connection
`/stats`	GET	Server statistics

Task Execution Endpoints

Endpoint	Method	Description
`/run/sync`	POST	Synchronous execution (waits for completion)
`/run/async`	POST	Asynchronous execution (returns task_id)
`/run/project`	POST	Upload ZIP/file and execute
`/task/{task_id}`	GET	Get task status/result/progress
`/tasks`	GET	List tasks
`/task/{task_id}`	DELETE	Delete task record + workspace (⚠ does NOT kill process — cancel first)
`/task/{task_id}/progress`	PUT	Manual progress update

Task Management Endpoints (v0.7.0)

Live observability and control over in-flight tasks — cancel a stuck training run, tail logs, browse workspace files, or subscribe to live stdout via SSE.

Endpoint	Method	Description
`/task/{task_id}/cancel`	POST	Cancel running task: SIGTERM → (timeout) → SIGKILL. Releases GPU. `?timeout=5` (default 5s)
`/task/{task_id}/logs`	GET	Tail logs. `?source=stdout\|stderr\|all\|workspace` · `?tail=200` · running uses 1000-line ring buffer, completed uses `task.stdout/stderr`
`/task/{task_id}/logs/stream`	GET	SSE live log stream. Pushes 200-line snapshot, then real-time lines. Emits `event: end` on task completion
`/task/{task_id}/files`	GET	Workspace file tree. `?max_depth=3` (default). Returns path/type/size_bytes/mtime
`/task/{task_id}/files/{path}`	GET	Read individual file from workspace. `?tail=N` for text tail, otherwise full file download

Why it matters — many frameworks (RecBole, TensorFlow, etc.) write logs to their own files instead of stdout, leaving task.stdout empty. GET /task/{id}/logs?source=workspace scans work_dir/**/*.log and tails the most recently modified file, so you can see epoch/loss progress even for file-logger frameworks.

Example — tail RecBole training log:

# Show most recent log file (auto-detected)
curl "http://{SERVER_IP}:9825/task/{TASK_ID}/logs?source=workspace&tail=50"

# Target specific file
curl "http://{SERVER_IP}:9825/task/{TASK_ID}/logs?source=workspace&workspace_file=log/LightGCN/train.log&tail=20"

Example — Python SSE client (live tail -f):

import httpx

url = f"http://{SERVER_IP}:9825/task/{TASK_ID}/logs/stream"
with httpx.stream("GET", url, timeout=None) as r:
    for line in r.iter_lines():
        if line.startswith("data: "):
            import json
            evt = json.loads(line[6:])
            print(f"[{evt['stream']}] {evt['line']}", end="")
        elif line.startswith("event: end"):
            print("\n--- task finished ---")
            break

Example — cancel a stuck task:

# Default 5-second SIGTERM grace period, then SIGKILL
curl -X POST "http://{SERVER_IP}:9825/task/{TASK_ID}/cancel"

# Give the task 30 seconds to flush checkpoints before force-kill
curl -X POST "http://{SERVER_IP}:9825/task/{TASK_ID}/cancel?timeout=30"

The GPU is automatically returned to the pool on cancel. Status transitions from running → cancelled (preserved — not overridden to failed).

Example — browse workspace for debugging:

# List all files within workspace, depth 3
curl "http://{SERVER_IP}:9825/task/{TASK_ID}/files?max_depth=3"

# Download a config file
curl "http://{SERVER_IP}:9825/task/{TASK_ID}/files/config.yaml" -o config.yaml

# Peek at last 30 lines of any text file
curl "http://{SERVER_IP}:9825/task/{TASK_ID}/files/saved/LightGCN-Mar-30-2026.pth.txt?tail=30"

Path traversal (..) is blocked — requests resolving outside the workspace return 400.

CPU Task Support (v0.8.0)

For workloads that don't need a GPU (data preprocessing, classic ML, RecBole eval, offline inference), pass device=cpu to keep the GPU pool free for training.

Endpoint	Method	Description
`/pool`	GET	Unified GPU + CPU pool status (v0.8.0~)
`/tasks?device=cpu\|gpu`	GET	Filter task list by device
`/run/async`, `/run/sync`, `/run/project`	POST	New `device` field: `"gpu"` (default) or `"cpu"`

CPU slot auto-detection — slot count is computed at server start using os.sched_getaffinity(0) (respects cgroups/taskset), reserving 25% or 4 cores (whichever is larger) as headroom for GPU DataLoader workers and the OS, then dividing the remainder by 4 (assumed average cores per CPU task), capped at 16. Override with VRUNGPU_CPU_SLOTS=N env.

Host cores	Slots	Rationale
4	1	Minimum
16	3	12 effective ÷ 4
32	6	24 effective ÷ 4
64	12	48 effective ÷ 4
128+	16	Upper cap

Example — submit CPU preprocessing job:

curl -X POST "http://{SERVER_IP}:9825/run/async" \
  -H "Content-Type: application/json" \
  -d '{
    "code": "import pandas as pd\ndf = pd.read_csv(\"/data/large.csv\")\nprint(df.describe())",
    "name": "preprocess_job",
    "device": "cpu"
  }'

Example — run RecBole eval on CPU while training runs on GPU:

curl -X POST "http://{SERVER_IP}:9825/run/project" \
  -F "file=@eval_project.zip" \
  -F "entry_point=run_eval.py" \
  -F "device=cpu"

Example — check pool status:

curl "http://{SERVER_IP}:9825/pool"
# {
#   "gpu": {"total_gpus": 2, "available_gpus": [1], "busy_gpus": {"task-abc": 0}},
#   "cpu": {"max_slots": 6, "in_use": 1, "tasks": ["task-xyz"], "auto_detected": true}
# }

What happens to CPU tasks internally — CUDA_VISIBLE_DEVICES="" and VRUNGPU_DEVICE="cpu" are injected into the subprocess env. The task runs in a separate CPU slot pool from the GPU pool, so a CPU task never blocks GPU scheduling and vice versa. CPU and GPU tasks can run concurrently.

Recommended CPU workloads: pandas/numpy preprocessing, scikit-learn/xgboost, RecBole eval_args.mode=full evaluation, offline inference on small models.

Not recommended on CPU: LLM fine-tuning, large tensor ops — practical speed is too low. If you try, it is permitted but will be slow.

File Upload Endpoints

Endpoint	Method	Description
`/upload`	POST	Upload arbitrary file (streaming multipart)
`/uploads`	GET	List uploaded files
`/upload/{file_id}`	GET	Get upload metadata
`/upload/{file_id}/download`	GET	Download uploaded file
`/upload/{file_id}`	DELETE	Delete uploaded file

Fine-Tuning Endpoints

Endpoint	Method	Description
`/llm/finetune`	POST	Start SFT LoRA fine-tuning
`/llm/finetune/dpo`	POST	Start DPO LoRA fine-tuning (RLHF)
`/llm/finetune/status`	GET	Fine-tuning progress (shared SFT/DPO)
`/llm/finetune/stop`	POST	Graceful stop with checkpoint save
`/llm/finetune/models`	GET	List fine-tuned adapter models

Model Management Endpoints

Endpoint	Method	Description
`/models`	GET	List models
`/model/register`	POST	Register model (file upload)
`/model/{model_id}`	GET	Get model info
`/model/{model_id}`	DELETE	Delete model
`/model/{model_id}/download`	GET	Download model file
`/model/{model_id}/inference`	POST	Run inference with model

LLM Service Endpoints

All LLM endpoints accept an optional ?gpu=N parameter to target a specific GPU instance.

Endpoint	Method	Description
`/llm/start`	POST	Start LLM instance on GPU (params: model, gpu, lora_adapter)
`/llm/stop`	POST	Stop instance (?gpu=N for specific, omit for all)
`/llm/status`	GET	All instances status (per-GPU model, port, health)
`/llm/generate`	POST	Text generation (?gpu=N or auto-route)
`/llm/chat`	POST	Chat completion (?gpu=N or auto-route, session_id)
`/llm/sessions`	GET	List active chat sessions (all instances or ?gpu=N)
`/llm/session/{id}`	GET	Get session history
`/llm/session/{id}`	DELETE	Delete session

WebSocket Real-time Monitoring

The dashboard uses WebSocket for real-time updates.

Message Types

// Initial data on connection
{ "type": "init", "gpus": [...], "tasks": [...] }

// GPU status update
{ "type": "gpu_update", "gpus": [...] }

// Task status change
{ "type": "task_update", "task": {...} }

// Real-time output (logs)
{ "type": "task_output", "task_id": "...", "stream": "stdout", "line": "..." }

JavaScript Connection Example

const ws = new WebSocket("ws://your-server:9825/ws");

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.type === "task_output") {
    console.log(data.line);  // Real-time log output
  }
};

Python Client

from client import VrunGPUClient

client = VrunGPUClient("http://your-server:9825")

# Get GPU info
print(client.get_gpu_info())

# Synchronous execution
result = client.run_sync("print('Hello GPU!')")
print(result['stdout'])

# Asynchronous execution + wait for completion
task_id = client.run_async(train_code)
result = client.wait_for_task(task_id)
print(result['stdout'])

# File upload
task_id = client.run_file("train.py", timeout=600, gpu_id=0)
result = client.wait_for_task(task_id)

Parallel Execution

When multiple GPUs are available, concurrent job submissions run on different GPUs in parallel.

# Submit job 1 -> Allocated to GPU 0
curl -X POST http://your-server:9825/run/async -d '{"code": "..."}'

# Submit job 2 -> Allocated to GPU 1
curl -X POST http://your-server:9825/run/async -d '{"code": "..."}'

# Check GPU pool status
curl http://your-server:9825/gpu/pool
# {"total_gpus": 2, "available_gpus": [], "busy_gpus": {"task1": 0, "task2": 1}}

When all GPUs are busy, new tasks wait in queued status and automatically start when a GPU becomes available.

Data Storage Structure

All data is persisted starting from v0.4.0.

vrungpu/
└── data/
    ├── vrungpu.db        # SQLite database (tasks, model metadata)
    ├── workspaces/       # Task execution environments and results
    │   ├── {task_id}/
    │   │   ├── main.py
    │   │   └── model.pt
    │   └── inference_{task_id}/
    │       └── inference.py
    ├── models/           # Registered model files
    │   └── {model_id}/
    │       └── model.pt
    └── uploads/          # Uploaded ZIP files

Project Structure

vrungpu/
├── server.py              # FastAPI server (main, multi-GPU LLM management)
├── inference_server.py    # LLM inference server (per-GPU instance)
├── finetune_worker.py     # SFT LoRA fine-tuning worker
├── finetune_dpo_worker.py # DPO fine-tuning worker (TRL DPOTrainer)
├── client.py              # Python client
├── requirements.txt       # Dependencies
├── test_server.py         # Test script
├── README.md              # Documentation
├── data/                  # Persistent storage
│   ├── vrungpu.db         # SQLite database
│   ├── workspaces/        # Task workspaces
│   ├── models/            # Registered + fine-tuned models
│   │   └── finetune/      # LoRA adapter outputs
│   ├── uploads/           # Uploaded files (datasets, etc.)
│   └── logs/              # Per-GPU inference server logs
├── dashboard/             # Next.js dashboard
│   ├── app/
│   │   ├── page.tsx       # Dashboard UI (D3.js charts)
│   │   └── globals.css    # Animation styles
│   └── package.json
└── examples/              # Example projects
    └── mnist_project/

Example: MNIST Training + Model Registration + Inference

1. Run Training

# Compress example project
cd examples
zip -r mnist_project.zip mnist_project/

# Upload and run (no timeout - runs until completion)
curl -X POST http://your-server:9825/run/project \
  -F "file=@mnist_project.zip" \
  -F "entry_point=train.py"

2. Monitor Progress

# Check progress
curl http://your-server:9825/task/{task_id}

# Response shows progress
# "progress": 60.0, "progress_message": "Epoch 6/10..."

3. Register Model

After training completes, register the model from the workspace:

curl -X POST http://your-server:9825/model/register \
  -F "name=mnist-v1" \
  -F "model_file=@/path/to/workspace/{task_id}/model.pt" \
  -F "model_type=classifier"

4. Run Inference

curl -X POST http://your-server:9825/model/{model_id}/inference \
  -H "Content-Type: application/json" \
  -d '{"input_data": {"image": "..."}, "timeout": 30}'

Statistics API

Get server-wide statistics.

curl http://your-server:9825/stats

Response:

{
  "tasks": {
    "by_status": {
      "completed": 150,
      "running": 2,
      "failed": 5
    },
    "avg_duration": {
      "training": 3600,
      "inference": 5
    },
    "last_24h": 25
  },
  "models": {
    "by_status": {
      "ready": 10,
      "archived": 3
    },
    "total": 13
  },
  "gpu": {
    "usage_count": {"0": 100, "1": 80},
    "current": {
      "total_gpus": 2,
      "available_gpus": [0, 1],
      "busy_gpus": {}
    }
  }
}

Requirements

Python 3.10+
CUDA-capable GPU(s)
PyTorch with CUDA support
Node.js 18+ (for dashboard)

Version History

v0.8.0 - CPU Task Support: device="cpu" opt-in for /run/async, /run/sync, /run/project routes the task to a separate CPUSlotPool with auto-detected slot count (os.sched_getaffinity → headroom 25%/4-core → divide by 4, cap 16). GPU pool untouched, so CPU and GPU tasks run concurrently without contention. New GET /pool endpoint (unified GPU+CPU view), ?device=cpu|gpu filter on /tasks, VRUNGPU_CPU_SLOTS=N env override. Subprocesses get CUDA_VISIBLE_DEVICES="" + VRUNGPU_DEVICE="cpu". Idempotent DB migration adds tasks.device column defaulting to "gpu".
v0.7.0 - Task Management API: POST /task/{id}/cancel (SIGTERM→SIGKILL, GPU auto-release), GET /task/{id}/logs?source=stdout|stderr|all|workspace (ring-buffer tail for running tasks, file-based tail for frameworks like RecBole that bypass stdout), GET /task/{id}/logs/stream (SSE live tail), GET /task/{id}/files + /files/{path} (workspace browsing with path-traversal protection). Preserves cancelled status instead of overwriting with failed.
v0.6.0 - Multi-GPU LLM (GPU별 독립 인스턴스, 동시 서빙), File Upload API, DPO fine-tuning (TRL DPOTrainer + LoRA), fix asyncio LimitOverrunError on large subprocess output
v0.5.0 - LLM Chat API (Qwen2.5/Qwen3/Qwen3.5, Qwen2.5-VL), SFT LoRA fine-tuning, server-side chat sessions, model aliases, auto GPU switching
v0.4.0 - SQLite persistence, model management API, inference API, progress tracking
v0.3.0 - D3.js dashboard with smooth animations
v0.2.0 - ZIP project upload, multi-GPU support
v0.1.0 - Basic sync/async execution

License

Apache License 2.0 - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
dashboard		dashboard
docs/images		docs/images
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
client.py		client.py
finetune_dpo_worker.py		finetune_dpo_worker.py
finetune_worker.py		finetune_worker.py
inference_server.py		inference_server.py
requirements.txt		requirements.txt
server.py		server.py
test_server.py		test_server.py

Folders and files

Latest commit

History

Repository files navigation

VrunGPU

Demo Video

Features

Who is this for?

When to use VrunGPU vs MLOps platforms

Where VrunGPU fits in your ML workflow

Quick Start

Installation

Running the Server

Access Points

Usage Examples

1. Check Server Status

2. Synchronous Execution (Quick Tasks)

3. Asynchronous Execution (Long Training)

4. Specify GPU

Progress Tracking

Progress Output Format

Example: Training + Evaluation Progress

Query Progress

Model Management

Register a Model

List Models

Get Model Details

Download Model

Delete Model

Inference API

Execute Inference

Get Inference Results

LLM Chat API

Supported Models

Start LLM Service

Chat API

Text Generation API

LLM Service Management

Multi-GPU Architecture

Python Example

File Upload API

Upload a File

Manage Uploads

Python Example

Fine-Tuning API

SFT Fine-Tuning

DPO Fine-Tuning (RLHF)

Progress Monitoring (shared by SFT/DPO)

Stop Fine-Tuning

Use the Fine-Tuned Adapter

End-to-End Python Example

Custom Inference Logic

ZIP Project Upload

Project Structure Example

Create ZIP Archive

Upload and Run

API Reference

Core Endpoints

Task Execution Endpoints

Task Management Endpoints (v0.7.0)

CPU Task Support (v0.8.0)

File Upload Endpoints

Fine-Tuning Endpoints

Model Management Endpoints

LLM Service Endpoints

WebSocket Real-time Monitoring

Message Types

JavaScript Connection Example

Python Client

Parallel Execution

Data Storage Structure

Project Structure

Example: MNIST Training + Model Registration + Inference

1. Run Training

2. Monitor Progress

3. Register Model

4. Run Inference

Statistics API

Requirements

Version History

Packages