Final-year PhD @ HKU CS · Hong Kong
I prove the architectural limits of LLM reasoning, and build the systems that route around them.
I'm a final-year PhD candidate in the Department of Computer Science at The University of Hong Kong, advised by Prof. Siu-Ming Yiu. My research sits at the intersection of three threads that keep refusing to be separate:
- What transformers can actually reason about. Tight architectural bounds, plus the tool-delegation systems those bounds force you to build.
- Trustworthy LLMs in regulated settings. Compliance-grade explainability, distribution-free coverage, atomic claim verification.
- Serving infrastructure that respects both. Workflow-atomic GPU scheduling with per-tenant fairness guarantees.
Theorems tell you what cannot be done. Systems make precise what can.
The cycle runs both ways: deployment surfaces the limits worth proving, and the proofs become the constraints that keep deployment honest.
- [05.2026] 🎉🎉🎉 Two papers accepted to ICML 2026: The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary (Main) and Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements (Position Track).
- [05.2026] ✨ On the postdoc market for Fall 2026. Trustworthy / compliance-grade AI, multi-agent systems & mechanism design, LLM theory, and serving systems. Reach out at bettyguo@connect.hku.hk.
- [05.2026] 📝 Serving as a reviewer for NeurIPS, ACM Multimedia (Main & Dataset Tracks), and UAI.
- [04.2026] 🏆🏆🏆 Four papers accepted to ACL 2026 Industry Track: FinGround (atomic claim verification), RouteNLP (conformal LLM routing), AgentEval (DAG-structured agent evaluation), and ComplianceNLP (KG-augmented regulatory gap detection).
- [04.2026] 🚀🚀🚀 SAGA accepted to HPDC 2026. It's a workflow-atomic scheduler for AI agent inference on GPU clusters, with per-tenant fairness guarantees that hold under real multi-tenant load.
- [03.2026] 📣 Adaptive Retrieval for Large Reasoning Models accepted to SIGIR 2026. When to retrieve during reasoning, with bounds, not heuristics.
- [02.2026] 💼 Conformal-bound risk management at Brain Investing is now running against live P&L. That's our HKU FinTech spin-out, and the lab's coverage work has finally made it onto a real trading book.
- [01.2026] 🛠️ Shipped multi-tenant scheduling and conformal-coverage pipelines at Stellaris AI for native-safe foundation-model deployment in regulated industries.
- [09.2025] 🎓 Began the final year of PhD at HKU CS, advised by Prof. Siu-Ming Yiu. Thesis focuses on the theory-meets-deployment cycle: bounds on transformer reasoning, and the systems those bounds force.
- [08.2025] 🏅 Continuing Cyberport Incubation (2023–2025 intake). That keeps an unbroken 2018–2025 funding run going across TSSSU, HKSTP Incu-Tech, HKU iAXON Deep Tech, and Cyberport.
Eight ICML / SIGIR / ACL / HPDC papers this cycle. Conformal-bound risk on a live trading book. HMAC-signed agent memory in Rust. A 60-paper survey atlas of LLM reasoning theory. Built across HKU CS, Stellaris AI, and Brain Investing.
|
papers, 2026 cycle ICML × 2 · SIGIR · HPDC ACL Industry × 4 |
original OSS repos 15 research · 8 MCP · 6 agents 5 benchmarks · 8 tools · 12 curated |
in production Stellaris AI · Brain Investing conformal-bound risk on live P&L |
years of continuous funding TSSSU · HKSTP Incu-Tech Cyberport (×2) · iAXON |
Four projects worth a second look:
|
ReaLM-Retrieve · SIGIR 2026. When to retrieve during reasoning, with bounds rather than heuristics. Highest-cloned repo in this account.
|
🚀 |
🧬 |
Verifiable memory for LLM agents. Every recalled claim is HMAC-signed back to its originating trajectory span.
|
| Paper | Venue | Code |
|---|---|---|
| The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary | deterministic-horizon |
|
| When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models | realm-retrieve |
|
| Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements | position paper | |
| SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters | SAGA |
|
| FinGround: Atomic Claim Verification for Financial LLM Outputs | FinGround |
|
| ComplianceNLP: KG-Augmented Regulatory Gap Detection | ComplianceNLP |
|
| RouteNLP: Conformal LLM Routing | RouteNLP |
|
| AgentEval: DAG-Structured Agent Evaluation | AgentEval |
Full publication list, PDFs, and BibTeX at bettyguo.github.io.
Three lines that keep crossing in our papers. Each thread proves a bound and ships the system that meets it.
What softmax attention can realize at inference time, and what it provably cannot. The matching upper and lower bounds become the spec for the tool-delegation layer above them.
📄 The Deterministic Horizon · · Adaptive Retrieval for Large Reasoning Models ·
· code:
deterministic-horizon, realm-retrieve
Explainability and verification that survive financial-services audit, not benchmark conditions. Distribution-free coverage, atomic claim verification, knowledge-graph-augmented regulatory gap detection.
📄 Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements · · FinGround, ComplianceNLP ·
· code:
FinGround, ComplianceNLP, TrustKGRAG
Workflow-atomic GPU scheduling with per-tenant fairness guarantees that hold under real multi-tenant load. DAG-structured evaluation harnesses and conformal routing for agent cascades.
📄 SAGA · · RouteNLP, AgentEval ·
· code:
SAGA, RouteNLP, AgentEval
How we approach problems, across every thread:
- Tight bounds with explicit constants. Upper and lower bounds in the same paper. No asymptotic hand-waving.
- Impossibility paired with construction. When a thing can't be done, that result becomes a design constraint, not a stopping point.
- Guarantees that survive reality. Distribution-free coverage, conformal prediction, fair scheduling. No idealized assumptions.
- Theory and the system that meets it, shipped together. The proof tells the algorithm what to achieve; the algorithm tells the proof what's worth bounding.
"Theorems tell you what cannot be done. Systems make precise what can."
54 original public repos. Research code behind every paper, plus the developer infrastructure our team relies on every day across HKU CS, Stellaris AI, and Brain Investing.
Browse the full index → github.com/bettyguo?tab=repositories
| 🔬 15 research |
🔌 8 MCP servers |
🤖 6 agent systems |
🧪 5 benchmarks |
🛠️ 8 dev tools |
📚 12 atlases & lists |
One repo per paper. Theory and the system that meets it, in the same artifact.
Reasoning & retrieval
| Repo | What it is |
|---|---|
deterministic-horizon |
ICML '26 companion. Bounds on extended reasoning, and the regime where tool delegation becomes necessary. Explicit constants. |
realm-retrieve |
ReaLM-Retrieve · SIGIR '26 companion. When to retrieve during reasoning, with bounds rather than heuristics. |
Serving & agent infrastructure
| Repo | What it is |
|---|---|
SAGA |
HPDC '26 companion. Workflow-atomic GPU-cluster scheduler. Within 1.31× of Bélády-optimal KV-cache eviction, with OpenMP-accelerated C++ kernels and LangChain / AutoGen / CrewAI bridges. |
RouteNLP |
ACL '26 Industry companion. Conformal-coverage router for LLM cascade serving. |
AgentEval |
ACL '26 Industry companion. DAG-structured evaluation harness for multi-step agents. |
Trustworthy & regulated AI
| Repo | What it is |
|---|---|
FinGround |
ACL '26 Industry companion. Atomic claim verification for financial LLM outputs. |
ComplianceNLP |
ACL '26 Industry companion. KG-augmented regulatory gap detection. |
TrustKGRAG |
Probabilistic certified robustness and anomaly detection against knowledge-graph poisoning in RAG. |
conformalized-neural-operators |
Distribution-free, spatially adaptive UQ for neural-operator PDE surrogates via physics-informed conformal prediction. |
Theory & foundations
| Repo | What it is |
|---|---|
SafeAnchor |
Safety-preserving continual domain adaptation of LLMs via Fisher-based subspace identification and orthogonal gradient projection. |
SigGate-GT |
Sigmoid-gated attention for graph transformers. Eliminates over-smoothing and stabilizes training via element-wise output gating. |
pac-learned-index |
PAC learning with tight VC-dimension bounds and provable sample-complexity guarantees for learned database indexes. |
JoinPAC |
PAC learnability for join cardinality estimation. Decomposition bounds, drift detection, hybrid-estimation guarantees. |
neural-precond-spectral |
Spectral-equivalence theory with mesh-independent convergence bounds for neural-operator preconditioning of PDE systems. |
sae-brain-topography |
Sparse-autoencoder decomposition of brain–LLM alignment with a priori cortical semantic topography mapping. |
Eight live integrations across our research workflow: code, data, papers, knowledge bases.
| Repo | What it is | Lang |
|---|---|---|
mcp-gateway |
Any OpenAPI 3.x spec into a Model Context Protocol server. Auth, rate-limiting, OpenTelemetry baked in. | |
mcp-postgres |
Postgres MCP server for agents. Four-tier safety: role grants, pglast AST guard, per-tx envelope, audit log. Schema introspection, EXPLAIN analysis, pgvector. PG 13 to 17. | |
mcp-jupyter |
MCP server for Jupyter. Live kernel state (variables, dataframes, plots, tracebacks) instead of just the .ipynb JSON. |
|
mcp-wandb-2 |
Analytical MCP server for Weights & Biases: hparam importance, sweep summaries, run-delta analysis, inline charts, gated Launch actions. | |
paperbase-mcp |
Research-grade MCP composing arXiv, Semantic Scholar, and OpenAlex. Related work, citation graphs, BibTeX in your chat. | |
mcp-overleaf |
MCP server and Skills bundle for finishing a LaTeX paper: bib cleanup, venue rule packs, latexdiff, related-work drafting. | |
obsidian_mcp |
MCP plus 7 Claude skills for Obsidian vaults. Read, search, write, and link notes from Claude / Cursor / ChatGPT. Filesystem-direct, local-first, round-trip safe. | |
semantic-grep |
Local semantic code search. CLI and MCP server, all on your machine. |
Local-first when possible; verifiable when not.
| Repo | What it is | Lang |
|---|---|---|
Vannevar |
Open-source agentic harness with citation-grade memory. Every fact carries a source URI, a temporal validity window, and an append-only provenance ledger. MCP-native, multi-frontend, fully self-hostable. | |
agent-memory |
Verifiable memory for LLM agents. Every recalled claim is HMAC-signed back to its originating trajectory span. | |
computer_use_agent |
Open-source local-VLM browser agent. AT-tree-first routing with VLM fallback, refusals enforced in code, honest benchmarks including the failure atlas. | |
whisper_agent |
Hands-free local voice agent: faster-whisper STT, local LLM with tool use, TTS. Runs entirely on your machine. | |
agent-tracer-2 |
OpenTelemetry-native, local-first observability for AI agents. DuckDB on disk, Next.js viewer on localhost, no SaaS. Adapters for Anthropic, OpenAI, LangGraph, AutoGen, CrewAI. | |
local-deep-research |
Self-hosted deep-research agent: multi-step query planning, source synthesis, report generation. Ollama / llama.cpp / vLLM friendly, with SearXNG, FAISS, and BM25. |
Reproducible by default. Probe for contamination and reward hacks before declaring a number.
| Repo | What it is | Lang |
|---|---|---|
agent_eval |
Open-source benchmark for Claude Code skill bundles. Pass@k plus cost plus reliability, content-addressed leaderboard across Anthropic / OpenAI / Google. | |
bench_audit |
Library of probes for agent benchmarks: contamination, gold-answer leaks, harness-injection vulnerabilities, reward hacking. CIs on every result. | |
rag-bench |
Small, reproducible benchmark for RAG pipelines. | |
agent-arena |
Arena-style framework for head-to-head agent comparison. | |
paper-replay |
Replay and reproduce paper experiments with locked seeds, environments, and artifacts. |
Quality layers, lockfiles, and ergonomics for the agent stack.
| Repo | What it is | Lang |
|---|---|---|
promptlock |
Production prompt workflow: semantic diff, eval-on-PR, lockfile, drift detection, and rollback for plain-markdown prompts in your repo. | |
skill-forge-2 |
Quality layer for Claude Code Skills: lint, test, and bench before you ship. | |
browser-skills |
15 reusable, agent-agnostic browser recipes plus an MCP server. Cookie banners, infinite scroll, calendar widgets, all solved once. | |
diagram-skills |
Generate validated diagrams across Mermaid, PlantUML, Graphviz, D2, and Excalidraw. MCP server, CLI, and Claude Code skills. | |
see-the-ai-think |
Watch an LLM think. Interactive interpretability tool that visualizes sparse-autoencoder features firing live across every token. Runs on your laptop, no GPU required. | |
paper_pod |
Local-first audio overviews for academic papers. Take an arXiv URL, PDF, or BibTeX in, get an 8 to 15 minute two-host podcast out. | |
paper2repro |
Paper to reproducible experiment scaffold. | |
test_forge |
Test-generation toolkit for Python research code. |
What we had to learn the hard way, written down for the next person.
📓 Atlases & annotated notebooks
| Repo | What it is |
|---|---|
awesome-llm-circuits-atlas |
Interactive atlas of discovered circuits and SAE features in large language models, with Colab reproductions on open-weights models. |
awesome-reasoning-models-theory |
Theory-first map of why reasoning models (o1/o3, DeepSeek-R1, Claude-thinking, Qwen-QwQ) actually work. 8 chapters, 60+ annotated papers, 13 models compared, 5 reproduction notebooks, live benchmarks. |
retrieval-from-scratch |
Modern Information Retrieval from scratch in PyTorch. BM25, dense bi-encoders, ColBERT late interaction, cross-encoder reranking, and RAG, in annotated notebooks that run on a single GPU. |
🗺️ Maps, lists & roadmaps
| Repo | What it is |
|---|---|
awesome-why-llms-work |
Falsifiable-hypothesis atlas of why LLMs work. Five competing research programmes, 41 tracked claims with epistemic status (🟢🟡🔴⚪) and named falsifiers. |
awesome-llm-reasoning-foundations |
Curated, rigorously-verified map of the theoretical foundations of LLM reasoning: transformer expressivity, chain-of-thought error bounds, circuit complexity, logical characterizations, learnability. |
llm-impossibility-results |
Verified, assumption-explicit catalog of published impossibility and lower-bound results for LLMs and AI agents: circuit-complexity ceilings, hallucination bounds, watermarking impossibility, alignment. |
awesome-llm-theory |
Companion list: theory papers for LLM behavior, expressiveness, and learnability. |
build-your-own-ai |
Master modern AI by building it from scratch: curated index of the best build-it-yourself guides for tokenizers, attention, training, RAG, agents, and evals. |
awesome-research-agents |
Opinionated, curated list of agents, skills, MCP servers, and tools ML researchers actually use. |
ai-engineer-roadmap |
Interactive end-to-end roadmap for AI engineers. 12 stages, 122 nodes, 276 link-verified resources from math prerequisites to the research frontier. |
harness-engineer-roadmap |
Interactive roadmap for harness engineering: the agent loop, tool layers, context engineering, memory, retrieval, eval. |
llm-interview-prep |
Interview-prep notebook for LLM and ML-systems roles. |
Translational work. Coverage proofs and scheduling guarantees in production, against real workloads.
- Stellaris AI. Conformal-coverage pipelines and multi-tenant scheduling for native-safe foundation models, in regulated deployments.
- Brain Investing. HKU FinTech spin-out. Conformal-bound risk management running against live P&L. The lab's coverage work, in a real trading book.
- Peer review at NeurIPS, ACM Multimedia (Main & Dataset Tracks), UAI.
- Mentoring LLM-infrastructure engineers at Stellaris AI on conformal coverage, agent evaluation, and multi-tenant scheduling.
- Ten consecutive years of competitive funding (2018–2025): TSSSU, HKSTP Incu-Tech, Cyberport (×2 intakes), HKU iAXON Deep Tech.
|
Open to positions where theory and deployment share a research agenda. Areas. Trustworthy & compliance-grade AI · Multi-agent systems & mechanism design · LLM theory (descriptive complexity, in-context reasoning) · Serving systems for inference. Reach me at bettyguo@connect.hku.hk |
Dongxin (Betty) Guo · The University of Hong Kong · Department of Computer Science
homepage ·
scholar ·
orcid ·
openreview ·
linkedin
Last updated May 2026




