Dongxin Guo bettyguo

Dongxin (Betty) Guo

Final-year PhD @ HKU CS · Hong Kong
I prove the architectural limits of LLM reasoning, and build the systems that route around them.

About

I'm a final-year PhD candidate in the Department of Computer Science at The University of Hong Kong, advised by Prof. Siu-Ming Yiu. My research sits at the intersection of three threads that keep refusing to be separate:

What transformers can actually reason about. Tight architectural bounds, plus the tool-delegation systems those bounds force you to build.
Trustworthy LLMs in regulated settings. Compliance-grade explainability, distribution-free coverage, atomic claim verification.
Serving infrastructure that respects both. Workflow-atomic GPU scheduling with per-tenant fairness guarantees.

Theorems tell you what cannot be done. Systems make precise what can.

The cycle runs both ways: deployment surfaces the limits worth proving, and the proofs become the constraints that keep deployment honest.

🎉 News

[05.2026] 🎉🎉🎉 Two papers accepted to ICML 2026: The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary (Main) and Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements (Position Track).
[05.2026] ✨ On the postdoc market for Fall 2026. Trustworthy / compliance-grade AI, multi-agent systems & mechanism design, LLM theory, and serving systems. Reach out at bettyguo@connect.hku.hk.
[05.2026] 📝 Serving as a reviewer for NeurIPS, ACM Multimedia (Main & Dataset Tracks), and UAI.
[04.2026] 🏆🏆🏆 Four papers accepted to ACL 2026 Industry Track: FinGround (atomic claim verification), RouteNLP (conformal LLM routing), AgentEval (DAG-structured agent evaluation), and ComplianceNLP (KG-augmented regulatory gap detection).
[04.2026] 🚀🚀🚀 SAGA accepted to HPDC 2026. It's a workflow-atomic scheduler for AI agent inference on GPU clusters, with per-tenant fairness guarantees that hold under real multi-tenant load.
[03.2026] 📣 Adaptive Retrieval for Large Reasoning Models accepted to SIGIR 2026. When to retrieve during reasoning, with bounds, not heuristics.
[02.2026] 💼 Conformal-bound risk management at Brain Investing is now running against live P&L. That's our HKU FinTech spin-out, and the lab's coverage work has finally made it onto a real trading book.
[01.2026] 🛠️ Shipped multi-tenant scheduling and conformal-coverage pipelines at Stellaris AI for native-safe foundation-model deployment in regulated industries.
[09.2025] 🎓 Began the final year of PhD at HKU CS, advised by Prof. Siu-Ming Yiu. Thesis focuses on the theory-meets-deployment cycle: bounds on transformer reasoning, and the systems those bounds force.
[08.2025] 🏅 Continuing Cyberport Incubation (2023–2025 intake). That keeps an unbroken 2018–2025 funding run going across TSSSU, HKSTP Incu-Tech, HKU iAXON Deep Tech, and Cyberport.

Theory. Production. Curation.

Eight ICML / SIGIR / ACL / HPDC papers this cycle. Conformal-bound risk on a live trading book. HMAC-signed agent memory in Rust. A 60-paper survey atlas of LLM reasoning theory. Built across HKU CS, Stellaris AI, and Brain Investing.

⚡ At a Glance

8

papers, 2026 cycle
_{ICML × 2 · SIGIR · HPDC
ACL Industry × 4}

54

original OSS repos
_{15 research · 8 MCP · 6 agents
5 benchmarks · 8 tools · 12 curated}

2

in production
_{Stellaris AI · Brain Investing
conformal-bound risk on live P&L}

10

years of continuous funding
_{TSSSU · HKSTP Incu-Tech
Cyberport (×2) · iAXON}

🌟 Showcase

Four projects worth a second look:

📈 `realm-retrieve`

ReaLM-Retrieve · SIGIR 2026. When to retrieve during reasoning, with bounds rather than heuristics. Highest-cloned repo in this account.

_{Python · ⭐ 18 · 🍴 7 · most-cloned}

🚀 `SAGA`

HPDC 2026. Workflow-atomic GPU-cluster scheduler for AI agents. Within 1.31× of Bélády-optimal KV-cache eviction, with OpenMP-accelerated C++ kernels and LangChain / AutoGen / CrewAI bridges.

_{Python C++ · concrete-metric flagship}

🧬 `Vannevar`

Open-source agentic harness with citation-grade memory: source URI, temporal validity window, append-only provenance ledger. MCP-native, multi-frontend, fully self-hostable.

_{Rust · flagship infrastructure}

🔐 `agent-memory`

Verifiable memory for LLM agents. Every recalled claim is HMAC-signed back to its originating trajectory span.

_{Python · cryptographically grounded}

📚 Selected Publications

Paper	Venue	Code
The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary		`deterministic-horizon`
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models		`realm-retrieve`
Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements		_{position paper}
SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters		`SAGA`
FinGround: Atomic Claim Verification for Financial LLM Outputs		`FinGround`
ComplianceNLP: KG-Augmented Regulatory Gap Detection		`ComplianceNLP`
RouteNLP: Conformal LLM Routing		`RouteNLP`
AgentEval: DAG-Structured Agent Evaluation		`AgentEval`

_{Full publication list, PDFs, and BibTeX at bettyguo.github.io.}

🧭 Research Threads

Three lines that keep crossing in our papers. Each thread proves a bound and ships the system that meets it.

🧠 Reasoning & tool use

What softmax attention can realize at inference time, and what it provably cannot. The matching upper and lower bounds become the spec for the tool-delegation layer above them.

_{📄 The Deterministic Horizon · · Adaptive Retrieval for Large Reasoning Models · · code: deterministic-horizon, realm-retrieve}

🛡️ Trustworthy LLMs for regulated settings

Explainability and verification that survive financial-services audit, not benchmark conditions. Distribution-free coverage, atomic claim verification, knowledge-graph-augmented regulatory gap detection.

_{📄 Current XAI Methods Cannot Satisfy Financial AI Explainability Requirements · · FinGround, ComplianceNLP · · code: FinGround, ComplianceNLP, TrustKGRAG}

⚡ Serving & agent infrastructure

Workflow-atomic GPU scheduling with per-tenant fairness guarantees that hold under real multi-tenant load. DAG-structured evaluation harnesses and conformal routing for agent cascades.

_{📄 SAGA · · RouteNLP, AgentEval · · code: SAGA, RouteNLP, AgentEval}

📐 Method, in four habits

How we approach problems, across every thread:

Tight bounds with explicit constants. Upper and lower bounds in the same paper. No asymptotic hand-waving.
Impossibility paired with construction. When a thing can't be done, that result becomes a design constraint, not a stopping point.
Guarantees that survive reality. Distribution-free coverage, conformal prediction, fair scheduling. No idealized assumptions.
Theory and the system that meets it, shipped together. The proof tells the algorithm what to achieve; the algorithm tells the proof what's worth bounding.

"Theorems tell you what cannot be done. Systems make precise what can."

🗂️ What lives in this account

54 original public repos. Research code behind every paper, plus the developer infrastructure our team relies on every day across HKU CS, Stellaris AI, and Brain Investing.
_{Browse the full index → github.com/bettyguo?tab=repositories}

🔬
15
_research

🔌
8
_{MCP servers}

🤖
6
_{agent systems}

🧪
5
_benchmarks

🛠️
8
_{dev tools}

📚
12
_{atlases & lists}

🔬 Research code

_{One repo per paper. Theory and the system that meets it, in the same artifact.}

Reasoning & retrieval

Repo	What it is
`deterministic-horizon`	ICML '26 companion. Bounds on extended reasoning, and the regime where tool delegation becomes necessary. Explicit constants.
`realm-retrieve`	ReaLM-Retrieve · SIGIR '26 companion. When to retrieve during reasoning, with bounds rather than heuristics.

Serving & agent infrastructure

Repo	What it is
`SAGA`	HPDC '26 companion. Workflow-atomic GPU-cluster scheduler. Within 1.31× of Bélády-optimal KV-cache eviction, with OpenMP-accelerated C++ kernels and LangChain / AutoGen / CrewAI bridges.
`RouteNLP`	ACL '26 Industry companion. Conformal-coverage router for LLM cascade serving.
`AgentEval`	ACL '26 Industry companion. DAG-structured evaluation harness for multi-step agents.

Trustworthy & regulated AI

Repo	What it is
`FinGround`	ACL '26 Industry companion. Atomic claim verification for financial LLM outputs.
`ComplianceNLP`	ACL '26 Industry companion. KG-augmented regulatory gap detection.
`TrustKGRAG`	Probabilistic certified robustness and anomaly detection against knowledge-graph poisoning in RAG.
`conformalized-neural-operators`	Distribution-free, spatially adaptive UQ for neural-operator PDE surrogates via physics-informed conformal prediction.

Theory & foundations

Repo	What it is
`SafeAnchor`	Safety-preserving continual domain adaptation of LLMs via Fisher-based subspace identification and orthogonal gradient projection.
`SigGate-GT`	Sigmoid-gated attention for graph transformers. Eliminates over-smoothing and stabilizes training via element-wise output gating.
`pac-learned-index`	PAC learning with tight VC-dimension bounds and provable sample-complexity guarantees for learned database indexes.
`JoinPAC`	PAC learnability for join cardinality estimation. Decomposition bounds, drift detection, hybrid-estimation guarantees.
`neural-precond-spectral`	Spectral-equivalence theory with mesh-independent convergence bounds for neural-operator preconditioning of PDE systems.
`sae-brain-topography`	Sparse-autoencoder decomposition of brain–LLM alignment with a priori cortical semantic topography mapping.

🔌 MCP servers

_{Eight live integrations across our research workflow: code, data, papers, knowledge bases.}

Repo	What it is	Lang
`mcp-gateway`	Any OpenAPI 3.x spec into a Model Context Protocol server. Auth, rate-limiting, OpenTelemetry baked in.
`mcp-postgres`	Postgres MCP server for agents. Four-tier safety: role grants, pglast AST guard, per-tx envelope, audit log. Schema introspection, EXPLAIN analysis, pgvector. PG 13 to 17.
`mcp-jupyter`	MCP server for Jupyter. Live kernel state (variables, dataframes, plots, tracebacks) instead of just the `.ipynb` JSON.
`mcp-wandb-2`	Analytical MCP server for Weights & Biases: hparam importance, sweep summaries, run-delta analysis, inline charts, gated Launch actions.
`paperbase-mcp`	Research-grade MCP composing arXiv, Semantic Scholar, and OpenAlex. Related work, citation graphs, BibTeX in your chat.
`mcp-overleaf`	MCP server and Skills bundle for finishing a LaTeX paper: bib cleanup, venue rule packs, latexdiff, related-work drafting.
`obsidian_mcp`	MCP plus 7 Claude skills for Obsidian vaults. Read, search, write, and link notes from Claude / Cursor / ChatGPT. Filesystem-direct, local-first, round-trip safe.
`semantic-grep`	Local semantic code search. CLI and MCP server, all on your machine.

🤖 Agent systems & runtimes

_{Local-first when possible; verifiable when not.}

Repo	What it is	Lang
`Vannevar`	Open-source agentic harness with citation-grade memory. Every fact carries a source URI, a temporal validity window, and an append-only provenance ledger. MCP-native, multi-frontend, fully self-hostable.
`agent-memory`	Verifiable memory for LLM agents. Every recalled claim is HMAC-signed back to its originating trajectory span.
`computer_use_agent`	Open-source local-VLM browser agent. AT-tree-first routing with VLM fallback, refusals enforced in code, honest benchmarks including the failure atlas.
`whisper_agent`	Hands-free local voice agent: faster-whisper STT, local LLM with tool use, TTS. Runs entirely on your machine.
`agent-tracer-2`	OpenTelemetry-native, local-first observability for AI agents. DuckDB on disk, Next.js viewer on localhost, no SaaS. Adapters for Anthropic, OpenAI, LangGraph, AutoGen, CrewAI.
`local-deep-research`	Self-hosted deep-research agent: multi-step query planning, source synthesis, report generation. Ollama / llama.cpp / vLLM friendly, with SearXNG, FAISS, and BM25.

🧪 Benchmarks & evaluation

_{Reproducible by default. Probe for contamination and reward hacks before declaring a number.}

Repo	What it is	Lang
`agent_eval`	Open-source benchmark for Claude Code skill bundles. Pass@k plus cost plus reliability, content-addressed leaderboard across Anthropic / OpenAI / Google.
`bench_audit`	Library of probes for agent benchmarks: contamination, gold-answer leaks, harness-injection vulnerabilities, reward hacking. CIs on every result.
`rag-bench`	Small, reproducible benchmark for RAG pipelines.
`agent-arena`	Arena-style framework for head-to-head agent comparison.
`paper-replay`	Replay and reproduce paper experiments with locked seeds, environments, and artifacts.

🛠️ Developer tools & skills

_{Quality layers, lockfiles, and ergonomics for the agent stack.}

Repo	What it is	Lang
`promptlock`	Production prompt workflow: semantic diff, eval-on-PR, lockfile, drift detection, and rollback for plain-markdown prompts in your repo.
`skill-forge-2`	Quality layer for Claude Code Skills: lint, test, and bench before you ship.
`browser-skills`	15 reusable, agent-agnostic browser recipes plus an MCP server. Cookie banners, infinite scroll, calendar widgets, all solved once.
`diagram-skills`	Generate validated diagrams across Mermaid, PlantUML, Graphviz, D2, and Excalidraw. MCP server, CLI, and Claude Code skills.
`see-the-ai-think`	Watch an LLM think. Interactive interpretability tool that visualizes sparse-autoencoder features firing live across every token. Runs on your laptop, no GPU required.
`paper_pod`	Local-first audio overviews for academic papers. Take an arXiv URL, PDF, or BibTeX in, get an 8 to 15 minute two-host podcast out.
`paper2repro`	Paper to reproducible experiment scaffold.
`test_forge`	Test-generation toolkit for Python research code.

📚 Curated knowledge

_{What we had to learn the hard way, written down for the next person.}

📓 Atlases & annotated notebooks

Repo	What it is
`awesome-llm-circuits-atlas`	Interactive atlas of discovered circuits and SAE features in large language models, with Colab reproductions on open-weights models.
`awesome-reasoning-models-theory`	Theory-first map of why reasoning models (o1/o3, DeepSeek-R1, Claude-thinking, Qwen-QwQ) actually work. 8 chapters, 60+ annotated papers, 13 models compared, 5 reproduction notebooks, live benchmarks.
`retrieval-from-scratch`	Modern Information Retrieval from scratch in PyTorch. BM25, dense bi-encoders, ColBERT late interaction, cross-encoder reranking, and RAG, in annotated notebooks that run on a single GPU.

🗺️ Maps, lists & roadmaps

Repo	What it is
`awesome-why-llms-work`	Falsifiable-hypothesis atlas of why LLMs work. Five competing research programmes, 41 tracked claims with epistemic status (🟢🟡🔴⚪) and named falsifiers.
`awesome-llm-reasoning-foundations`	Curated, rigorously-verified map of the theoretical foundations of LLM reasoning: transformer expressivity, chain-of-thought error bounds, circuit complexity, logical characterizations, learnability.
`llm-impossibility-results`	Verified, assumption-explicit catalog of published impossibility and lower-bound results for LLMs and AI agents: circuit-complexity ceilings, hallucination bounds, watermarking impossibility, alignment.
`awesome-llm-theory`	Companion list: theory papers for LLM behavior, expressiveness, and learnability.
`build-your-own-ai`	Master modern AI by building it from scratch: curated index of the best build-it-yourself guides for tokenizers, attention, training, RAG, agents, and evals.
`awesome-research-agents`	Opinionated, curated list of agents, skills, MCP servers, and tools ML researchers actually use.
`ai-engineer-roadmap`	Interactive end-to-end roadmap for AI engineers. 12 stages, 122 nodes, 276 link-verified resources from math prerequisites to the research frontier.
`harness-engineer-roadmap`	Interactive roadmap for harness engineering: the agent loop, tool layers, context engineering, memory, retrieval, eval.
`llm-interview-prep`	Interview-prep notebook for LLM and ML-systems roles.

🏭 Deployment

Translational work. Coverage proofs and scheduling guarantees in production, against real workloads.

Stellaris AI. Conformal-coverage pipelines and multi-tenant scheduling for native-safe foundation models, in regulated deployments.
Brain Investing. HKU FinTech spin-out. Conformal-bound risk management running against live P&L. The lab's coverage work, in a real trading book.

🏅 Service & Recognition

Peer review at NeurIPS, ACM Multimedia (Main & Dataset Tracks), UAI.
Mentoring LLM-infrastructure engineers at Stellaris AI on conformal coverage, agent evaluation, and multi-tenant scheduling.
Ten consecutive years of competitive funding (2018–2025): TSSSU, HKSTP Incu-Tech, Cyberport (×2 intakes), HKU iAXON Deep Tech.

📬 Availability

Postdoc, Fall 2026

Open to positions where theory and deployment share a research agenda.

Areas. Trustworthy & compliance-grade AI · Multi-agent systems & mechanism design · LLM theory (descriptive complexity, in-context reasoning) · Serving systems for inference.

Reach me at bettyguo@connect.hku.hk

_{Dongxin (Betty) Guo · The University of Hong Kong · Department of Computer Science

homepage ·
scholar ·
orcid ·
openreview ·
linkedin

Last updated May 2026}