Gold-Guided Programmatic Distillation for Verifiable Financial QA (TAT-QA)

Stanford CS224 Natural Language Processing with Deep Learning Final Project Course Website: https://web.stanford.edu/class/cs224n/index.html

Overview

Financial disclosures are long, dense, and computation-heavy. While frontier LLMs can answer many finance questions, they are expensive to run and prone to arithmetic errors or unsupported statements. This project investigates whether a smaller language model can approach large-model performance on financial question answering over hybrid tables and text by learning executable reasoning.

We propose Gold-Guided Programmatic Distillation (GGPD), a PaD-style teacher–student framework that distills Program-of-Thought supervision rather than natural-language Chain-of-Thought. For arithmetic questions, a large teacher generates executable Python programs guided by gold symbolic derivations (from TAT-QA), and we retain only samples that pass execution-based verification. A smaller student is then fine-tuned to generate the verified programs conditioned on the question and context, enabling verifiable numeric reasoning via an external interpreter.

Dataset

TAT-QA (Table-and-Text Question Answering) Original Paper / Dataset: https://github.com/NExTplusplus/TAT-QA

Answer Types

TAT-QA contains heterogeneous supervision:

Arithmetic / computation-heavy (ratios, growth rates, sums, margins, etc.)
Non-arithmetic (extractive spans / short grounded textual answers)

Our pipeline applies program synthesis + execution filtering primarily to arithmetic instances.

Method Summary

Teacher Phase (Gold-Guided Data Synthesis)

Teacher: Qwen2.5-72B-Instruct (data generation only)
Inputs: question and hybrid context (table + associated text) + gold symbolic derivation (when available)
Output: a Python function that implements the derivation with explicit unit/scale normalization when required
Verification: run the program in a sandboxed Python interpreter and compare against gold answer with tolerance (default: 1e-4)
Keep only verified samples

Student Phase (Fine-Tuning)

Student: Qwen2.5-7B-Instruct
Training: SFT + LoRA
Targets:
- Arithmetic: verified Python programs (Program-of-Thought)
- Non-arithmetic: grounded short textual answers (optional evidence span)
Inference:
- Arithmetic: generate program → execute → return numeric answer
- Non-arithmetic: generate answer directly

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
data_exploration_plots		data_exploration_plots
data_preparation		data_preparation
dataset_filtered		dataset_filtered
dataset_raw		dataset_raw
evaluation_metrics		evaluation_metrics
flat_data		flat_data
prompt_optimization		prompt_optimization
student_infer_eval		student_infer_eval
student_model		student_model
student_model_fine_tune		student_model_fine_tune
teacher_model		teacher_model
token_counter		token_counter
.gitignore		.gitignore
README.md		README.md
python		python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gold-Guided Programmatic Distillation for Verifiable Financial QA (TAT-QA)

Overview

Dataset

Answer Types

Method Summary

Teacher Phase (Gold-Guided Data Synthesis)

Student Phase (Fine-Tuning)

Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gold-Guided Programmatic Distillation for Verifiable Financial QA (TAT-QA)

Overview

Dataset

Answer Types

Method Summary

Teacher Phase (Gold-Guided Data Synthesis)

Student Phase (Fine-Tuning)

Repository Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages