Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 57 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Data files
dataset/*.json

# Cache and generated files
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# Logging
*.log
logs/
nohup.out

# OS-specific files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db

# IDE files
.idea/
.vscode/
*.swp
*.swo

# Result files
evaluation/results/
*.csv

# Jupyter Notebook
.ipynb_checkpoints

# Keep dataset README
!dataset/README.md

# Keep configuration files
!evaluation/my_ragas/data/gt_cache.json
285 changes: 276 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,284 @@
## My Project
<p align="center">
<img src="assets/amazon_logo.png" height="80px" alt="Amazon">
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src="assets/logo_new.png" height="100px" alt="SenTSR-Bench">
</p>

TODO: Fill this README out!
<h1 align="center">SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning</h1>

Be sure to:
<p align="center">
<sub>Zelin He, Boran Han, Xiyuan Zhang, Shuai Zhang, Haotian Lin, Qi Zhu, Haoyang Fang,<br>Danielle C. Maddix, Abdul Fatir Ansari, Akash Chandrayan, Abhinav Pradhan, Bernie Wang, Matthew Reimherr</sub>
</p>

* Change the title in this README
* Edit your repository description on GitHub
<p align="center">
<a href="https://arxiv.org/abs/2602.19455"><img src="https://img.shields.io/badge/Paper-arXiv%3A2602.19455-red" alt="Paper"></a>
<a href="https://huggingface.co/datasets/ZLHe0/SenTSR-Bench"><img src="https://img.shields.io/badge/Dataset-HuggingFace-yellow" alt="Dataset"></a>
<a href="https://zlhe0.github.io/SenTSR-Bench-Website/"><img src="https://img.shields.io/badge/Website-SenTSR--Bench-blue" alt="Website"></a>
<a href="LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License: Apache 2.0"></a>
<a href="https://aistats.org/aistats2026/"><img src="https://img.shields.io/badge/Venue-AISTATS%202026-purple" alt="Venue"></a>
</p>

## Security
<p align="center">
Official implementation of the <b>SenTSR-Bench</b> knowledge injection framework.<br>
Inject in-domain knowledge from fine-tuned time-series specialists into frozen general reasoning LMs<br>
for robust, context-aware diagnostic reasoning.
</p>

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
---

## License
## Overview

<div align="center">
<img src="assets/Intro.png" alt="SenTSR-Bench Overview" width="800px">
<br>
<em>(a) A fine-tuned TSLM captures key time-series patterns but fails on diagnostic reasoning; (b) A general-purpose GRLM reasons well but overlooks domain-specific patterns; (c) Our knowledge injection steers the GRLM's reasoning with in-domain knowledge from the TSLM, producing the correct diagnosis.</em>
</div>

<br>

**General reasoning LMs** (GRLMs) show strong reasoning but lack domain-specific time-series knowledge. **Time-series specialist LMs** (TSLMs) capture signal patterns but struggle with multi-step diagnostic reasoning. Our framework bridges this gap by **injecting TSLM-generated insights directly into the GRLM's reasoning trace** — no weight updates needed.

<div align="center">
<img src="assets/Paradigm.png" alt="Knowledge Injection Paradigm" width="600px">
<br>
<em>(a) Knowledge injection paradigm; (b) RL-honed thinking traces via RLVR enable effective injection without human supervision.</em>
</div>

### Key Contributions

1. **New Paradigm** &mdash; A framework that injects in-domain knowledge from a TSLM into a GRLM's reasoning process, steering reasoning with domain knowledge
2. **RL-Based Injection** &mdash; Reinforcement learning with verifiable rewards (RLVR) elicits knowledge-rich thinking traces *without manual supervision*
3. **SenTSR-Bench** &mdash; A new benchmark of 110 real-world multivariate sensor streams with 330 human-curated diagnostic questions
4. **Strong Results** &mdash; Surpasses TSLMs by 9.1%&ndash;26.1% and GRLMs by 7.9%&ndash;22.4% across benchmarks

### Main Results

<div align="center">
<img src="assets/result.png" alt="Main Results" width="700px">
<br>
<em>Reasoning performance on SenTSR-Bench. RL-injection consistently outperforms all baselines.</em>
</div>

---

## Table of Contents

- [Overview](#overview)
- [Setup](#setup)
- [Benchmark](#benchmark)
- [SenTSR-Bench Evaluation Benchmark](#sentsr-bench-evaluation-benchmark)
- [Public Benchmark Data Curation](#public-benchmark-data-curation)
- [Synthetic Data Generation Pipeline](#synthetic-data-generation-pipeline)
- [Method: Knowledge Injection](#method-knowledge-injection)
- [Claude (GRLM) + ChatTS (TSLM)](#closed-source-claude-grlm--chatts-tslm)
- [Qwen3 (GRLM) + Qwen-VL (TSLM)](#open-source-qwen3-grlm--qwen-vl-tslm)
- [DeepSeek-R1 (GRLM) + Qwen-VL (TSLM)](#open-source-deepseek-r1-grlm--qwen-vl-tslm)
- [TSLM Training](#tslm-training)
- [Evaluation](#evaluation)
- [Citation](#citation)
- [License](#license)

---

## Setup

### Prerequisites

- Python 3.10+
- Conda for environment management
- AWS account with access to Claude models via Bedrock (for closed-source experiments)
- GPU for running self-hosted model servers (8xA100 recommended)

### Installation

```bash
git clone https://github.com/amazon-science/SenTSR-Bench.git
cd SenTSR-Bench

conda create -n tsr-env python=3.10
conda activate tsr-env
pip install -r requirements.txt

# (For Claude experiments) Configure AWS credentials
aws configure
```

---

## Benchmark

### SenTSR-Bench Evaluation Benchmark

SenTSR-Bench is a first-of-its-kind dataset of **110 multivariate sensor streams** with **330 human-curated diagnostic questions**, built from real-world industrial operations. Each time series contains 3 sensor channels (acceleration, velocity, temperature).

The benchmark evaluates a three-stage diagnostic reasoning chain:

| Stage | Task | Description |
|-------|------|-------------|
| **What Happened** | Anomaly Characterization | Identify key time-series anomaly patterns |
| **How Happened** | Root-Cause Diagnosis | Determine the most likely causes |
| **Suggested Fix** | Action Recommendation | Propose corrective actions |

Download the evaluation benchmark from HuggingFace:

```bash
# Install huggingface_hub if needed
pip install huggingface_hub

# Download dataset to ./dataset/
huggingface-cli download ZLHe0/SenTSR-Bench --repo-type dataset --local-dir ./dataset
```

See `dataset/README.md` for format specifications.

### Public Benchmark Data Curation

We additionally evaluate on two public benchmarks: **TSEvol** (Dataset A) and **TS&Language** (MCQ2 subset).

```bash
python dataset/preprocess_dataset.py \
--dataset_a path/to/dataset_a.json \
--mcq2_source path/to/MCQ_2_TS.jsonl \
--output_dir ./dataset/processed \
--mcq2_sample_size 100
```

### Synthetic Data Generation Pipeline

The synthetic training data pipeline uses VLM-assisted code synthesis to bootstrap realistic simulators from 23 seed signals, producing **6,000 MCQ training entries**.

| Stage | Script | Description |
|-------|--------|-------------|
| 1. Iterative Code Synthesis | `./scripts/run_iterative_generation.sh` | Claude generates Python simulators from real data |
| 2. Stochastic Diversification | `./scripts/run_stochastic_refinement.sh` | Convert to sampling-based generators |
| 3&ndash;4. Benchmark Generation | `./scripts/run_synthetic_benchmark.sh 100` | Generate synthetic time series + QA/MCQ |

> **Note:** Stage 2 requires manual review to select the best stochastic model per sample before proceeding.

See `dataset/synthetic/README.md` for full pipeline documentation.

---

## Method: Knowledge Injection

This project is licensed under the Apache-2.0 License.
The knowledge injection framework injects TSLM-generated insights directly into the GRLM's reasoning trace. We provide end-to-end examples with multiple model combinations:

- **GRLMs** (General Reasoners): Claude 3.7 Sonnet, Qwen3-32B, DeepSeek-R1-Distill-Qwen-32B
- **TSLMs** (Time-Series Specialists): ChatTS-14B, Qwen2.5-VL-3B (SFT/RL fine-tuned)

### Closed-Source: Claude (GRLM) + ChatTS (TSLM)

**Claude** (via AWS Bedrock) serves as the general reasoner; **ChatTS** provides injected observations via an instructional proxy (`<thinking>` tags).

```bash
# 1. Start the ChatTS server
./src/chatts_utils/start_chatts_server.sh

# 2. Run standalone baselines
./scripts/run_chatts_inference.sh --dataset ./dataset/dataset_a_with_mcq2.json
./scripts/run_claude_inference.sh --dataset ./dataset/dataset_a_with_mcq2.json

# 3. Run knowledge injection (generates observations + injects into Claude)
./scripts/run_injection_workflow.sh --dataset ./dataset/dataset_a_with_mcq2.json

# 4. Stop the server
./src/chatts_utils/stop_chatts_server.sh
```

### Open-Source: Qwen3 (GRLM) + Qwen-VL (TSLM)

**Qwen3-32B** serves as the GRLM; **Qwen2.5-VL-3B** provides injected thoughts via `continue_final_message` assistant prefill.

```bash
# 1. Start both servers
./src/qwen_utils/start_qwen_vl_server.sh # Qwen2.5-VL-3B on port 5003
./src/qwen3_utils/start_qwen3_server.sh # Qwen3-32B on port 5001

# 2. Run standalone baseline
./scripts/run_qwen_inference.sh --dataset ./dataset/dataset_a_with_mcq2.json

# 3. Run knowledge injection
./scripts/run_qwen3_injection_workflow.sh --dataset ./dataset/dataset_a_with_mcq2.json

# 4. Stop servers
./src/qwen_utils/stop_qwen_vl_server.sh
./src/qwen3_utils/stop_qwen3_server.sh
```

### Open-Source: DeepSeek-R1 (GRLM) + Qwen-VL (TSLM)

**DeepSeek-R1-Distill-Qwen-32B** is an alternative open-source GRLM. It shares the same tokenizer and API as Qwen3, so the same injection script supports both via `--model_name`.

```bash
# 1. Start servers
./src/qwen_utils/start_qwen_vl_server.sh # Qwen2.5-VL-3B on port 5003
./src/r1_utils/start_r1_server.sh # DeepSeek-R1 on port 5002

# 2. Run knowledge injection
./scripts/run_r1_injection_workflow.sh --dataset ./dataset/dataset_a_with_mcq2.json

# 3. Stop servers
./src/qwen_utils/stop_qwen_vl_server.sh
./src/r1_utils/stop_r1_server.sh
```

---

## TSLM Training

The time-series specialist (TSLM) is initialized from the public `Qwen2.5-VL-3B-Instruct` checkpoint and post-trained in two stages:

1. **Supervised Fine-Tuning (SFT)** using [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)
2. **Reinforcement Learning (GRPO)** using [VERL](https://github.com/volcengine/verl)

The synthetic training data generated by `dataset/synthetic/` can be used directly with these frameworks. See Appendix B of the paper for hyperparameter details.

---

## Evaluation

Evaluate results with sampling for statistical robustness (mean &plusmn; std over 3 runs):

```bash
python evaluation/evaluate_with_sampling.py \
--exp experiment_name \
--dataset ./dataset/dataset_a_with_mcq2.json \
--generated ./evaluation/results/experiment_name/generated_answer.json
```

The evaluation uses custom metrics based on the [RAGAS](https://github.com/explodinggradients/ragas) framework. Results are saved to `evaluation/exp/<experiment_name>/`.

---

## Citation

If you find this work useful, please cite:

```bibtex
@misc{he2026sentsrbenchthinkinginjectedknowledge,
title={SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning},
author={Zelin He and Boran Han and Xiyuan Zhang and Shuai Zhang and Haotian Lin and Qi Zhu and Haoyang Fang and Danielle C. Maddix and Abdul Fatir Ansari and Akash Chandrayan and Abhinav Pradhan and Bernie Wang and Matthew Reimherr},
year={2026},
eprint={2602.19455},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.19455},
}
```

---

## Acknowledgements

This project is inspired by and builds upon several excellent open-source projects:

- [**ChatTS**](https://github.com/NetManAIOps/ChatTS)
- [**LLaMA-Factory**](https://github.com/hiyouga/LLaMA-Factory)
- [**VERL**](https://github.com/volcengine/verl)
- [**vLLM**](https://github.com/vllm-project/vllm)

---

## License

This project is licensed under the Apache License 2.0.
Binary file added assets/Intro.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/Paradigm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/amazon_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/logo_new.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/result.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading