Skip to content

docs: add reward function customization guide#997

Open
NJX-njx wants to merge 2 commits intoinclusionAI:mainfrom
NJX-njx:feature/reward-model-docs
Open

docs: add reward function customization guide#997
NJX-njx wants to merge 2 commits intoinclusionAI:mainfrom
NJX-njx:feature/reward-model-docs

Conversation

@NJX-njx
Copy link
Copy Markdown
Contributor

@NJX-njx NJX-njx commented Mar 6, 2026

Summary

Add comprehensive reward function customization documentation, addressing long-standing documentation gaps from issues #124 and #162. This also covers the Roadmap item \Example of using a generative or critic-like reward model\ from #907.

Changes

New files

  • *\docs/en/customization/reward.md* English reward customization guide
  • *\docs/zh/customization/reward.md* Chinese reward customization guide

Modified files

  • *\docs/en/_toc.yml* Added reward page to Customization section
  • *\docs/zh/_toc.yml* Added reward page to Customization section (定制指南)

Guide Contents

  1. Reward Function Signature Explains the standard interface (prompt, completions, prompt_ids, completion_ids, **kwargs) -> float\
  2. Rule-Based Reward Functions Math verification and composite (format + accuracy) reward examples
  3. Built-in Rewards Reference Table of all built-in reward functions with module paths
  4. Training Integration How to pass reward functions via string paths or function objects
  5. Generative Reward Model (LLM-as-Judge) Complete example using an external LLM API as a reward evaluator
  6. AsyncRewardWrapper How slow reward functions are handled with configurable timeout, concurrency, and retry
  7. Registering New Built-in Rewards Step-by-step guide for contributing new reward functions

Refs: #124, #162, #907

- Add comprehensive reward customization guide covering:
  - Reward function signature and conventions
  - Rule-based reward functions with examples
  - Built-in reward functions reference table
  - Integration with training scripts
  - Generative reward model (LLM-as-Judge) pattern
  - AsyncRewardWrapper usage for slow reward functions
  - How to register new built-in reward functions
- Add both English (docs/en/customization/reward.md) and
  Chinese (docs/zh/customization/reward.md) versions
- Register new pages in both _toc.yml files

Refs: inclusionAI#124, inclusionAI#162
Copilot AI review requested due to automatic review settings March 6, 2026 12:01
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AReaL documentation by introducing a detailed guide on customizing reward functions. It provides users with clear instructions and examples for implementing various reward mechanisms, from simple rule-based functions to advanced LLM-as-Judge models, and explains how to integrate them into the training workflow. This update addresses critical documentation gaps and improves the overall usability for developers working with reinforcement learning in AReaL.

Highlights

  • New Reward Customization Guide: Introduced a comprehensive guide for customizing reward functions in AReaL, available in both English and Chinese.
  • Addresses Documentation Gaps: Resolved long-standing documentation issues ([Doc] Provide api for the base reward and reward model #124, Reward function & Reward model #162) and a roadmap item ([Roadmap] 2026 Q1 Milestones #907) by providing detailed explanations for various reward models, including generative/critic-like approaches.
  • Detailed Content Coverage: The guide covers essential aspects such as reward function signatures, examples of rule-based and LLM-as-Judge rewards, built-in reward functions, integration into training workflows, asynchronous handling for slow rewards, and steps for registering new built-in reward functions.
  • Documentation Navigation Updates: Updated the English and Chinese tables of contents to include the newly added reward customization pages, improving discoverability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/en/_toc.yml
    • Added the 'reward' page to the English Customization section in the table of contents.
  • docs/en/customization/reward.md
    • Added a new English guide detailing reward function customization, including rule-based and generative models, training integration, and advanced topics.
  • docs/zh/_toc.yml
    • Added the 'reward' page to the Chinese Customization section in the table of contents.
  • docs/zh/customization/reward.md
    • Added a new Chinese guide detailing reward function customization, including rule-based and generative models, training integration, and advanced topics.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation for reward function customization in AReaL, addressing documentation gaps from issues #124, #162, and the roadmap item from #907. The guide covers the reward function interface, rule-based rewards, built-in rewards, training integration, LLM-as-Judge generative reward models, async handling, and contributing new built-in rewards.

Changes:

  • Added English and Chinese reward customization guides covering the full reward function lifecycle (signature, examples, integration, LLM-as-Judge, async wrapper, registration)
  • Added the new reward page to both English and Chinese table of contents files, placed between dataset and agent customization

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
docs/en/customization/reward.md New English guide for reward function customization
docs/zh/customization/reward.md New Chinese guide (translation of English guide)
docs/en/_toc.yml Added customization/reward entry to Customization section
docs/zh/_toc.yml Added customization/reward entry to 定制指南 section

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| `gsm8k_reward_fn` | `areal.reward.gsm8k.gsm8k_reward_fn` | GSM8K math |
| `geometry3k_reward_fn` | `areal.reward.geometry3k.geometry3k_reward_fn` | Geometry3K |
| `clevr_count_70k_reward_fn` | `areal.reward.clevr_count_70k.clevr_count_70k_reward_fn` | CLEVR Count |

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The built-in reward functions table lists gsm8k_reward_fn as a built-in reward function, but it is NOT registered in VALID_REWARD_FN or handled by get_custom_reward_fn in areal/reward/__init__.py (line 8: VALID_REWARD_FN = ["clevr_count_70k", "geometry3k"]). Unlike geometry3k_reward_fn and clevr_count_70k_reward_fn, gsm8k_reward_fn cannot be auto-selected by dataset name. The table should either note this distinction, or the function should be registered in areal/reward/__init__.py as well.

Suggested change
Note: `geometry3k_reward_fn` and `clevr_count_70k_reward_fn` are registered in
`areal.reward.__init__.VALID_REWARD_FN` and can be auto-selected based on the dataset
name. `gsm8k_reward_fn` is shipped as a utility but is **not** auto-selected; use its
module path string (`"areal.reward.gsm8k.gsm8k_reward_fn"`) when configuring
`reward_fn`.

Copilot uses AI. Check for mistakes.
| `gsm8k_reward_fn` | `areal.reward.gsm8k.gsm8k_reward_fn` | GSM8K 数学 |
| `geometry3k_reward_fn` | `areal.reward.geometry3k.geometry3k_reward_fn` | Geometry3K |
| `clevr_count_70k_reward_fn` | `areal.reward.clevr_count_70k.clevr_count_70k_reward_fn` | CLEVR Count |

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The built-in reward functions table lists gsm8k_reward_fn as a built-in reward function, but it is NOT registered in VALID_REWARD_FN or handled by get_custom_reward_fn in areal/reward/__init__.py (line 8: VALID_REWARD_FN = ["clevr_count_70k", "geometry3k"]). Unlike geometry3k_reward_fn and clevr_count_70k_reward_fn, gsm8k_reward_fn cannot be auto-selected by dataset name. The table should either note this distinction, or the function should be registered in areal/reward/__init__.py as well.

Suggested change
注意:`geometry3k_reward_fn``clevr_count_70k_reward_fn` 会在 `areal.reward.__init__.py`
中注册到 `VALID_REWARD_FN`,因此可以按数据集名称自动选择;`gsm8k_reward_fn` 目前不会被
自动注册或按数据集名称选择,使用时需要像下文示例一样,通过模块路径字符串显式传入
`reward_fn`

Copilot uses AI. Check for mistakes.
your reward function for async execution with timeout handling:

```python
from areal.api import AsyncRewardWrapper
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import from areal.api import AsyncRewardWrapper will not work because there is no __init__.py in areal/api/. All existing usages in the codebase use from areal.api.reward_api import AsyncRewardWrapper (see e.g., areal/workflow/rlvr.py:12, areal/workflow/multi_turn.py:12). The import path should be corrected to from areal.api.reward_api import AsyncRewardWrapper.

Suggested change
from areal.api import AsyncRewardWrapper
from areal.api.reward_api import AsyncRewardWrapper

Copilot uses AI. Check for mistakes.
异步执行,并提供超时处理:

```python
from areal.api import AsyncRewardWrapper
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import from areal.api import AsyncRewardWrapper will not work because there is no __init__.py in areal/api/. All existing usages in the codebase use from areal.api.reward_api import AsyncRewardWrapper (see e.g., areal/workflow/rlvr.py:12, areal/workflow/multi_turn.py:12). The import path should be corrected to from areal.api.reward_api import AsyncRewardWrapper.

Suggested change
from areal.api import AsyncRewardWrapper
from areal.api.reward_api import AsyncRewardWrapper

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds comprehensive documentation for reward function customization, which is a great addition. The guide is well-structured and covers various aspects from basic rule-based functions to advanced LLM-as-judge models.

My review focuses on the code examples provided in the new documentation files. I've identified a few areas where the examples can be improved to better reflect best practices in terms of performance, robustness, and code style. Specifically, I've suggested changes to:

  • Move imports to the top of the file scope.
  • Instantiate API clients once and reuse them to avoid inefficiency.
  • Add error handling for external API calls to make the code more robust.

These changes will ensure that users who copy the example code will start with a more solid foundation. The same suggestions apply to both the English and Chinese versions of the documentation to maintain consistency.

Comment on lines +142 to +175
import re

def llm_judge_reward_fn(
prompt, completions, prompt_ids, completion_ids, **kwargs
) -> float:
"""Use an external LLM API to judge response quality."""
import openai

judge_prompt = f"""Rate the following response on a scale of 0 to 10.
Only output the numeric score.

Question: {prompt}
Response: {completions}

Score:"""

client = openai.OpenAI(
base_url="http://localhost:8000/v1", # local vLLM/SGLang server
api_key="unused",
)
response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[{"role": "user", "content": judge_prompt}],
temperature=0.0,
max_tokens=16,
)
score_text = response.choices[0].message.content.strip()

# Extract numeric score
match = re.search(r"(\d+(?:\.\d+)?)", score_text)
if match:
score = float(match.group(1))
return min(score / 10.0, 1.0) # Normalize to [0, 1]
return 0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This example function can be improved in terms of efficiency, robustness, and style.

  1. Efficiency: A new openai.OpenAI client is created on every function call, which is inefficient. The client should be created once and reused.
  2. Style: The import openai statement is inside the function. Imports should be at the top of the file.
  3. Robustness: The API call is not wrapped in a try...except block. Any network or API error will crash the reward calculation.

Here is a revised version that addresses these points:

import re
import openai

# Create the client once at the module level for efficiency.
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",  # local vLLM/SGLang server
    api_key="unused",
)

def llm_judge_reward_fn(
    prompt, completions, prompt_ids, completion_ids, **kwargs
) -> float:
    """Use an external LLM API to judge response quality."""
    judge_prompt = f"""Rate the following response on a scale of 0 to 10.
Only output the numeric score.

Question: {prompt}
Response: {completions}

Score:"""

    try:
        response = client.chat.completions.create(
            model="Qwen/Qwen3-8B",
            messages=[{"role": "user", "content": judge_prompt}],
            temperature=0.0,
            max_tokens=16,
        )
        score_text = response.choices[0].message.content.strip()

        # Extract numeric score
        match = re.search(r"(\d+(?:\.\d+)?)", score_text)
        if match:
            score = float(match.group(1))
            return min(score / 10.0, 1.0)  # Normalize to [0, 1]
    except Exception as e:
        # In a real application, you should log this error.
        print(f"LLM-as-judge API call failed: {e}")

    return 0.0

Comment on lines +137 to +170
import re

def llm_judge_reward_fn(
prompt, completions, prompt_ids, completion_ids, **kwargs
) -> float:
"""使用外部 LLM API 评判回复质量。"""
import openai

judge_prompt = f"""请对以下回复进行评分,评分范围 0-10。
只输出数字分数。

问题:{prompt}
回复:{completions}

分数:"""

client = openai.OpenAI(
base_url="http://localhost:8000/v1", # 本地 vLLM/SGLang 服务
api_key="unused",
)
response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[{"role": "user", "content": judge_prompt}],
temperature=0.0,
max_tokens=16,
)
score_text = response.choices[0].message.content.strip()

# 提取数字分数
match = re.search(r"(\d+(?:\.\d+)?)", score_text)
if match:
score = float(match.group(1))
return min(score / 10.0, 1.0) # 归一化到 [0, 1]
return 0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This example function can be improved in terms of efficiency, robustness, and style.

  1. Efficiency: A new openai.OpenAI client is created on every function call, which is inefficient. The client should be created once and reused.
  2. Style: The import openai statement is inside the function. Imports should be at the top of the file.
  3. Robustness: The API call is not wrapped in a try...except block. Any network or API error will crash the reward calculation.

Here is a revised version that addresses these points:

import re
import openai

# Create the client once at the module level for efficiency.
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",  # 本地 vLLM/SGLang 服务
    api_key="unused",
)

def llm_judge_reward_fn(
    prompt, completions, prompt_ids, completion_ids, **kwargs
) -> float:
    """使用外部 LLM API 评判回复质量。"""
    judge_prompt = f"""请对以下回复进行评分,评分范围 0-10。
只输出数字分数。

问题:{prompt}
回复:{completions}

分数:"""

    try:
        response = client.chat.completions.create(
            model="Qwen/Qwen3-8B",
            messages=[{"role": "user", "content": judge_prompt}],
            temperature=0.0,
            max_tokens=16,
        )
        score_text = response.choices[0].message.content.strip()

        # 提取数字分数
        match = re.search(r"(\d+(?:\.\d+)?)", score_text)
        if match:
            score = float(match.group(1))
            return min(score / 10.0, 1.0)  # 归一化到 [0, 1]
    except Exception as e:
        # In a real application, you should log this error.
        print(f"LLM-as-judge API call failed: {e}")

    return 0.0

Comment on lines +64 to +86
import re

def composite_reward_fn(
prompt, completions, prompt_ids, completion_ids, answer, **kwargs
) -> float:
"""Reward that combines format compliance and answer accuracy."""
format_score = 0.0
accuracy_score = 0.0

# Check if response follows the expected format (e.g., uses \boxed{})
if re.search(r"\\boxed\{.+\}", completions):
format_score = 0.2

# Check answer accuracy
from areal.reward import get_math_verify_worker
try:
accuracy_score = get_math_verify_worker().verify(
str(completions), str(answer)
) * 0.8
except Exception:
pass

return format_score + accuracy_score
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This code snippet can be improved by moving the import statement to the top of the file scope, which is a standard Python convention (PEP 8). This improves readability and avoids re-importing on every function call.

import re
from areal.reward import get_math_verify_worker

def composite_reward_fn(
    prompt, completions, prompt_ids, completion_ids, answer, **kwargs
) -> float:
    """Reward that combines format compliance and answer accuracy."""
    format_score = 0.0
    accuracy_score = 0.0

    # Check if response follows the expected format (e.g., uses \boxed{})
    if re.search(r"\\boxed\{.+\}", completions):
        format_score = 0.2

    # Check answer accuracy
    try:
        accuracy_score = get_math_verify_worker().verify(
            str(completions), str(answer)
        ) * 0.8
    except Exception:
        pass

    return format_score + accuracy_score

Comment on lines +61 to +83
import re

def composite_reward_fn(
prompt, completions, prompt_ids, completion_ids, answer, **kwargs
) -> float:
"""结合格式合规性和答案准确性的奖励函数。"""
format_score = 0.0
accuracy_score = 0.0

# 检查回复是否遵循预期格式(如使用 \boxed{})
if re.search(r"\\boxed\{.+\}", completions):
format_score = 0.2

# 检查答案准确性
from areal.reward import get_math_verify_worker
try:
accuracy_score = get_math_verify_worker().verify(
str(completions), str(answer)
) * 0.8
except Exception:
pass

return format_score + accuracy_score
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This code snippet can be improved by moving the import statement to the top of the file scope, which is a standard Python convention (PEP 8). This improves readability and avoids re-importing on every function call.

import re
from areal.reward import get_math_verify_worker

def composite_reward_fn(
    prompt, completions, prompt_ids, completion_ids, answer, **kwargs
) -> float:
    """结合格式合规性和答案准确性的奖励函数。"""
    format_score = 0.0
    accuracy_score = 0.0

    # 检查回复是否遵循预期格式(如使用 \boxed{})
    if re.search(r"\\boxed\{.+\}", completions):
        format_score = 0.2

    # 检查答案准确性
    try:
        accuracy_score = get_math_verify_worker().verify(
            str(completions), str(answer)
        ) * 0.8
    except Exception:
        pass

    return format_score + accuracy_score

Copy link
Copy Markdown
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new doc is indeed a plus, but it does not fully resolve the issues.

  1. For LLM-as-a-judge reward, writing it as a normal synchronous function limits the concurrency. That's why we usually wrap with with the AsyncRewardWrapper and put the computation to a dedicated process pool for the expensive math reward. However, LLM calls are natively async, so we should use native async rewards instead of wrapping it into AsyncRewardWrapper.

  2. I think one of the major concerns in issues is how can we add critic-like rewards, where we utilize a fine-tuned LLM that outputs a scalar value as the reward. It requires heavy GPU computation and cannot be incorporated in the workflow. Instead, reward computation should be added in training loops and the reward model should be offloaded immediately after usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants