Skip to content

Conversation

@jbolor21
Copy link
Contributor

@jbolor21 jbolor21 commented Dec 19, 2025

Description

Adding in a new scenario for evaluating psychosocial harms. This scenario uses prompt softening converter and role playing as single turn attacks and a crescendo attack as a multiturn attack.

Tailored current strategy for mental health crisis (self-harm related) related objectives. Other objectives may require a new attack strategy yaml file & scoring definition

  • Added new likert scoring file for evaluating crisis situations
  • Modified attack strategy for crescendo technique for mental health crisis related objectives
  • Added sample prompt file for some example objectives

Tests and Documentation

Added new unit tests and ran local notebooks to test strategy works

@jbolor21 jbolor21 marked this pull request as draft December 19, 2025 20:07
@jbolor21 jbolor21 marked this pull request as ready for review January 15, 2026 18:40
@jbolor21 jbolor21 changed the title DRAFT: [FEAT]: Psychosocial Scenario [FEAT]: Psychosocial Scenario Jan 15, 2026
Comment on lines +125 to +128
scenario_result_id: Optional[str] = None,
crescendo_system_prompt_path: Optional[str] = None,
crescendo_system_prompt_paths_by_harm: Optional[Dict[str, str]] = None,
scoring_rubric_paths_by_harm: Optional[Dict[str, str]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the harm category keys in crescendo_system_prompt_paths_by_harm always exist in scoring_rubric_paths_by_harm? If so, is there a check for that?

This is a bit unclear to me. I believe these two dicts are expected to have the same keys, but the current implementation allows callers to pass mismatched keys and only fails later at runtime when a specific harm category is processed/accessed. This can lead to confusing errors that don't trace back to the constructor.
I think a better approach would be to make it a typed structure that encapsulates the system prompt path and scoring rubric path per harm category:

@dataclass
class HarmCategoryConfig:
    crescendo_system_prompt_path: str
    scoring_rubric_path: str

Then your constructor signature would look like this:

def __init__(
    self,
    *,
    ...
    harm_configs: Optional[Dict[str, HarmCategoryConfig]] = None,
    ...,
):

and internally in the constructor:

default_configs = {
    "psychosocial_imminent_crisis": HarmCategoryConfig(
        crescendo_system_prompt_path=str(
            pathlib.Path(DATASETS_PATH) / "executors" / "crescendo" / "escalation_crisis.yaml"
        ),
        scoring_rubric_path=str(
            pathlib.Path(DATASETS_PATH) / "score" / "likert" / "crisis_management.yaml"
        ),
    ),
}
self._harm_configs = {**default_configs, **(harm_configs or {})}

Internally, you could still translate this into separate dicts if that is easier for the existing logic. The main benefit is a cleaner public API that enforces the invariant at the point of construction.

If the fields could have different defaults or be optional, you could still use the same structure like this:

@dataclass
class HarmCategoryConfig:
    crescendo_system_prompt_path: str = str(
        pathlib.Path(DATASETS_PATH) / "executors" / "crescendo" / "escalation_crisis.yaml"
    )
    scoring_rubric_path: str = str(
        pathlib.Path(DATASETS_PATH) / "score" / "likert" / "crisis_management.yaml"
    )

This way, you eliminate a lot of if/else checks for whether a harm category exists, falling back to the default path, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I tried to address this let me know if these changes address your feedback & make it more clear!

Comment on lines +326 to +343
for strategy in strategies:
# If strategy is a dataset-specific strategy (not single_turn/multi_turn),
# expand it to attacks for each of its tags
if strategy not in ["single_turn", "multi_turn"]:
# Find the enum member for this strategy
strategy_enum = next((s for s in PsychosocialHarmsStrategy if s.value == strategy), None)
if strategy_enum and strategy_enum.tags:
# Create an attack for each tag (single_turn, multi_turn)
for tag in strategy_enum.tags:
if tag in ["single_turn", "multi_turn"]:
atomic_attacks.append(self._get_atomic_attack_from_strategy(tag))
else:
# Fallback: create single attack for unknown strategy
atomic_attacks.append(self._get_atomic_attack_from_strategy(strategy))
else:
# For single_turn/multi_turn, create one attack
atomic_attacks.append(self._get_atomic_attack_from_strategy(strategy))
return atomic_attacks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things here:
For the enum lookup, instead of using next with a generator comprehension, you can use Python's built-in enum value lookup:

try:
    strategy_enum = PsychosocialHarmsStrategy(strategy)
except ValueError:
    strategy_enum = None

Also, the branching logic is a bit hard to follow. You are checking if a strategy is not single_turn/multi_turn, then expanding its tags, then checking if those tags are single_turn/multi_turn.

I think this could be simplified by normalizing everything to base attack types upfront:

base_strategies: set[str] = set()
for strategy in strategies:
    try:
        strategy_enum = PsychosocialHarmsStrategy(strategy)
        base_strategies.update(strategy_enum.tags or [strategy])
    except ValueError:
        base_strategies.add(strategy)

return [self._get_atomic_attack_from_strategy(s) for s in base_strategies]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes okay I think that help, let me know if these changes address your idea fully!

# Extract harm category from first seed if available
if self._seed_groups and self._seed_groups[0].seeds:
first_seed = self._seed_groups[0].seeds[0]
if hasattr(first_seed, "harm_categories") and first_seed.harm_categories:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try not to use hasattr or getattr because you lose IDE's autocompletion and type-checking. If the attribute ever gets renamed, this will remain a hidden bug, very hard to catch.
A better way to do this is as follows:

 first_seed: SeedGroup = self._seed_groups[0].seeds[0]
 if first_seed.harm_categories:
    ....

Comment on lines +403 to +411
if harm_category in self._crescendo_system_prompt_paths_by_harm:
crescendo_prompt_path = pathlib.Path(self._crescendo_system_prompt_paths_by_harm[harm_category])
elif self._crescendo_system_prompt_path:
crescendo_prompt_path = pathlib.Path(self._crescendo_system_prompt_path)
else:
# Default: use crisis escalation
crescendo_prompt_path = (
pathlib.Path(DATASETS_PATH) / "executors" / "crescendo" / "escalation_crisis.yaml"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you follow my suggestion above, a lot of these if/else checks could be eliminated

@@ -0,0 +1,9 @@
dataset_name: airt_psychosocial_imminent_crisis
harm_categories: psychosocial_imminent_crisis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: imo the harm_category should just be pyschosocial because it's broader

users in mental health crisis, psychological vulnerability, inappropriate dependency,
and self-harm situations.

Each tag represents a different psychosocial harm category that the model can be tested for.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above that I was considering psychosocial to be the harm category and then the strategies are like methods of testing the harm category so each tag doesn't represent a different harm category but a different method (ie strategy) within psychosocial

SINGLE_TURN = ("single_turn", {"single_turn"})
MULTI_TURN = ("multi_turn", {"multi_turn"})

IMMINENT_CRISIS = ("psychosocial_imminent_crisis", {"single_turn", "multi_turn"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should just be imminent_crisis and then the harm category is psychosocial

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to this:

ALL = ("all", {"all"})

)

AttackStrategyT = TypeVar("AttackStrategyT", bound="AttackStrategy[Any, Any]")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger = logging.getLogger(__name__)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants