-
Notifications
You must be signed in to change notification settings - Fork 645
[FEAT]: Psychosocial Scenario #1266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…lity.prompt delete unused file
delete unused file
| scenario_result_id: Optional[str] = None, | ||
| crescendo_system_prompt_path: Optional[str] = None, | ||
| crescendo_system_prompt_paths_by_harm: Optional[Dict[str, str]] = None, | ||
| scoring_rubric_paths_by_harm: Optional[Dict[str, str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the harm category keys in crescendo_system_prompt_paths_by_harm always exist in scoring_rubric_paths_by_harm? If so, is there a check for that?
This is a bit unclear to me. I believe these two dicts are expected to have the same keys, but the current implementation allows callers to pass mismatched keys and only fails later at runtime when a specific harm category is processed/accessed. This can lead to confusing errors that don't trace back to the constructor.
I think a better approach would be to make it a typed structure that encapsulates the system prompt path and scoring rubric path per harm category:
@dataclass
class HarmCategoryConfig:
crescendo_system_prompt_path: str
scoring_rubric_path: strThen your constructor signature would look like this:
def __init__(
self,
*,
...
harm_configs: Optional[Dict[str, HarmCategoryConfig]] = None,
...,
):and internally in the constructor:
default_configs = {
"psychosocial_imminent_crisis": HarmCategoryConfig(
crescendo_system_prompt_path=str(
pathlib.Path(DATASETS_PATH) / "executors" / "crescendo" / "escalation_crisis.yaml"
),
scoring_rubric_path=str(
pathlib.Path(DATASETS_PATH) / "score" / "likert" / "crisis_management.yaml"
),
),
}
self._harm_configs = {**default_configs, **(harm_configs or {})}Internally, you could still translate this into separate dicts if that is easier for the existing logic. The main benefit is a cleaner public API that enforces the invariant at the point of construction.
If the fields could have different defaults or be optional, you could still use the same structure like this:
@dataclass
class HarmCategoryConfig:
crescendo_system_prompt_path: str = str(
pathlib.Path(DATASETS_PATH) / "executors" / "crescendo" / "escalation_crisis.yaml"
)
scoring_rubric_path: str = str(
pathlib.Path(DATASETS_PATH) / "score" / "likert" / "crisis_management.yaml"
)This way, you eliminate a lot of if/else checks for whether a harm category exists, falling back to the default path, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I tried to address this let me know if these changes address your feedback & make it more clear!
| for strategy in strategies: | ||
| # If strategy is a dataset-specific strategy (not single_turn/multi_turn), | ||
| # expand it to attacks for each of its tags | ||
| if strategy not in ["single_turn", "multi_turn"]: | ||
| # Find the enum member for this strategy | ||
| strategy_enum = next((s for s in PsychosocialHarmsStrategy if s.value == strategy), None) | ||
| if strategy_enum and strategy_enum.tags: | ||
| # Create an attack for each tag (single_turn, multi_turn) | ||
| for tag in strategy_enum.tags: | ||
| if tag in ["single_turn", "multi_turn"]: | ||
| atomic_attacks.append(self._get_atomic_attack_from_strategy(tag)) | ||
| else: | ||
| # Fallback: create single attack for unknown strategy | ||
| atomic_attacks.append(self._get_atomic_attack_from_strategy(strategy)) | ||
| else: | ||
| # For single_turn/multi_turn, create one attack | ||
| atomic_attacks.append(self._get_atomic_attack_from_strategy(strategy)) | ||
| return atomic_attacks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things here:
For the enum lookup, instead of using next with a generator comprehension, you can use Python's built-in enum value lookup:
try:
strategy_enum = PsychosocialHarmsStrategy(strategy)
except ValueError:
strategy_enum = NoneAlso, the branching logic is a bit hard to follow. You are checking if a strategy is not single_turn/multi_turn, then expanding its tags, then checking if those tags are single_turn/multi_turn.
I think this could be simplified by normalizing everything to base attack types upfront:
base_strategies: set[str] = set()
for strategy in strategies:
try:
strategy_enum = PsychosocialHarmsStrategy(strategy)
base_strategies.update(strategy_enum.tags or [strategy])
except ValueError:
base_strategies.add(strategy)
return [self._get_atomic_attack_from_strategy(s) for s in base_strategies]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes okay I think that help, let me know if these changes address your idea fully!
| # Extract harm category from first seed if available | ||
| if self._seed_groups and self._seed_groups[0].seeds: | ||
| first_seed = self._seed_groups[0].seeds[0] | ||
| if hasattr(first_seed, "harm_categories") and first_seed.harm_categories: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try not to use hasattr or getattr because you lose IDE's autocompletion and type-checking. If the attribute ever gets renamed, this will remain a hidden bug, very hard to catch.
A better way to do this is as follows:
first_seed: SeedGroup = self._seed_groups[0].seeds[0]
if first_seed.harm_categories:
....| if harm_category in self._crescendo_system_prompt_paths_by_harm: | ||
| crescendo_prompt_path = pathlib.Path(self._crescendo_system_prompt_paths_by_harm[harm_category]) | ||
| elif self._crescendo_system_prompt_path: | ||
| crescendo_prompt_path = pathlib.Path(self._crescendo_system_prompt_path) | ||
| else: | ||
| # Default: use crisis escalation | ||
| crescendo_prompt_path = ( | ||
| pathlib.Path(DATASETS_PATH) / "executors" / "crescendo" / "escalation_crisis.yaml" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you follow my suggestion above, a lot of these if/else checks could be eliminated
| @@ -0,0 +1,9 @@ | |||
| dataset_name: airt_psychosocial_imminent_crisis | |||
| harm_categories: psychosocial_imminent_crisis | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: imo the harm_category should just be pyschosocial because it's broader
| users in mental health crisis, psychological vulnerability, inappropriate dependency, | ||
| and self-harm situations. | ||
|
|
||
| Each tag represents a different psychosocial harm category that the model can be tested for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above that I was considering psychosocial to be the harm category and then the strategies are like methods of testing the harm category so each tag doesn't represent a different harm category but a different method (ie strategy) within psychosocial
| SINGLE_TURN = ("single_turn", {"single_turn"}) | ||
| MULTI_TURN = ("multi_turn", {"multi_turn"}) | ||
|
|
||
| IMMINENT_CRISIS = ("psychosocial_imminent_crisis", {"single_turn", "multi_turn"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this should just be imminent_crisis and then the harm category is psychosocial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to this:
| ALL = ("all", {"all"}) |
| ) | ||
|
|
||
| AttackStrategyT = TypeVar("AttackStrategyT", bound="AttackStrategy[Any, Any]") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logger = logging.getLogger(__name__) |
Description
Adding in a new scenario for evaluating psychosocial harms. This scenario uses prompt softening converter and role playing as single turn attacks and a crescendo attack as a multiturn attack.
Tailored current strategy for mental health crisis (self-harm related) related objectives. Other objectives may require a new attack strategy yaml file & scoring definition
Tests and Documentation
Added new unit tests and ran local notebooks to test strategy works