-
Notifications
You must be signed in to change notification settings - Fork 645
MAINT FIX: Moving Message metadata to AttackResult and attack score fix #1316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
MAINT FIX: Moving Message metadata to AttackResult and attack score fix #1316
Conversation
| def _get_attack_result_metadata( | ||
| self, | ||
| *, | ||
| context: AttackStrategyContextT, | ||
| request_converters: Optional[list[Any]] = None, | ||
| ) -> dict[str, Any]: | ||
| """ | ||
| Build common metadata fields for AttackResult. | ||
| This helper method extracts metadata and consolidates it for per-attack storage. | ||
| Args: | ||
| context: The attack context containing memory labels and other state. | ||
| request_converters: Optional list of PromptConverterConfiguration objects | ||
| used in the attack. | ||
| Returns: | ||
| Dict: A dictionary containing attack_identifier, objective_target_identifier, | ||
| request_converter_identifiers, and labels that can be unpacked into | ||
| AttackResult constructor. | ||
| """ | ||
| request_converter_identifiers = None | ||
| if request_converters: | ||
| # request_converters is a list of PromptConverterConfiguration objects | ||
| # Each config has a 'converters' list of actual PromptConverter instances | ||
| all_converters = [] | ||
| for config in request_converters: | ||
| if hasattr(config, "converters"): | ||
| all_converters.extend(config.converters) | ||
| elif hasattr(config, "get_identifier"): | ||
| # Direct converter object | ||
| all_converters.append(config) | ||
| if all_converters: | ||
| request_converter_identifiers = [converter.get_identifier() for converter in all_converters] | ||
|
|
||
| return { | ||
| "attack_identifier": self.get_identifier(), | ||
| "objective_target_identifier": self.get_objective_target().get_identifier(), | ||
| "request_converter_identifiers": request_converter_identifiers, | ||
| "labels": context.memory_labels if context.memory_labels else None, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few concerns with this:
request_converters: Optional[list[Any]]is too loosely typed. When you useAny, it removes IDE autocompletion/type checking and makes it really unclear what callers should pass.- Looking at the usage in the code,
request_convertersonly acceptslist[PromptConverterConfiguration]orlist[PromptConverter]. We should make sure the signature explicitly uses a union (e.g.,Optional[list[PromptConverterConfiguration | PromptConverter]]). - Using
hasattrhere is really not ideal because it can fail silently if you ever rename the attribute. For example, ifPromptConverterConfiguration.converterswere renamed toconvs, this code would still type-check and run but would not behave correctly and would be a hidden bug in the code. If we useUnion, it would be safer to branch usingisinstance()and handle each type explicitly.
I would push back a bit on adding all the identifier fields directly to AttackResult and removing the dataclass. It bloats the class, loses the dataclass benefits (__repr__, __eq__, field defaults), and turns AttackResult into a catch-all bucket. I think a cleaner pattern is adding a single structured field that groups related identifiers:
@dataclass
class AttackLineage: # naming is my weakness
objective_target_identifier: dict[str, str]
request_converter_identifiers: Optional[list[dict[str, str]]] = None
response_converter_identifiers: Optional[list[dict[str, str]]] = None
objective_scorer_identifier: Optional[dict[str, str]] = None
adversarial_chat_target_identifier: Optional[dict[str, str]] = NoneThen AttackResult just gets one new field which is lineage: Optional[AttackLineage] = None. This keeps the dataclass, groups related tracking info logically, and is easy to extend later without ever touching AttackResult again.
Here's how the helper methods would look:
def _get_attack_lineage(
self,
*,
request_converters: Optional[list[PromptConverterConfiguration | PromptConverter]] = None,
response_converters: Optional[list[PromptConverterConfiguration | PromptConverter]] = None,
objective_scorer_identifier: Optional[dict[str, str]] = None,
adversarial_chat_target_identifier: Optional[dict[str, str]] = None,
) -> AttackLineage:
return AttackLineage(
objective_target_identifier=self.get_objective_target().get_identifier(),
request_converter_identifiers=self._extract_converter_identifiers(request_converters),
response_converter_identifiers=self._extract_converter_identifiers(response_converters),
objective_scorer_identifier=objective_scorer_identifier,
adversarial_chat_target_identifier=adversarial_chat_target_identifier,
)
@staticmethod
def _extract_converter_identifiers(
converters: Optional[list[PromptConverterConfiguration | PromptConverter]],
) -> Optional[list[dict[str, str]]]:
if not converters:
return None
all_converters: list[PromptConverter] = []
for item in converters:
if isinstance(item, PromptConverterConfiguration):
all_converters.extend(item.converters)
elif isinstance(item, PromptConverter):
all_converters.append(item)
if not all_converters:
return None
return [converter.get_identifier() for converter in all_converters]| @dataclass | ||
| class AttackResult(StrategyResult): | ||
| """Base class for all attack results.""" | ||
| """ | ||
| Base class for all attack results. | ||
| Contains identity information, scoring, metadata moved from per-message storage, | ||
| and methods to retrieve conversation history. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| *, | ||
| conversation_id: str, | ||
| objective: str, | ||
| attack_identifier: dict[str, str], | ||
| targeted_harm_categories: Optional[List[str]] = None, | ||
| request_converter_identifiers: Optional[List[Dict[str, str]]] = None, | ||
| objective_target_identifier: Optional[Dict[str, str]] = None, | ||
| labels: Optional[Dict[str, str]] = None, | ||
| automated_objective_score: Optional[Score] = None, | ||
| human_objective_score: Optional[Score] = None, | ||
| auxiliary_score_ids: Optional[List[str]] = None, | ||
| executed_turns: int = 0, | ||
| execution_time_ms: int = 0, | ||
| outcome: Optional[AttackOutcome] = None, | ||
| outcome_reason: Optional[str] = None, | ||
| related_conversations: Optional[set[ConversationReference]] = None, | ||
| metadata: Optional[Dict[str, Any]] = None, | ||
| ) -> None: | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned above, we may be able to keep this as a @dataclass and still achieve what you are trying to do here
| labels=message_piece.labels, | ||
| prompt_target_identifier=message_piece.prompt_target_identifier, | ||
| attack_identifier=message_piece.attack_identifier, | ||
| attack_identifier=message_piece._attack_identifier, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why _attack_identifier instead of attack_identifier?
| category_set: set[str] = set() | ||
|
|
||
| for s in scores: | ||
| metadata = combine_dict(metadata, getattr(s, "score_metadata", None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to just do combined_dict(metadaya, s.score_metadata) ? getattr doesn't give you IDE support and if we ever rename the attribute, this will be a bug which is hidden
Moving metadata to AttackResult
This PR moves many fields that were part of
Messageinto theAttackResult. Includingtargeted_harm_categories,converter_identifiers,attack_identifier,scorer_identifier. The old usage still exists but is deprecated. These fields make a lot more sense as part of a conversation/attack and shouldn't be attached to every single message_piece.Deprecated get_prompt_scores in MemoryInterface
This makes less sense, because many of those fields are now
AttackResultfields. So before, the usage might be "get all scores that had this attackID". But now the usage should be "Get the AttackResults with this AttackID" and the scores are attached to those.Crescendo/TAP Scoring Bug Fix
There was also a pretty bad bug in TAP (and to a lesser extent, Crescendo). When we allowed
TrueFalseScorersto be passed in, we lost the float value, so TAP was unable to actually retrieve the "best" conversation (it was either 1 or 0). Also, Crescendo couldn't get that feedback, but it relies on it less.normalize_score_to_floatnow fixes this by adding the float value to metadata, and TAP now requires aFloatScaleThreshholdScorerso this value is set.Tests