From 010b971eebfb6e0b557cedc06977f180dcbaf4e0 Mon Sep 17 00:00:00 2001 From: openhands Date: Sat, 14 Mar 2026 17:06:10 +0000 Subject: [PATCH 1/2] Add negative-bleedthrough obstacle Documents how negative tokens bleed into context and can be counterproductive when instructing LLMs. Co-authored-by: openhands --- documents/obstacles/negative-bleedthrough.md | 29 ++++++++++++++++++++ documents/relationships.mmd | 1 + 2 files changed, 30 insertions(+) create mode 100644 documents/obstacles/negative-bleedthrough.md diff --git a/documents/obstacles/negative-bleedthrough.md b/documents/obstacles/negative-bleedthrough.md new file mode 100644 index 0000000..a0a7ea9 --- /dev/null +++ b/documents/obstacles/negative-bleedthrough.md @@ -0,0 +1,29 @@ +--- +authors: [juanmichelini] +--- + +# Negative Bleedthrough (Obstacle) + +## Description +When you tell an LLM what *not* to do, you're activating the very tokens you want it to avoid. Negation words like "don't", "not", "never" are weak signals compared to the content words around them. The model processes "don't mention the moon" by first attending heavily to "moon" — and now the moon is in the room. + +This is well-documented in NLP research. Studies on negation handling in transformer models (e.g., [Kassner & Schütze, 2020](https://aclanthology.org/2020.acl-main.698/) — *Negated and Misprimed Probes for Pretrained Language Models*) show that LLMs struggle to distinguish negated statements from affirmative ones. The model's internal representations for "the moon is a planet" and "the moon is not a planet" are surprisingly similar. + +**Example: "List the traditional planets, but not the moon."** + +The model sees heavy token activation around "planets" and "moon." The negation "not" is a lightweight modifier that often loses the fight. You'll frequently get the moon in the list anyway. + +This isn't just a text problem. Vision models show the same behavior — ask for "a room with no elephants" and you'll likely get elephants. The underlying mechanism is the same: describing what to avoid activates representations of that thing. + +## Root Causes + +### Token activation doesn't respect negation +Transformers build meaning by attending to content words. "Not" modifies the intent, but the attention still flows to whatever follows it. By the time the model is generating output, the activated concept is competing with the instruction to suppress it. + +### Training data reinforcement +Most training examples of "X is not Y" still associate X and Y. The model learns co-occurrence patterns, not logical negation. So "planets, not the moon" reinforces the planet–moon association. + +## Impact +- Negative instructions increase the chance of getting exactly what you asked to avoid +- More negations in a prompt means more unwanted concepts activated in context +- Workarounds like repeating "do NOT" or using caps don't fix the underlying mechanism — they just add more tokens that activate the concept diff --git a/documents/relationships.mmd b/documents/relationships.mmd index 1775dae..7cc3981 100644 --- a/documents/relationships.mmd +++ b/documents/relationships.mmd @@ -114,6 +114,7 @@ graph LR %% Obstacle → Obstacle relationships (related) obstacles/solution-fixation -->|related| obstacles/compliance-bias obstacles/selective-hearing -->|related| obstacles/context-rot + obstacles/negative-bleedthrough -->|related| obstacles/selective-hearing %% Obstacle → Anti-pattern relationships (related) obstacles/obedient-contractor -->|related| anti-patterns/silent-misalignment From d787a038665ca8f423e9eb184c52e65942898c7a Mon Sep 17 00:00:00 2001 From: Juan Michelini Date: Sat, 14 Mar 2026 14:28:38 -0300 Subject: [PATCH 2/2] Update negative-bleedthrough.md --- documents/obstacles/negative-bleedthrough.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/documents/obstacles/negative-bleedthrough.md b/documents/obstacles/negative-bleedthrough.md index a0a7ea9..f78011e 100644 --- a/documents/obstacles/negative-bleedthrough.md +++ b/documents/obstacles/negative-bleedthrough.md @@ -5,15 +5,15 @@ authors: [juanmichelini] # Negative Bleedthrough (Obstacle) ## Description -When you tell an LLM what *not* to do, you're activating the very tokens you want it to avoid. Negation words like "don't", "not", "never" are weak signals compared to the content words around them. The model processes "don't mention the moon" by first attending heavily to "moon" — and now the moon is in the room. +When you tell an LLM what *not* to do, you're activating the very tokens you want it to avoid. Negation words like "don't", "not", "never" are weak signals compared to the content words around them. The model processes "don't mention the moon" by first attending heavily to "moon" and now the moon is in the room. -This is well-documented in NLP research. Studies on negation handling in transformer models (e.g., [Kassner & Schütze, 2020](https://aclanthology.org/2020.acl-main.698/) — *Negated and Misprimed Probes for Pretrained Language Models*) show that LLMs struggle to distinguish negated statements from affirmative ones. The model's internal representations for "the moon is a planet" and "the moon is not a planet" are surprisingly similar. +This is well-documented in NLP research. Studies on negation handling in transformer models (e.g., [Kassner & Schütze, 2020](https://aclanthology.org/2020.acl-main.698/) : *Negated and Misprimed Probes for Pretrained Language Models*) show that LLMs struggle to distinguish negated statements from affirmative ones. The model's internal representations for "the moon is a planet" and "the moon is not a planet" are surprisingly similar. **Example: "List the traditional planets, but not the moon."** The model sees heavy token activation around "planets" and "moon." The negation "not" is a lightweight modifier that often loses the fight. You'll frequently get the moon in the list anyway. -This isn't just a text problem. Vision models show the same behavior — ask for "a room with no elephants" and you'll likely get elephants. The underlying mechanism is the same: describing what to avoid activates representations of that thing. +This isn't just a text problem. Vision models show the same behavior : ask for "a room with no elephants" and you'll likely get elephants. The underlying mechanism is the same: describing what to avoid activates representations of that thing. ## Root Causes @@ -26,4 +26,4 @@ Most training examples of "X is not Y" still associate X and Y. The model learns ## Impact - Negative instructions increase the chance of getting exactly what you asked to avoid - More negations in a prompt means more unwanted concepts activated in context -- Workarounds like repeating "do NOT" or using caps don't fix the underlying mechanism — they just add more tokens that activate the concept +- Workarounds like repeating "do NOT" or using caps don't fix the underlying mechanism : they just add more tokens that activate the concept