karafka · mensfeld · Jun 28, 2026 · Jun 28, 2026
diff --git a/Changelog/Karafka-Core.md b/Changelog/Karafka-Core.md
@@ -3,6 +3,9 @@
 
 # Karafka Core Changelog
 
+## 2.6.2 (Unreleased)
+- [Fix] Make assigning a setting on a frozen `Configurable::Node` atomic. The ivar-backed writer evaluated `@configs_refs[name] = value` before `instance_variable_set`, so a frozen node mutated the canonical store and only then raised `FrozenError`, leaving the store and the ivar-backed reader permanently out of sync. It now raises before touching any state.
+
 ## 2.6.1 (2026-06-15)
 - [Enhancement] Speed up `Contract#call` by ~1.25x for minimal and ~1.4x for fully populated data: resolve rule paths with a single `Hash#fetch` per level instead of `key?` + `[]`, inline the per-rule type dispatch into the rules loop, and compare the dig sentinel via `#equal?` so `#==` is never dispatched to the validated (user-provided) values. This is the per-message validation path in WaterDrop producers.
 - [Fix] `Contract#call` with rule paths of 3+ keys no longer raises `NoMethodError` when an intermediate value is not a `Hash` and reports the path as missing instead, consistent with the 2-key path behavior.

diff --git a/Changelog/Karafka.md b/Changelog/Karafka.md
@@ -27,6 +27,7 @@
 - [Enhancement] Contain non-`StandardError` errors (`ScriptError`, `SystemStackError`, etc.) raised in the consumption flow (both the consume and the after-consume phases, where user-extensible DLQ code runs) the same way as standard processing errors: the failure is recorded and the regular retry/pause/DLQ flow engages, instead of the batch being skipped with offsets later committed past it. Process-critical errors (`SystemExit`, `SignalException`, `NoMemoryError`) are not retried in-process and are never dispatched to the DLQ: the failure is recorded, the partition stays paused and a graceful shutdown is initiated by the auto-subscribed `Instrumentation::CriticalErrorsListener` watching the `error.occurred` bus - so critical errors escalate regardless of where in the framework they surface (also from quiet mode) and the failed batch is redelivered after a restart. The critical errors list is configurable via `config.internal.processing.critical_errors` for user-specific fatal error classes.
 - [Enhancement] Bump swarm supervisor `SHUTDOWN_GRACE_PERIOD` from 1s to 15s to give forked nodes enough time to finish post-`shutdown_timeout` cleanup (at_exit handlers, librdkafka finalization, connection pool close) before the supervisor forcefully terminates them, especially on CI where `sleep` granularity and `waitpid` cost stretch each supervision loop iteration.
 - [Enhancement] Add `stability_ttl` to `Kubernetes::LivenessListener` and `Pro::Swarm::LivenessListener` to detect consumers frozen in a single non-`"steady"` librdkafka `cgrp.join_state` (e.g. a broker stuck in `CompletingRebalance` due to bugs like KAFKA-19862). Brief non-steady periods are normal during rebalances: the timer resets on every join_state transition, so only a group frozen in the same non-steady state continuously for longer than the threshold is flagged. The pre-join `init` and mid-join `wait-metadata` states are excluded and clear prior tracking; genuine freezes there are covered by `polling_ttl`. On breach, `stability_ttl_exceeded: true` is reported in the Kubernetes HTTP body (HTTP 500) and status code `4` by the swarm listener; the flag clears automatically once the group returns to `"steady"`. **Requires `statistics.interval.ms` to be set** - without it this check has no effect. Defaults to twice the maximum `max.poll.interval.ms` across all active subscription groups (root `kafka` config as fallback), derived lazily on first health evaluation. When a connection listener finishes its fetch loop (downscale or shutdown), its subscription group's tracking is cleared to prevent false positives.
+- [Fix] `Pro::Iterator#each` reset its EOF/stop tracking (`@stopped_partitions`, `@stopped`) only after the consumer block, so a `break` out of the yielded block (a documented way to stop) skipped the reset. Reusing the same iterator afterwards then saw `done?` immediately true and iterated nothing - silently dropping a fresh backlog until a third run recovered. The reset now runs in an `ensure`, so reuse is consistent regardless of how the previous run ended (normal, `#stop`, or `break`) (Pro).
 - [Fix] Virtual Partitions `VirtualOffsetManager#markable` raised `KeyError` under the `:exact` offset-metadata strategy when the lowest-offset virtual-partition group in a batch was left unmarked: the materialized real offset falls back to `min - 1`, an offset that was never registered, so `@offsets_metadata.fetch` missed. It now falls back to the current offset metadata instead of raising (which had forced a coordinator failure, collapse and full-batch reprocessing) (Pro).
 - [Fix] Encryption crashed on tombstone records (a key with a `nil` payload): producing one ran `cipher.encrypt(nil)` in the encryption middleware, and consuming one carrying an `encryption` header ran `cipher.decrypt(nil)` - both raising instead of passing the tombstone through. Tombstones are now left untouched on produce and consume, so they remain valid tombstones (for example for log compaction) (Pro).
 - [Fix] `Pro::Processing::JobsQueue#clear` (run per subscription group on listener recovery) disabled the async-locking fast path globally, so another subscription group's active `lock_async` locks were ignored until the next `lock_async` call - a listener parked by an advanced scheduler could resume polling early, and `empty?` could report a still-locked group as empty during shutdown. The fast-path flag is now recomputed from the remaining groups' locks instead of being forced off (Pro).
@@ -52,6 +53,7 @@
 - [Maintenance] Use namespaced topic naming format in all integration specs for consistent traceability.
 - [Maintenance] Add `bin/tests_topics_hashes` script for looking up spec files by their topic name hash prefix.
 - [Maintenance] Fix flaky `pro/admin/recovery/coordinator_for_spec.rb` by warming up the `__consumer_offsets` internal topic with a produce + `seek_consumer_group` and retrying the first `Recovery.coordinator_for` call with exponential backoff on `MetadataError`.
+- [Refactor] Remove the unused `@mutex` from `Pro::Connection::Manager`. It was left dead after #1851 replaced lock-based `@changes` eviction with key preloading; the cross-thread `@changes` access (statistics callback on listener threads vs. `touch`/`stable?` on the Runner thread) is intentionally lock-free - only existing keys are mutated, so there is no insert-during-iteration, and the remaining field-level interleavings are benign and self-correcting. The threading model is now documented so the absence of a lock is explicit (Pro).
 - [Change] Require `karafka-rdkafka` `>=` `0.27.1` to pick up the fix for `poll_batch` and `poll_batch_nb` raising `RdkafkaError` with only the integer error code, which discarded topic/partition/offset context from `e.details` and caused `partition_eof` handling to call `@buffer.eof(nil, nil)`, resulting in a `TopicNotFoundError` crash.
 
 ## 2.5.9 (2026-03-30)

diff --git a/Development/Karafka-Integration-Tests-Catalog.md b/Development/Karafka-Integration-Tests-Catalog.md
@@ -1161,6 +1161,7 @@
 | `pro/iterator/negative_multiple_topics_lookups_spec.rb` | The author retains all right, title, and interest in this software, including all copyrights, patents, and other intellectual property rights. No patent rights are granted under this license. - Reverse engineering, decompilation, or disassembly of this software Receipt, viewing, or possession of this software does not convey or imply any license or right beyond those expressly stated above. We should be able to subscribe to multiple topics with custom per topic negative lookups and they should work on all partitions |
 | `pro/iterator/negative_topic_multiple_partitions_spec.rb` | The author retains all right, title, and interest in this software, including all copyrights, patents, and other intellectual property rights. No patent rights are granted under this license. - Reverse engineering, decompilation, or disassembly of this software Receipt, viewing, or possession of this software does not convey or imply any license or right beyond those expressly stated above. We should be able to use different negative offsets for different partitions of the same topic |
 | `pro/iterator/repeated_eof_on_one_partition_spec.rb` | The author retains all right, title, and interest in this software, including all copyrights, patents, and other intellectual property rights. No patent rights are granted under this license. - Reverse engineering, decompilation, or disassembly of this software Receipt, viewing, or possession of this software does not convey or imply any license or right beyond those expressly stated above. When one partition reaches EOF more than once (because new data arrived after its first EOF and was consumed down to a second EOF), the iterator must not treat the repeated EOF of the same partition as if another partition finished. Iteration should continue until every partition has actually reached its end and all messages from all partitions were yielded. This guards against a bug where EOF events were counted globally instead of per partition: a single live partition receiving new data could exhaust the EOF budget for the whole subscription and silently terminate iteration while other partitions still had a backlog. |
+| `pro/iterator/reuse_after_break_spec.rb` | The author retains all right, title, and interest in this software, including all copyrights, patents, and other intellectual property rights. No patent rights are granted under this license. - Reverse engineering, decompilation, or disassembly of this software Receipt, viewing, or possession of this software does not convey or imply any license or right beyond those expressly stated above. F34: `Pro::Iterator#each` reset `@stopped_partitions`/`@stopped` only after the `with_consumer` block, so a `break` out of the yielded block (a documented way to stop) skipped the reset. The accumulated EOF state then leaked into the next `#each`: `done?` was immediately true and the whole iteration was a silent no-op, dropping a fresh backlog. Reproduced by iterating to EOF, breaking, then reusing the SAME iterator. With `yield_nil` the drained (single-partition) iterator yields a nil once it reaches EOF - i.e. once `@stopped_partitions` is full - giving us a point to `break` on while the state is full. The second pass over the same iterator must still re-stream the whole backlog. |
 | `pro/iterator/stop_start_with_marking_spec.rb` | The author retains all right, title, and interest in this software, including all copyrights, patents, and other intellectual property rights. No patent rights are granted under this license. - Reverse engineering, decompilation, or disassembly of this software Receipt, viewing, or possession of this software does not convey or imply any license or right beyond those expressly stated above. We should be able with proper settings |
 | `pro/iterator/with_custom_group_id_spec.rb` | The author retains all right, title, and interest in this software, including all copyrights, patents, and other intellectual property rights. No patent rights are granted under this license. - Reverse engineering, decompilation, or disassembly of this software Receipt, viewing, or possession of this software does not convey or imply any license or right beyond those expressly stated above. The iterator should use the group.id specified in the settings hash when provided. By default, the iterator uses its own admin group ID (not the routing-defined consumer group), but this can be overridden via the settings hash. This is useful when you want to: - Track iterator progress for resuming later - Use a specific consumer group for offset management - Share offset state between multiple iterator instances |
 | `pro/iterator/with_granular_cleaning_spec.rb` | The author retains all right, title, and interest in this software, including all copyrights, patents, and other intellectual property rights. No patent rights are granted under this license. - Reverse engineering, decompilation, or disassembly of this software Receipt, viewing, or possession of this software does not convey or imply any license or right beyond those expressly stated above. We should be able to clean messages while keeping the metadata |