Architecture Refactoring Plan Documentation by danieldotnl · Pull Request #535 · danieldotnl/ha-multiscrape

danieldotnl · 2025-11-17T19:59:04Z

Summary

This PR adds comprehensive documentation for refactoring the ha-multiscrape architecture to address complexity issues identified in the codebase.

What's Included

A detailed refactoring plan covering 6 priority architectural fixes:

Unify HTTP Handling - Replace dual HTTP wrapper instances with a single HttpSession class that properly manages cookies, authentication, and session state
Simplify Variable System - Replace the confusing mix of form variables, template variables, and selector variables with a structured ScrapeContext system
Refactor Scraper Class - Split the monolithic 197-line Scraper class into focused components using the Strategy pattern (Parser, Extractor, Renderer)
Fix Notification Flow - Eliminate backwards dependency flow by using session validation or event-based architecture instead of notify_scrape_exception() chains
Replace Index-Based Discovery - Replace fragile array indices with a type-safe ScraperRegistry using unique IDs
Extract Strategy Pattern for Parsers - Enable native JSON API support alongside HTML scraping using pluggable parser strategies

Each Section Includes

Current Problem: Detailed analysis with code examples and file references
Proposed Solution: Complete implementation with code examples
Benefits: Why this improves the architecture
Migration Path: Step-by-step transition plan

Additional Content

Implementation Timeline: 8-week phased approach
Testing Strategy: Unit, integration, and compatibility tests
Success Metrics: Measurable improvements
Breaking Changes: Analysis (spoiler: no config changes!)
Questions & Decisions: Key architectural choices

Why This Matters

The current architecture works but has accumulated complexity that makes it:

Hard to understand the request flow (especially with form submit)
Difficult to add new features (like JSON API support)
Challenging to debug (backwards notification flow, mixed variables)
Fragile (index-based discovery can break on reload)

This refactoring plan provides a roadmap to make the codebase more maintainable, testable, and extensible while maintaining backwards compatibility.

Next Steps

This is documentation only - no code changes yet. The plan serves as:

Reference for future refactoring work
Discussion document for architectural decisions
Guide for contributors who want to help

Feedback welcome!

🤖 Generated with Claude Code

Summary by CodeRabbit

Documentation
- Added a comprehensive Architecture Refactor Plan describing centralized per-instance HTTP/session handling, session-based authentication and state, a simplified request flow, modular HTML/JSON parsing and extraction, registry-based platform discovery via unique IDs, optional event-driven lifecycle hooks, extended selector support, and migration, testing, and rollout guidance.

Document six priority fixes to address architectural complexity: 1. Unify HTTP handling - Replace dual HTTP wrappers with single HttpSession 2. Simplify variable system - Structured context instead of mixed dicts 3. Refactor Scraper - Split into Parser/Extractor/Renderer components 4. Fix notification flow - Event-based or session validation instead of backwards calls 5. Replace index-based discovery - Type-safe registry with unique IDs 6. Extract strategy pattern - Support HTML and JSON parsing natively Includes detailed implementation plans, code examples, migration paths, and timeline for each fix. Total estimated effort: 6-8 weeks. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2025-11-17T19:59:14Z

Warning

Rate limit exceeded

@danieldotnl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 4 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 2a1b30f and 607c811.

📒 Files selected for processing (1)

docs/ARCHITECTURE_REFACTORING.md (1 hunks)

Walkthrough

Centralizes HTTP into a new HttpSession, introduces typed AuthContext/ScrapeContext and VariableRenderer, refactors scrapers with a ParserFactory, adds parser/extractor abstractions (HTML/JSON), and replaces index-based discovery with a ScraperRegistry and registry-driven platform loading.

Changes

Cohort / File(s)	Summary
Architecture doc `docs/ARCHITECTURE_REFACTORING.md`	Adds the comprehensive refactor plan: new components, wiring, migration path, testing strategy, and deprecation guidance.
HTTP & session `http_session.py`, `ContentRequestManager` `.../FormSubmitter` (deprecated notes in doc)	Introduces `HttpSession` and `HttpConfig`/`FormAuthConfig`; centralizes HTTP client, cookie and auth state, and request flow through a session-managed client.
Auth & context `context.py`, `AuthContext`, `ScrapeContext`, `VariableRenderer`	Adds typed contexts and a `VariableRenderer` for namespaced template variables and render-time data separation.
Scraper core `scraper.py`, `coordinator.py`, `.../__init__.py`	Refactors `Scraper` into a modular design, integrates `ParserFactory`, and replaces index loaders with `get_scraper_for_platform` / registry-driven entrypoints.
Parsers & extractors `parsers.py`, `extractors.py`, `ParserFactory`, `HtmlParser`, `JsonParser`, `HtmlContent`, `JsonContent`, `ValueExtractor`	Adds parser abstractions for HTML/JSON, `ParsedContent` types, JSONPath-like selectors, and a value extractor subsystem.
Registry & discovery `registry.py`, `ScraperRegistry`, `ScraperInstance`, `PlatformConfig`	Implements registry-based discovery with unique IDs, lifecycle management, safer reload semantics, and platform config wiring.
Events & observability `events.py`, `ScraperEvent`, `EventBus`	Adds optional event classes and an `EventBus` for decoupled lifecycle/error notifications and session-validation hooks.
Public API surface `__init__.py`, docs, public exports	Changes public surfaces: promotes `HttpSession`, exposes registry lookup by unique_id, and marks legacy wrappers/forms for deprecation; updates exported types and signatures.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Coordinator
    participant Registry
    participant Scraper as ScraperInstance
    participant Http as HttpSession
    participant Parser as ParserFactory

    Note over Coordinator,Registry: Load platform by unique_id (registry-driven)
    Coordinator->>Registry: get_scraper(unique_id)
    Registry-->>Coordinator: ScraperInstance
    Coordinator->>Scraper: start_scrape(request, scrape_ctx)
    Scraper->>Http: request(url, auth_ctx)
    Http-->>Scraper: Response (cookies/state updated)
    Scraper->>Parser: parse(response)
    Parser-->>Scraper: ParsedContent
    Scraper->>Scraper: extract values (VariableRenderer / ValueExtractor)
    Scraper-->>Coordinator: scrape_result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas needing focused review:
- HttpSession concurrency, auth lifecycle, cookie handling and HttpConfig defaults.
- ScraperRegistry unique_id mapping, reload semantics, and lifecycle hooks.
- ParserFactory selection logic, HTML/JSON parsing boundaries, and JSONPath-like selector correctness.
- Public API changes and deprecation notes to ensure backward-compatibility and migration clarity.

Poem

🐰 I hopped through bytes and tidy threads,
Sessions cozy where HTTP beds.
Parsers sniff JSON, selectors trace,
Registry shelves each scraper's place.
Nibbles of refactor — a carrot chase! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: a comprehensive architecture refactoring plan document. It is concise, clear, and directly related to the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

docs/ARCHITECTURE_REFACTORING.md (1)

344-344: Refine wording for conciseness.

Consider replacing "mixed together without clear boundaries" with "commingled without clear boundaries" or simply "mixed without clear separation" for conciseness.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcff870 and 5bda929.

📒 Files selected for processing (1)

docs/ARCHITECTURE_REFACTORING.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

docs/ARCHITECTURE_REFACTORING.md

[style] ~344-~344: ‘mixed together’ might be wordy. Consider a shorter alternative.
Context: ...bles come from multiple sources and are mixed together without clear boundaries: 1. **Form Va...

(EN_WORDINESS_PREMIUM_MIXED_TOGETHER)

🪛 markdownlint-cli2 (0.18.1)

docs/ARCHITECTURE_REFACTORING.md

51-51: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

988-988: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (2)

docs/ARCHITECTURE_REFACTORING.md (2)

1728-1763: Validate JSONPath implementation covers intended use cases.

The basic JSONPath implementation (lines 1728–1763) handles dot notation and integer indices, but complex expressions (e.g., filters, wildcards) are not supported. Before implementation, verify this covers the expected JSON API queries from users, or plan to integrate jsonpath-ng as the optional migration mentioned at line 1938.

1-2072: Excellent comprehensive refactoring plan with sound architectural decisions.

This Architecture Refactoring Plan is well-structured and thoughtfully designed. Each of the six priority fixes addresses specific architectural problems with clear problem analysis, proposed solutions with working code examples, realistic benefits, and practical migration paths. The phased 8-week timeline is realistic, success metrics are measurable, and the decision to defer breaking changes to configuration shows good judgment.

The proposed patterns (Strategy for parsers, Registry for discovery, Context objects for variables, Session validation for authentication) are architecturally sound and follow established best practices.

However, address the markdown linting issues (missing language specifications on code blocks) before merging, and consider validating the JSONPath implementation scope.

coderabbitai · 2025-11-17T20:02:04Z

+```python
+# In __init__.py
+if form_submit_config:
+    form_http = create_http_wrapper(config_name, form_submit_config, hass, file_manager)
+    form_submitter = create_form_submitter(
+        config_name,
+        form_submit_config,
+        hass,
+        form_http,  # First HTTP wrapper
+        file_manager,
+        parser,
+    )
+
+http = create_http_wrapper(config_name, conf, hass, file_manager)  # Second HTTP wrapper
+```


⚠️ Potential issue | 🟡 Minor

Add language specifications to fenced code blocks.

Code blocks at lines 26–40 and 51–76 lack language identifiers in the markdown fence. Line 51–76 should specify text for the ASCII diagram; lines 26–40 should specify python.

Update the fenced code block opening from ``` to ```python (or ```text for the diagram).

Also applies to: 51-76

🤖 Prompt for AI Agents

In docs/ARCHITECTURE_REFACTORING.md around lines 26–40 and 51–76, the fenced code blocks are missing language identifiers; update the opening fence on lines 26–40 to use ```python for the Python snippet and update the opening fence on lines 51–76 to use ```text for the ASCII diagram so that syntax highlighting/rendering is applied correctly.

coderabbitai · 2025-11-17T20:02:04Z

+```python
+# In entity.py:145 - Mixing form variables into scraping
+attr_value = self.scraper.scrape(
+    attr_selector,
+    self._name,
+    name,
+    variables=self.coordinator.form_variables  # ← Form variables passed everywhere
+)
+
+# In coordinator.py:233 - Property chain
+@property
+def form_variables(self):
+    return self._request_manager.form_variables  # ← Law of Demeter violation
+
+# In selector.py - Templates receive mixed variable context
+value = selector.value_template.async_render(
+    variables=variables,  # ← What's in here? Form vars? Template vars? Both?
+    parse_result=True
+)


⚠️ Potential issue | 🟡 Minor

Add language specifications to additional code blocks.

Multiple code blocks lack language identifiers in markdown fences throughout Priority Fix #2 (lines 353–371, 385–453, 457–521, 525–566, 570–595). Add python language specifiers to each.

Also applies to: 385-453, 457-521, 525-566, 570-595

🤖 Prompt for AI Agents

In docs/ARCHITECTURE_REFACTORING.md around lines 353-371 (and additionally 385-453, 457-521, 525-566, 570-595), several fenced code blocks are missing language specifiers; update each Markdown code fence in those ranges to include the python language tag (```python) so they render with correct syntax highlighting — locate every triple-backtick fence in the listed line ranges and change the opening fence to ```python, leaving the block contents unchanged.

coderabbitai · 2025-11-17T20:02:04Z

+```python
+def scrape(self, selector, sensor, attribute=None, variables: dict = {}):
+    """60+ line method doing everything."""
+
+    # Template rendering (should be in VariableRenderer)
+    if selector.just_value:
+        result = selector.value_template.async_render_with_possible_json_value(...)
+        return selector.value_template._parse_result(result)
+
+    # Content type detection (should be in Parser)
+    content_stripped = self._data.lstrip() if self._data else ""
+    if content_stripped and content_stripped[0] in ["{", "["]:
+        raise ValueError("JSON cannot be scraped...")
+
+    # CSS selection (should be in Selector)
+    if selector.is_list:
+        tags = self._soup.select(selector.list)
+    else:
+        tag = self._soup.select_one(selector.element)
+
+    # Value extraction (should be in Extractor)
+    if selector.attribute is not None:
+        value = tag[selector.attribute]
+    else:
+        value = self.extract_tag_value(tag, selector)
+
+    # Template rendering again (should be in VariableRenderer)
+    if value is not None and selector.value_template is not None:
+        value = selector.value_template.async_render(...)
+```


⚠️ Potential issue | 🟡 Minor

Add language specifications to code blocks in Priority Fix #3.

All fenced code blocks in this section (lines 638–667, 673–765, 768–816, 819–958) lack python language specifiers. Add them to each fence opening.

Also applies to: 673-765, 768-816, 819-958

🤖 Prompt for AI Agents

docs/ARCHITECTURE_REFACTORING.md lines 638-667 (and also 673-765, 768-816, 819-958): several fenced code blocks are missing the language specifier; update each opening fence (``` ) in these ranges to include the python specifier (```python) so syntax highlighting and tooling recognize the blocks; scan the listed line ranges and replace every ``` with ```python for those code blocks, leaving the fence closings unchanged.

coderabbitai · 2025-11-17T20:02:04Z

+```
+Entity (scraping fails)
+    ↓
+Coordinator.notify_scrape_exception()
+    ↓
+ContentRequestManager.notify_scrape_exception()
+    ↓
+FormSubmitter.notify_scrape_exception()
+    ↓
+Sets should_submit = True
+```


⚠️ Potential issue | 🟡 Minor

Add language specifications to all code blocks in Priority Fix #4.

Multiple code blocks lack language identifiers: line 988–998 should be text, and lines 1002–1013, 1029–1113, 1074–1113, 1115–1180, 1184–1259 should all be python. Update each fence opening.

Also applies to: 1002-1013, 1029-1113, 1074-1113, 1115-1180, 1184-1259

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

988-988: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In docs/ARCHITECTURE_REFACTORING.md around lines 988-998 (and also apply to 1002-1013, 1029-1113, 1074-1113, 1115-1180, 1184-1259), several fenced code blocks are missing language specifiers; change the fence opening on lines 988-998 to ```text and update the other listed fence openings to ```python so each code block has the correct language tag; ensure you only modify the fence lines (the opening triple-backticks) and leave the block contents unchanged.

coderabbitai · 2025-11-17T20:02:04Z

+```python
+# In __init__.py:121-142
+hass.data[DOMAIN][SCRAPER_DATA].append(
+    {SCRAPER: scraper, COORDINATOR: coordinator}
+)
+
+for platform_domain in PLATFORMS:
+    if platform_domain not in conf:
+        continue
+
+    for platform_conf in conf[platform_domain]:
+        hass.data[DOMAIN][platform_domain].append(platform_conf)
+        platform_idx = len(hass.data[DOMAIN][platform_domain]) - 1
+
+        load = discovery.async_load_platform(
+            hass,
+            platform_domain,
+            DOMAIN,
+            {SCRAPER_IDX: scraper_idx, PLATFORM_IDX: platform_idx},  # ← Indices!
+            config,
+        )
+
+# In platforms (sensor.py, binary_sensor.py, etc):
+async def async_get_config_and_coordinator(hass, platform_domain, discovery_info):
+    shared_data = hass.data[DOMAIN][SCRAPER_DATA][discovery_info[SCRAPER_IDX]]  # ← Array lookup
+    conf = hass.data[DOMAIN][platform_domain][discovery_info[PLATFORM_IDX]]      # ← Array lookup
+    coordinator = shared_data[COORDINATOR]
+    scraper = shared_data[SCRAPER]
+    return conf, coordinator, scraper
+```


⚠️ Potential issue | 🟡 Minor

Add language specifications to all code blocks in Priority Fix #5.

All fenced code blocks in this section (lines 1310–1339, 1354–1453, 1457–1525, 1529–1578) should specify python language.

Also applies to: 1354-1453, 1457-1525, 1529-1578

🤖 Prompt for AI Agents

In docs/ARCHITECTURE_REFACTORING.md around lines 1310 to 1339 (and also apply the same change to ranges 1354–1453, 1457–1525, 1529–1578), several fenced code blocks lack a language specifier; update each triple-backtick fence to include "python" (i.e., change ``` to ```python) so all code blocks in those ranges are marked as Python.

coderabbitai · 2025-11-17T20:02:05Z

+```python
+async def set_content(self, content):
+    """Set the content to be scraped."""
+    self._data = content
+
+    # Try to detect JSON more robustly
+    content_stripped = content.lstrip() if content else ""
+    if content_stripped and content_stripped[0] in ["{", "["]:
+        _LOGGER.debug(
+            "%s # Response seems to be json. Skip parsing with BeautifulSoup.",
+            self._config_name,
+        )
+    else:
+        try:
+            _LOGGER.debug(
+                "%s # Loading the content in BeautifulSoup.",
+                self._config_name,
+            )
+            self._soup = await self._hass.async_add_executor_job(
+                BeautifulSoup, self._data, self._parser
+            )
+```
+
+**Then during scraping**:
+
+```python
+def scrape(self, selector, sensor, attribute=None, variables: dict = {}):
+    # Check AGAIN if content is JSON
+    content_stripped = self._data.lstrip() if self._data else ""
+    if content_stripped and content_stripped[0] in ["{", "["]:
+        raise ValueError(
+            "JSON cannot be scraped. Please provide a value template to parse JSON response."
+        )
+```


⚠️ Potential issue | 🟡 Minor

Add language specifications to all code blocks in Priority Fix #6.

All fenced code blocks lack language specifiers: lines 1608–1641, 1656–1861, 1865–1904 should be python; line 1908–1919 should be yaml.

Also applies to: 1656-1861, 1865-1904, 1908-1919

🤖 Prompt for AI Agents

In docs/ARCHITECTURE_REFACTORING.md around lines 1608–1641, 1656–1861, 1865–1904 and 1908–1919, the fenced code blocks are missing language specifiers; update the opening fence for the code blocks in lines 1608–1641, 1656–1861 and 1865–1904 to ```python and update the block in lines 1908–1919 to ```yaml so each fenced block includes the correct language label for syntax highlighting.

Corrections based on code review: - Both HTTP wrappers already use get_async_client(hass) for connection pooling - The real issue is cookie management, not separate clients - Per-request cookies are necessary because shared HA client can't maintain per-integration session state - Updated HttpSession design to use get_async_client() and manage cookies in _session_cookies dict - Changed close() to not close shared client, only cleanup session state The problem was cookie juggling complexity, not connection pooling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…e jar Major improvement based on discussion: Instead of using HA's shared client and manually managing cookies, use a dedicated httpx.AsyncClient per instance. Why this is better: - httpx DOES handle cookies automatically via built-in cookie jar - Problem: can't use shared HA client's cookie jar (shared by all integrations) - Solution: Each multiscrape instance gets its own httpx.AsyncClient - Result: Zero manual cookie extraction/passing - httpx does it all! This is how httpx is meant to be used for stateful scraping. Trade-off: - Pro: Natural session management, simpler code, standards-compliant - Con: More connection pools (but still pooled within each instance) - Con: Deviates from HA's get_async_client() pattern (but justified) The current approach of manual cookie juggling defeats the purpose of using an HTTP library that has session management built-in. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

docs/ARCHITECTURE_REFACTORING.md (1)
26-40: Add language specifiers to all fenced code blocks throughout the document.

All fenced code blocks are missing language identifiers. Prior reviewers flagged this across multiple sections; these remain unaddressed. This affects syntax highlighting, rendering, and tooling support.

Code block language mappings required:

Priority Fix #1 (HTTP): Lines 26–40, 91–262, 266–304, 308–360 → python; lines 51–76 → text

Priority Fix #2 (Variables): Lines 406–425, 438–506, 510–574, 578–619, 623–648 → python

Priority Fix #3 (Scraper): Lines 691–720, 727–818, 820–869, 872–1011 → python

Priority Fix #4 (Notification): Lines 1041–1051 → text; lines 1055–1066, 1082–1123, 1125–1166, 1168–1233, 1237–1312 → python

Priority Fix #5 (Registry): Lines 1363–1392, 1406–1506, 1510–1578, 1582–1631 → python

Priority Fix #6 (Parser): Lines 1661–1694 → text; lines 1708–1914, 1918–1957 → python; lines 1961–1972 → yaml

Fix approach: Use a shell script or find-and-replace to systematically update all fence openings:
#!/bin/bash
# Find all triple-backtick fences and add language specifiers
# This is a template; adapt to your specific needs

# For Python code blocks (majority):
sed -i '/^```$/s/^```$/```python/' docs/ARCHITECTURE_REFACTORING.md

# Then manually verify and correct specific exceptions (text, yaml blocks)
Alternatively, manually edit each fence opening across the listed line ranges to include the language tag.

Also applies to: 51-76, 353-371, 385-453, 457-521, 525-566, 570-595, 638-667, 673-765, 768-816, 819-958, 988-998, 1002-1013, 1029-1113, 1074-1113, 1115-1180, 1184-1259, 1310-1339, 1354-1453, 1457-1525, 1529-1578, 1608-1641, 1656-1861, 1865-1904, 1908-1919

🧹 Nitpick comments (1)

docs/ARCHITECTURE_REFACTORING.md (1)

397-397: Minor: Consider rewording for conciseness.

Line 397: "are mixed together without clear boundaries" is slightly wordy. Consider: "are intermingled without clear boundaries" or "lack clear separation."

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9abead and 8ee795a.

📒 Files selected for processing (1)

docs/ARCHITECTURE_REFACTORING.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

docs/ARCHITECTURE_REFACTORING.md

[style] ~397-~397: ‘mixed together’ might be wordy. Consider a shorter alternative.
Context: ...bles come from multiple sources and are mixed together without clear boundaries: 1. **Form Va...

(EN_WORDINESS_PREMIUM_MIXED_TOGETHER)

🪛 markdownlint-cli2 (0.18.1)

docs/ARCHITECTURE_REFACTORING.md

60-60: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

1041-1041: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (1)

docs/ARCHITECTURE_REFACTORING.md (1)

1-2120: Comprehensive, well-structured refactoring plan that clearly articulates architectural improvements.

The documentation excels at:

Problem identification: Each section opens with concrete code examples and file references showing current issues (e.g., dual HTTP wrappers, cookie juggling, backwards notification chains)

Solution design: Proposed solutions follow established design patterns (Strategy, Factory, Registry) with clear rationales

Migration path: Each fix includes step-by-step migration guidance

Holistic scope: Coverage spans HTTP management, variable context, parser abstraction, event-based error handling, and type-safe discovery—addressing root causes rather than symptoms

Tradeoff analysis: Honest discussion of pros/cons (e.g., dedicated httpx client vs. HA's shared client)

Realistic timeline: 8-week phased approach with clear milestones and testing strategy

Breaking-change clarity: Explicit assurance that no config changes break existing setups

Substantive strengths:

HTTP Session (#1): The reasoning for a dedicated httpx.AsyncClient is sound—leveraging httpx's built-in cookie jar instead of manual extraction/passing is idiomatic and simpler.

Variable Context (#2): Structured AuthContext and ScrapeContext classes eliminate variable-source ambiguity and enable type safety.

Scraper Refactoring (#3): Split of parsing, extraction, and rendering into separate classes reduces cognitive load and enables independent testing.

Parser Strategy (#6): Abstract ContentParser hierarchy with HtmlParser / JsonParser cleanly supports multi-format scraping without modifying core logic.

Registry Pattern (#5): Replaces fragile index-based discovery with typed dataclasses and unique IDs—safer reloads and better debuggability.

Notification Flow (#4): Session validation approach is pragmatic for MVP, with clear path to event-based architecture if needed later.

Minor observations:

Success metrics section is measurable and actionable (Scraper 197 lines → ~60 lines, cyclomatic complexity <10, >90% unit test coverage).

Questions & Decisions section shows thoughtful trade-off analysis (e.g., basic JSONPath vs. advanced, session validation vs. events).

References are helpful (DataUpdateCoordinator, Strategy Pattern, JSONPath spec).

Critical findings from HA and httpx source code analysis: 1. httpx DOES automatically persist cookies by default - Any Set-Cookie headers go into client.cookies - Stored cookies are sent with subsequent requests - This is built-in behavior, not opt-in 2. Home Assistant's get_async_client() does NOT disable this - The shared client has an active cookie jar - All integrations using it share the same jar! - This is fine for stateless requests, problematic for sessions 3. Current multiscrape workaround uses deprecated pattern - Passes cookies per-request via cookies= parameter - httpx deprecated this in favor of client-level cookies - This prevents accidental pollution of shared jar 4. Why dedicated client is correct: - HA's shared client is for STATELESS requests - Stateful scraping needs isolated cookie jar - HA docs mention create_async_httpx_client kwargs "i.e. for cookies" - Multiple multiscrape instances would conflict on shared jar The solution is justified: dedicated httpx client per instance is the right pattern for session-based scraping. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

docs/ARCHITECTURE_REFACTORING.md (2)

26-40: Add language specifications to all fenced code blocks throughout the document.

All code fences in this document lack language identifiers. These should be added for proper syntax highlighting and tool recognition.

Affected ranges and required language tags:

Lines 26–40, 97–268, 272–310, 314–366, 353–371, 385–453, 457–521, 525–566, 570–595, 638–667, 673–765, 768–816, 819–958, 1002–1013, 1029–1113, 1115–1180, 1184–1259, 1310–1339, 1354–1453, 1457–1525, 1529–1578, 1608–1641, 1656–1861, 1865–1904: Use ```python

Lines 51–76, 988–998: Use ```text

Line 1908–1919: Use ```yaml

For example, change ``` to ```python at line 26, and ``` to ```text at line 51.

Also applies to: 51-76, 97-268, 272-310, 314-366, 353-371, 385-453, 457-521, 525-566, 570-595, 638-667, 673-765, 768-816, 819-958, 988-998, 1002-1013, 1029-1113, 1074-1113, 1115-1180, 1184-1259, 1310-1339, 1354-1453, 1457-1525, 1529-1578, 1608-1641, 1656-1861, 1865-1904, 1908-1919

407-407: Simplify wording: "mixed together" → "intermixed" or "combined".

Line 407 uses "mixed together," which is slightly wordy. Consider a more concise alternative.

Current: "Variables come from multiple sources and are mixed together without clear boundaries:"

Suggested: "Variables come from multiple sources and are intermixed without clear boundaries:"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ee795a and 2a1b30f.

📒 Files selected for processing (1)

docs/ARCHITECTURE_REFACTORING.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

docs/ARCHITECTURE_REFACTORING.md

[style] ~407-~407: ‘mixed together’ might be wordy. Consider a shorter alternative.
Context: ...bles come from multiple sources and are mixed together without clear boundaries: 1. **Form Va...

(EN_WORDINESS_PREMIUM_MIXED_TOGETHER)

🪛 markdownlint-cli2 (0.18.1)

docs/ARCHITECTURE_REFACTORING.md

66-66: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

1051-1051: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

Research confirms Home Assistant's aiohttp helper has identical behavior: 1. aiohttp.ClientSession creates CookieJar() by default - Automatically stores cookies from Set-Cookie headers - Automatically sends cookies with subsequent requests - Can be disabled with DummyCookieJar() 2. HA's async_get_clientsession() does NOT disable this - No cookie_jar parameter passed to ClientSession - Uses default behavior = active cookie jar - Same docstring: "with kwargs, i.e. for cookies" 3. Identical issue to httpx - Both shared clients have active cookie jars - All integrations share the same jar - Fine for stateless requests, bad for sessions Conclusion: Whether using httpx or aiohttp, dedicated client per instance is the correct pattern for session-based scraping. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai Bot reviewed Nov 17, 2025

View reviewed changes

danieldotnl and others added 2 commits November 17, 2025 20:32

coderabbitai Bot reviewed Nov 17, 2025

View reviewed changes

danieldotnl merged commit b00bb79 into master Mar 5, 2026
7 checks passed

danieldotnl deleted the docs/architecture-refactoring-plan branch March 5, 2026 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Refactoring Plan Documentation#535

Architecture Refactoring Plan Documentation#535
danieldotnl merged 5 commits intomasterfrom
docs/architecture-refactoring-plan

danieldotnl commented Nov 17, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Nov 17, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Uh oh!

coderabbitai Bot Nov 17, 2025

Uh oh!

coderabbitai Bot Nov 17, 2025

Uh oh!

coderabbitai Bot Nov 17, 2025

Uh oh!

coderabbitai Bot Nov 17, 2025

Uh oh!

coderabbitai Bot Nov 17, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danieldotnl commented Nov 17, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's Included

Each Section Includes

Additional Content

Why This Matters

Next Steps

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danieldotnl commented Nov 17, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Nov 17, 2025 •

edited

Loading