Skip to content

Architecture Refactoring Plan Documentation#535

Merged
danieldotnl merged 5 commits intomasterfrom
docs/architecture-refactoring-plan
Mar 5, 2026
Merged

Architecture Refactoring Plan Documentation#535
danieldotnl merged 5 commits intomasterfrom
docs/architecture-refactoring-plan

Conversation

@danieldotnl
Copy link
Copy Markdown
Owner

@danieldotnl danieldotnl commented Nov 17, 2025

Summary

This PR adds comprehensive documentation for refactoring the ha-multiscrape architecture to address complexity issues identified in the codebase.

What's Included

A detailed refactoring plan covering 6 priority architectural fixes:

  1. Unify HTTP Handling - Replace dual HTTP wrapper instances with a single HttpSession class that properly manages cookies, authentication, and session state

  2. Simplify Variable System - Replace the confusing mix of form variables, template variables, and selector variables with a structured ScrapeContext system

  3. Refactor Scraper Class - Split the monolithic 197-line Scraper class into focused components using the Strategy pattern (Parser, Extractor, Renderer)

  4. Fix Notification Flow - Eliminate backwards dependency flow by using session validation or event-based architecture instead of notify_scrape_exception() chains

  5. Replace Index-Based Discovery - Replace fragile array indices with a type-safe ScraperRegistry using unique IDs

  6. Extract Strategy Pattern for Parsers - Enable native JSON API support alongside HTML scraping using pluggable parser strategies

Each Section Includes

  • Current Problem: Detailed analysis with code examples and file references
  • Proposed Solution: Complete implementation with code examples
  • Benefits: Why this improves the architecture
  • Migration Path: Step-by-step transition plan

Additional Content

  • Implementation Timeline: 8-week phased approach
  • Testing Strategy: Unit, integration, and compatibility tests
  • Success Metrics: Measurable improvements
  • Breaking Changes: Analysis (spoiler: no config changes!)
  • Questions & Decisions: Key architectural choices

Why This Matters

The current architecture works but has accumulated complexity that makes it:

  • Hard to understand the request flow (especially with form submit)
  • Difficult to add new features (like JSON API support)
  • Challenging to debug (backwards notification flow, mixed variables)
  • Fragile (index-based discovery can break on reload)

This refactoring plan provides a roadmap to make the codebase more maintainable, testable, and extensible while maintaining backwards compatibility.

Next Steps

This is documentation only - no code changes yet. The plan serves as:

  1. Reference for future refactoring work
  2. Discussion document for architectural decisions
  3. Guide for contributors who want to help

Feedback welcome!

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive Architecture Refactor Plan describing centralized per-instance HTTP/session handling, session-based authentication and state, a simplified request flow, modular HTML/JSON parsing and extraction, registry-based platform discovery via unique IDs, optional event-driven lifecycle hooks, extended selector support, and migration, testing, and rollout guidance.

Document six priority fixes to address architectural complexity:

1. Unify HTTP handling - Replace dual HTTP wrappers with single HttpSession
2. Simplify variable system - Structured context instead of mixed dicts
3. Refactor Scraper - Split into Parser/Extractor/Renderer components
4. Fix notification flow - Event-based or session validation instead of backwards calls
5. Replace index-based discovery - Type-safe registry with unique IDs
6. Extract strategy pattern - Support HTML and JSON parsing natively

Includes detailed implementation plans, code examples, migration paths,
and timeline for each fix. Total estimated effort: 6-8 weeks.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 17, 2025

Warning

Rate limit exceeded

@danieldotnl has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 9 minutes and 4 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 2a1b30f and 607c811.

📒 Files selected for processing (1)
  • docs/ARCHITECTURE_REFACTORING.md (1 hunks)

Walkthrough

Centralizes HTTP into a new HttpSession, introduces typed AuthContext/ScrapeContext and VariableRenderer, refactors scrapers with a ParserFactory, adds parser/extractor abstractions (HTML/JSON), and replaces index-based discovery with a ScraperRegistry and registry-driven platform loading.

Changes

Cohort / File(s) Summary
Architecture doc
docs/ARCHITECTURE_REFACTORING.md
Adds the comprehensive refactor plan: new components, wiring, migration path, testing strategy, and deprecation guidance.
HTTP & session
http_session.py, ContentRequestManager
.../FormSubmitter (deprecated notes in doc)
Introduces HttpSession and HttpConfig/FormAuthConfig; centralizes HTTP client, cookie and auth state, and request flow through a session-managed client.
Auth & context
context.py, AuthContext, ScrapeContext, VariableRenderer
Adds typed contexts and a VariableRenderer for namespaced template variables and render-time data separation.
Scraper core
scraper.py, coordinator.py, .../__init__.py
Refactors Scraper into a modular design, integrates ParserFactory, and replaces index loaders with get_scraper_for_platform / registry-driven entrypoints.
Parsers & extractors
parsers.py, extractors.py, ParserFactory, HtmlParser, JsonParser, HtmlContent, JsonContent, ValueExtractor
Adds parser abstractions for HTML/JSON, ParsedContent types, JSONPath-like selectors, and a value extractor subsystem.
Registry & discovery
registry.py, ScraperRegistry, ScraperInstance, PlatformConfig
Implements registry-based discovery with unique IDs, lifecycle management, safer reload semantics, and platform config wiring.
Events & observability
events.py, ScraperEvent, EventBus
Adds optional event classes and an EventBus for decoupled lifecycle/error notifications and session-validation hooks.
Public API surface
__init__.py, docs, public exports
Changes public surfaces: promotes HttpSession, exposes registry lookup by unique_id, and marks legacy wrappers/forms for deprecation; updates exported types and signatures.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Coordinator
    participant Registry
    participant Scraper as ScraperInstance
    participant Http as HttpSession
    participant Parser as ParserFactory

    Note over Coordinator,Registry: Load platform by unique_id (registry-driven)
    Coordinator->>Registry: get_scraper(unique_id)
    Registry-->>Coordinator: ScraperInstance
    Coordinator->>Scraper: start_scrape(request, scrape_ctx)
    Scraper->>Http: request(url, auth_ctx)
    Http-->>Scraper: Response (cookies/state updated)
    Scraper->>Parser: parse(response)
    Parser-->>Scraper: ParsedContent
    Scraper->>Scraper: extract values (VariableRenderer / ValueExtractor)
    Scraper-->>Coordinator: scrape_result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Areas needing focused review:
    • HttpSession concurrency, auth lifecycle, cookie handling and HttpConfig defaults.
    • ScraperRegistry unique_id mapping, reload semantics, and lifecycle hooks.
    • ParserFactory selection logic, HTML/JSON parsing boundaries, and JSONPath-like selector correctness.
    • Public API changes and deprecation notes to ensure backward-compatibility and migration clarity.

Poem

🐰 I hopped through bytes and tidy threads,
Sessions cozy where HTTP beds.
Parsers sniff JSON, selectors trace,
Registry shelves each scraper's place.
Nibbles of refactor — a carrot chase! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: a comprehensive architecture refactoring plan document. It is concise, clear, and directly related to the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
docs/ARCHITECTURE_REFACTORING.md (1)

344-344: Refine wording for conciseness.

Consider replacing "mixed together without clear boundaries" with "commingled without clear boundaries" or simply "mixed without clear separation" for conciseness.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcff870 and 5bda929.

📒 Files selected for processing (1)
  • docs/ARCHITECTURE_REFACTORING.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/ARCHITECTURE_REFACTORING.md

[style] ~344-~344: ‘mixed together’ might be wordy. Consider a shorter alternative.
Context: ...bles come from multiple sources and are mixed together without clear boundaries: 1. **Form Va...

(EN_WORDINESS_PREMIUM_MIXED_TOGETHER)

🪛 markdownlint-cli2 (0.18.1)
docs/ARCHITECTURE_REFACTORING.md

51-51: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


988-988: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (2)
docs/ARCHITECTURE_REFACTORING.md (2)

1728-1763: Validate JSONPath implementation covers intended use cases.

The basic JSONPath implementation (lines 1728–1763) handles dot notation and integer indices, but complex expressions (e.g., filters, wildcards) are not supported. Before implementation, verify this covers the expected JSON API queries from users, or plan to integrate jsonpath-ng as the optional migration mentioned at line 1938.


1-2072: Excellent comprehensive refactoring plan with sound architectural decisions.

This Architecture Refactoring Plan is well-structured and thoughtfully designed. Each of the six priority fixes addresses specific architectural problems with clear problem analysis, proposed solutions with working code examples, realistic benefits, and practical migration paths. The phased 8-week timeline is realistic, success metrics are measurable, and the decision to defer breaking changes to configuration shows good judgment.

The proposed patterns (Strategy for parsers, Registry for discovery, Context objects for variables, Session validation for authentication) are architecturally sound and follow established best practices.

However, address the markdown linting issues (missing language specifications on code blocks) before merging, and consider validating the JSONPath implementation scope.

Comment on lines +26 to +40
```python
# In __init__.py
if form_submit_config:
form_http = create_http_wrapper(config_name, form_submit_config, hass, file_manager)
form_submitter = create_form_submitter(
config_name,
form_submit_config,
hass,
form_http, # First HTTP wrapper
file_manager,
parser,
)

http = create_http_wrapper(config_name, conf, hass, file_manager) # Second HTTP wrapper
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifications to fenced code blocks.

Code blocks at lines 26–40 and 51–76 lack language identifiers in the markdown fence. Line 51–76 should specify text for the ASCII diagram; lines 26–40 should specify python.

Update the fenced code block opening from ``` to ```python (or ```text for the diagram).

Also applies to: 51-76

🤖 Prompt for AI Agents
In docs/ARCHITECTURE_REFACTORING.md around lines 26–40 and 51–76, the fenced
code blocks are missing language identifiers; update the opening fence on lines
26–40 to use ```python for the Python snippet and update the opening fence on
lines 51–76 to use ```text for the ASCII diagram so that syntax
highlighting/rendering is applied correctly.

Comment on lines +353 to +371
```python
# In entity.py:145 - Mixing form variables into scraping
attr_value = self.scraper.scrape(
attr_selector,
self._name,
name,
variables=self.coordinator.form_variables # ← Form variables passed everywhere
)

# In coordinator.py:233 - Property chain
@property
def form_variables(self):
return self._request_manager.form_variables # ← Law of Demeter violation

# In selector.py - Templates receive mixed variable context
value = selector.value_template.async_render(
variables=variables, # ← What's in here? Form vars? Template vars? Both?
parse_result=True
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifications to additional code blocks.

Multiple code blocks lack language identifiers in markdown fences throughout Priority Fix #2 (lines 353–371, 385–453, 457–521, 525–566, 570–595). Add python language specifiers to each.

Also applies to: 385-453, 457-521, 525-566, 570-595

🤖 Prompt for AI Agents
In docs/ARCHITECTURE_REFACTORING.md around lines 353-371 (and additionally
385-453, 457-521, 525-566, 570-595), several fenced code blocks are missing
language specifiers; update each Markdown code fence in those ranges to include
the python language tag (```python) so they render with correct syntax
highlighting — locate every triple-backtick fence in the listed line ranges and
change the opening fence to ```python, leaving the block contents unchanged.

Comment on lines +638 to +667
```python
def scrape(self, selector, sensor, attribute=None, variables: dict = {}):
"""60+ line method doing everything."""

# Template rendering (should be in VariableRenderer)
if selector.just_value:
result = selector.value_template.async_render_with_possible_json_value(...)
return selector.value_template._parse_result(result)

# Content type detection (should be in Parser)
content_stripped = self._data.lstrip() if self._data else ""
if content_stripped and content_stripped[0] in ["{", "["]:
raise ValueError("JSON cannot be scraped...")

# CSS selection (should be in Selector)
if selector.is_list:
tags = self._soup.select(selector.list)
else:
tag = self._soup.select_one(selector.element)

# Value extraction (should be in Extractor)
if selector.attribute is not None:
value = tag[selector.attribute]
else:
value = self.extract_tag_value(tag, selector)

# Template rendering again (should be in VariableRenderer)
if value is not None and selector.value_template is not None:
value = selector.value_template.async_render(...)
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifications to code blocks in Priority Fix #3.

All fenced code blocks in this section (lines 638–667, 673–765, 768–816, 819–958) lack python language specifiers. Add them to each fence opening.

Also applies to: 673-765, 768-816, 819-958

🤖 Prompt for AI Agents
docs/ARCHITECTURE_REFACTORING.md lines 638-667 (and also 673-765, 768-816,
819-958): several fenced code blocks are missing the language specifier; update
each opening fence (``` ) in these ranges to include the python specifier
(```python) so syntax highlighting and tooling recognize the blocks; scan the
listed line ranges and replace every ``` with ```python for those code blocks,
leaving the fence closings unchanged.

Comment on lines +988 to +998
```
Entity (scraping fails)
Coordinator.notify_scrape_exception()
ContentRequestManager.notify_scrape_exception()
FormSubmitter.notify_scrape_exception()
Sets should_submit = True
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifications to all code blocks in Priority Fix #4.

Multiple code blocks lack language identifiers: line 988–998 should be text, and lines 1002–1013, 1029–1113, 1074–1113, 1115–1180, 1184–1259 should all be python. Update each fence opening.

Also applies to: 1002-1013, 1029-1113, 1074-1113, 1115-1180, 1184-1259

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

988-988: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In docs/ARCHITECTURE_REFACTORING.md around lines 988-998 (and also apply to
1002-1013, 1029-1113, 1074-1113, 1115-1180, 1184-1259), several fenced code
blocks are missing language specifiers; change the fence opening on lines
988-998 to ```text and update the other listed fence openings to ```python so
each code block has the correct language tag; ensure you only modify the fence
lines (the opening triple-backticks) and leave the block contents unchanged.

Comment on lines +1310 to +1339
```python
# In __init__.py:121-142
hass.data[DOMAIN][SCRAPER_DATA].append(
{SCRAPER: scraper, COORDINATOR: coordinator}
)

for platform_domain in PLATFORMS:
if platform_domain not in conf:
continue

for platform_conf in conf[platform_domain]:
hass.data[DOMAIN][platform_domain].append(platform_conf)
platform_idx = len(hass.data[DOMAIN][platform_domain]) - 1

load = discovery.async_load_platform(
hass,
platform_domain,
DOMAIN,
{SCRAPER_IDX: scraper_idx, PLATFORM_IDX: platform_idx}, # ← Indices!
config,
)

# In platforms (sensor.py, binary_sensor.py, etc):
async def async_get_config_and_coordinator(hass, platform_domain, discovery_info):
shared_data = hass.data[DOMAIN][SCRAPER_DATA][discovery_info[SCRAPER_IDX]] # ← Array lookup
conf = hass.data[DOMAIN][platform_domain][discovery_info[PLATFORM_IDX]] # ← Array lookup
coordinator = shared_data[COORDINATOR]
scraper = shared_data[SCRAPER]
return conf, coordinator, scraper
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifications to all code blocks in Priority Fix #5.

All fenced code blocks in this section (lines 1310–1339, 1354–1453, 1457–1525, 1529–1578) should specify python language.

Also applies to: 1354-1453, 1457-1525, 1529-1578

🤖 Prompt for AI Agents
In docs/ARCHITECTURE_REFACTORING.md around lines 1310 to 1339 (and also apply
the same change to ranges 1354–1453, 1457–1525, 1529–1578), several fenced code
blocks lack a language specifier; update each triple-backtick fence to include
"python" (i.e., change ``` to ```python) so all code blocks in those ranges are
marked as Python.

Comment on lines +1608 to +1641
```python
async def set_content(self, content):
"""Set the content to be scraped."""
self._data = content

# Try to detect JSON more robustly
content_stripped = content.lstrip() if content else ""
if content_stripped and content_stripped[0] in ["{", "["]:
_LOGGER.debug(
"%s # Response seems to be json. Skip parsing with BeautifulSoup.",
self._config_name,
)
else:
try:
_LOGGER.debug(
"%s # Loading the content in BeautifulSoup.",
self._config_name,
)
self._soup = await self._hass.async_add_executor_job(
BeautifulSoup, self._data, self._parser
)
```

**Then during scraping**:

```python
def scrape(self, selector, sensor, attribute=None, variables: dict = {}):
# Check AGAIN if content is JSON
content_stripped = self._data.lstrip() if self._data else ""
if content_stripped and content_stripped[0] in ["{", "["]:
raise ValueError(
"JSON cannot be scraped. Please provide a value template to parse JSON response."
)
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifications to all code blocks in Priority Fix #6.

All fenced code blocks lack language specifiers: lines 1608–1641, 1656–1861, 1865–1904 should be python; line 1908–1919 should be yaml.

Also applies to: 1656-1861, 1865-1904, 1908-1919

🤖 Prompt for AI Agents
In docs/ARCHITECTURE_REFACTORING.md around lines 1608–1641, 1656–1861, 1865–1904
and 1908–1919, the fenced code blocks are missing language specifiers; update
the opening fence for the code blocks in lines 1608–1641, 1656–1861 and
1865–1904 to ```python and update the block in lines 1908–1919 to ```yaml so
each fenced block includes the correct language label for syntax highlighting.

danieldotnl and others added 2 commits November 17, 2025 20:32
Corrections based on code review:
- Both HTTP wrappers already use get_async_client(hass) for connection pooling
- The real issue is cookie management, not separate clients
- Per-request cookies are necessary because shared HA client can't maintain
  per-integration session state
- Updated HttpSession design to use get_async_client() and manage cookies
  in _session_cookies dict
- Changed close() to not close shared client, only cleanup session state

The problem was cookie juggling complexity, not connection pooling.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…e jar

Major improvement based on discussion: Instead of using HA's shared client
and manually managing cookies, use a dedicated httpx.AsyncClient per instance.

Why this is better:
- httpx DOES handle cookies automatically via built-in cookie jar
- Problem: can't use shared HA client's cookie jar (shared by all integrations)
- Solution: Each multiscrape instance gets its own httpx.AsyncClient
- Result: Zero manual cookie extraction/passing - httpx does it all!

This is how httpx is meant to be used for stateful scraping.

Trade-off:
- Pro: Natural session management, simpler code, standards-compliant
- Con: More connection pools (but still pooled within each instance)
- Con: Deviates from HA's get_async_client() pattern (but justified)

The current approach of manual cookie juggling defeats the purpose of
using an HTTP library that has session management built-in.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
docs/ARCHITECTURE_REFACTORING.md (1)

26-40: Add language specifiers to all fenced code blocks throughout the document.

All fenced code blocks are missing language identifiers. Prior reviewers flagged this across multiple sections; these remain unaddressed. This affects syntax highlighting, rendering, and tooling support.

Code block language mappings required:

  • Priority Fix #1 (HTTP): Lines 26–40, 91–262, 266–304, 308–360 → python; lines 51–76 → text
  • Priority Fix #2 (Variables): Lines 406–425, 438–506, 510–574, 578–619, 623–648 → python
  • Priority Fix #3 (Scraper): Lines 691–720, 727–818, 820–869, 872–1011 → python
  • Priority Fix #4 (Notification): Lines 1041–1051 → text; lines 1055–1066, 1082–1123, 1125–1166, 1168–1233, 1237–1312 → python
  • Priority Fix #5 (Registry): Lines 1363–1392, 1406–1506, 1510–1578, 1582–1631 → python
  • Priority Fix #6 (Parser): Lines 1661–1694 → text; lines 1708–1914, 1918–1957 → python; lines 1961–1972 → yaml

Fix approach: Use a shell script or find-and-replace to systematically update all fence openings:

#!/bin/bash
# Find all triple-backtick fences and add language specifiers
# This is a template; adapt to your specific needs

# For Python code blocks (majority):
sed -i '/^```$/s/^```$/```python/' docs/ARCHITECTURE_REFACTORING.md

# Then manually verify and correct specific exceptions (text, yaml blocks)

Alternatively, manually edit each fence opening across the listed line ranges to include the language tag.

Also applies to: 51-76, 353-371, 385-453, 457-521, 525-566, 570-595, 638-667, 673-765, 768-816, 819-958, 988-998, 1002-1013, 1029-1113, 1074-1113, 1115-1180, 1184-1259, 1310-1339, 1354-1453, 1457-1525, 1529-1578, 1608-1641, 1656-1861, 1865-1904, 1908-1919

🧹 Nitpick comments (1)
docs/ARCHITECTURE_REFACTORING.md (1)

397-397: Minor: Consider rewording for conciseness.

Line 397: "are mixed together without clear boundaries" is slightly wordy. Consider: "are intermingled without clear boundaries" or "lack clear separation."

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9abead and 8ee795a.

📒 Files selected for processing (1)
  • docs/ARCHITECTURE_REFACTORING.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/ARCHITECTURE_REFACTORING.md

[style] ~397-~397: ‘mixed together’ might be wordy. Consider a shorter alternative.
Context: ...bles come from multiple sources and are mixed together without clear boundaries: 1. **Form Va...

(EN_WORDINESS_PREMIUM_MIXED_TOGETHER)

🪛 markdownlint-cli2 (0.18.1)
docs/ARCHITECTURE_REFACTORING.md

60-60: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


1041-1041: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (1)
docs/ARCHITECTURE_REFACTORING.md (1)

1-2120: Comprehensive, well-structured refactoring plan that clearly articulates architectural improvements.

The documentation excels at:

  • Problem identification: Each section opens with concrete code examples and file references showing current issues (e.g., dual HTTP wrappers, cookie juggling, backwards notification chains)
  • Solution design: Proposed solutions follow established design patterns (Strategy, Factory, Registry) with clear rationales
  • Migration path: Each fix includes step-by-step migration guidance
  • Holistic scope: Coverage spans HTTP management, variable context, parser abstraction, event-based error handling, and type-safe discovery—addressing root causes rather than symptoms
  • Tradeoff analysis: Honest discussion of pros/cons (e.g., dedicated httpx client vs. HA's shared client)
  • Realistic timeline: 8-week phased approach with clear milestones and testing strategy
  • Breaking-change clarity: Explicit assurance that no config changes break existing setups

Substantive strengths:

  1. HTTP Session (#1): The reasoning for a dedicated httpx.AsyncClient is sound—leveraging httpx's built-in cookie jar instead of manual extraction/passing is idiomatic and simpler.
  2. Variable Context (#2): Structured AuthContext and ScrapeContext classes eliminate variable-source ambiguity and enable type safety.
  3. Scraper Refactoring (#3): Split of parsing, extraction, and rendering into separate classes reduces cognitive load and enables independent testing.
  4. Parser Strategy (#6): Abstract ContentParser hierarchy with HtmlParser / JsonParser cleanly supports multi-format scraping without modifying core logic.
  5. Registry Pattern (#5): Replaces fragile index-based discovery with typed dataclasses and unique IDs—safer reloads and better debuggability.
  6. Notification Flow (#4): Session validation approach is pragmatic for MVP, with clear path to event-based architecture if needed later.

Minor observations:

  • Success metrics section is measurable and actionable (Scraper 197 lines → ~60 lines, cyclomatic complexity <10, >90% unit test coverage).
  • Questions & Decisions section shows thoughtful trade-off analysis (e.g., basic JSONPath vs. advanced, session validation vs. events).
  • References are helpful (DataUpdateCoordinator, Strategy Pattern, JSONPath spec).

Critical findings from HA and httpx source code analysis:

1. httpx DOES automatically persist cookies by default
   - Any Set-Cookie headers go into client.cookies
   - Stored cookies are sent with subsequent requests
   - This is built-in behavior, not opt-in

2. Home Assistant's get_async_client() does NOT disable this
   - The shared client has an active cookie jar
   - All integrations using it share the same jar!
   - This is fine for stateless requests, problematic for sessions

3. Current multiscrape workaround uses deprecated pattern
   - Passes cookies per-request via cookies= parameter
   - httpx deprecated this in favor of client-level cookies
   - This prevents accidental pollution of shared jar

4. Why dedicated client is correct:
   - HA's shared client is for STATELESS requests
   - Stateful scraping needs isolated cookie jar
   - HA docs mention create_async_httpx_client kwargs "i.e. for cookies"
   - Multiple multiscrape instances would conflict on shared jar

The solution is justified: dedicated httpx client per instance
is the right pattern for session-based scraping.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
docs/ARCHITECTURE_REFACTORING.md (2)

26-40: Add language specifications to all fenced code blocks throughout the document.

All code fences in this document lack language identifiers. These should be added for proper syntax highlighting and tool recognition.

Affected ranges and required language tags:

  • Lines 26–40, 97–268, 272–310, 314–366, 353–371, 385–453, 457–521, 525–566, 570–595, 638–667, 673–765, 768–816, 819–958, 1002–1013, 1029–1113, 1115–1180, 1184–1259, 1310–1339, 1354–1453, 1457–1525, 1529–1578, 1608–1641, 1656–1861, 1865–1904: Use ```python
  • Lines 51–76, 988–998: Use ```text
  • Line 1908–1919: Use ```yaml

For example, change ``` to ```python at line 26, and ``` to ```text at line 51.

Also applies to: 51-76, 97-268, 272-310, 314-366, 353-371, 385-453, 457-521, 525-566, 570-595, 638-667, 673-765, 768-816, 819-958, 988-998, 1002-1013, 1029-1113, 1074-1113, 1115-1180, 1184-1259, 1310-1339, 1354-1453, 1457-1525, 1529-1578, 1608-1641, 1656-1861, 1865-1904, 1908-1919


407-407: Simplify wording: "mixed together" → "intermixed" or "combined".

Line 407 uses "mixed together," which is slightly wordy. Consider a more concise alternative.

Current: "Variables come from multiple sources and are mixed together without clear boundaries:"

Suggested: "Variables come from multiple sources and are intermixed without clear boundaries:"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ee795a and 2a1b30f.

📒 Files selected for processing (1)
  • docs/ARCHITECTURE_REFACTORING.md (1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/ARCHITECTURE_REFACTORING.md

[style] ~407-~407: ‘mixed together’ might be wordy. Consider a shorter alternative.
Context: ...bles come from multiple sources and are mixed together without clear boundaries: 1. **Form Va...

(EN_WORDINESS_PREMIUM_MIXED_TOGETHER)

🪛 markdownlint-cli2 (0.18.1)
docs/ARCHITECTURE_REFACTORING.md

66-66: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


1051-1051: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

Research confirms Home Assistant's aiohttp helper has identical behavior:

1. aiohttp.ClientSession creates CookieJar() by default
   - Automatically stores cookies from Set-Cookie headers
   - Automatically sends cookies with subsequent requests
   - Can be disabled with DummyCookieJar()

2. HA's async_get_clientsession() does NOT disable this
   - No cookie_jar parameter passed to ClientSession
   - Uses default behavior = active cookie jar
   - Same docstring: "with kwargs, i.e. for cookies"

3. Identical issue to httpx
   - Both shared clients have active cookie jars
   - All integrations share the same jar
   - Fine for stateless requests, bad for sessions

Conclusion: Whether using httpx or aiohttp, dedicated client per
instance is the correct pattern for session-based scraping.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant