diff --git a/examples/deployment/docker-compose.yml b/examples/deployment/docker-compose.yml new file mode 100644 index 00000000..5b98fa66 --- /dev/null +++ b/examples/deployment/docker-compose.yml @@ -0,0 +1,39 @@ +services: + html2rss: + image: html2rss/web:latest + env_file: .env + + caddy: + image: caddy:2-alpine + depends_on: + - html2rss + command: + - caddy + - reverse-proxy + - --from + - ${CADDY_HOST} + - --to + - html2rss:3000 + ports: + - "80:80" + - "443:443" + volumes: + - caddy_data:/data + + watchtower: + image: containrrr/watchtower + depends_on: + - html2rss + - caddy + command: + - --cleanup + - --interval + - "300" + - html2rss + - caddy + volumes: + - /var/run/docker.sock:/var/run/docker.sock:ro + restart: unless-stopped + +volumes: + caddy_data: diff --git a/src/components/docs/AutoGenerationOptional.astro b/src/components/docs/AutoGenerationOptional.astro index 35039fc9..982682bb 100644 --- a/src/components/docs/AutoGenerationOptional.astro +++ b/src/components/docs/AutoGenerationOptional.astro @@ -2,7 +2,7 @@ import { Aside } from "@astrojs/starlight/components"; --- - --- @@ -160,6 +160,21 @@ html2rss supports many configuration options: 4. **Check the output:** Make sure all items have titles, links, and descriptions +### Useful CLI flags when a site is difficult + +Some sites need a little more request budget than the defaults. + +- Use `--max-redirects` when the site bounces through several canonicalization or tracking redirects before the real page loads. +- Use `--max-requests` when your config needs more than one request, for example pagination or other follow-up fetches. + +```bash +html2rss feed your-config.yml --max-redirects 10 +html2rss feed your-config.yml --max-requests 5 +html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5 +``` + +Keep these values as low as possible. If a site only needs one extra redirect, prefer `--max-redirects 4` over a much larger number. + ## Add It To html2rss-web Once the config works locally, add it to your `feeds.yml` or shared config repository and restart your diff --git a/src/content/docs/getting-started.mdx b/src/content/docs/getting-started.mdx index f08de5ba..30ea4175 100644 --- a/src/content/docs/getting-started.mdx +++ b/src/content/docs/getting-started.mdx @@ -1,6 +1,6 @@ --- title: "Getting Started" -description: "Learn how to get RSS feeds from any website. Start with existing feeds or create your own in minutes." +description: "Start html2rss-web locally, verify the web interface, generate your first feed URL, and decide when to move to custom configs." sidebar: order: 1 --- @@ -14,12 +14,30 @@ If you want the recommended path, go to [Run html2rss-web with Docker](/web-appl That guide is the canonical setup flow for: - running `html2rss-web` locally -- confirming your first successful feed -- deciding when to use included feeds, automatic generation, or custom configs +- confirming the interface is working +- generating a first feed URL +- deciding when to use automatic generation or custom configs ## Quick Shortcuts -- **[Run html2rss-web with Docker](/web-application/getting-started)** - Recommended first step -- **[Browse working feed examples](/feed-directory/)** - See what success looks like -- **[Create Custom Feeds](/creating-custom-feeds)** - Write configs when you need more control -- **[Troubleshooting Guide](/troubleshooting/troubleshooting)** - Fix startup or extraction problems +- **[Run html2rss-web with Docker](/web-application/getting-started)**: recommended first step +- **[Use automatic feed generation](/web-application/how-to/use-automatic-feed-generation/)**: create a feed directly from a page URL +- **[Browse working feed examples](/feed-directory/)**: see what successful outputs look like +- **[Create Custom Feeds](/creating-custom-feeds)**: write configs when you need more control +- **[Troubleshooting Guide](/troubleshooting/troubleshooting)**: fix startup or extraction problems + +## Using the Ruby CLI + +If you are working directly with the gem instead of `html2rss-web`, start with: + +```bash +html2rss auto https://example.com/blog +``` + +If the target site is unusually redirect-heavy or needs extra follow-up requests, the CLI also supports: + +```bash +html2rss auto https://example.com/blog --max-redirects 10 --max-requests 5 +``` + +For config-driven runs, the same flags are available on `html2rss feed`. diff --git a/src/content/docs/index.mdx b/src/content/docs/index.mdx index 349bccae..75ab9d0e 100644 --- a/src/content/docs/index.mdx +++ b/src/content/docs/index.mdx @@ -1,101 +1,69 @@ --- -title: "Turn Any Website Into an RSS Feed - Never Miss Updates Again" -description: "Create RSS feeds from any website - no coding required. Turn blogs, news sites, and forums into RSS feeds you can follow in your favorite reader. Free, open source, and easy to use." +title: "Turn Any Website Into an RSS Feed" +description: "Run html2rss-web with Docker, open the web interface, generate stable feed URLs, and move to custom configs only when you need more control." --- -Run `html2rss-web` with Docker, start with included feeds, and add custom configs only when you need more control. +Run `html2rss-web` with Docker, open the web interface, and generate stable feed URLs from pages you want to follow. -## πŸš€ Get Started in 30 Seconds +## Start Here -**Start here:** [Run html2rss-web with Docker](/web-application/getting-started) | [Browse working feed examples](/feed-directory/) +**Recommended path:** [Run html2rss-web with Docker](/web-application/getting-started) -Need more control? [Write a custom feed config](/creating-custom-feeds) +That guide is the canonical onboarding flow for: ---- +- starting a local instance +- verifying the web interface +- generating a first feed URL +- deciding when to use automatic generation or custom configs ## How It Works 1. **Run your own local instance** with Docker -2. **Use included feeds or add your own** website targets -3. **Subscribe from your RSS reader** using stable feed URLs - ---- - -## Why RSS Still Matters Today - -**Real examples of what you can do:** - -- Follow your favorite blogs without social media algorithms -- Get notified when your local news site posts about your neighborhood -- Track job postings from multiple company websites -- Monitor product updates from software vendors -- Follow academic papers from your field - -**RSS vs Social Media:** - -- βœ… **No algorithms** deciding what you see -- βœ… **No ads** or sponsored content -- βœ… **Works with any feed reader** you choose -- βœ… **Your data stays private** -- βœ… **Never miss updates** - automatic notifications -- βœ… **Save time** - no more manual checking - ---- +2. **Open the web interface** and paste a page URL +3. **Copy the feed URL into your reader** ## What is html2rss? -html2rss is a toolkit for turning websites into RSS feeds. Think of it as a translator that converts website content into a format your feed reader can understand. +html2rss is a toolkit for turning websites into feeds. -**Most people should start with the web application:** +Most people should start with the web application: -- **🌐 html2rss-web** - The easiest way to run your own feed server with Docker -- **βš™οΈ html2rss gem** - The underlying engine, CLI, and developer interface +- **`html2rss-web`**: the self-hosted web interface and feed server +- **`html2rss` gem**: the Ruby engine, CLI, and lower-level config workflow ---- - -## 🎯 Choose Your Path +## Choose Your Path ### I want a working instance first -1. **[Run html2rss-web with Docker](/web-application/getting-started)** - Recommended starting path -2. **[Browse working feed examples](/feed-directory/)** - See what success looks like -3. **[Use the included configs](/web-application/how-to/use-included-configs/)** - Start with ready-made feeds +1. **[Run html2rss-web with Docker](/web-application/getting-started)**: recommended starting path +2. **[Use automatic feed generation](/web-application/how-to/use-automatic-feed-generation/)**: create a feed directly from a page URL +3. **[Browse working feed examples](/feed-directory/)**: see what working outputs look like ### I need more control -1. **[Creating Custom Feeds](/creating-custom-feeds)** - Write and test your own configs -2. **[Selectors Reference](/ruby-gem/reference/selectors/)** - Learn the matching rules -3. **[Strategy Reference](/ruby-gem/reference/strategy/)** - Use `browserless` for JS-heavy sites +1. **[Creating Custom Feeds](/creating-custom-feeds)**: write and test your own configs +2. **[Selectors Reference](/ruby-gem/reference/selectors/)**: learn the matching rules +3. **[Strategy Reference](/ruby-gem/reference/strategy/)**: decide when `browserless` is justified ### I'm building or integrating -1. **[Ruby Gem Reference](/ruby-gem/)** - Full API documentation -2. **[Advanced Features](/ruby-gem/how-to/advanced-features/)** - Custom HTTP requests, etc. -3. **[Contribute to Core](/get-involved/contributing/)** - Help improve the engine - ---- - -## 🌟 What People Are Using html2rss For - -- **News & Blogs:** Follow your favorite writers without social media -- **Job Hunting:** Track job postings from multiple company sites -- **Product Updates:** Get notified when software you use gets updated -- **Academic Research:** Follow new papers in your field -- **Local News:** Stay updated on your neighborhood and city -- **Hobby Communities:** Follow forums and communities you care about - -[Browse all examples in our Feed Directory β†’](/feed-directory/) - ---- - -## πŸ”§ Common Issues? +1. **[Ruby Gem Reference](/ruby-gem/)**: full API documentation +2. **[Advanced Features](/ruby-gem/how-to/advanced-features/)**: custom HTTP requests and advanced extraction +3. **[Contribute to Core](/get-involved/contributing/)**: help improve the engine -**Start with Docker, not a public instance.** That gives you the most reliable path and the newest integrated behavior. +## What People Use It For -**Feed not working?** Check our [troubleshooting guide](/troubleshooting/troubleshooting) +- follow blogs and news sites without social media algorithms +- track product updates and release notes +- monitor job postings from company websites +- subscribe to forums and communities that do not publish feeds +- follow local news without repeated manual checking -**Need custom control?** Continue to [Creating Custom Feeds](/creating-custom-feeds) +## Practical Notes -**Need help?** Join our [community discussions](https://github.com/orgs/html2rss/discussions) +- Start with Docker, not a public instance. +- Use the web interface to verify the deployment first. +- Use automatic generation for the first pass. +- Move to custom configs when you need a stable, reviewable setup. -**Found a bug?** [Report it on GitHub](https://github.com/html2rss/html2rss/issues) +**Need help?** Continue to the [troubleshooting guide](/troubleshooting/troubleshooting) or join [GitHub Discussions](https://github.com/orgs/html2rss/discussions). diff --git a/src/content/docs/ruby-gem/how-to/advanced-features.mdx b/src/content/docs/ruby-gem/how-to/advanced-features.mdx index 703bd9e9..cf052a64 100644 --- a/src/content/docs/ruby-gem/how-to/advanced-features.mdx +++ b/src/content/docs/ruby-gem/how-to/advanced-features.mdx @@ -7,13 +7,13 @@ This guide covers advanced features and performance optimizations for html2rss. ## Parallel Processing -html2rss uses parallel processing to improve performance when scraping multiple items. This happens automatically and doesn't require any configuration. +html2rss uses parallel processing in auto-source discovery to improve performance when multiple scrapers inspect the same page. This happens automatically and doesn't require any configuration. ### How It Works -- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the page -- **Item processing:** Each scraped item is processed in parallel -- **Performance benefit:** Significantly faster when dealing with many items +- **Auto-source scraping:** Multiple scrapers run in parallel to analyze the same response body +- **Selectors and pagination:** Selector extraction and `rel="next"` pagination stay sequential and share the same request budget +- **Performance benefit:** Faster auto-discovery without changing selector semantics ### Performance Tips @@ -75,6 +75,8 @@ selectors: extractor: "href" ``` +When you use the Browserless strategy, Chromium rejects transport-level headers such as `Host`, `Connection`, `Content-Length`, and `Transfer-Encoding`. html2rss filters those headers before navigation and logs the filtered header names at `info` level. + ## Monitoring and Debugging ### Enable Debug Logging diff --git a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx index 33b6cca3..1a4d10bd 100644 --- a/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx +++ b/src/content/docs/ruby-gem/how-to/custom-http-requests.mdx @@ -5,6 +5,12 @@ description: "Learn how to customize HTTP requests with custom headers, authenti Some websites require custom HTTP headers, authentication, or other request settings to access their content. `html2rss` lets you customize requests for those cases. +Keep this structure in mind: + +- `headers` stays top-level +- `strategy` stays top-level +- request-specific controls such as budgets and Browserless options live under `request` + ## When You Need Custom Headers You might need custom HTTP requests when: @@ -35,6 +41,32 @@ selectors: selector: "url" ``` +## Request Controls + +Request budgets are configured under `request`, not as top-level keys: + +```yaml +headers: + User-Agent: "Mozilla/5.0 (compatible; html2rss/1.0)" +request: + max_redirects: 5 + max_requests: 6 +channel: + url: https://example.com/articles +selectors: + items: + selector: article + title: + selector: h2 + url: + selector: a + extractor: href +``` + +- `request.max_redirects` limits redirect hops +- `request.max_requests` limits the total request budget for the feed build +- `request.browserless.*` is reserved for Browserless-only behavior such as preload actions + ## Common Use Cases ### API Authentication diff --git a/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx b/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx index c0e5e379..0905835e 100644 --- a/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx +++ b/src/content/docs/ruby-gem/how-to/handling-dynamic-content.mdx @@ -9,6 +9,29 @@ Some websites load their content dynamically using JavaScript. The default `html Use the [`browserless` strategy](/ruby-gem/reference/strategy) to render JavaScript-heavy websites with a headless browser. +Keep the strategy at the top level and put request-specific options under `request`: + +```yaml +strategy: browserless +request: + max_redirects: 5 + max_requests: 6 + browserless: + preload: + wait_for_network_idle: + timeout_ms: 5000 +channel: + url: https://example.com/app +selectors: + items: + selector: .article + title: + selector: h2 + url: + selector: a + extractor: href +``` + ## When to Use Browserless The `browserless` strategy is necessary when: @@ -18,6 +41,56 @@ The `browserless` strategy is necessary when: - **Infinite scroll** - Content loads as you scroll - **Dynamic forms** - Content changes based on user interaction +## Preload Actions + +For dynamic sites, rendering once is often not enough. Use `request.browserless.preload` to wait, click, or scroll before the +HTML snapshot is taken. + +### Wait for JavaScript Requests + +```yaml +strategy: browserless +request: + browserless: + preload: + wait_for_network_idle: + timeout_ms: 4000 +``` + +### Click "Load More" Buttons + +```yaml +strategy: browserless +request: + browserless: + preload: + click_selectors: + - selector: ".load-more" + max_clicks: 3 + delay_ms: 250 + wait_for_network_idle: + timeout_ms: 3000 +``` + +### Scroll Infinite Lists + +```yaml +strategy: browserless +request: + browserless: + preload: + scroll_down: + iterations: 5 + delay_ms: 200 + wait_for_network_idle: + timeout_ms: 2500 +``` + +These preload steps can be combined in a single config when a site needs several interactions before all items appear. + +If a click or scroll step causes a real navigation, html2rss returns the final document metadata, not the original page-load +metadata. That keeps extracted relative links anchored to the rendered page. + ## Performance Considerations The `browserless` strategy is slower than the default `faraday` strategy because it: diff --git a/src/content/docs/ruby-gem/reference/auto-source.mdx b/src/content/docs/ruby-gem/reference/auto-source.mdx index 33454232..82e92df0 100644 --- a/src/content/docs/ruby-gem/reference/auto-source.mdx +++ b/src/content/docs/ruby-gem/reference/auto-source.mdx @@ -17,16 +17,19 @@ auto_source: {} `auto_source` uses the following strategies to find content: -1. **`schema`:** Parses `