Skip to content

Commit 85ac293

Browse files
authored
docs: unify crawl caps and fix the runnable code examples (#1002)
Fixes the "Run on Apify" examples in the Python SDK guides. - Lower the page cap from 50 to 10 across all crawling examples, so the browser-based ones finish within the 180s runnable-demo timeout. - **Selenium**: slim the runnable example to a plain crawler (the full version was too large to encode into the Run-on-Apify URL and failed with HTTP 414), and move the proxy-auth extension into the "Using Apify Proxy" section as a separate, non-runnable extension snippet. - **Browser Use**: make it non-runnable (it needs an LLM API key the shared runner cannot provide), with a comment explaining why. Same note added to the Scrapy and Scrapling-browser examples. - Keep both **Scrapling** examples runnable. - **Pydantic**: log a readable validation summary and fail cleanly via `Actor.fail` instead of re-raising into a raw traceback.
1 parent 2b5d64d commit 85ac293

17 files changed

Lines changed: 114 additions & 92 deletions

docs/03_guides/04_selenium.mdx

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,11 @@ title: Browser automation with Selenium
44
description: Build an Apify Actor that scrapes dynamic web pages using Selenium WebDriver.
55
---
66

7+
import CodeBlock from '@theme/CodeBlock';
78
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
89

910
import SeleniumExample from '!!raw-loader!roa-loader!./code/04_selenium.py';
11+
import SeleniumProxyExample from '!!raw-loader!./code/04_selenium_proxy.py';
1012

1113
In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for browser automation and web scraping in your Apify Actors.
1214

@@ -42,9 +44,15 @@ It uses Selenium ChromeDriver to open the pages in an automated Chrome browser,
4244

4345
## Using Apify Proxy
4446

45-
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and routes the browser through it for the whole run.
47+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The runnable example Actor skips the proxy to stay simple. This section extends it to route the browser through Apify Proxy. The snippet below isn't a complete, runnable Actor on its own. It shows only the proxy-specific parts you add to the example Actor.
4648

47-
Chrome ignores the credentials passed in the `--proxy-server` flag. Because of that, configure an authenticated proxy such as Apify Proxy from inside a small extension. The `proxy_auth_extension` helper builds one at runtime: its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. Note that the new headless mode (`--headless=new`) is required for Chrome to load the extension. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
49+
Chrome ignores the credentials passed in the `--proxy-server` flag. To use an authenticated proxy such as Apify Proxy, configure it from inside a small extension. The `proxy_auth_extension` helper builds one at runtime. Its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. The proxy-aware `build_chrome_driver` below replaces the simple one from the example Actor and loads that extension. The new headless mode (`--headless=new`) is required for Chrome to load it.
50+
51+
<CodeBlock className="language-python">
52+
{SeleniumProxyExample}
53+
</CodeBlock>
54+
55+
To wire the proxy into the example Actor, create the proxy configuration in `main` with `Actor.create_proxy_configuration`, get a URL with `await proxy_configuration.new_url()`, and pass it to `build_chrome_driver`. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
4856

4957
## Conclusion
5058

docs/03_guides/06_scrapy.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ For further details, see the [Scrapy migration guide](https://docs.apify.com/cli
7373

7474
The following example shows a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.
7575

76+
{/* Not runnable from the docs: a Scrapy Actor is a multi-file project, while the "Run on Apify" runner executes a single self-contained snippet. */}
7677
<Tabs>
7778
<TabItem value="__main__.py" label="__main__.py">
7879
<CodeBlock className="language-python">

docs/03_guides/07_scrapling.mdx

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,10 @@ title: Adaptive scraping with Scrapling
44
description: Build an Apify Actor that scrapes web pages using the Scrapling adaptive web scraping library.
55
---
66

7-
import CodeBlock from '@theme/CodeBlock';
87
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
98

109
import ScraplingExample from '!!raw-loader!roa-loader!./code/07_scrapling.py';
11-
import ScraplingBrowserScraper from '!!raw-loader!./code/07_scrapling_browser.py';
10+
import ScraplingBrowserScraper from '!!raw-loader!roa-loader!./code/07_scrapling_browser.py';
1211

1312
In this guide, you'll learn how to use the [Scrapling](https://scrapling.readthedocs.io/) library for adaptive web scraping in your Apify Actors.
1413

@@ -101,9 +100,9 @@ scrapling install
101100

102101
To switch the example from HTTP to a real browser, fetch each page through a browser session instead of `AsyncFetcher`. Opening a fresh browser for every page would be wasteful, so `main` enters an `AsyncDynamicSession` once and reuses it for the whole crawl, while `scrape_page` fetches with `session.fetch`. The parsing API is identical, so the extraction code stays the same:
103102

104-
<CodeBlock className="language-python">
103+
<RunnableCodeBlock className="language-python" language="python">
105104
{ScraplingBrowserScraper}
106-
</CodeBlock>
105+
</RunnableCodeBlock>
107106

108107
Note that:
109108

docs/03_guides/09_browser_use.mdx

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ title: Browser AI agents with Browser Use
44
description: Build an Apify Actor that automates a browser with an LLM agent using the Browser Use library.
55
---
66

7-
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
7+
import CodeBlock from '@theme/CodeBlock';
88

9-
import BrowserUseExample from '!!raw-loader!roa-loader!./code/09_browser_use.py';
9+
import BrowserUseExample from '!!raw-loader!./code/09_browser_use.py';
1010

1111
In this guide, you'll learn how to use the [Browser Use](https://browser-use.com/) library to drive a browser with an LLM agent in your Apify Actors.
1212

@@ -46,9 +46,10 @@ The following Actor runs a Browser Use agent for a single task and stores its st
4646

4747
The whole Actor fits in a single file. A `run_agent_task` helper holds the Browser Use-specific logic: it defines the output schema and builds the LLM, browser, and agent. The `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy), runs the agent, and stores the result:
4848

49-
<RunnableCodeBlock className="language-python" language="python">
49+
{/* Not runnable from the docs: the agent needs an LLM API key (OPENAI_API_KEY) that the shared example runner does not provide. */}
50+
<CodeBlock className="language-python">
5051
{BrowserUseExample}
51-
</RunnableCodeBlock>
52+
</CodeBlock>
5253

5354
Note that:
5455

docs/03_guides/11_pydantic.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ The following Actor declares its input as a Pydantic `BaseModel`, validates the
5656
### About the validation
5757

5858
- `model_validate` parses the raw dictionary into a typed `ActorInput` instance. It fills in defaults and guarantees every field is valid, or raises a `ValidationError` that describes every problem at once.
59-
- Catching that error, logging a readable summary, and re-raising makes the Actor fail fast with a clear explanation right at the start, rather than crashing with an obscure error somewhere deep in the run. Because the body runs inside `async with Actor:`, the re-raised exception automatically marks the run as `FAILED`.
59+
- Catching that error, logging a readable summary, and failing the run with <ApiLink to="class/Actor#fail">`Actor.fail`</ApiLink> marks the run as `FAILED` with a clear status message. It fails fast right at the start with a readable explanation, instead of crashing with a raw traceback deeper in the run.
6060
- The error messages refer to the fields by their input-schema aliases. For invalid input like `{"searchTerms": [], "maxResults": 999, "outputFormat": "xml"}`, the log shows exactly what's wrong:
6161

6262
```text

docs/03_guides/code/01_beautifulsoup_httpx.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ async def main() -> None:
8282
await request_queue.add_request(Request.from_url(url))
8383

8484
# Cap the crawl. Raise or remove the limit to follow more pages.
85-
max_requests = 50
85+
max_requests = 10
8686
handled_requests = 0
8787

8888
while handled_requests < max_requests and (

docs/03_guides/code/02_parsel_impit.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ async def main() -> None:
8282
await request_queue.add_request(Request.from_url(url))
8383

8484
# Cap the crawl. Raise or remove the limit to follow more pages.
85-
max_requests = 50
85+
max_requests = 10
8686
handled_requests = 0
8787

8888
while handled_requests < max_requests and (

docs/03_guides/code/03_playwright.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ async def main() -> None:
9494
await request_queue.add_request(Request.from_url(url))
9595

9696
# Cap the crawl. Raise or remove the limit to follow more pages.
97-
max_requests = 50
97+
max_requests = 10
9898
handled_requests = 0
9999

100100
Actor.log.info('Launching Playwright...')

docs/03_guides/code/04_selenium.py

Lines changed: 4 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,6 @@
11
import asyncio
2-
import json
3-
from pathlib import Path
4-
from tempfile import mkdtemp
52
from typing import Any
63
from urllib.parse import urljoin, urlsplit
7-
from zipfile import ZipFile
84

95
from selenium import webdriver
106
from selenium.webdriver.chrome.options import Options as ChromeOptions
@@ -18,71 +14,17 @@
1814
# On the Apify platform, it's already in the Actor's Docker image.
1915

2016

21-
def proxy_auth_extension(proxy_url: str) -> str:
22-
"""Build a Chrome extension that routes Chrome through an authenticated proxy."""
23-
parts = urlsplit(proxy_url)
24-
25-
manifest = {
26-
'name': 'Apify Proxy',
27-
'version': '1.0.0',
28-
'manifest_version': 3,
29-
'permissions': ['proxy', 'webRequest', 'webRequestAuthProvider'],
30-
'host_permissions': ['<all_urls>'],
31-
'background': {'service_worker': 'background.js'},
32-
'minimum_chrome_version': '108',
33-
}
34-
35-
# The service worker sets the proxy and answers the auth challenge.
36-
proxy_config = json.dumps(
37-
{
38-
'mode': 'fixed_servers',
39-
'rules': {
40-
'singleProxy': {
41-
'scheme': parts.scheme,
42-
'host': parts.hostname,
43-
'port': parts.port,
44-
},
45-
},
46-
}
47-
)
48-
credentials = json.dumps(
49-
{'username': parts.username or '', 'password': parts.password or ''}
50-
)
51-
background = (
52-
'chrome.proxy.settings.set('
53-
'{value: ' + proxy_config + ', scope: "regular"});\n'
54-
'chrome.webRequest.onAuthRequired.addListener(\n'
55-
' () => ({authCredentials: ' + credentials + '}),\n'
56-
' {urls: ["<all_urls>"]},\n'
57-
' ["blocking"],\n'
58-
');\n'
59-
)
60-
61-
extension_path = Path(mkdtemp()) / 'apify_proxy.zip'
62-
with ZipFile(extension_path, 'w') as archive:
63-
archive.writestr('manifest.json', json.dumps(manifest))
64-
archive.writestr('background.js', background)
65-
return str(extension_path)
66-
67-
68-
def build_chrome_driver(proxy_url: str | None = None) -> webdriver.Chrome:
69-
"""Create a headless Chrome WebDriver, optionally routed through a proxy."""
17+
def build_chrome_driver() -> webdriver.Chrome:
18+
"""Create a headless Chrome WebDriver suitable for a container."""
7019
chrome_options = ChromeOptions()
7120

7221
if Actor.configuration.headless:
73-
# The new headless mode is required to load the proxy extension.
7422
chrome_options.add_argument('--headless=new')
7523

7624
chrome_options.add_argument('--no-sandbox')
7725
chrome_options.add_argument('--disable-dev-shm-usage')
7826
chrome_options.add_argument('--disable-gpu')
7927

80-
if proxy_url:
81-
chrome_options.add_extension(proxy_auth_extension(proxy_url))
82-
chrome_options.add_argument(
83-
'--disable-features=DisableLoadExtensionCommandLineSwitch'
84-
)
85-
8628
return webdriver.Chrome(options=chrome_options)
8729

8830

@@ -140,9 +82,6 @@ async def main() -> None:
14082
Actor.log.info('No start URLs specified in Actor input, exiting...')
14183
await Actor.exit()
14284

143-
# Selenium proxies at the browser level, so one URL is shared per run.
144-
proxy_configuration = await Actor.create_proxy_configuration()
145-
14685
# Open the request queue and enqueue the start URLs (crawl depth 0).
14786
request_queue = await Actor.open_request_queue()
14887
for start_url in start_urls:
@@ -151,16 +90,11 @@ async def main() -> None:
15190
await request_queue.add_request(Request.from_url(url))
15291

15392
# Cap the crawl. Raise or remove the limit to follow more pages.
154-
max_requests = 50
93+
max_requests = 10
15594
handled_requests = 0
15695

157-
# Fresh proxy URL for the run (None if no proxy).
158-
proxy_url = None
159-
if proxy_configuration:
160-
proxy_url = await proxy_configuration.new_url()
161-
16296
Actor.log.info('Launching Chrome WebDriver...')
163-
driver = build_chrome_driver(proxy_url)
97+
driver = build_chrome_driver()
16498

16599
while handled_requests < max_requests and (
166100
request := await request_queue.fetch_next_request()
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
import json
2+
from pathlib import Path
3+
from tempfile import mkdtemp
4+
from urllib.parse import urlsplit
5+
from zipfile import ZipFile
6+
7+
from selenium import webdriver
8+
from selenium.webdriver.chrome.options import Options as ChromeOptions
9+
10+
from apify import Actor
11+
12+
13+
def proxy_auth_extension(proxy_url: str) -> str:
14+
"""Build a Chrome extension that routes Chrome through an authenticated proxy."""
15+
parts = urlsplit(proxy_url)
16+
17+
manifest = {
18+
'name': 'Apify Proxy',
19+
'version': '1.0.0',
20+
'manifest_version': 3,
21+
'permissions': ['proxy', 'webRequest', 'webRequestAuthProvider'],
22+
'host_permissions': ['<all_urls>'],
23+
'background': {'service_worker': 'background.js'},
24+
'minimum_chrome_version': '108',
25+
}
26+
27+
# The service worker sets the proxy and answers the auth challenge.
28+
proxy_config = json.dumps(
29+
{
30+
'mode': 'fixed_servers',
31+
'rules': {
32+
'singleProxy': {
33+
'scheme': parts.scheme,
34+
'host': parts.hostname,
35+
'port': parts.port,
36+
},
37+
},
38+
}
39+
)
40+
credentials = json.dumps(
41+
{'username': parts.username or '', 'password': parts.password or ''}
42+
)
43+
background = (
44+
'chrome.proxy.settings.set('
45+
'{value: ' + proxy_config + ', scope: "regular"});\n'
46+
'chrome.webRequest.onAuthRequired.addListener(\n'
47+
' () => ({authCredentials: ' + credentials + '}),\n'
48+
' {urls: ["<all_urls>"]},\n'
49+
' ["blocking"],\n'
50+
');\n'
51+
)
52+
53+
extension_path = Path(mkdtemp()) / 'apify_proxy.zip'
54+
with ZipFile(extension_path, 'w') as archive:
55+
archive.writestr('manifest.json', json.dumps(manifest))
56+
archive.writestr('background.js', background)
57+
return str(extension_path)
58+
59+
60+
def build_chrome_driver(proxy_url: str) -> webdriver.Chrome:
61+
"""Create a headless Chrome WebDriver routed through an authenticated proxy."""
62+
chrome_options = ChromeOptions()
63+
64+
if Actor.configuration.headless:
65+
# The new headless mode is required to load the proxy extension.
66+
chrome_options.add_argument('--headless=new')
67+
68+
chrome_options.add_argument('--no-sandbox')
69+
chrome_options.add_argument('--disable-dev-shm-usage')
70+
chrome_options.add_argument('--disable-gpu')
71+
72+
# Load the proxy extension and keep it enabled in headless mode.
73+
chrome_options.add_extension(proxy_auth_extension(proxy_url))
74+
chrome_options.add_argument(
75+
'--disable-features=DisableLoadExtensionCommandLineSwitch'
76+
)
77+
78+
return webdriver.Chrome(options=chrome_options)

0 commit comments

Comments
 (0)