You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: unify crawl caps and fix the runnable code examples (#1002)
Fixes the "Run on Apify" examples in the Python SDK guides.
- Lower the page cap from 50 to 10 across all crawling examples, so the
browser-based ones finish within the 180s runnable-demo timeout.
- **Selenium**: slim the runnable example to a plain crawler (the full
version was too large to encode into the Run-on-Apify URL and failed
with HTTP 414), and move the proxy-auth extension into the "Using Apify
Proxy" section as a separate, non-runnable extension snippet.
- **Browser Use**: make it non-runnable (it needs an LLM API key the
shared runner cannot provide), with a comment explaining why. Same note
added to the Scrapy and Scrapling-browser examples.
- Keep both **Scrapling** examples runnable.
- **Pydantic**: log a readable validation summary and fail cleanly via
`Actor.fail` instead of re-raising into a raw traceback.
In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for browser automation and web scraping in your Apify Actors.
12
14
@@ -42,9 +44,15 @@ It uses Selenium ChromeDriver to open the pages in an automated Chrome browser,
42
44
43
45
## Using Apify Proxy
44
46
45
-
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and routes the browser through it for the whole run.
47
+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The runnable example Actor skips the proxy to stay simple. This section extends it to route the browser through Apify Proxy. The snippet below isn't a complete, runnable Actor on its own. It shows only the proxy-specific parts you add to the example Actor.
46
48
47
-
Chrome ignores the credentials passed in the `--proxy-server` flag. Because of that, configure an authenticated proxy such as Apify Proxy from inside a small extension. The `proxy_auth_extension` helper builds one at runtime: its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. Note that the new headless mode (`--headless=new`) is required for Chrome to load the extension. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
49
+
Chrome ignores the credentials passed in the `--proxy-server` flag. To use an authenticated proxy such as Apify Proxy, configure it from inside a small extension. The `proxy_auth_extension` helper builds one at runtime. Its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. The proxy-aware `build_chrome_driver` below replaces the simple one from the example Actor and loads that extension. The new headless mode (`--headless=new`) is required for Chrome to load it.
50
+
51
+
<CodeBlockclassName="language-python">
52
+
{SeleniumProxyExample}
53
+
</CodeBlock>
54
+
55
+
To wire the proxy into the example Actor, create the proxy configuration in `main` with `Actor.create_proxy_configuration`, get a URL with `await proxy_configuration.new_url()`, and pass it to `build_chrome_driver`. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
Copy file name to clipboardExpand all lines: docs/03_guides/06_scrapy.mdx
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,6 +73,7 @@ For further details, see the [Scrapy migration guide](https://docs.apify.com/cli
73
73
74
74
The following example shows a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.
75
75
76
+
{/* Not runnable from the docs: a Scrapy Actor is a multi-file project, while the "Run on Apify" runner executes a single self-contained snippet. */}
In this guide, you'll learn how to use the [Scrapling](https://scrapling.readthedocs.io/) library for adaptive web scraping in your Apify Actors.
14
13
@@ -101,9 +100,9 @@ scrapling install
101
100
102
101
To switch the example from HTTP to a real browser, fetch each page through a browser session instead of `AsyncFetcher`. Opening a fresh browser for every page would be wasteful, so `main` enters an `AsyncDynamicSession` once and reuses it for the whole crawl, while `scrape_page` fetches with `session.fetch`. The parsing API is identical, so the extraction code stays the same:
In this guide, you'll learn how to use the [Browser Use](https://browser-use.com/) library to drive a browser with an LLM agent in your Apify Actors.
12
12
@@ -46,9 +46,10 @@ The following Actor runs a Browser Use agent for a single task and stores its st
46
46
47
47
The whole Actor fits in a single file. A `run_agent_task` helper holds the Browser Use-specific logic: it defines the output schema and builds the LLM, browser, and agent. The `main` coroutine handles the [Actor](https://docs.apify.com/platform/actors) lifecycle, reads the input, sets up [Apify Proxy](https://docs.apify.com/platform/proxy), runs the agent, and stores the result:
Copy file name to clipboardExpand all lines: docs/03_guides/11_pydantic.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -56,7 +56,7 @@ The following Actor declares its input as a Pydantic `BaseModel`, validates the
56
56
### About the validation
57
57
58
58
-`model_validate` parses the raw dictionary into a typed `ActorInput` instance. It fills in defaults and guarantees every field is valid, or raises a `ValidationError` that describes every problem at once.
59
-
- Catching that error, logging a readable summary, and re-raising makes the Actor fail fast with a clear explanation right at the start, rather than crashing with an obscure error somewhere deep in the run. Because the body runs inside `async with Actor:`, the re-raised exception automatically marks the run as `FAILED`.
59
+
- Catching that error, logging a readable summary, and failing the run with <ApiLinkto="class/Actor#fail">`Actor.fail`</ApiLink> marks the run as `FAILED` with a clear status message. It fails fast right at the start with a readable explanation, instead of crashing with a raw traceback deeper in the run.
60
60
- The error messages refer to the fields by their input-schema aliases. For invalid input like `{"searchTerms": [], "maxResults": 999, "outputFormat": "xml"}`, the log shows exactly what's wrong:
0 commit comments