Why does Crawlee use a single persistent crawler instance in server mode instead of creating one per request? #3261
-
|
Why does the Crawlee documentation use a single long-running crawler instance with a request map when handling multiple HTTP requests on a server? Reference: https://crawlee.dev/js/docs/guides/running-in-web-server Doc server example: import { randomUUID } from 'node:crypto';
import { CheerioCrawler } from 'crawlee';
import { createServer } from 'http';
const requests = new Map();
const crawler = new CheerioCrawler({
requestHandler: async ({ request, $ }) => {
const send = requests.get(request.uniqueKey);
send(JSON.stringify({ title: $('title').text() }));
requests.delete(request.uniqueKey);
},
});
createServer(async (req, res) => {
const url = new URL(req.url, 'http://localhost:3000').searchParams.get('url');
if (!url) {
res.writeHead(400).end(`{"error":"missing url"}`);
return;
}
const uniqueKey = randomUUID();
requests.set(uniqueKey, (data) => res.writeHead(200).end(data));
await crawler.addRequests([{ url, uniqueKey }]);
}).listen(3000);Can we instead create a crawler instance inside a function, run it with a single URL, and return the scraped data? Will this cause any issues? My version: import { CheerioCrawler } from 'crawlee';
export async function fetchTitleOnce(url) {
return new Promise(async (resolve) => {
const crawler = new CheerioCrawler({
requestHandler: ({ $, request }) => {
resolve({ url: request.url, title: $('title').text() });
},
});
await crawler.run([{ url }]);
});
} |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
Hello and thank you for your interest in Crawlee! With some caveats, it is indeed possible to just make a fresh instance of When you use a single crawler class, you will avoid problems with sharing the filesystem storage (extra setup will be needed to avoid that with the crawler-per-request approach). Also, it will run the request handlers using the |
Beta Was this translation helpful? Give feedback.
I wouldn't worry about that too much. With a fresh instance each time, each request will be done with a new session, basically a "user profile". So you'll lose some efficiency, but anti-blocking performance should be mostly unaffected. But you still need to try, this is just an educated guess.