Check robots.txt

**Is your feature request related to a problem? Please describe.**


bots on the internet should honor the robots.txt (see [RFC 9309](https://datatracker.ietf.org/doc/rfc9309/)

**Describe the solution you'd like**


Checking the robots.txt of every domain being crawled before crawling the actual content.
I think the tool should provide an option to ignore the robots.txt but being annoyed about it on stdout when enabled.

The downside is an additional request to the server on every crawling attempt

**Describe alternatives you've considered**


Provide an additional subcommand to check the domains in the config for the robots.txt. The user of this tool can run the command to see if they are allowed to do this by the host.
This way the additional requests are only done on demand and the user can decide to remove their crawling attempts.
Maybe integrate this into the check command which checks the config and error when the robots.txt denies a path?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Check robots.txt #183

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Check robots.txt #183

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions