Feature: Add bulk URL processing and global installation support #1

zazu-22 · 2025-08-03T02:09:09Z

Feature: Add bulk URL processing and global installation support

Problem

The current script only processes one URL at a time, making it inefficient for documentation sites with multiple pages. Users also need to manage dependencies manually and run the script from the project directory.

Solution

This PR adds bulk processing capabilities and global installation support to streamline documentation workflows.

Key improvements:

Bulk processing: Process multiple URLs from a file with --urls-file
Global installation: Use uv script metadata for dependency-free execution anywhere
Smart error recovery: Auto-generate retry files for failed URLs
Better file naming: Prevent overwrites with path-based, timestamped filenames

Example usage:

# Install globally
ln -sf "$(pwd)/generate-llmstxt.py" ~/.local/bin/generate-llmstxt

# Process multiple URLs
echo "https://docs.example.com/page1" >> urls.txt
echo "https://docs.example.com/page2" >> urls.txt
generate-llmstxt --urls-file urls.txt

# Auto-retry failures
generate-llmstxt --urls-file urls-failed.txt

Output changes:

Before: docs.example.com-llms.txt (overwrites on same domain)
After: docs_example_page1.txt, docs_example_page2.txt (unique files)
Bulk mode: Individual files + docs_example_consolidated_index_20250103_143022.txt

Impact

Backward compatible: All existing commands work unchanged
Performance: Process entire documentation sites in one command
Reliability: Failed URLs don't stop the entire batch
Usability: Install once, run anywhere with automatic dependency management

Testing

Tested with various scenarios including API failures, mixed success/failure batches, and different URL patterns from sites like modelcontextprotocol.io.

Files changed:

generate-llmstxt.py - Core enhancements
README.md - Updated documentation
CHANGELOG.md - Detailed change log
pyproject.toml - uv dependencies
sample-urls.txt - Example file

- Add uv script support with inline dependencies for global installation - Implement intelligent filename generation that removes extensions and creates unique names - Add bulk processing support via --urls-file argument - Generate consolidated index for bulk operations instead of separate index files - Add comprehensive error handling with failed URL tracking - Auto-generate urls-failed.txt for easy retry of failed URLs - Improve environment variable loading to work from any directory - Add sample URLs file for testing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Update README.md with comprehensive documentation of new features: * Global installation using uv script support * Bulk URL processing via --urls-file * Smart filename generation and collision prevention * Error recovery with automatic retry file generation * Enhanced examples and usage patterns - Create CHANGELOG.md documenting all improvements: * Detailed feature descriptions with technical details * Migration guide for existing users * Before/after examples showing new capabilities * Semver-compliant versioning structure This documentation update prepares the repository for PR submission to the main project, clearly outlining the value and scope of improvements. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

When processing multiple batches from the same domain (especially during retry operations), the consolidated index files would overwrite each other. Now includes timestamp in format: domain_consolidated_index_YYYYMMDD_HHMMSS.txt This ensures each batch operation creates a unique index file, making it easier to track processing history and compare results across runs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Update CHANGELOG.md to include timestamp protection feature - Update README.md with examples of timestamped index files - Add explanation of how multiple batch processing creates unique files - Show examples of processing history preservation across runs This completes the documentation for the timestamp feature that prevents consolidated index file overwrites during retry operations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…empted The consolidated index header was showing the total number of URLs attempted rather than the actual number successfully processed. This was misleading when some URLs failed during processing. Changed from: 'X URLs processed' (total attempted) To: 'X URLs processed successfully' (actual successful count) Now accurately reflects the number of entries actually in the index file. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

zazu-22 and others added 5 commits August 2, 2025 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Add bulk URL processing and global installation support #1

Feature: Add bulk URL processing and global installation support #1

Uh oh!

zazu-22 commented Aug 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feature: Add bulk URL processing and global installation support #1

Are you sure you want to change the base?

Feature: Add bulk URL processing and global installation support #1

Uh oh!

Conversation

zazu-22 commented Aug 3, 2025

Feature: Add bulk URL processing and global installation support

Problem

Solution

Key improvements:

Example usage:

Output changes:

Impact

Testing

Files changed:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant