Skip to content

Conversation

@zazu-22
Copy link

@zazu-22 zazu-22 commented Aug 3, 2025

Feature: Add bulk URL processing and global installation support

Problem

The current script only processes one URL at a time, making it inefficient for documentation sites with multiple pages. Users also need to manage dependencies manually and run the script from the project directory.

Solution

This PR adds bulk processing capabilities and global installation support to streamline documentation workflows.

Key improvements:

  • Bulk processing: Process multiple URLs from a file with --urls-file
  • Global installation: Use uv script metadata for dependency-free execution anywhere
  • Smart error recovery: Auto-generate retry files for failed URLs
  • Better file naming: Prevent overwrites with path-based, timestamped filenames

Example usage:

# Install globally
ln -sf "$(pwd)/generate-llmstxt.py" ~/.local/bin/generate-llmstxt

# Process multiple URLs
echo "https://docs.example.com/page1" >> urls.txt
echo "https://docs.example.com/page2" >> urls.txt
generate-llmstxt --urls-file urls.txt

# Auto-retry failures
generate-llmstxt --urls-file urls-failed.txt

Output changes:

  • Before: docs.example.com-llms.txt (overwrites on same domain)
  • After: docs_example_page1.txt, docs_example_page2.txt (unique files)
  • Bulk mode: Individual files + docs_example_consolidated_index_20250103_143022.txt

Impact

  • Backward compatible: All existing commands work unchanged
  • Performance: Process entire documentation sites in one command
  • Reliability: Failed URLs don't stop the entire batch
  • Usability: Install once, run anywhere with automatic dependency management

Testing

Tested with various scenarios including API failures, mixed success/failure batches, and different URL patterns from sites like modelcontextprotocol.io.

Files changed:

  • generate-llmstxt.py - Core enhancements
  • README.md - Updated documentation
  • CHANGELOG.md - Detailed change log
  • pyproject.toml - uv dependencies
  • sample-urls.txt - Example file

zazu-22 and others added 5 commits August 2, 2025 21:25
- Add uv script support with inline dependencies for global installation
- Implement intelligent filename generation that removes extensions and creates unique names
- Add bulk processing support via --urls-file argument
- Generate consolidated index for bulk operations instead of separate index files
- Add comprehensive error handling with failed URL tracking
- Auto-generate urls-failed.txt for easy retry of failed URLs
- Improve environment variable loading to work from any directory
- Add sample URLs file for testing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Update README.md with comprehensive documentation of new features:
  * Global installation using uv script support
  * Bulk URL processing via --urls-file
  * Smart filename generation and collision prevention
  * Error recovery with automatic retry file generation
  * Enhanced examples and usage patterns

- Create CHANGELOG.md documenting all improvements:
  * Detailed feature descriptions with technical details
  * Migration guide for existing users
  * Before/after examples showing new capabilities
  * Semver-compliant versioning structure

This documentation update prepares the repository for PR submission
to the main project, clearly outlining the value and scope of improvements.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
When processing multiple batches from the same domain (especially during
retry operations), the consolidated index files would overwrite each other.
Now includes timestamp in format: domain_consolidated_index_YYYYMMDD_HHMMSS.txt

This ensures each batch operation creates a unique index file, making it
easier to track processing history and compare results across runs.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Update CHANGELOG.md to include timestamp protection feature
- Update README.md with examples of timestamped index files
- Add explanation of how multiple batch processing creates unique files
- Show examples of processing history preservation across runs

This completes the documentation for the timestamp feature that prevents
consolidated index file overwrites during retry operations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…empted

The consolidated index header was showing the total number of URLs attempted
rather than the actual number successfully processed. This was misleading
when some URLs failed during processing.

Changed from: 'X URLs processed' (total attempted)
To: 'X URLs processed successfully' (actual successful count)

Now accurately reflects the number of entries actually in the index file.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant