A robust Python automation tool designed to perform Web Reconnaissance, SEO Audits, and Data Mining on any given URL.
Unlike simple scrapers that break when website layouts change, this tool focuses on extracting universal HTML structures, making it resilient and applicable to 99% of websites.
- 🌐 Universal Compatibility: Works on any URL provided by the user (E-commerce, Blogs, Corporate sites).
- 🤖 Headless Automation: Runs silently in the background using Selenium (Chrome Driver) for maximum speed.
- 📊 SEO & Meta Data Extraction: Automatically grabs Page Titles and Meta Descriptions.
- 🔗 Structural Mapping: Extracts all Headers (H1, H2, H3), Links, and Image sources to map the site's architecture.
- 📂 Dual Export Format:
- JSON: A structured report for developers and NoSQL databases.
- CSV: A clean spreadsheet of all extracted links ready for Excel/Analysis.
- Python: Core logic and scripting.
- Selenium: Dynamic web navigation and JavaScript rendering.
- BeautifulSoup4: High-speed HTML parsing.
- Pandas: Data cleaning, transformation, and CSV export.
- WebDriver Manager: Automated Chrome driver management.
git clone https://github.com/armando-desouza/universal-web-scraper.git
cd universal-web-scraperMake sure you have Python installed, then run:
pip install -r requirements.txt python universal_scraper.pyThe terminal will prompt you for a URL. Example:
🌐 Cole a URL que deseja raspar: [https://www.python.org](https://www.python.org)The tool generates two files automatically:
web_scan_report.json(Structured Data).
{
"target_url": "[https://www.python.org](https://www.python.org)",
"scraped_at": "2024-01-14 10:30:00",
"page_title": "Welcome to Python.org",
"meta_description": "The official home of the Python Programming Language",
"total_links_found": 145,
"headers_structure": [
"Get Started",
"Downloads",
"Docs"
],
"images_extracted": [
{"alt": "Python Logo", "src": "/static/img/python-logo.png"}
]
}extracted_links.csv(Excel Ready).
| text | url |
|---|---|
| About | /about/ |
| Downloads | /downloads/ |
| Documentation | /document/ |
I built this tool to automate the initial phase of Data Extraction projects. Before building a custom bot for a specific client, I use this scanner to:
-
Understand the target website's structure (DOM).
-
Check for anti-bot measures.
-
Audit internal linking strategies.
Francisco A. de Souza Python Developer & Data Scientist | Data Specialist
For the installation command to work, make sure the `requirements.txt` file is in the same folder and contains exactly the following:
selenium
beautifulsoup4
pandas
webdriver-manager