🕵️‍♂️ Universal Web Scanner & Data Extractor

A robust Python automation tool designed to perform Web Reconnaissance, SEO Audits, and Data Mining on any given URL.

Unlike simple scrapers that break when website layouts change, this tool focuses on extracting universal HTML structures, making it resilient and applicable to 99% of websites.

🚀 Key Features.

🌐 Universal Compatibility: Works on any URL provided by the user (E-commerce, Blogs, Corporate sites).
🤖 Headless Automation: Runs silently in the background using Selenium (Chrome Driver) for maximum speed.
📊 SEO & Meta Data Extraction: Automatically grabs Page Titles and Meta Descriptions.
🔗 Structural Mapping: Extracts all Headers (H1, H2, H3), Links, and Image sources to map the site's architecture.
📂 Dual Export Format:
- JSON: A structured report for developers and NoSQL databases.
- CSV: A clean spreadsheet of all extracted links ready for Excel/Analysis.

🛠️ Tech Stack.

Python: Core logic and scripting.
Selenium: Dynamic web navigation and JavaScript rendering.
BeautifulSoup4: High-speed HTML parsing.
Pandas: Data cleaning, transformation, and CSV export.
WebDriver Manager: Automated Chrome driver management.

⚙️ Installation & Usage.

1. Clone the Repository.

git clone https://github.com/armando-desouza/universal-web-scraper.git

cd universal-web-scraper

2. Install Dependencies.

Make sure you have Python installed, then run:

    pip install -r requirements.txt

3. Run the Scanner.

    python universal_scraper.py

4. Enter a Target URL.

The terminal will prompt you for a URL. Example:

    🌐 Cole a URL que deseja raspar: [https://www.python.org](https://www.python.org)

📂 Outputs Example.

The tool generates two files automatically:

web_scan_report.json (Structured Data).

    {
        "target_url": "[https://www.python.org](https://www.python.org)",
        "scraped_at": "2024-01-14 10:30:00",
        "page_title": "Welcome to Python.org",
        "meta_description": "The official home of the Python Programming Language",
        "total_links_found": 145,
        "headers_structure": [
            "Get Started",
            "Downloads",
            "Docs"
        ],
        "images_extracted": [
            {"alt": "Python Logo", "src": "/static/img/python-logo.png"}
        ]
    }

extracted_links.csv (Excel Ready).

text	url
About	/about/
Downloads	/downloads/
Documentation	/document/

💡 Why this tool?

I built this tool to automate the initial phase of Data Extraction projects. Before building a custom bot for a specific client, I use this scanner to:

Understand the target website's structure (DOM).
Check for anti-bot measures.
Audit internal linking strategies.

👨‍💻 Author

Francisco A. de Souza Python Developer & Data Scientist | Data Specialist

⚠️ Don't forget the `requirements.txt` file.

For the installation command to work, make sure the `requirements.txt` file is in the same folder and contains exactly the following:

    selenium
    beautifulsoup4
    pandas
    webdriver-manager

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕵️‍♂️ Universal Web Scanner & Data Extractor

🚀 Key Features.

🛠️ Tech Stack.

⚙️ Installation & Usage.

1. Clone the Repository.

2. Install Dependencies.

3. Run the Scanner.

4. Enter a Target URL.

📂 Outputs Example.

💡 Why this tool?

👨‍💻 Author

⚠️ Don't forget the `requirements.txt` file.

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
extracted_links.csv		extracted_links.csv
requirements.txt		requirements.txt
universal_scraper.py		universal_scraper.py
web_scan_report.json		web_scan_report.json

Folders and files

Latest commit

History

Repository files navigation

🕵️‍♂️ Universal Web Scanner & Data Extractor

🚀 Key Features.

🛠️ Tech Stack.

⚙️ Installation & Usage.

1. Clone the Repository.

2. Install Dependencies.

3. Run the Scanner.

4. Enter a Target URL.

📂 Outputs Example.

💡 Why this tool?

👨‍💻 Author

⚠️ Don't forget the requirements.txt file.

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

⚠️ Don't forget the `requirements.txt` file.

Packages