Skip to content

armando-desouza/Universal-Web-Scanner-Data-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🕵️‍♂️ Universal Web Scanner & Data Extractor

Python Selenium Pandas Status


A robust Python automation tool designed to perform Web Reconnaissance, SEO Audits, and Data Mining on any given URL.

Unlike simple scrapers that break when website layouts change, this tool focuses on extracting universal HTML structures, making it resilient and applicable to 99% of websites.


🚀 Key Features.

  • 🌐 Universal Compatibility: Works on any URL provided by the user (E-commerce, Blogs, Corporate sites).
  • 🤖 Headless Automation: Runs silently in the background using Selenium (Chrome Driver) for maximum speed.
  • 📊 SEO & Meta Data Extraction: Automatically grabs Page Titles and Meta Descriptions.
  • 🔗 Structural Mapping: Extracts all Headers (H1, H2, H3), Links, and Image sources to map the site's architecture.
  • 📂 Dual Export Format:
    • JSON: A structured report for developers and NoSQL databases.
    • CSV: A clean spreadsheet of all extracted links ready for Excel/Analysis.

🛠️ Tech Stack.

  • Python: Core logic and scripting.
  • Selenium: Dynamic web navigation and JavaScript rendering.
  • BeautifulSoup4: High-speed HTML parsing.
  • Pandas: Data cleaning, transformation, and CSV export.
  • WebDriver Manager: Automated Chrome driver management.

⚙️ Installation & Usage.

1. Clone the Repository.

git clone https://github.com/armando-desouza/universal-web-scraper.git

cd universal-web-scraper

2. Install Dependencies.

Make sure you have Python installed, then run:

    pip install -r requirements.txt

3. Run the Scanner.

    python universal_scraper.py

4. Enter a Target URL.

The terminal will prompt you for a URL. Example:

    🌐 Cole a URL que deseja raspar: [https://www.python.org](https://www.python.org)

📂 Outputs Example.

The tool generates two files automatically:

  1. web_scan_report.json (Structured Data).
    {
        "target_url": "[https://www.python.org](https://www.python.org)",
        "scraped_at": "2024-01-14 10:30:00",
        "page_title": "Welcome to Python.org",
        "meta_description": "The official home of the Python Programming Language",
        "total_links_found": 145,
        "headers_structure": [
            "Get Started",
            "Downloads",
            "Docs"
        ],
        "images_extracted": [
            {"alt": "Python Logo", "src": "/static/img/python-logo.png"}
        ]
    }
  1. extracted_links.csv (Excel Ready).
text url
About /about/
Downloads /downloads/
Documentation /document/

💡 Why this tool?

I built this tool to automate the initial phase of Data Extraction projects. Before building a custom bot for a specific client, I use this scanner to:

  1. Understand the target website's structure (DOM).

  2. Check for anti-bot measures.

  3. Audit internal linking strategies.

👨‍💻 Author

Francisco A. de Souza Python Developer & Data Scientist | Data Specialist


⚠️ Don't forget the requirements.txt file.

For the installation command to work, make sure the `requirements.txt` file is in the same folder and contains exactly the following:

    selenium
    beautifulsoup4
    pandas
    webdriver-manager

About

A robust Python automation tool designed to perform Web Reconnaissance, SEO Audits, and Data Mining on any given URL.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages