ML Phishing Page Detection

Classify screenshots or DOM data of login pages to identify fake sites phishing for credentials.

Project Overview

Automatically detect phishing websites that mimic legitimate login pages (e.g., banking, email, social media) by analyzing their visual appearance, HTML structure, and textual content.

Goals

Detect fake login pages in real time and with high precision
Leverage visual, structural, and semantic features for robust detection
Optional browser extension or SOC dashboard for alerting

Architecture & Workflow

            ┌────────────┐
            │  Target URL│
            └─────┬──────┘
                  ▼
       ┌────────────────────┐
       │ Page Renderer (Headless) │  ← Screenshots, HTML dump
       └────────┬───────────┘
                ▼
   ┌────────────┴─────────────┐
   │ Feature Extractor        │
   │ ├── Visual: screenshot   │ ← CNN-based classifier
   │ ├── Text: NLP on content │ ← BERT, TF-IDF, etc.
   │ └── DOM: tag structure   │ ← tag frequency, depth
   └────────────┬─────────────┘
                ▼
       ┌────────┴────────┐
       │   ML Classifier │ ← binary classification: {real, fake}
       └────────┬────────┘
                ▼
        ┌───────┴────────┐
        │ Alert/Block/Log│
        └────────────────┘

Tech Stack

| Area                     | Tool/Library                              | Purpose                                           |
| ------------------------ | ----------------------------------------- | ------------------------------------------------- |
| **Web Scraping**         | `Playwright` / `Selenium`                 | Load page, take screenshots, extract HTML         |
| **Image Classification** | `PyTorch` / `TensorFlow`                  | CNN model to classify page screenshots            |
| **Text Analysis**        | `Transformers (BERT)`                     | Analyze textual clues (e.g., "Login to PayPal")   |
| **DOM Analysis**         | `BeautifulSoup`, `lxml`                   | Parse HTML, extract structural features           |
| **Model Training**       | `scikit-learn`, `XGBoost`                 | Traditional models on structural/textual features |
| **Monitoring**           | `FastAPI`, `Prometheus`                   | Optional API and alerting layer                   |
| **Deployment**           | `Docker`, `Kubernetes`                    | Serve the model in a scalable microservice        |
| **Dataset Sources**      | PhishTank, OpenPhish, Legit scraped pages | Collect real vs. phishing pages                   |

Features

Visual
- Screenshot image (CNN input)
- Logo detection (optional)
- Layout similarities
Textual
- Login-related keywords
- Brand mentions (e.g., PayPal, Google)
- Language models for semantic similarity
DOM Structure
presence
Form actions, external links
Suspicious domains in href attributes

Output

The final model returns:

{
  "label": "phishing",
  "confidence": 0.96,
  "explanation": "Fake PayPal login on suspicious domain"
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ML Phishing Page Detection

Project Overview

Goals

Architecture & Workflow

Tech Stack

Features

Output

About

Uh oh!

Releases

Packages

Languages

License

dalton-herriman/ML-phishing-site-detection

Folders and files

Latest commit

History

Repository files navigation

ML Phishing Page Detection

Project Overview

Goals

Architecture & Workflow

Tech Stack

Features

Output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages