Skip to content

devjoshi2005/ResilienceOps

Repository files navigation

RESILIENCEOPS: Cloud-Native Incident Response Engine


What Problem Does This Solve?

The Scenario: Imagine your company's cloud infrastructure is like a large building with many rooms (servers, databases, applications). Right now, if someone breaks in or something goes wrong:

  • Security cameras (monitoring tools) record everything but nobody watches them in real-time
  • When an alarm goes off, someone has to manually check 10 different systems to understand what's happening
  • By the time they figure out the problem, the damage is done
  • Fixing the issue requires writing complex code by hand under pressure

The Cost:

  • Average data breach takes 287 days to identify and contain (IBM 2024)
  • During this time, attackers move laterally, steal data, deploy ransomware
  • Manual incident response costs $4.88M per breach on average

Technical Report : Technical Report Business Report : Business Report


Image

How ResilienceOps Works (The "Security Guard" Analogy)

Think of ResilienceOps as an AI-powered security command center that never sleeps:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE RESILIENCEOPS SECURITY COMMAND CENTER                 │
└─────────────────────────────────────────────────────────────────────────────┘

  SECURITY CAMERAS                    COMMAND CENTER                    RESPONSE
  (AWS CloudTrail +                   (AI Brain)                        (Auto-Fix)
   GuardDuty)                                                          
       │                                    │                               │
       ▼                                    ▼                               ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐   ┌─────────────┐
│ EKS Cluster │───▶│  S3 Bucket  │───▶│   SQLite    │───▶│    AI       │──▶│   JIRA      │
│   (Apps)    │    │ (Log Store) │    │  (Database) │    │  Analysis   │    │ (Tickets)   │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘   └─────────────┘
       │                                    │              │                      │
       │                                    │              ▼                      │
       │                                    │       ┌─────────────┐               │
       │                                    │       │  Anomaly    │               │
       │                                    │       │  Detection  │               │
       │                                    │       └─────────────┘               │
       │                                    │              │                      │
       │                                    │              ▼                      │
       │                                    │       ┌─────────────┐               │
       │                                    │       │   OpenAI    │               │
       │                                    │       │ (Terraform  │               │
       │                                    │       │  Remediation│               │
       │                                    │       └─────────────┘               │
       │                                    │              │                      │
       │                                    ▼              ▼                      ▼
       │                              ┌─────────────┐    ┌─────────────┐   ┌─────────────┐
       │                              │    Neo4j    │    │    OPA      │   │ Terraform   │
       │                              │  (Graph DB) │    │  (Policy    │   │   Apply     │
       │                              │  (Threat    │    │   Check)    │   │             │
       │                              │  Mapping)   │    └─────────────┘   └─────────────┘
       │                              └─────────────┘
       │
       ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│    IAM      │───▶│ CloudTrail  │───▶│  Prometheus │
│ (Identity)  │    │   Logs      │    │  (Metrics)  │
└─────────────┘    └─────────────┘    └─────────────┘
       │                                    │
       │                                    ▼
       │                              ┌─────────────┐
       │                              │   Grafana   │
       │                              │(Dashboards) │
       │                              └─────────────┘
       │
       ▼
┌─────────────┐
│  EC2 Instance│
│  (Servers)   │
└─────────────┘

Step-by-Step: What Happens During an Attack

Phase 1: Detection (The Cameras Roll)

  • AWS GuardDuty (AI threat detector) spots suspicious activity
    • Example: Someone trying to access your database from an unusual location
  • AWS CloudTrail (activity logger) records every action
    • Example: "User X tried to delete 50 files at 3 AM"
  • EKS Cluster (container monitoring) detects pod anomalies
    • Example: "Container Y is using 10x normal CPU (crypto mining?)"

Phase 2: Collection (Evidence Gathering)

All these alerts flow into S3 buckets (secure storage), then get processed into a SQLite database organized by:

  • When it happened (timestamp)
  • How serious it is (severity: Low/Medium/High/Critical)
  • What was affected (resource: EC2, S3, IAM, etc.)
  • Who did it (account ID, region)

Phase 3: Analysis (The AI Brain Kicks In)

The Anomaly Detector (using Isolation Forest machine learning) asks:

  • "Is this normal behavior?"
  • "Have we seen this pattern before?"
  • "How risky is this combination of events?"

Risk Score Calculation:

Risk Score = (Severity × 40%) + (Anomaly × 30%) + (Rarity × 20%) + (Scope × 10%)

Example: 
- Critical severity (40 points)
- Never seen before (30 points)
- Rare event type (20 points)
- Multiple resources affected (10 points)
= 100/100 RISK SCORE → IMMEDIATE ACTION

Phase 4: Threat Intelligence (Connecting the Dots)

Neo4j Graph Database maps relationships:

  • "This IAM user accessed that S3 bucket"
  • "This EC2 instance talked to that database"
  • "Attack spread from Resource A → Resource B → Resource C"

Why this matters: You can see the attack path and stop it before it spreads.

Phase 5: Automated Response (Fixing the Problem)

For Critical Incidents (Risk Score ≥ 100):

  1. Create JIRA Ticket → Alerts human security team with full context
  2. Generate Terraform Code → AI writes the fix automatically
    • "Block this IP address"
    • "Revoke these permissions"
    • "Enable encryption on this bucket"
  3. Policy Check → OPA validates the fix won't break anything
  4. Auto-Remediation → Apply fix immediately (optional)

Real-World Example: Crypto Mining Attack

The Attack:

  1. Attacker exploits vulnerable Kubernetes pod (using vulnerable.yaml - intentionally insecure for testing)
  2. Deploys crypto miner (simulated by cryptosimulation.yaml)
  3. CPU usage spikes to 100%

ResilienceOps Response:

Time Action System Component
T+0s GuardDuty detects anomalous compute Detection
T+5s Event ingested into SQLite Collection
T+10s Anomaly detector flags 95% risk score Analysis
T+15s Neo4j maps: Pod → Node → IAM Role Threat Intel
T+20s JIRA ticket created with full context Notification
T+30s OpenAI generates Terraform to isolate pod Remediation
T+60s OPA validates: "No destructive actions" Validation
T+90s Pod isolated, attack contained Resolution

Total Response Time: 90 seconds (vs. industry average of 287 days for undetected breaches)


Key Components Explained

1. Data Collection Layer

Component What It Does Real-World Analogy
AWS CloudTrail Records every API call Security camera footage
AWS GuardDuty AI-powered threat detection Motion sensors with AI
EKS Cluster Container monitoring Smart building sensors
Prometheus Metrics collection Utility usage monitors

2. Storage & Processing

Component What It Does Real-World Analogy
S3 Buckets Secure log storage Evidence locker
SQLite Structured event database Incident report filing system
Neo4j Relationship mapping Investigation pinboard with string connections

3. Intelligence Layer

Component What It Does Real-World Analogy
Isolation Forest ML anomaly detection Experienced security guard's gut feeling
Risk Scoring Prioritization engine Triage nurse at emergency room

4. Response Layer

Component What It Does Real-World Analogy
OpenAI GPT-4 Auto-generates fixes Senior engineer writing code instantly
OPA/Rego Policy validation Legal compliance check
JIRA Ticket creation Dispatch calling backup
Terraform Infrastructure fixes Automated repair robots

Business Value

For Startups:

  • Cost: One security engineer costs ₹15-25 LPA
  • ResilienceOps: Automates 70% of tier-1 incident response
  • ROI: Detect and contain breaches in minutes vs. months

For Enterprises:

  • Compliance: SOC2, ISO27001, PCI-DSS require incident response capabilities
  • MTTD/MTTR: Reduce Mean Time To Detect/Respond by 99%
  • Insurance: Lower cyber insurance premiums with demonstrated automation

Comparison: With vs. Without ResilienceOps

Scenario Traditional Approach With ResilienceOps
Detection 24-48 hours (manual log review) 5 seconds (automated)
Analysis 2-4 hours (correlating across tools) 15 seconds (AI + Graph DB)
Prioritization Subjective, inconsistent Risk score 0-100, objective
Response Manual ticket creation, research Auto-generated remediation code
Documentation Post-incident, often incomplete Real-time, comprehensive
Learning Lessons lost after incident Neo4j retains attack patterns

Technical Architecture Summary

┌─────────────────────────────────────────────────────────────────┐
│                        DATA SOURCES                              │
│  AWS CloudTrail │ AWS GuardDuty │ EKS │ EC2 │ IAM │ Prometheus  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      DATA COLLECTION                             │
│  S3 Buckets (Raw Logs) → SQLite (Structured) → Neo4j (Graph)   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      ANALYSIS ENGINE                             │
│  Isolation Forest ML → Risk Scoring → Anomaly Detection         │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      RESPONSE AUTOMATION                         │
│  Critical (≥100) → JIRA + OpenAI + OPA → Terraform Remediation  │
│  High (70-99)    → JIRA + Notification                          │
│  Medium (40-69)  → Dashboard Alert                              │
│  Low (<40)       → Log for Review                               │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      OBSERVABILITY                               │
│  Prometheus Metrics → Grafana Dashboards                         │
└─────────────────────────────────────────────────────────────────┘

About

Github Project demonstrating the working of a cloud native incident response platform

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors