Skip to content

LogicalGagan/NEXUS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nexus — Multimodal Desktop Intelligence

Nexus is a powerful, self-healing AI agent designed for advanced Windows automation. It combines Local Vision (Moondream2), Speech-to-Text (Whisper), and Deep Windows Integration (COM/PowerShell) to act as a truly autonomous digital assistant.

🚀 Key Features

  • Autonomous Self-Healing: If an action fails, Nexus "looks" at the screen using its vision engine, analyzes the error, and automatically rewrites its plan to succeed.
  • Multimodal Control: Talk to Nexus (Voice), type to it (Chat), or let it observe your screen (Vision).
  • Human Behavior Engine: Natural mouse movements, scrolling, and "Visual Clicking" (finding buttons by their appearance rather than just code).
  • Native Office Automation: Read/write Excel files and send Outlook emails in the background using native Windows COM services.
  • Multi-LLM Support: Dynamically switch between local models (Ollama/Qwen2.5) and cloud models (Gemini, Grok, OpenAI).

🛠️ Project Evolution (Phases)

Phase 1: The Skeleton

Established the core GUIAgent architecture and tool registry. Integrated basic file system and process management tools.

Phase 2: Web Integration

Implemented a robust Playwright-based browser automation suite. Enabled Nexus to search the web, navigate sites, and interact with web elements using placeholder-based selectors.

Phase 3: The Voice (Jarvis Mode)

Added a real-time voice interface using Faster-Whisper for speech recognition and pyttsx3 for offline text-to-speech. Introduced a global PTT (Push-To-Talk) hotkey for hands-free control.

Phase 4: Nexus Vision

Integrated Moondream2, a local vision-language model. This gave Nexus "eyes," allowing it to describe the screen and troubleshoot browser failures visually.

Phase 5: Power Tools & Human Engine (Current)

Finalized deep Windows integration:

  • Native COM: Direct Excel and Outlook control.
  • System API: Native Volume and Brightness management.
  • Human Engine: Autonomous scrolling, coordinate-based clicking, and the vision-driven recovery loop.

📁 Project Structure

DESKTOP_AGENT/
├── core/               # The Brain (Router, LLM Clients, Vision Engine)
├── os_tools/           # The Hands (Browser, Excel, Outlook, Power Tools)
├── voice/              # The Ears/Voice (Whisper STT, pyttsx3 TTS)
├── chats/              # Session logs and history
├── desktop.py          # Main GUI Entry Point (Tkinter/Web Hybrid)
├── .env                # API Keys and Environment Config
└── requirements.txt    # Python dependencies

🚦 Quick Start Guide

1. Prerequisites

  • Python 3.10+ (3.11 recommended)
  • Ollama (Optional, for local LLM support)
  • FFmpeg (For Voice/Whisper support)

2. Installation

# Clone the repository
git clone https://github.com/your-username/nexus-agent.git
cd nexus-agent

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install Playwright browser engines
playwright install chromium

3. Configuration

Nexus is highly flexible. You can run it 100% locally or connect it to high-end cloud models.

Create a .env file in the root directory:

# Cloud Intelligence (Optional - Recommended for complex tasks)
GEMINI_API_KEY=your_key_here
GROK_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here

# HuggingFace (Required for Vision fallback)
HF_TOKEN=your_huggingface_token

🧠 Intelligence Scaling

Nexus's capability is directly tied to the "Brain" (LLM) you select in the UI dropdown:

Model Type Recommended Performance Capability
Local (Fast) qwen2.5:3b ⚡ Lightning Fast Great for simple file tasks and basic navigation.
Local (Heavy) llama3.1:8b 🐢 Slower More reliable logic, better at following complex instructions.
Cloud (Pro) Gemini 1.5 Pro ☁️ API Dependent The Ultimate Nexus Experience. Exceptional at self-healing and complex browser navigation.

🏗️ Technical Architecture

  • The Brain: core/intent_router.py — Handles planning and tool selection.
  • The Eyes: core/vision.py — Local Moondream2 model for visual reasoning.
  • The Ears: voice/voice_engine.py — Faster-Whisper for instant voice commands.
  • The Hands: os_tools/ — A library of 50+ tools for Browser, Excel, Outlook, and PowerShell.

🏃 Running Nexus

python desktop.py
  • Hotkey: Hold Right Ctrl to speak to Nexus (Jarvis Mode).
  • Self-Healing: If a task fails, watch the bottom status bar; Nexus will automatically analyze the screen and retry.

Built with ❤️ by Antigravity

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages