Nexus is a powerful, self-healing AI agent designed for advanced Windows automation. It combines Local Vision (Moondream2), Speech-to-Text (Whisper), and Deep Windows Integration (COM/PowerShell) to act as a truly autonomous digital assistant.
- Autonomous Self-Healing: If an action fails, Nexus "looks" at the screen using its vision engine, analyzes the error, and automatically rewrites its plan to succeed.
- Multimodal Control: Talk to Nexus (Voice), type to it (Chat), or let it observe your screen (Vision).
- Human Behavior Engine: Natural mouse movements, scrolling, and "Visual Clicking" (finding buttons by their appearance rather than just code).
- Native Office Automation: Read/write Excel files and send Outlook emails in the background using native Windows COM services.
- Multi-LLM Support: Dynamically switch between local models (Ollama/Qwen2.5) and cloud models (Gemini, Grok, OpenAI).
Established the core GUIAgent architecture and tool registry. Integrated basic file system and process management tools.
Implemented a robust Playwright-based browser automation suite. Enabled Nexus to search the web, navigate sites, and interact with web elements using placeholder-based selectors.
Added a real-time voice interface using Faster-Whisper for speech recognition and pyttsx3 for offline text-to-speech. Introduced a global PTT (Push-To-Talk) hotkey for hands-free control.
Integrated Moondream2, a local vision-language model. This gave Nexus "eyes," allowing it to describe the screen and troubleshoot browser failures visually.
Finalized deep Windows integration:
- Native COM: Direct Excel and Outlook control.
- System API: Native Volume and Brightness management.
- Human Engine: Autonomous scrolling, coordinate-based clicking, and the vision-driven recovery loop.
DESKTOP_AGENT/
├── core/ # The Brain (Router, LLM Clients, Vision Engine)
├── os_tools/ # The Hands (Browser, Excel, Outlook, Power Tools)
├── voice/ # The Ears/Voice (Whisper STT, pyttsx3 TTS)
├── chats/ # Session logs and history
├── desktop.py # Main GUI Entry Point (Tkinter/Web Hybrid)
├── .env # API Keys and Environment Config
└── requirements.txt # Python dependencies- Python 3.10+ (3.11 recommended)
- Ollama (Optional, for local LLM support)
- FFmpeg (For Voice/Whisper support)
# Clone the repository
git clone https://github.com/your-username/nexus-agent.git
cd nexus-agent
# Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install Playwright browser engines
playwright install chromiumNexus is highly flexible. You can run it 100% locally or connect it to high-end cloud models.
Create a .env file in the root directory:
# Cloud Intelligence (Optional - Recommended for complex tasks)
GEMINI_API_KEY=your_key_here
GROK_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
# HuggingFace (Required for Vision fallback)
HF_TOKEN=your_huggingface_tokenNexus's capability is directly tied to the "Brain" (LLM) you select in the UI dropdown:
| Model Type | Recommended | Performance | Capability |
|---|---|---|---|
| Local (Fast) | qwen2.5:3b |
⚡ Lightning Fast | Great for simple file tasks and basic navigation. |
| Local (Heavy) | llama3.1:8b |
🐢 Slower | More reliable logic, better at following complex instructions. |
| Cloud (Pro) | Gemini 1.5 Pro |
☁️ API Dependent | The Ultimate Nexus Experience. Exceptional at self-healing and complex browser navigation. |
- The Brain:
core/intent_router.py— Handles planning and tool selection. - The Eyes:
core/vision.py— Local Moondream2 model for visual reasoning. - The Ears:
voice/voice_engine.py— Faster-Whisper for instant voice commands. - The Hands:
os_tools/— A library of 50+ tools for Browser, Excel, Outlook, and PowerShell.
python desktop.py- Hotkey: Hold
Right Ctrlto speak to Nexus (Jarvis Mode). - Self-Healing: If a task fails, watch the bottom status bar; Nexus will automatically analyze the screen and retry.
Built with ❤️ by Antigravity