This document contains the full technical documentation for Off Grid. For a quick overview, see the README.
| Feature | Android | iOS |
|---|---|---|
| Text Generation (GGUF) | llama.cpp (CPU + OpenCL GPU) | llama.cpp (CPU + Metal) |
| Vision AI | llama.rn multimodal | llama.rn multimodal |
| Image Generation | local-dream (MNN/QNN) | Core ML (ANE + CPU) |
| Voice Transcription | whisper.cpp | whisper.cpp |
| PDF Extraction | PdfRenderer (Kotlin) | PDFKit (Swift) |
| Document Viewer | Intent.ACTION_VIEW | QuickLook |
| Tool Calling | llama.rn (model-dependent) | llama.rn (model-dependent) |
| Background Downloads | Native DownloadManager | RNFS / URLSession |
Multi-model LLM inference using llama.cpp compiled for ARM64 Android via llama.rn native bindings. Supports any GGUF-format model compatible with llama.cpp — downloaded from HuggingFace or imported directly from device storage:
- Streaming inference with real-time token callbacks
- OpenCL GPU offloading on Qualcomm Adreno GPUs (experimental, optional)
- Context window management with automatic truncation and continuation
- Performance instrumentation - Tracks tok/s (overall and decode-only), TTFT, token count
- Custom system prompts via project-based conversation contexts
- KV cache management with manual clear capability for memory optimization
- Configurable KV cache type — Choose between f16 (default), q8_0, and q4_0 quantization for memory/quality tradeoffs
- Flash attention — User-configurable toggle for faster inference (automatically disabled when GPU layers > 0 on Android due to llama.cpp compatibility)
- Markdown rendering — Rich text output with syntax highlighting, lists, and formatting
- llama.cpp parameter constraints — Automatic enforcement of GPU/flash attention/KV cache compatibility to prevent native crashes (SIGSEGV, SIGABRT)
Implementation:
llmService(src/services/llm.ts) wraps llama.rn for model lifecycle and inferencegenerationService(src/services/generationService.ts) provides background-safe orchestrationactiveModelService(src/services/activeModelService.ts) singleton ensures safe model loading/unloading- State managed via Zustand stores (
src/stores/) with AsyncStorage persistence
GPU Acceleration: llama.cpp's OpenCL backend enables GPU offloading on Qualcomm Adreno GPUs. Configurable layer count (0-99) determines CPU/GPU split. Automatic fallback to CPU-only if OpenCL initialization fails. CPU inference uses ARM NEON, i8mm, and dotprod SIMD instructions.
Multimodal understanding via vision-language models (VLMs) with automatic mmproj (multimodal projector) handling:
- Automatic mmproj detection - Vision models automatically download required mmproj companion files
- Combined asset tracking - Model size estimates include mmproj overhead
- Runtime mmproj discovery - If mmproj wasn't linked during download, searches model directory on load
- Camera and photo library integration - React Native Image Picker for image capture/selection
- OpenAI-compatible message format - Uses llama.rn's OAI message structure for vision inference
Implementation:
- mmproj files loaded via
llmService.initMultimodal() - Image URIs converted to
file://paths and passed in OAI message format - Vision models tracked separately with
isVisionModelflag and combined size calculation - SmolVLM 500M-2.2B recommended (fast, 7-10s inference); Qwen models support multilingual vision
Supported Vision Models:
- SmolVLM (500M, 2.2B) - Fast, compact, 7-10s inference on flagship devices
- Qwen3-VL (2B, 8B) - Vision-language with thinking mode, excellent multilingual understanding
- Gemma 3n E4B - Vision + audio, built for mobile with selective activation
- LLaVA - Large Language and Vision Assistant
- MiniCPM-V - Efficient multimodal
On-device Stable Diffusion with platform-native acceleration:
Android:
- MNN backend - Alibaba's MNN framework, works on all ARM64 devices (CPU-only)
- QNN backend - Qualcomm AI Engine (NPU acceleration) for Snapdragon 8 Gen 1+
- Automatic backend detection - Runtime NPU detection with MNN fallback
- Models from xororz's HuggingFace repos (pre-converted MNN/QNN formats)
iOS:
- Core ML backend - Apple's ml-stable-diffusion pipeline with Neural Engine (ANE) acceleration
- DPM-Solver scheduler - Faster convergence, better quality at fewer steps
- Safety checker disabled - Reduced latency (no NSFW classification overhead)
- Palettized models - 6-bit quantized (~1GB) for memory-constrained devices
- Full precision models - fp16 (~4GB) for maximum speed on ANE
- Models from Apple's official HuggingFace repos (compiled Core ML format)
Common:
- Real-time preview - Progressive image display every N steps
- Background generation - Lifecycle-independent service continues when screens unmount
- AI prompt enhancement - Optional LLM-based prompt expansion using loaded text model
Technical Pipeline:
Text Prompt → CLIP Tokenizer → Text Encoder (embeddings)
→ Scheduler (DPM-Solver/Euler) ↔ UNet (denoising, iterative)
→ VAE Decoder → 512×512 Image
Implementation:
localDreamGeneratorService(src/services/localDreamGenerator.ts) bridges to native viaPlatform.select()imageGenerationService(src/services/imageGenerationService.ts) provides orchestration- Android:
LocalDreamModulewraps local-dream C++ lib (MNN/QNN) - iOS:
CoreMLDiffusionModulewraps Apple'sStableDiffusionPipeline(ANE/CPU) - Progress callbacks, preview callbacks, and completion callbacks flow through singleton service
- Gallery persistence via AsyncStorage with automatic cleanup on conversation deletion
Prompt Enhancement: When enabled, uses the currently loaded text model to expand simple prompts into detailed descriptions:
// User input: "Draw a dog"
// LLM enhancement system prompt guides model to add:
// - Artistic style descriptors
// - Lighting and composition details
// - Quality modifiers
// - Concrete visual details
// Result: ~75-word enhanced prompt
// "A golden retriever with soft, fluffy fur, sitting gracefully..."Implementation uses separate message array with enhancement-specific system prompt, calls llmService.generateResponse(), then explicitly resets LLM state (stopGeneration() only, no KV cache clear to preserve vision inference performance).
Image Models (Android):
- CPU (MNN): 5 models (~1.2GB each) - Anything V5, Absolute Reality, QteaMix, ChilloutMix, CuteYukiMix
- NPU (QNN): 20 models (~1.0GB each) - all CPU models plus DreamShaper, Realistic Vision, MajicmixRealistic, etc.
- QNN variants:
min(non-flagship),8gen1,8gen2(8 Gen 2/3/4/5)
Image Models (iOS - Core ML):
- SD 1.5 Palettized (~1GB) - 6-bit quantized, 512×512
- SD 2.1 Palettized (~1GB) - 6-bit quantized, 512×512
- SDXL iOS (~2GB) - 4-bit mixed-bit palettized, 768×768, ANE-optimized
- SD 1.5 Full (~4GB) - fp16, 512×512, fastest on ANE
- SD 2.1 Base Full (~4GB) - fp16, 512×512, fastest on ANE
Generation Performance:
- Android CPU (MNN): ~15s for 512×512 @ 20 steps (Snapdragon 8 Gen 3)
- Android NPU (QNN): ~5-10s for 512×512 @ 20 steps (chipset-dependent)
- iOS ANE (Core ML): ~8-15s for 512×512 @ 20 steps (A17 Pro/M-series), palettized models ~2x slower due to dequantization
On-device speech recognition using whisper.cpp via whisper.rn native bindings:
- Multiple Whisper models - Tiny, Base, Small (speed vs accuracy tradeoff)
- Real-time partial transcription - Streaming word-by-word results
- Hold-to-record interface - Slide-to-cancel gesture support
- No network - All transcription happens on-device
- Auto-download - Whisper models downloaded on first use
- Language support - Multilingual transcription
Implementation:
- whisper.rn native module handles audio recording and inference
- Transcription results passed via callbacks to React Native
- Audio temporarily buffered in native code, cleared after transcription
- Model selection in settings (Tiny: fastest, Base: balanced, Small: most accurate)
Attach documents to chat messages for context-aware conversations. Documents are parsed and included in the LLM context window alongside the user's message.
Supported Formats:
- Text files -
.txt,.md,.log - Code files -
.py,.js,.ts,.jsx,.tsx,.java,.c,.cpp,.h,.swift,.kt,.go,.rs,.rb,.php,.sql,.sh - Data files -
.csv,.json,.xml,.yaml,.yml,.toml,.ini,.cfg,.conf,.html - PDF documents - Native text extraction via platform-specific modules
Features:
- File picker integration - Uses
@react-native-documents/pickerfor native file selection - PDF text extraction - Native modules extract text from PDFs on both platforms
- Persistent storage - Attached files are copied to persistent app storage so they survive temp file cleanup
- Tappable document badges - Tap any attached document in chat to open it with the system viewer (QuickLook on iOS, Intent viewer on Android)
- Paste-as-attachment - Large pasted text can be attached as a document
- File size limit - 5MB maximum, with text content truncated to 50K characters for context window management
PDF Extraction Implementation:
- Android:
PdfExtractorModule(Kotlin) uses Android's nativePdfRendererAPI - iOS:
PDFExtractorModule(Swift) uses Apple'sPDFKitframework - Both extract page-by-page text content with page separators
- Graceful fallback when PDF extraction is unavailable
Document Viewer:
- Uses
@react-native-documents/viewerfor cross-platform document opening - iOS: Opens in QuickLook preview
- Android: Opens with
Intent.ACTION_VIEWusing the appropriate system app
On-device function calling for models that support it. The system automatically detects tool calling capability from the model's jinja chat template at load time.
Available Tools:
- Web Search — Scrapes Brave Search for top 5 results (requires network). Clickable URLs in results open in the system browser.
- Calculator — Safe recursive descent parser supporting
+, -, *, /, %, ^, (). Noeval(). - Date/Time — Returns formatted date/time with optional timezone support.
- Device Info — Battery level, storage usage, and memory stats via
react-native-device-info.
Tool Loop:
- Max 3 iterations per generation (prevents runaway loops)
- Max 5 total tool calls across all iterations
- Flow: LLM generates → tool calls parsed → tools executed → results injected → LLM continues
- Supports both structured tool calls (via llama.rn) and fallback XML tag parsing (
<tool_call>) for smaller models - Empty web search queries fall back to the user's last message
Capability Gating:
- Tool calling support detected at model load time via jinja chat template introspection
- Tools button disabled in UI when model doesn't support function calling
- Tool system prompt hint only injected when model supports tools AND user has tools enabled
- User can configure which tools are enabled in settings (default: calculator + date/time)
Implementation:
src/services/tools/registry.ts— Tool definitions and OpenAI schema conversionsrc/services/tools/handlers.ts— Tool execution dispatchsrc/services/generationToolLoop.ts— Multi-turn tool loop orchestrationsrc/services/llmToolGeneration.ts— Low-level tool-aware LLM generationllmService.supportsToolCalling()— Runtime capability check
Send messages while the LLM is still generating a response. Messages are queued and processed automatically after the current generation completes.
Features:
- Non-blocking input - Send button stays active during generation, alongside a visible stop button
- Queue indicator - Shows count and preview of queued messages in the toolbar
- Clear queue - Tap the "x" on the queue indicator to discard all queued messages
- Aggregated processing - When generation completes, all queued messages are combined into a single prompt and processed together
- Image bypass - Image generation requests skip the queue and process immediately via the separate image generation service
- Stop + Send side-by-side - Both stop and send buttons visible in the input area during active generation
Implementation:
- Queue lives in
generationService.tsas transient state (not persisted across app restarts) - User messages are added to chat only when the queue processor picks them up, preserving correct chronology
- ChatScreen registers a
queueProcessorcallback; when generation resets and the queue has items, the service calls this callback - Multiple queued messages are aggregated: texts joined with
\n\n, attachments combined
Auto Detection:
- AI automatically classifies user intent (text response vs image generation)
- Uses pattern matching or LLM-based classification (configurable)
- Pattern mode: Fast keyword matching ("draw", "generate", "create image")
- LLM mode: More accurate, uses loaded text model for classification
Manual Override:
- Toggle button forces image generation mode
- Useful when auto-detection misclassifies
- Stays active for current message only
Settings Configuration:
// Image generation mode setting
imageGenerationMode: 'off' | 'auto' | 'manual'
// Auto-detection method
autoDetectMethod: 'pattern' | 'llm'Passphrase Lock:
- Protect conversations with passphrase
- App-level security layer on top of OS encryption
- Locks on app backgrounding (configurable timeout)
- Biometric unlock (planned)
Implementation:
- Passphrase stored securely in Android Keystore
- Conversations encrypted with AES-256
- No biometric data stored on device
Text Model Settings:
- Temperature (0-2.0) - Creativity control
- Max tokens (64-4096) - Response length limit
- Top-p (0.1-1.0) - Nucleus sampling threshold
- Repeat penalty (1.0-2.0) - Repetition reduction
- Context length (512-8192) - Conversation memory window
- CPU threads (1-12) - Performance tuning
- Batch size (32-512) - Processing chunk size
- GPU layers (0-99) - GPU offload configuration
- Enable GPU (on/off) - Toggle GPU acceleration
- Flash attention (on/off) - Faster inference (auto-disabled with GPU layers > 0 on Android)
- KV cache type (f16/q8_0/q4_0) - Cache quantization for memory/quality tradeoff
- Enabled tools - Configure which tools are available for tool-calling models
Image Model Settings:
- Steps (4-50) - Quality vs speed (default: 20)
- Guidance scale (1-20) - Prompt adherence (default: 7.5)
- Seed (random/fixed) - Reproducibility control
- Resolution (256x256-512x512) - Output size
- Preview interval (1-10) - Real-time preview frequency
- Enhance prompts (on/off) - AI prompt enhancement toggle
- Detection method (pattern/LLM) - Intent classification mode
- Threads (1-8) - CPU thread count for image generation
Global Configuration: All settings are global and apply to every model. Settings persist across app restarts and affect all models uniformly.
Pre-Load Memory Checks: Before loading any model, system calculates:
// Text models
requiredRAM = fileSize * 1.5 // KV cache, activations
// Vision models
requiredRAM = (modelFileSize + mmProjSize) * 1.5
// Image models
requiredRAM = fileSize * 1.8 // MNN/QNN runtime overheadMemory Budget:
- Uses 60% of device RAM as safe limit
- Warns at 50% (yellow warning UI)
- Blocks at 60%+ (red error, prevents load)
User-Friendly Messages:
"Cannot load Qwen3-7B-Q4_K_M (~5.5GB required) - would exceed
device safe limit of 4.8GB. Unload current model or choose smaller."
CPU Threads:
- More threads = faster inference (to a point)
- Optimal: 4-6 threads on most devices
- Flagship: 6-8 threads
- Diminishing returns beyond 8
Batch Size:
- Smaller (32-128): Faster first token
- Larger (256-512): Better throughput
- Default 256: Balanced for mobile
Context Length:
- Longer context = more memory + slower
- Automatic truncation when exceeded
- Recommendations:
- 512: Short conversations
- 2048: Standard (default)
- 4096-8192: Long conversations (requires 8GB+ RAM)
GPU Offloading:
- Specify number of layers to offload (0-99)
- More layers = faster (if stable)
- OpenCL backend experimental (can crash)
- Start with 0, incrementally increase
- Automatic fallback to CPU if fails
┌──────────────────────────────────────────────────────────────────┐
│ React Native UI Layer │
│ (Brutalist Design System - TypeScript/TSX) │
├──────────────────────────────────────────────────────────────────┤
│ TypeScript Services Layer │
│ │
│ Core Services (background-safe singletons): │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐│
│ │ llmService │ │ whisperService │ │ hardware ││
│ │ (llama.rn) │ │ (whisper.rn) │ │ (RAM/CPU info) ││
│ └─────────────────┘ └─────────────────┘ └─────────────────┘│
│ │
│ Orchestration Services (lifecycle-independent): │
│ ┌───────────────────────┐ ┌───────────────────────┐ │
│ │ generationService │ │ imageGenerationService│ │
│ │ (text, background) │ │ (images, background) │ │
│ └───────────────────────┘ └───────────────────────┘ │
│ │
│ Management Services: │
│ ┌────────────────────┐ ┌────────────────────┐ │
│ │activeModelService │ │ modelManager │ │
│ │(singleton, mem mgmt)│ │(download, storage) │ │
│ └────────────────────┘ └────────────────────┘ │
├──────────────────────────────────────────────────────────────────┤
│ Native Module Bridge (JNI / ObjC) │
├──────────────────────────────────────────────────────────────────┤
│ Native Implementations: │
│ │
│ Cross-platform: │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ llama.rn │ │ whisper.rn │ │
│ │ (C++ native) │ │ (C++ native) │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Android: iOS: │
│ ┌──────────┐ ┌───────────────┐ ┌──────────────────────┐ │
│ │local-dream│ │DownloadManager│ │CoreMLDiffusionModule │ │
│ │(C++/MNN) │ │ (Kotlin) │ │(StableDiffusionPipe) │ │
│ └──────────┘ └───────────────┘ └──────────────────────┘ │
├──────────────────────────────────────────────────────────────────┤
│ Hardware Acceleration: │
│ Android: iOS: │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │OpenCL (Adreno GPU)│ │ ANE (Neural Engine)│ │
│ │ Text LLMs only │ │ Image gen + LLMs │ │
│ ├──────────────────┤ ├──────────────────┤ │
│ │ QNN (NPU) │ │ Metal (GPU) │ │
│ │ Image gen only │ │ LLM inference │ │
│ └──────────────────┘ └──────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
1. Singleton Services
All core services (llmService, activeModelService, generationService, imageGenerationService) are singleton instances to prevent:
- Duplicate model loading
- Concurrent inference conflicts
- Memory leaks from orphaned contexts
- State desynchronization
Example from activeModelService.ts:
class ActiveModelService {
private loadedTextModelId: string | null = null;
private textLoadPromise: Promise<void> | null = null;
async loadTextModel(modelId: string) {
// Guard against concurrent loads
if (this.textLoadPromise) {
await this.textLoadPromise;
if (this.loadedTextModelId === modelId) return;
}
// ... load logic
}
}
export const activeModelService = new ActiveModelService();2. Background-Safe Orchestration
generationService and imageGenerationService maintain state independently of React component lifecycle:
class GenerationService {
private state: GenerationState = { isGenerating: false, ... };
private listeners: Set<GenerationListener> = new Set();
subscribe(listener: GenerationListener): () => void {
this.listeners.add(listener);
listener(this.getState()); // Immediate state delivery
return () => this.listeners.delete(listener);
}
private notifyListeners(): void {
const state = this.getState();
this.listeners.forEach(listener => listener(state));
}
}Screens subscribe on mount, get current state immediately, continue receiving updates until unmount. Generation continues regardless of UI state.
3. Memory-First Loading
All model loads check available RAM before proceeding:
async loadTextModel(modelId: string) {
const model = store.downloadedModels.find(m => m.id === modelId);
// Estimate: file size × 1.5 for text (KV cache overhead)
const estimatedRAM = (model.fileSize / (1024**3)) * 1.5;
// Check against device RAM budget (60% of total)
const deviceRAM = await hardware.getDeviceInfo();
const budget = (deviceRAM.totalMemory / (1024**3)) * 0.6;
if (estimatedRAM > budget) {
throw new Error('Insufficient RAM');
}
await llmService.loadModel(model.filePath);
}Vision models add mmproj overhead, image models multiply by 1.8x for MNN/QNN runtime.
4. Combined Asset Tracking
Vision models track both main GGUF and mmproj as single logical unit:
interface DownloadedModel {
id: string;
filePath: string;
fileSize: number;
// Vision-specific
mmProjPath?: string;
mmProjFileSize?: number;
isVisionModel: boolean;
}
// Total size calculation
const totalSize = model.fileSize + (model.mmProjFileSize || 0);
const totalRAM = totalSize * 1.5; // Both contribute to RAM estimate5. Aggressive State Cleanup
After prompt enhancement (which uses llmService), explicit cleanup ensures text generation doesn't hang:
// After enhancement completes
await llmService.stopGeneration(); // Clear generating flag
// Note: KV cache NOT cleared to preserve vision inference speedVision inference can be 30-60s slower if KV cache is cleared after every enhancement.
Application state managed via Zustand with AsyncStorage persistence:
appStore (src/stores/appStore.ts):
- Downloaded models (text, image, Whisper)
- Active model IDs
- Settings (temperature, context length, GPU config, image gen params)
- Hardware info (RAM, available memory)
- Gallery (generated images metadata)
- Background generation state (progress, status, preview path)
chatStore (src/stores/chatStore.ts):
- Conversations and messages
- Streaming state (current streaming message)
- Message operations (add, update, delete, edit)
projectStore (src/stores/projectStore.ts):
- Projects (custom system prompts)
- Active project selection
authStore (src/stores/authStore.ts):
- Passphrase lock state
- Authentication status
whisperStore (src/stores/whisperStore.ts):
- Whisper model state and selection
- Transcription configuration
Persistence:
const useAppStore = create<AppStore>()(
persist(
(set, get) => ({
// state and actions
}),
{
name: 'app-storage',
storage: createJSONStorage(() => AsyncStorage),
}
)
);All stores automatically persist to AsyncStorage on state changes, rehydrate on app launch.
Services update stores for UI reactivity:
// imageGenerationService updates appStore during generation
private updateState(partial: Partial<ImageGenerationState>): void {
this.state = { ...this.state, ...partial };
this.notifyListeners();
const appStore = useAppStore.getState();
if ('isGenerating' in partial) {
appStore.setIsGeneratingImage(this.state.isGenerating);
}
if ('progress' in partial) {
appStore.setImageGenerationProgress(this.state.progress);
}
}UI components read from stores, services write to stores. Unidirectional data flow.
Native Android DownloadManager handles model downloads:
Implementation (android/app/src/main/java/ai/offgridmobile/download/DownloadManagerModule.kt):
class DownloadManagerModule(reactContext: ReactApplicationContext) :
ReactContextBaseJavaModule(reactContext) {
fun downloadFile(url: String, fileName: String, modelId: String) {
val request = DownloadManager.Request(Uri.parse(url))
.setTitle(fileName)
.setNotificationVisibility(DownloadManager.Request.VISIBILITY_VISIBLE)
.setDestinationInExternalFilesDir(context, "models", fileName)
val downloadId = downloadManager.enqueue(request)
// Poll download progress
monitorDownload(downloadId, modelId)
}
}Downloads continue even when app is backgrounded or killed. Native notifications show progress. React Native polls for updates via BroadcastReceiver.
Race Condition Fix:
On slow emulators, download completion notification could arrive before React Native received DownloadComplete event. Fixed by tracking event delivery separately:
// Track event delivery separately from completion status
private data class DownloadInfo(
val downloadId: Long,
val modelId: String,
var completedEventSent: Boolean = false // New field
)
// Only send event if not already sent
if (!info.completedEventSent) {
sendDownloadCompleteEvent(modelId)
info.completedEventSent = true
}imageGenerationService maintains generation state independently of React component lifecycle. Native local-dream inference continues on background threads while JavaScript service layer notifies any mounted subscribers.
Flow:
- User starts generation in ChatScreen
- ChatScreen subscribes to
imageGenerationService - User navigates to HomeScreen (ChatScreen unmounts)
- Generation continues, service maintains state
- HomeScreen mounts, subscribes, immediately receives current state (progress, preview)
- User navigates back to ChatScreen
- ChatScreen re-subscribes, receives current state (may be complete)
Subscribers are weakly held, services never leak references.
Text Models:
- Hugging Face API integration (
src/services/modelManager.ts) - Curated recommended models filtered by device RAM (Qwen 3, Llama 3.2, Gemma 3, SmolLM3, Phi-4)
- Advanced filtering: organization (Qwen, Meta, Google, Microsoft, Mistral, DeepSeek, HuggingFace, NVIDIA), size category (tiny/small/medium/large), quantization level, model type (text/vision/code)
- Filters: LM Studio compatible, Official/Verified/Community credibility badges
- Automatic GGUF quantization detection
- RAM compatibility checks based on device memory
Local Model Import (Bring Your Own Model):
- Import
.gguffiles directly from device storage via native file picker - Validates file format, parses model name and quantization from filename
- Handles Android
content://URIs by copying to app storage - Progress tracking during file import
- Imported models appear alongside downloaded models in the model selector
Image Models:
- xororz HuggingFace repos (pre-converted MNN/QNN)
- Dynamic model list fetch
- Backend filtering (CPU/NPU)
- Chipset variant selection for QNN models
Features:
- Background downloads via native Android DownloadManager
- Automatic retry on network interruption
- Storage space pre-check before download
- Combined progress for vision models (shows total for GGUF + mmproj)
- Parallel downloads supported
Storage Management:
- Orphaned file detection (GGUF files not tracked in store)
- Stale download cleanup (invalid entries from interrupted downloads)
- Bulk deletion of orphaned files
- Model size breakdown with mmproj overhead included
Dynamic Memory Budget:
- Uses 60% of device RAM as budget for models
- Warns at 50% usage (yellow warning)
- Blocks at 60%+ usage (red error)
- Text models: file size x 1.5 (KV cache, activations)
- Image models: file size x 1.8 (MNN/QNN runtime, intermediate tensors)
- Vision models: text estimate + mmproj overhead
Pre-load Checks:
async checkMemoryForModel(modelId: string, modelType: 'text' | 'image') {
const deviceRAM = await hardware.getDeviceInfo();
const budget = (deviceRAM.totalMemory / (1024**3)) * 0.60;
const model = findModel(modelId, modelType);
const requiredRAM = estimateModelMemory(model, modelType);
const currentlyLoaded = getCurrentlyLoadedMemory();
const totalRequired = requiredRAM + currentlyLoaded;
if (totalRequired > budget) {
return { canLoad: false, severity: 'critical', message: '...' };
}
// ...
}Prevents OOM crashes by blocking loads that would exceed safe RAM limits.
Off Grid uses a terminal-inspired brutalist design system with full light/dark theme support. The system emphasizes information density and functional minimalism with a monochromatic palette and emerald accent.
Dynamic light/dark theming via src/theme/:
palettes.ts—COLORS_LIGHT,COLORS_DARK,SHADOWS_LIGHT,SHADOWS_DARK,createElevation()index.ts—useTheme()hook,getTheme(mode),ThemetypeuseThemedStyles.ts—useThemedStyles(createStyles)memoized style factory
Theme mode is stored in appStore.themeMode (persisted) and toggled via the Settings screen Dark Mode switch.
Usage pattern (every screen/component):
const { colors } = useTheme();
const styles = useThemedStyles(createStyles);
const createStyles = (colors: ThemeColors, shadows: ThemeShadows) => ({
container: { backgroundColor: colors.background },
card: { ...shadows.medium, backgroundColor: colors.surface },
});Token split:
- Theme-independent tokens stay in
src/constants/:TYPOGRAPHY,SPACING,FONTS - Dynamic tokens come from hooks:
colors.*,shadows.*
Powered by react-native-reanimated with spring-based physics and react-native-haptic-feedback:
AnimatedEntry— Staggered fade+slide entrance for lists and grids, with configurable delay, respectsuseReducedMotion()AnimatedListItem— Combines staggered entry animation with spring-based press scale feedbackAnimatedPressable— Spring scale-down on press with haptic feedback (selection, impact, notification types)AppSheet— Custom swipe-to-dismiss bottom sheet with spring animation, replacing React Native modalsuseFocusTrigger— Hook that replays stagger animations on every tab focus- Tab transitions — Fade animations and haptic feedback on bottom tab switches
- Modal screens —
slide_from_bottomanimation for all modal presentations
Typography (10-level scale, all Menlo monospace):
export const TYPOGRAPHY = {
display: { fontSize: 22, fontWeight: '200' },
h1: { fontSize: 24, fontWeight: '300' }, // Hero text only
h2: { fontSize: 16, fontWeight: '400' }, // Screen titles
h3: { fontSize: 13, fontWeight: '400' }, // Section headers
body: { fontSize: 14, fontWeight: '400' }, // Primary content
bodySmall: { fontSize: 13, fontWeight: '400' }, // Descriptions
label: { fontSize: 10, fontWeight: '400' }, // Uppercase labels
labelSmall: { fontSize: 9, fontWeight: '400' },
meta: { fontSize: 10, fontWeight: '300' }, // Timestamps
metaSmall: { fontSize: 9, fontWeight: '300' },
};Spacing (6-step scale):
export const SPACING = { xs: 4, sm: 8, md: 12, lg: 16, xl: 24, xxl: 32 };Scenario: User travels with no internet access, needs AI assistance for writing, research, or problem-solving.
Implementation:
- Download Qwen 3 0.6B or Llama 3.2 3B (Q4_K_M) from curated recommendations, or import your own .gguf model from device storage
- Create project with custom system prompt: "You are a helpful writing assistant..."
- Generate responses entirely on-device
- All conversations persist locally
Performance: 5-10 tok/s on mid-range devices, 15-30 tok/s on flagships.
Scenario: Artist/designer needs AI-generated images but doesn't want prompts or outputs sent to cloud services.
Implementation:
- Download Anything V5 (CPU) or DreamShaper (NPU) image model
- Enable prompt enhancement for detailed results from simple inputs
- Generate images with seed control for reproducibility
- Save to device gallery or share directly
Privacy: Zero network activity after model download. Prompts never leave device.
Scenario: User needs to analyze receipts, invoices, or documents on the go without internet.
Implementation:
- Download SmolVLM-500M (vision model, ~600MB)
- Capture document photo via camera
- Send to model with prompt: "Extract all line items and totals"
- Receive structured text response
Performance: ~7s inference on flagship devices.
Scenario: Developer needs code assistance without sharing proprietary code with cloud services.
Implementation:
- Download Qwen 3 Coder A3B or Phi-4 Mini (Q4_K_M)
- Create "Code Review" project with system prompt
- Paste code snippets, receive suggestions
- All code stays on device
Use case: Security-sensitive environments, air-gapped development, competitive advantage protection.
Scenario: Language learner practices conversations without subscription or data harvesting.
Implementation:
- Download multilingual model (Qwen3, Command-R)
- Create project: "You are a patient Spanish tutor..."
- Voice input via Whisper for pronunciation practice
- Text responses for grammar explanation
Advantages: Unlimited practice, no usage limits, complete privacy.
Performance Considerations:
- Vision inference can take 10-30+ seconds depending on model size and device
- Larger models (2B+) are slower but provide better understanding
- SmolVLM models offer fastest inference (7-15s on flagship devices)
- Comprehensive logging tracks vision inference progress for debugging
OpenCL Stability:
- OpenCL backend can crash on some Qualcomm devices
- Crash typically happens during layer offload initialization
- Automatic fallback to CPU if GPU initialization fails
- User can manually reduce GPU layers or disable entirely
Recommendation: Start with 0 GPU layers, incrementally increase while monitoring stability.
Flagship devices (Snapdragon 8 Gen 2+):
- CPU: 15-30 tok/s (4-8 threads)
- GPU (OpenCL): 20-40 tok/s (experimental, stability varies)
- TTFT: 0.5-2s depending on context length
Mid-range devices (Snapdragon 7 series):
- CPU: 5-15 tok/s
- TTFT: 1-3s
Factors:
- Model size (larger = slower)
- Quantization (lower bits = faster)
- Context length (more tokens = slower)
- Thread count (4-8 threads optimal)
- GPU layers (more = faster if stable)
SmolVLM 500M:
- Flagship: ~7s per image
- Mid-range: ~15s per image
SmolVLM 2.2B:
- Flagship: ~10-15s per image
- Mid-range: ~25-35s per image
CPU (MNN):
- 512x512, 20 steps: ~15s (Snapdragon 8 Gen 3)
- 512x512, 20 steps: ~30s (Snapdragon 7 series)
NPU (QNN):
- 512x512, 20 steps: ~5-10s (chipset-dependent)
- Requires Snapdragon 8 Gen 1+ with QNN support
GGUF quantization methods and their trade-offs:
| Quantization | Bits | Quality | 7B Size | RAM | Use Case |
|---|---|---|---|---|---|
| Q2_K | 2-3 bit | Lowest | ~2.5 GB | ~3.5 GB | Very constrained devices |
| Q3_K_M | 3-4 bit | Low-Med | ~3.3 GB | ~4.5 GB | Budget devices, testing |
| Q4_K_M | 4-5 bit | Good | ~4.0 GB | ~5.5 GB | Recommended default |
| Q5_K_M | 5-6 bit | Very Good | ~5.0 GB | ~6.5 GB | Quality-focused users |
| Q6_K | 6 bit | Excellent | ~6.0 GB | ~7.5 GB | Flagship devices |
| Q8_0 | 8 bit | Near FP16 | ~7.5 GB | ~9.0 GB | Maximum quality |
Recommendation: Q4_K_M provides best balance. Q5_K_M for quality on devices with 8GB+ RAM.
- React Native 0.83 - Cross-platform mobile framework
- TypeScript 5.x - Type safety and developer experience
- llama.rn - Native bindings for llama.cpp GGUF inference
- whisper.rn - Native bindings for whisper.cpp speech recognition
- local-dream - MNN/QNN Stable Diffusion implementation (Android)
- ml-stable-diffusion - Apple's Core ML Stable Diffusion pipeline (iOS)
- Zustand 5.x - Lightweight state management
- AsyncStorage - Persistent local storage
- React Navigation 7.x - Native navigation
- React Native Reanimated 4.x - Performant native-thread animations
- React Native Haptic Feedback - Haptic responses on interactions
- @react-native-documents/picker - Native document picker for file attachments and local model import
- @react-native-documents/viewer - Native document viewer (QuickLook / Intent.ACTION_VIEW)
llama.rn (Android + iOS):
- Compiles llama.cpp for ARM64
- Android: JNI bindings, OpenCL GPU offloading on Adreno GPUs
- iOS: Metal GPU acceleration
- Handles multimodal (vision) via mmproj
whisper.rn (Android + iOS):
- Compiles whisper.cpp for ARM64
- Real-time audio recording and transcription
- Multiple model sizes (Tiny, Base, Small, Medium)
local-dream (Android only):
- C++ implementation of Stable Diffusion
- MNN backend (CPU, all ARM64 devices)
- QNN backend (NPU, Snapdragon 8 Gen 1+)
- Automatic backend detection and fallback
CoreMLDiffusionModule (iOS only):
- Swift bridge to Apple's
ml-stable-diffusionStableDiffusionPipeline - Neural Engine (ANE) acceleration via
.cpuAndNeuralEnginecompute units - DPM-Solver multistep scheduler for faster convergence
- Safety checker disabled for reduced latency
- Supports palettized (6-bit) and full-precision (fp16) Core ML models
PdfExtractorModule (Android + iOS):
- Android: Kotlin module using
android.graphics.pdf.PdfRendererfor page-by-page text extraction - iOS: Swift module using Apple's
PDFKit(PDFDocument,PDFPage) for text extraction - Both expose
extractText(filePath)returning concatenated page text with separators - Graceful error handling when PDF is corrupted or password-protected
DownloadManager (Android only):
- Native Android DownloadManager wrapper
- Background download support
- Progress polling and event emission to React Native
- Proper cleanup and error handling
- Conversations: AsyncStorage (encrypted at OS level)
- Models: Internal app files directory
- Images: Internal app files directory + optional export to Pictures
- Settings: AsyncStorage
All data stored in app's private storage, inaccessible to other apps (Android sandboxing).
Model download only:
- Hugging Face API (model metadata)
- Hugging Face CDN (model file downloads)
- xororz HuggingFace repos (image model listings)
After model download:
- Zero network activity
- Enable airplane mode and use indefinitely
- All inference happens on-device
- Passphrase lock for sensitive conversations
- Biometric authentication (planned)
- Conversation export/import with encryption (planned)
OffgridMobile/
├── src/
│ ├── components/ # Reusable UI components
│ │ ├── AnimatedEntry.tsx # Staggered fade+slide entrance animation
│ │ ├── AnimatedListItem.tsx # Entry animation + press feedback combo
│ │ ├── AnimatedPressable.tsx # Spring scale + haptic press wrapper
│ │ ├── AppSheet.tsx # Custom swipe-to-dismiss bottom sheet
│ │ ├── ChatInput.tsx # Message input with attachments, vision/image-mode badges
│ │ ├── ChatMessage.tsx # Message bubbles with metadata
│ │ ├── DebugSheet.tsx # Developer debug bottom sheet
│ │ ├── ModelCard.tsx # Model display card (compact/full modes, icon actions)
│ │ ├── ModelSelectorModal.tsx # Quick model switcher
│ │ ├── GenerationSettingsModal.tsx # Image generation settings
│ │ ├── ProjectSelectorSheet.tsx # Project picker bottom sheet
│ │ ├── CustomAlert.tsx # Consistent alert dialogs
│ │ ├── Button.tsx # Reusable button component
│ │ ├── Card.tsx # Reusable card component
│ │ ├── ThinkingIndicator.tsx # LLM thinking animation
│ │ └── VoiceRecordButton.tsx # Voice recording button
│ ├── constants/ # Design tokens and configuration
│ │ └── index.ts # TYPOGRAPHY, SPACING, FONTS
│ ├── hooks/ # Custom React hooks
│ │ ├── useAppState.ts # App lifecycle state
│ │ ├── useFocusTrigger.ts # Focus-based animation replay trigger
│ │ ├── useVoiceRecording.ts # Voice recording logic
│ │ └── useWhisperTranscription.ts # Whisper transcription
│ ├── navigation/ # React Navigation setup
│ │ └── AppNavigator.tsx # Tab and stack navigators
│ ├── screens/ # Main app screens
│ │ ├── HomeScreen.tsx # Dashboard with model status
│ │ ├── ChatScreen.tsx # Main chat interface
│ │ ├── ChatsListScreen.tsx # Conversation list
│ │ ├── ModelsScreen.tsx # Browse and download models
│ │ ├── ModelDownloadScreen.tsx # Model download details
│ │ ├── ModelSettingsScreen.tsx # Text and image model settings
│ │ ├── GalleryScreen.tsx # Generated images gallery
│ │ ├── DownloadManagerScreen.tsx # Download tracking
│ │ ├── StorageSettingsScreen.tsx # Storage management
│ │ ├── SettingsScreen.tsx # App settings hub
│ │ ├── ProjectsScreen.tsx # Project list
│ │ ├── ProjectDetailScreen.tsx # Project detail view
│ │ ├── ProjectEditScreen.tsx # Create/edit projects
│ │ ├── SecuritySettingsScreen.tsx # Security settings
│ │ ├── PassphraseSetupScreen.tsx # Passphrase configuration
│ │ ├── LockScreen.tsx # App lock screen
│ │ ├── OnboardingScreen.tsx # First-launch onboarding
│ │ ├── DeviceInfoScreen.tsx # Device hardware info
│ │ └── VoiceSettingsScreen.tsx # Whisper settings
│ ├── services/ # Core business logic
│ │ ├── llm.ts # Text LLM inference
│ │ ├── activeModelService.ts # Model lifecycle management
│ │ ├── modelManager.ts # Download and storage
│ │ ├── generationService.ts # Text generation orchestration
│ │ ├── imageGenerationService.ts # Image generation orchestration
│ │ ├── localDreamGenerator.ts # local-dream bridge
│ │ ├── imageGenerator.ts # Image generation utilities
│ │ ├── hardware.ts # Device info and memory
│ │ ├── documentService.ts # Document attachment processing
│ │ ├── pdfExtractor.ts # Native PDF text extraction bridge
│ │ ├── authService.ts # Passphrase authentication
│ │ ├── voiceService.ts # Voice recording
│ │ ├── whisperService.ts # Whisper transcription
│ │ ├── intentClassifier.ts # Image generation intent detection
│ │ ├── huggingFaceModelBrowser.ts # HF model browsing
│ │ ├── huggingface.ts # HF API utilities
│ │ ├── coreMLModelBrowser.ts # iOS Core ML model browsing
│ │ ├── backgroundDownloadService.ts # Background download management
│ │ └── index.ts # Service exports
│ ├── stores/ # Zustand state management
│ │ ├── appStore.ts # Global app state
│ │ ├── chatStore.ts # Conversations and messages
│ │ ├── authStore.ts # Authentication state
│ │ ├── projectStore.ts # Projects state
│ │ ├── whisperStore.ts # Whisper model state
│ │ └── index.ts # Store exports
│ ├── theme/ # Dynamic light/dark theme system
│ │ ├── index.ts # useTheme() hook, getTheme(), exports
│ │ ├── palettes.ts # Light/dark color palettes, shadows, elevation
│ │ └── useThemedStyles.ts # Memoized themed stylesheet factory
│ ├── utils/ # Utility functions
│ │ └── haptics.ts # Haptic feedback triggers
│ └── types/ # TypeScript type definitions
│ ├── index.ts # All interfaces and types
│ └── whisper.rn.d.ts # Whisper native module types
├── android/ # Android native code
│ └── app/src/main/java/ai/offgridmobile/
│ ├── MainActivity.kt # Main activity
│ ├── MainApplication.kt # Application entry point
│ ├── download/ # Background download manager
│ │ ├── DownloadManagerModule.kt
│ │ ├── DownloadManagerPackage.kt
│ │ └── DownloadCompleteBroadcastReceiver.kt
│ ├── localdream/ # local-dream native module
│ │ ├── LocalDreamModule.kt
│ │ └── LocalDreamPackage.kt
│ └── pdf/ # PDF text extraction
│ ├── PdfExtractorModule.kt
│ └── PdfExtractorPackage.kt
├── ios/ # iOS native code
│ └── OffgridMobile/
│ ├── AppDelegate.swift # Application delegate
│ ├── CoreMLDiffusion/ # Core ML image generation
│ │ ├── CoreMLDiffusionModule.swift
│ │ └── CoreMLDiffusionModule.m
│ ├── PDFExtractor/ # PDF text extraction
│ │ └── PDFExtractorModule.swift
│ └── Download/ # iOS download manager
│ ├── DownloadManagerModule.swift
│ └── DownloadManagerModule.m
└── docs/ # Documentation
├── ARCHITECTURE.md # This file
├── CODEBASE_GUIDE.md # Comprehensive architecture guide
├── DESIGN_PHILOSOPHY_SYSTEM.md # Design system reference
└── ...
- Node.js 20+
- React Native CLI
Android:
- JDK 17
- Android SDK (API 36)
- Android NDK r27
iOS:
- Xcode 15+ with iOS 17 SDK
- CocoaPods
- Apple Developer account (for device deployment)
# Clone repository
git clone https://github.com/alichherawalla/off-grid-mobile.git
cd off-grid-mobile
# Install JavaScript dependencies
npm install
# Android setup
cd android && ./gradlew clean && cd ..
# iOS setup
cd ios && pod install && cd ..# Start Metro bundler
npm start
# Android (separate terminal)
npm run android
# iOS (separate terminal)
npm run ios
# Or use Xcode
# Open ios/OffgridMobile.xcworkspace
# Select target device → Build & Run# Android
cd android
./gradlew assembleRelease
# Output: android/app/build/outputs/apk/release/app-release.apk
# iOS
# Open Xcode → Product → Archive → DistributeCreate android/gradle.properties:
MYAPP_RELEASE_STORE_FILE=your-release-key.keystore
MYAPP_RELEASE_KEY_ALIAS=your-key-alias
MYAPP_RELEASE_STORE_PASSWORD=***
MYAPP_RELEASE_KEY_PASSWORD=***Contributions welcome. Please:
- Fork repository
- Create feature branch
- Follow existing code style (TypeScript, design tokens)
- Add tests where applicable
- Update documentation
- Submit pull request
- Use
useTheme()anduseThemedStyles()for all styling — never hardcode colors - Use
TYPOGRAPHYandSPACINGfromsrc/constants/for theme-independent tokens - Follow singleton pattern for services
- Implement background-safe operations for long-running tasks
- Add comprehensive logging for debugging
- Check memory before model operations
- Handle errors gracefully with user-friendly messages