Skip to content

bteodora/gdg-accessibility-agent

Repository files navigation

SilverOne Logo

SILVERONE

Accessible Android assistant for voice and text-driven device control

Android API 33 Java 17 Gemini Live MVVM Firebase Fallback

Built to reduce the number of steps between user intent and device action.

Voice or text enters the system, the agent resolves intent, and a deterministic tool layer executes the action. The design favors explicit fallbacks, on-device state, and predictable behavior over speculative automation.

In short: fewer menus, fewer taps, and less state the user has to remember.

Contents | Architecture | Features | Getting Started


Table of Contents

  1. Engineering Challenge
  2. Architecture & Design
  3. Contextual Memory & Personalization: The RAG Engine
  4. Implemented Capabilities
  5. Deployment & Distribution
  6. Roadmap & Technical Priorities
  7. Getting Started: Building & Running
  8. Development & Contribution

Engineering Challenge

The Core Problem: Modern smartphone UI operates 6-7 levels of abstraction away from user intent. Making a call requires: visual hierarchy parsing, precise 48dp target tapping, 5-12 sequential menu interactions, and silent-failure recovery. For 1.3B people with accessibility constraints, this design is functionally precluded.

The Technical Thesis: Eliminate abstraction layers by building an agent-driven interface where the device understands intent (voice, touch, context) and executes actions immediately with spoken confirmation. Reliability through fallback orchestration—multiple turn sources (Gemini Live → Gemini API → A2A → Legacy), every failure path has a recovery mechanism and user notification.

The Current Reality: This is not an edge-AI product yet. The shipped system is a cloud-first agent orchestrator: Gemini Live for streaming voice, Gemini API for chat, and a deterministic tool dispatcher for device actions. On-device intelligence is limited to state, caching, and the RAG memory layer.

Architectural Trade-offs & Known Limitations

Latency vs. Accuracy in Speech Handling

Cloud-first voice handling gives the MVP low integration risk and fast iteration, but it also means the experience depends on network quality. The current trade-off is to keep the runtime lean and accept that offline speech understanding is not solved in this version.

LLM Reasoning vs. Deterministic Execution

The model is used to extract intent and parameters, not to generate arbitrary actions. That keeps the behavior easier to reason about, but it also means the system still depends on the quality of the model output when parsing ambiguous phrasing or local dialects.

Persistent Connectivity vs. Battery Use

Maintaining an active WebSocket session improves responsiveness, but it also increases power usage. The implementation uses aggressive idle disconnects and explicit session state transitions to keep the cost bounded.


Architecture & Design

Dual-Agent System: Voice vs. Chat

The app exposes two interaction modes that share the same execution layer:

Voice Agent (VoiceAgentViewModel) — streaming voice path
Voice input is captured, streamed, and mapped to agent turns through Gemini Live over WebSocket. The goal is not raw model novelty; it is to keep the feedback loop short enough that the interaction still feels direct.

  • Protocol: Gemini Live WebSocket (bidirectional audio streaming)
  • Activation: User holds button → audio streams → agent responds in real time
  • Latency target: low and predictable, with visible state transitions
  • Concurrency: one active session per device; idle disconnect after 120s

Chat Agent (ChatAgentViewModel) — typed interaction path
Text input uses the same tool execution backbone, but with a simpler turn model and a clearer audit trail. That makes it easier to debug and easier to explain when a result is wrong.

  • Protocol: Gemini API (stateless REST turns with conversation context)
  • Activation: User types → message is serialized with prior context → response returns
  • Latency target: typical text turnaround; no need to keep audio open
  • Concurrency: one turn at a time for predictable state handling

Intent Flow Diagram

graph TB
    A["User Input<br/>(Voice / Text / Touch)"] -->|Speech intent| B["VoiceAgentViewModel<br/>Gemini Live WebSocket"]
    A -->|Text intent| C["ChatAgentViewModel<br/>Gemini API"]
    
    B -->|Real-time streaming| D["GeminiLiveSessionClient<br/>Bidirectional PCM Audio"]
    C -->|Interactive turns| E["GeminiTurnSource<br/>REST API Calls"]
    
    D -->|Function call| F["AgentConversationCoordinator<br/>(Orchestrator & Tool Dispatcher)"]
    E -->|Function call| F
    
    F -->|Validate & dispatch| G["LiveToolDispatcher<br/>12+ Specialized Handlers"]
    
    G -->|callContact| H1["CallToolHandler<br/>TelephonyManager"]
    G -->|sendSMS| H2["ContactsToolHandler<br/>Provider Queries"]
    G -->|setVolume| H3["VolumeToolHandler<br/>AudioManager"]
    G -->|setFlashlight| H4["FlashlightToolHandler<br/>LEDs"]
    G -->|setAccessibility| H5["AccessibilityToolHandler<br/>Settings"]
    G -->|getDateTime| H6["DateTimeToolHandler<br/>Clock/Calendar"]
    G -->|sendLocation| H7["SafetyToolHandler<br/>GPS + SMS"]
    G -->|openApp| H8["NavigationToolHandler<br/>Intent Launch"]
    
    H1 --> I["Device Action Executed<br/>+ Feedback to User"]
    H2 --> I
    H3 --> I
    H4 --> I
    H5 --> I
    H6 --> I
    H7 --> I
    H8 --> I
    
    I -->|Confirmation audio| B
    I -->|Confirmation text| C
Loading

Component & Data Flow

graph LR
    subgraph "UI Layer"
        A1["SeniorHomeFragment<br/>Senior UI"]
        A2["VoiceAgentFragment<br/>Voice UI"]
        A3["ChatAgentFragment<br/>Chat UI"]
    end
    
    subgraph "ViewModel Layer"
        B1["VoiceAgentViewModel<br/>Audio I/O + State"]
        B2["ChatAgentViewModel<br/>Message List + State"]
    end
    
    subgraph "Repository Layer"
        C1["VoiceAgentRepository<br/>Coordinate + Archive"]
        C2["ChatAgentRepository<br/>Coordinate + Archive"]
    end
    
    subgraph "Agent Core (Orchestration)"
        D["AgentConversationCoordinator<br/>Tool dispatch logic"]
        E["AgentStateStore<br/>Current context"]
        F["RagMemoryStore<br/>SQLite + Vector DB"]
    end
    
    subgraph "Turn Sources (LLM Backend)"
        G1["GeminiLiveSessionClient<br/>WebSocket Streaming"]
        G2["GeminiTurnSource<br/>REST API"]
        G3["A2aTurnSource<br/>Fallback 1"]
        G4["AdkHttpTurnSource<br/>Fallback 2"]
    end
    
    subgraph "Device Actions"
        H["LiveToolDispatcher"]
        I["12+ ToolHandlers<br/>Call, SMS, Volume, etc"]
    end
    
    A1 --> B1
    A2 --> B1
    A3 --> B2
    
    B1 --> C1
    B2 --> C2
    
    C1 --> D
    C2 --> D
    
    D --> E
    D --> F
    
    D -->|Primary| G1
    D -->|Primary| G2
    D -->|Fallback| G3
    D -->|Fallback| G4
    
    D --> H
    H --> I
    
    I --> F
Loading

Fallback Orchestration: Recovery Paths, Not Just Errors

Cloud systems fail in ordinary ways: the network drops, a session expires, a response times out, or the model returns something unusable. The implementation handles those cases by moving through a small number of known fallbacks instead of exposing raw failure to the user.

The objective is simple: return either a valid action or a clear explanation of why the action could not be completed. There should be no silent failure and no dead-end UI.

Tool Call Initiated
    ↓
[Attempt Primary TurnSource: Gemini Live (WebSocket)]
    ├─ Success → Execute tool, record result, continue
    └─ Failure (network, rate limit, malformed response)
        ↓
    [Attempt Secondary TurnSource: Gemini API (REST)]
        ├─ Success → Execute tool, record result
        └─ Failure (timeout, rate limit, inference error)
            ↓
        [Attempt Tertiary TurnSource: A2A (Fallback Orchestrator)]
            ├─ Success → Execute tool, record result
            └─ Failure (all cloud unreachable OR new device)
                ↓
            [Graceful Local Degradation]
            - Notify user: "Having trouble reaching cloud. Using local mode."
            - Execute from cache: "Call last 5 most frequent contacts"
            - Queue command for later retry: Once connectivity returns, retry with cloud
            - Ensure semantics are preserved: "Send SMS to Mom" → queue, retry on reconnect

In practice, the system either completes the request or degrades to cached behavior and a clear retry path.


Contextual Memory & Personalization: The RAG Engine

RagMemoryStore: On-Device AI Memory Layer

The system implements a Retrieval-Augmented Generation (RAG) layer to keep recent context, preferences, and interaction history available without pushing that state to a remote service.

Architecture:

  • Vector Database: SQLite + FTS5 + vector search extension
  • Context Storage: Turn-by-turn conversation history, user preferences, device state snapshots
  • Update Mechanism: Real-time indexed insertion on every execution; aged data evicted via LRU
  • Query Speed: <50ms semantic search latency (tested on Pixel 4a)

Why this matters:

  1. Privacy by default — The meaningful state lives on the device.

    • Recent contacts, preferences, and turn history stay local.
    • The cloud sees the minimum data needed to fulfill the request.
  2. Behavioral customization — The agent gets more useful as the local history grows.

    • Repeated names and commands become easier to resolve.
    • Frequent actions can be served from cache instead of being reconstructed.
  3. Multi-turn continuity — The system can use earlier turns as context.

    • That helps with instructions like "call the person we spoke about earlier."
    • It also makes later follow-up requests easier to interpret.
  4. Safer execution — Memory improves retrieval, but execution still goes through deterministic handlers.

    • The model suggests parameters.
    • The tool layer decides whether an action is valid and how it is executed.

Data Flow: User Input → Vector Lookup → Gemini + Local Execution

User Voice Input
    ↓
[Extract partial intent from audio] → Partial match in vector DB (~10ms)
    ↓
[Retrieve top-3 candidate contexts] → "Call Marko" (high confidence)
    ↓
[Query Gemini Live with retrieved context] → Cloud confirmation + enrichment
    ↓
[Archive interaction + update vector embeddings] → Next call to same person is cache-hit

Privacy Guarantee: If the user clears the app data, the local history goes with it. There is no hidden cloud backup path in the current implementation.


Implemented Capabilities

SilverOne currently ships with a practical set of interaction and device-control features. The value is in how the pieces fit together and how consistently they execute.

Voice I/O Pipeline

The Challenge: Transcribing fragmented speech from users with dysarthria, background noise, and variable microphone quality.
The Solution: Stream the audio, keep the turn state explicit, and let the tool layer handle execution.

  • Streaming speech capture — Audio is captured and sent through the voice path in real time
  • Turn handling — Partial and final turns are kept separate so state transitions stay readable
  • Intent disambiguation — The agent resolves commands using conversation context and device state
  • Response feedback — The system confirms actions with voice or text, depending on the active mode

Contact Resolution Engine

The Challenge: "Call Marko" could mean 3 different Markos in the contact list, plus a Marko in messages history, plus a Marko the user never explicitly named but refers to contextually ("my son's friend").
The Solution: Probabilistic disambiguation through multiple overlapping signals.

  • Fuzzy Matching — Levenshtein distance with Cyrillic transliteration (detect "Marko" vs. "Marko" in different Cyrillic/Latin encodings)
  • Nickname Handling — Learn variant names ("Mara" → "Marija") from repeated user corrections; vector DB stores name aliases
  • Phonetic Collision Detection — When two contacts sound identical, resolve via context: "Did you mean X (called yesterday) or Y (messaged last week)?"
  • Transitive Queries — "Call the person who called yesterday" (combines temporal + identity resolution; queries CallLog + vector DB semantics)

Device Integration

The goal: One-command device control where possible, with explicit confirmation when the action is sensitive.

  • Call Control — TelephonyManager-based native dialing with call state machine (pre-call → alerting → active → disconnecting → idle)
  • SMS Handling — Secure message composition, read confirmation (TTS), reply tracking via ContentProvider
  • GPS & Location — Privacy-bounded emergency location dispatch; sends location only to pre-approved emergency contacts
  • Audio Stack — Adaptive volume (ambient noise API triggers auto-levels), intelligent speaker routing (Bluetooth vs. wired vs. speaker)
  • Camera + Vision — ML Kit OCR with real-time text extraction and audio readback; user points camera at mail/bill/sign, system reads aloud

Adaptive Behavior

The insight: Repeated behavior is useful input, especially for contacts and common actions.

  • Preference Learning — User adjustments (volume = 60%, speech speed = 1.2x, language = Serbian) persisted in vector DB as embeddings; applied to all future interactions
  • Usage Analytics — Track command frequency distribution so cache and shortcuts reflect real use
  • Drift Detection — Track unusual shifts in behavior and surface them as signals, not conclusions

Offline & Resilience

Because connectivity isn't guaranteed—especially on unstable mobile networks.

  • LRU Command Cache — Persistent command cache with eviction so the common paths stay fast after restart
  • Graceful Degradation — If the cloud path is unavailable, the system falls back to cached context and explicit retry behavior
  • Connectivity Awareness — NetworkCallback-based checks keep the active session honest about whether it can continue
  • Battery Optimization — Idle disconnects and request coalescing keep long-lived sessions from draining the device unnecessarily

Deployment & Distribution Strategy

SilverOne is designed to be installed like a normal Android app, but the runtime assumptions are closer to a managed assistant than a generic consumer app.

Technical Deployment Paths

Path 1: Direct APK Installation
For individual users and internal testing:

  • Sideloadable via Firebase App Distribution
  • Installable from a local debug or release build
  • Useful for pilot testing and regression verification

Path 2: Managed Device Deployment
For organizations that preconfigure devices:

  • Preload the app and permissions during provisioning
  • Inject API configuration through managed device setup
  • Use standard Android update channels for maintenance

Path 3: Public Sector Rollout
For accessibility pilots and social programs:

  • MDM deployment to a managed fleet
  • Region-specific configuration where needed
  • Fallback behavior for unstable connectivity

Operational Notes

The current implementation is not presented here as a load-tested, high-concurrency backend. The useful part of the design is the split between local state, cloud turns, and deterministic action execution.


Roadmap & Technical Priorities

Phase 1: MVP Stabilization (Weeks 1-4)

  • Completed: Voice → Call pipeline
  • Completed: Fallback orchestration
  • In progress: Contact matching robustness (5,000+ user name corpus validation)
  • In progress: Speech handling robustness on low-quality input (coffee shops, traffic)

Phase 2: Feature Completeness (Weeks 5-8)

  • SMS sending with confirmation workflow (TextMessage API + SmsManager)
  • Missed call detection with smart recall logic (CallLog provider querying)
  • Calendar integration (CalendarProvider read/write)
  • Medication reminders (persistent notification + WorkManager scheduled tasks)

Phase 3: Advanced Capabilities (Weeks 9-12)

  • Multi-agent orchestration (GPT-4 + Gemini hybrid reasoning)
  • Cross-device sync (Cloud Firestore bidirectional sync)
  • Family oversight dashboard (Firebase Hosting + Svelte frontend)
  • Expanded language support (additional language packs and improved speech handling)

Phase 4: Production Hardening (Week 13+)

  • Load testing (define realistic device and session targets before release)
  • Security audit (OWASP top 10, Firebase security rules review)
  • Accessibility compliance (WCAG 2.1 Level AA verification)
  • Play Store launch (Google Play Console submission, policy review)

Getting Started: Building & Running the App

System Requirements

Minimum Hardware:

  • Android 12+ (API 33+) — Supports scoped storage + modern permission models
  • RAM: 2GB available (app footprint ~110MB with models + cache)
  • Storage: 50MB free (app + SQLite + vector DB indexes)
  • Network: Optional (graceful operation with degraded connectivity)

Prerequisites

  • Android Studio 2023.1.0 or later (with Android SDK, build tools 33+)
  • Java/Kotlin — JDK 17 LTS (configured in Studio)
  • Gradle — 7.5+ (included in project)
  • Device — Android 12+ (API 33+) for testing
  • API Keys:

Step 1: Clone & Configure Repository

# Clone the repository
git clone https://github.com/gdg-accessibility-agent/gdg-accessibility-agent.git
cd gdg-accessibility-agent

# Create local.properties with API keys
echo "sdk.dir=/path/to/android/sdk" > local.properties
echo "GOOGLE_API_KEY=sk-..." >> local.properties
echo "OPENAI_API_KEY=sk-..." >> local.properties
echo "LIVE_SESSION_URL=wss://..." >> local.properties
echo "ORCHESTRATOR_URL=https://..." >> local.properties

Note:

  • LIVE_SESSION_URL is for Gemini Live WebSocket (optional; system gracefully degrades to REST API if not provided)
  • ORCHESTRATOR_URL is for fallback turn source (optional)
  • If API keys are not in local.properties, the build reads from environment variables

Step 2: Open in Android Studio

# Open the project in Android Studio
open -a "Android Studio" .

# Or via command line:
studio .

In Android Studio:

  • Sync Gradle files (Android Studio will auto-prompt)
  • Accept SDK downloads if prompted (build tools, emulator image)
  • Configure SDK location if not auto-detected (File → Project Structure → SDK Location)

Step 3: Build the Debug APK

# Build via command line
./gradlew assembleDebug

# APK output: app/build/outputs/apk/debug/app-debug.apk

# Or build via Android Studio
# Build → Build Bundle(s)/APK(s) → Build APK(s)

Build time: ~45 seconds (Kotlin incremental compilation on subsequent builds)

Step 4: Install & Run on Device

Option A: Physical Device

# Enable USB Debugging on your Android 12+ device
# Settings → Developer Options → USB Debugging (toggle ON)
# Connect via USB

# Install the APK
./gradlew installDebug

# Or use Android Studio (Run → Run 'app' or Shift+F10)

Option B: Android Emulator

# List available Android Virtual Devices (AVDs)
emulator -list-avds

# Launch an emulator (if not already running)
emulator -avd Pixel_4a_API_33 &

# Install via Gradle
./gradlew installDebug

First Launch:

  1. App requests 40+ permissions (calls, SMS, contacts, location, camera, audio, etc.)
  2. Accept permissions for full functionality
  3. Select language (Serbian or English)
  4. Voice button appears on home screen

Step 5: Test Core Features

Voice Agent (WebSocket streaming):

  • Press and hold the voice button on home screen
  • Say: "Call Marko" or "Send SMS to Maja"
  • Listen for confirmation audio

Chat Agent (text input):

  • Tap the chat icon
  • Type: "What's the time?" or "Turn on flashlight"
  • View text response + execution

Device Actions:

  • Volume: "Increase volume" / "Mute"
  • Flashlight: "Turn on flashlight"
  • Location: "Send location to emergency" (opens SMS draft)
  • Contacts: "Who called me?" (plays last caller name)

Step 6: Development & Debugging

Run Tests:

# Unit tests
./gradlew test

# Integration tests (requires connected device or emulator)
./gradlew connectedAndroidTest

View Logs:

# Filter app logs only
adb logcat | grep "GdgAgent\|VoiceAgent\|ChatAgent\|Coordinator"

# Or use Android Studio Logcat
# Android Studio → View → Tool Windows → Logcat

Breakpoint Debugging:

  1. Set breakpoint in Android Studio (click left margin)
  2. Run app in debug mode: ./gradlew installDebug then attach debugger in Studio
  3. Step through code, inspect variables, etc.

Common Issues & Troubleshooting

Issue Solution
"Gradle sync failed" File → Invalidate Caches → Restart
"JAVA_HOME not set" Studio → Preferences → Build, Execution, Deployment → Gradle → Gradle JDK (select JDK 17)
"API key missing" Check local.properties or environment variables; build.gradle logs which key is missing
"Device not detected" Enable USB Debugging; check adb devices
"Emulator freezes" Cold boot: emulator -avd [name] -no-snapshot-load
"WebSocket fails (voice agent)" LIVE_SESSION_URL incorrect or unreachable; app falls back to REST API automatically

Development & Contribution

Repository Structure

gdg-accessibility-agent/
├── app/
│   ├── src/main/java/com/example/gdgagent/
│   │   ├── activities/          (Launcher, HomeActivity)
│   │   ├── fragments/           (Senior home, chat, voice agent, settings)
│   │   ├── agent/               (Voice agent logic, intent dispatch)
│   │   ├── tools/               (AI instruction parsing/execution)
│   │   ├── database/            (SQLite entities, DAOs)
│   │   └── viewmodels/          (MVVM state management)
│   ├── src/main/res/
│   │   ├── layout/              (XML layouts, accessibility-first)
│   │   ├── drawable/            (Icons, theme resources)
│   │   └── values/              (Strings, themes, dimensions)
│   └── build.gradle
├── functions/                   (Firebase Cloud Functions for inference)
├── README.md                    (This file)
└── IMPLEMENTATION_GUIDE.md      (Developer onboarding)

Build & Test

# Build
./gradlew assembleDebug           # Development APK
./gradlew bundleRelease           # Google Play AAB

# Test
./gradlew test                    # Unit tests
./gradlew connectedAndroidTest    # Integration tests

About

Cloud-first Android agent decoupling intent from UI complexity. Built with Gemini Live streaming, deterministic tool dispatch, and multi-tier offline fallbacks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages