🏎️ Alora: The Multi-Agent Automotive Co-Pilot

Deployed on Google Cloud Run | Built with Google Agent Development Kit (ADK)

📥 Installation & Releases

🥽 Meta Quest (XR Experience)

Includes full Multimodal Live API integration and 3D Map Grounding.

🤖 Android

Automated Release Pipeline: Every push to main triggers a GitHub Action that builds, signs, and releases a production-ready APK.

Join Google Play Internal Test

Requires email invitation. Contact team for access.

Build: Capacitor 7.4 + Vite

CI/CD: GitHub Actions -> Gradle -> Signed APK

🍎 iOS

View iOS Builds

Cross-Platform Proof of Concept: Successfully built for iOS 18.2 via Capacitor.

A short demo for XR

Agent Evaluation Journey & Reference

This document serves as an "honest readme" regarding the evolution of our agent evaluation pipeline, from initial failures to a robust, three-tier testing strategy.

1. The Challenge: Mixed Tools & Modern Models

We initially encountered critical failures when upgrading to Gemini 2.5 Flash. The core issue was a strict constraint in the new model architecture: It does not support mixing Google Search citations with other function calls in the same turn.

The Fix: Split & Sequencing

To resolve the ClientError: 400 INVALID_ARGUMENT (Mixed Tools), we refactored the monolithic ResearcherAgent into two specialized components:

SearchAgent: Dedicated solely to using the google_search tool. It outputs raw search results.
ResearchAnalysisAgent: Dedicated to "thinking". It takes the search results as input (context) and uses internal tools/logic to synthesize an answer.
SequentialAgent: Orchestrates them (Search -> Analysis), ensuring the model never sees conflicting tool definitions in a single context.

2. The Solution: Agent Testing Pyramid

To "stretch" our evaluation and ensure reliability beyond just "it didn't crash", we implemented the Agent Testing Pyramid.

Tier 1: Component-Level Unit Tests 🧪

Goal: Ensure individual agents are configured correctly and select the right tools in isolation.
Implementation: pilot/tests/test_search_agent.py & test_analysis_agent.py.
What Works: We now verify that SearchAgent has the correct instructions and tool definitions without needing to run the full expensive pipeline.

Tier 2: Trajectory-Level Integration Tests 🛤️

Goal: Verify the agent behaves correctly, not just that it produced an answer.
Implementation:
- Updated evaluation_dataset.json to include "expected_tool_sequence": ["MainWorkflowAgent"].
- Updated benchmark_prompts.py to trace the execution path.
Metric: trajectory_score. We require a score of 0.8+ (along with semantic similarity) to pass. This catches cases where the agent might hallucinate an answer without actually using the required tools.

Tier 3: End-to-End Human Review 👁️

Goal: Allow humans to inspect the reasoning process for complex queries.
Implementation: pilot/evaluation/human_review.py.
Result: Each run generates a clean Markdown report in pilot/evaluation_reports/ containing the full Q&A trace, tool usage, and scores. This is uploaded as a CI Artifact (human-review-reports) for easy inspection.

3. Architecture Evolution & Metamorphosis 🦋

Our agent architecture has undergone a significant metamorphosis to address the "infinite loop" problem and optimize for cost/latency.

📜 Phase 1: The "Spinning" Researcher (Legacy)

Initially, the agent was a Reactive entity. It blindly entered a research loop for every query.

The Flaw: When validating jargon-heavy queries (e.g., "activate RACE START"), the validator would reject imperfect answers, forcing the agent to research again and again—spinning indefinitely.
The Diagram:

🧠 Phase 2: The Intelligence Center (Modern)

We re-architected the system into a Predictive "Intelligence Center". The agent now acts as a Planner, routing queries based on knowledge state.

The Fix: A Memory-First strategy.
1. Recall: The agent must check Long-Term Memory (Vertex AI) first. If the answer exists, it returns immediately (0 searching cost).
2. Research: Only if memory misses does it deploy the heavy DeepResearchWorkflow tool.
The Diagram:

🔄 The Metamorphosis

This shift transforms the agent from a simple tool-user to a state-aware orchestrator. Metamorphosis

4. Current Limitations (The "Honest" Part)

Monte Carlo Tree Search (MCTS): While intended to be part of the advanced planning capabilities, the MCTS component is currently not fully functional and disabled in the active evaluation path. We are relying on the deterministic SequentialAgent flow for now.
Dependency Speed: The sentence-transformers library (used for similarity scoring) is heavy. We implemented a robust fallback to a mock scorer if the download times out, ensuring the pipeline doesn't flake due to network issues, but this means local runs might sometimes skip semantic verification if the environment isn't cached.

Security & Guardrails (Model Armor) 🛡️

We integrated Google Cloud Model Armor to sanitize inputs before they ever reach our agent logic.

Mechanism: A before_model_callback intercepts every request.
Filters: We use the alora-ma-template which enforces:
- PII Detection: Blocks sharing sensitive personal info.
- Jailbreak/Attack: Prevents prompt injection attempts.
- Malicious URIs: Filters unsafe links.
Result: If a thread is detected, the prompt is scrubbed and replaced with a system refusal instruction, protecting the LLM context.

Audio Integration (ElevenLabs) 🎧

We have integrated ElevenLabs TTS to provide on-demand audio insights for our widgets.

Features

On-Demand Generation: Audio is synthesized only when requested (clicked) to optimize costs.
Custom Waveform: A custom canvas-based visualizer mimics the ElevenLabs UI style.
Caching: Generated audio files are stored in a public Google Cloud Storage bucket (audio_assets/) and served via CDN to avoid re-generating the same audio.

Configuration

Required environment variables in .env or Google Cloud Run:

ELEVENLABS_API_KEY: Your API Key.
AUDIO_BUCKET_NAME: GCS Bucket name (defaults to vigilant-journey-assets).

The service automatically creates the bucket and sets public-read permissions if it doesn't exist.

Observability & Incident Management 📊

We utilize Datadog for full-stack observability, including APM traces, RUM, and Incident Management.

Incident Reporting (Jan 2026)

We have compiled a comprehensive incident report analyzing the stability of the Pilot launch and Outage tracking.

Report: Alora Incident Report (PDF)
Methodology: These incidents were manually declared and managed directly within the Datadog platform to demonstrate the end-to-end Incident Management lifecycle, generating authoritative reports from system records.

This report covers:

Incident Trends: Breakdown of SEV-1/SEV-2 outages.
Response Metrics: MTTR (Mean Time to Repair) analysis.
Post-Mortems: Action items derived from simulated API and Audio failures.

🔮 What's Next

Phase 1: Data Agents

Goal: Automate data collection, ingestion, and user preference learning

Phase 2: Android XR (Glasses)

Goal: AR/VR-compatible app using Jetpack Compose or Multiplatform
Features:
- Heads-up display on engineer's helmet visor
- Spatial audio for directional cues
- Gesture controls for hands-free operation

Phase 3: Public Release

Google Play: Consumer Android app
Meta Horizon: VR experience for sim racing
Integration: Voice-enabled dashboard for race engineers
Build for google glasses

Phase 4: Architecture Evolution (Current Focus)

Unified Persona Architecture: Transitioning from context-switching multiple personas to a single, coherent persona that adapts to the current mode (EV vs. Race). Object Classification with Meta SAM3: Integrating Meta's Segment Anything Model 3 (SAM3) for advanced object detection and classification capabilities. Note: As an open-source model, this will require a hosting strategy (e.g., self-hosted, cloud inference API).