🏎️ Alora: The Multi-Agent Automotive Co-Pilot
Deployed on Google Cloud Run | Built with Google Agent Development Kit (ADK)
📥 Installation & Releases
🥽 Meta Quest (XR Experience)
Includes full Multimodal Live API integration and 3D Map Grounding.
🤖 Android
Automated Release Pipeline: Every push to
maintriggers a GitHub Action that builds, signs, and releases a production-ready APK.
Join Google Play Internal Test
Requires email invitation. Contact team for access.
- Build: Capacitor 7.4 + Vite
- CI/CD: GitHub Actions -> Gradle -> Signed APK
🍎 iOS
Cross-Platform Proof of Concept: Successfully built for iOS 18.2 via Capacitor.
Agent Evaluation Journey & Reference
This document serves as an "honest readme" regarding the evolution of our agent evaluation pipeline, from initial failures to a robust, three-tier testing strategy.
1. The Challenge: Mixed Tools & Modern Models
We initially encountered critical failures when upgrading to Gemini 2.5 Flash. The core issue was a strict constraint in the new model architecture: It does not support mixing Google Search citations with other function calls in the same turn.
The Fix: Split & Sequencing
To resolve the ClientError: 400 INVALID_ARGUMENT (Mixed Tools), we refactored the monolithic ResearcherAgent into two specialized components:
-
SearchAgent: Dedicated solely to using thegoogle_searchtool. It outputs raw search results. -
ResearchAnalysisAgent: Dedicated to "thinking". It takes the search results as input (context) and uses internal tools/logic to synthesize an answer. -
SequentialAgent: Orchestrates them (Search -> Analysis), ensuring the model never sees conflicting tool definitions in a single context.
2. The Solution: Agent Testing Pyramid
To "stretch" our evaluation and ensure reliability beyond just "it didn't crash", we implemented the Agent Testing Pyramid.
Tier 1: Component-Level Unit Tests 🧪
- Goal: Ensure individual agents are configured correctly and select the right tools in isolation.
- Implementation:
pilot/tests/test_search_agent.py&test_analysis_agent.py. - What Works: We now verify that
SearchAgenthas the correct instructions and tool definitions without needing to run the full expensive pipeline.
Tier 2: Trajectory-Level Integration Tests 🛤️
- Goal: Verify the agent behaves correctly, not just that it produced an answer.
- Implementation:
- Updated
evaluation_dataset.jsonto include"expected_tool_sequence": ["MainWorkflowAgent"]. - Updated
benchmark_prompts.pyto trace the execution path.
- Updated
- Metric:
trajectory_score. We require a score of 0.8+ (along with semantic similarity) to pass. This catches cases where the agent might hallucinate an answer without actually using the required tools.
Tier 3: End-to-End Human Review 👁️
- Goal: Allow humans to inspect the reasoning process for complex queries.
- Implementation:
pilot/evaluation/human_review.py. - Result: Each run generates a clean Markdown report in
pilot/evaluation_reports/containing the full Q&A trace, tool usage, and scores. This is uploaded as a CI Artifact (human-review-reports) for easy inspection.
3. Architecture Evolution & Metamorphosis 🦋
Our agent architecture has undergone a significant metamorphosis to address the "infinite loop" problem and optimize for cost/latency.
📜 Phase 1: The "Spinning" Researcher (Legacy)
Initially, the agent was a Reactive entity. It blindly entered a research loop for every query.
- The Flaw: When validating jargon-heavy queries (e.g., "activate RACE START"), the validator would reject imperfect answers, forcing the agent to research again and again—spinning indefinitely.
- The Diagram:

🧠 Phase 2: The Intelligence Center (Modern)
We re-architected the system into a Predictive "Intelligence Center". The agent now acts as a Planner, routing queries based on knowledge state.
- The Fix: A Memory-First strategy.
- Recall: The agent must check Long-Term Memory (Vertex AI) first. If the answer exists, it returns immediately (0 searching cost).
- Research: Only if memory misses does it deploy the heavy
DeepResearchWorkflowtool.
- The Diagram:

🔄 The Metamorphosis
This shift transforms the agent from a simple tool-user to a state-aware orchestrator.

4. Current Limitations (The "Honest" Part)
- Monte Carlo Tree Search (MCTS): While intended to be part of the advanced planning capabilities, the MCTS component is currently not fully functional and disabled in the active evaluation path. We are relying on the deterministic
SequentialAgentflow for now. - Dependency Speed: The
sentence-transformerslibrary (used for similarity scoring) is heavy. We implemented a robust fallback to a mock scorer if the download times out, ensuring the pipeline doesn't flake due to network issues, but this means local runs might sometimes skip semantic verification if the environment isn't cached.
Security & Guardrails (Model Armor) 🛡️
We integrated Google Cloud Model Armor to sanitize inputs before they ever reach our agent logic.
- Mechanism: A
before_model_callbackintercepts every request. - Filters: We use the
alora-ma-templatewhich enforces:- PII Detection: Blocks sharing sensitive personal info.
- Jailbreak/Attack: Prevents prompt injection attempts.
- Malicious URIs: Filters unsafe links.
- Result: If a thread is detected, the prompt is scrubbed and replaced with a system refusal instruction, protecting the LLM context.
Audio Integration (ElevenLabs) 🎧
We have integrated ElevenLabs TTS to provide on-demand audio insights for our widgets.
Features
- On-Demand Generation: Audio is synthesized only when requested (clicked) to optimize costs.
- Custom Waveform: A custom canvas-based visualizer mimics the ElevenLabs UI style.
- Caching: Generated audio files are stored in a public Google Cloud Storage bucket (
audio_assets/) and served via CDN to avoid re-generating the same audio.
Configuration
Required environment variables in .env or Google Cloud Run:
-
ELEVENLABS_API_KEY: Your API Key. -
AUDIO_BUCKET_NAME: GCS Bucket name (defaults tovigilant-journey-assets).
The service automatically creates the bucket and sets public-read permissions if it doesn't exist.
Observability & Incident Management 📊
We utilize Datadog for full-stack observability, including APM traces, RUM, and Incident Management.
Incident Reporting (Jan 2026)
We have compiled a comprehensive incident report analyzing the stability of the Pilot launch and Outage tracking.
- Report: Alora Incident Report (PDF)
- Methodology: These incidents were manually declared and managed directly within the Datadog platform to demonstrate the end-to-end Incident Management lifecycle, generating authoritative reports from system records.
This report covers:
- Incident Trends: Breakdown of SEV-1/SEV-2 outages.
- Response Metrics: MTTR (Mean Time to Repair) analysis.
- Post-Mortems: Action items derived from simulated API and Audio failures.
🔮 What's Next
Phase 1: Data Agents
- Goal: Automate data collection, ingestion, and user preference learning
Phase 2: Android XR (Glasses)
- Goal: AR/VR-compatible app using Jetpack Compose or Multiplatform
- Features:
- Heads-up display on engineer's helmet visor
- Spatial audio for directional cues
- Gesture controls for hands-free operation
Phase 3: Public Release
- Google Play: Consumer Android app
- Meta Horizon: VR experience for sim racing
- Integration: Voice-enabled dashboard for race engineers
- Build for google glasses
Phase 4: Architecture Evolution (Current Focus)
Unified Persona Architecture: Transitioning from context-switching multiple personas to a single, coherent persona that adapts to the current mode (EV vs. Race). Object Classification with Meta SAM3: Integrating Meta's Segment Anything Model 3 (SAM3) for advanced object detection and classification capabilities. Note: As an open-source model, this will require a hosting strategy (e.g., self-hosted, cloud inference API).

Log in or sign up for Devpost to join the conversation.