Hackathon Report: PrivateAI
1. Inspiration
The genesis of PrivateAI stems from the "Privacy Paradox" currently plaguing the generative AI landscape. While Large Language Models (LLMs) offer unprecedented productivity, they often require sending sensitive personal or proprietary enterprise data to cloud-based APIs. This creates significant risks regarding data sovereignty, latency, and internet dependency.
Our team asked a fundamental question: Can we achieve GPT-like intelligence on consumer hardware without a single byte leaving the device?
Inspired by the democratization of AI, our vision was to build a local-first Research Augmented Generation (RAG) system. We wanted to empower journalists, legal researchers, and developers to query their private documents securely, leveraging the power of edge computing rather than relying on centralized servers.
2. Learning
This hackathon was an intensive crash course in optimizing Small Language Models (SLMs) for edge deployment. We moved beyond simple API calls to understanding the physics of inference.
Key Technical Concepts Mastered:
- Small Language Models (SLMs): We learned that parameter count does not strictly dictate reasoning capability. By utilizing high-quality synthetic data for training, models like Phi-3 or Llama-3-8B can outperform larger predecessors in specific reasoning tasks.
- The RunAnywhere SDK: A crucial component of our stack was the RunAnywhere SDK. It abstracted the hardware layer, allowing our code to dynamically offload matrix operations to the NPU (Neural Processing Unit) or GPU depending on availability, optimizing the thermal profile of the device.
- Quantization: We deep-dived into reducing model precision to fit into limited VRAM. We utilized 4-bit quantization (Q4_K_M) to compress the model weights. The relationship between model size and memory requirements was calculated using:
$$ M_{GB} \approx \frac{P \times Q}{8 \times 10^9} + K_{cache} $$
Where $M_{GB}$ is memory in Gigabytes, $P$ is the parameter count (e.g., 3.8 Billion), and $Q$ is the quantization bit-depth (e.g., 4 bits). Excluding the KV cache overhead ($K_{cache}$), a 3.8B model at 4-bit precision requires roughly:
$$ M \approx \frac{3.8 \times 10^9 \times 4}{8 \times 10^9} = 1.9 \text{ GB} $$
This calculation proved that high-fidelity inference was mathematically possible on standard 8GB RAM laptops.
3. Build Process
Our architecture follows a strictly local RAG pipeline:
- Ingestion Layer: We built a Python watcher that detects PDF/TXT files in a designated local folder.
- Vectorization: Using a quantized embedding model (
all-MiniLM-L6-v2), we converted text chunks into vector embeddings locally. - Storage: These vectors were stored in a persistent ChromaDB instance running locally—no cloud vector stores were used.
- Inference Engine: We integrated the RunAnywhere SDK to load a GGUF format version of the Phi-3 model.
- User Interface: A Streamlit frontend was developed to handle user queries, retrieve relevant context from ChromaDB, and inject it into the SLM's context window for the final answer.
4. Challenges
The road to a working prototype was paved with hardware bottlenecks.
The Memory Bandwidth Wall: While we solved the capacity issue with quantization, we hit a bottleneck with memory bandwidth on CPU-only inference. The token generation speed (Tokens Per Second - TPS) is constrained by how fast data moves from RAM to the processor.
$$ TPS_{max} = \frac{B_{mem}}{S_{model}} $$
Where $B_{mem}$ is memory bandwidth (GB/s) and $S_{model}$ is the model size (GB). On a machine with 50 GB/s bandwidth and a 4GB model, the theoretical max is 12.5 tokens/second. We initially saw speeds as low as 3 TPS. We resolved this by utilizing the RunAnywhere SDK to pipeline layers more efficiently and offload the prompt processing (prefill) to the integrated GPU, doubling our inference speed.
Context Window Limitations: Local SLMs often have smaller context windows. Feeding large legal documents resulted in truncation. We overcame this by implementing a "sliding window" chunking strategy during the vectorization phase, ensuring no semantic data was lost at the boundaries of text blocks.
Conclusion
PrivateAI demonstrates that the future of AI is not just in the cloud, but at the edge. By combining efficient SLMs, aggressive quantization, and hardware-aware SDKs, we successfully built a private, offline, and intelligent assistant.
The Local AI Revolution: Aura Nexus Edge Diagnostics About the Project Aura Nexus represents the pinnacle of the "Local AI Revolution," a paradigm shift from centralized cloud-dependent intelligence to ubiquitous edge processing. In a world where medical data is the most sensitive asset a human possesses, the current model of uploading clinical images and symptom lists to black-box APIs is fundamentally broken. Aura Nexus utilizes the RunAnywhere SDK and Nexa AI blueprints to demonstrate a mobile-first, privacy-impenetrable diagnostic analyst.
The Inspiration Our team was inspired by the critical failure points in modern healthcare informatics:
Latency Vulnerability: In emergency trauma scenarios, a 5-second round-trip to a cloud server is 5 seconds too long. Connectivity Deserts: Rural medics and humanitarian aid workers operating in "dead zones" are currently stripped of AI-augmented capabilities. Data Sovereignty: The ethical imperative to keep patient data locally quantized and encrypted. Technical Implementation & Science The core architecture follows the logic of quantized tensor processing. We utilize Small Language Models (SLMs) like DeepSeek-R1-Distill and Llama-3-3B which have been optimized via 4-bit integer quantization (INT4).
The mathematical foundation of our diagnostic confidence interval is derived from Bayesian inference models:
P(D|S) = [P(S|D) * P(D)] / P(S) where D represents the diagnosis and S the observed symptoms. By running these computations on the device's NPU, we achieve near-zero latency.
Challenges Faced Building for the edge presents unique constraints. Memory pressure (RAM management) was our primary hurdle. Modern mobile devices often kill processes exceeding 2GB of active memory. We solved this by implementing a Multi-Stage Loading Pipeline:
Whisper Tiny: For initial voice-to-text conversion of clinical symptoms. Context Eviction: Actively flushing the KV-cache of the SLM between diagnostic iterations. Hybrid Quantization: Using Q4_K_M for weights and Q8_0 for sensitive clinical embedding layers to maintain accuracy. The "Offline" Edge Scenario Imagine a paramedic in a rural mountain pass. There is no 5G. The patient shows signs of acute pulmonary edema. Aura Nexus, running entirely on a standard mobile chipset, performs a differential diagnosis, identifies contraindications for medication, and provides a real-time guidance protocol—all while being 100% offline.
Roadmap to the Future We are moving toward Federated Clinical Learning, where models can "learn" from local cases without ever sharing raw patient data, only synchronized weight updates. This is the Dendrite Nexus vision—a global nervous system of intelligence that respects individual privacy.
Built With
- flask
- geminiapi
- python
- react
- typescript
Log in or sign up for Devpost to join the conversation.