Agora/PennGPT

Inspiration

Every campus, every nonprofit, every hospital is drowning in information. It’s not that the answers don’t exist. They do. But they’re scattered across course catalogs, PDFs, portals, event sites. For students, that means 15 browser tabs open just to check dining hours. For a nonprofit, it means outdated volunteer pages. For a patient, it means getting lost in a maze of hospital forms.

And here’s the real problem: when information is that fragmented, it’s basically invisible. Students miss deadlines. Communities miss opportunities. People give up before they find what they need.

We asked a simple question: what if you could just ask? What if instead of searching across a mess of systems, you typed: “Where can I print on campus tonight?” or “How many volunteer hours were logged this month?”, and got a clear, trusted answer, instantly?

That’s the vision of AGORA. We started with Penn because of PennApps, but it’s never been just about Penn. This is a framework any university, nonprofit, or hospital could deploy in a weekend. AGORA is about democratizing access to information. We’re turning scattered data into something usable, accessible, and human.

What it does

Agora enables real-time, conversational access to high-quality, up-to-date information by indexing trusted web sources at custom frequencies. Many small businesses and campus organizations not only lack the resources to build their own LLM-powered systems, but also struggle with fragmented and inconsistently maintained data. Agora bridges that gap by centralizing and structuring their information, making it accessible through a seamless natural language interface.

How we built it

Agora is built on a modular Django-based multi-service architecture, designed for scalability, flexibility, and maintainability. It integrates a custom web scraper, a vector database for semantic search, and LLM services as distinct yet cohesive components. This setup allows us to aggregate and unify fragmented data sources, keep them up to date, and make the resulting information available through a conversational interface.

Here’s a breakdown of our project architecture:

Backend

Django: Core framework for API and dashboard FastAPI: Vector search microservice APScheduler: Job scheduling and automation Milvus: AI-powered vector database Dashboard: A service for managing the urls used by the model SQLite: Default local database

Project Structure Highlights

agora_crawler.py: Core scraping service with robots.txt compliance scheduler: Automated job manager to handle data updates services.py: Manages all services both internal and external ai_services.py: Integrate ai services for text processing, summarization, and embeddings summarization_pipeline.py: Orchestrates the scraped page to vector embedding pipeline enterprise_batch_api.py: Handles bulk data scraping milvus_service.py: Service for storing and querying embeddings in vector db vector_search_api.py: Service for querying the vector db signals.py: Handles data expiration and renewal file_organization.py: Service for storing scraped pages and summaries by domain and path structure Dockerfile + docker-compose.yml: Production deployment

Testing & QA

test_enterprise_api.py: Batch processing test test_agora_integration.py: Agora crawler validation test_vector_system.py: AI vector database tests test_recursive_crawler.py: Crawler service tests

Challenges we ran into

Authentication requirments for campus endpoints While Penn provides a wealth of high-quality public datasets, the most relevant and up-to-date information for students is often gated behind university authentication. Accessing and integrating this data was a significant hurdle. Data Freshness and Quality Control Ensuring the accuracy and reliability of our data sources proved to be the most time-consuming aspect of development. Maintaining up-to-date information required careful curation and continuous monitoring.

Accomplishments that we're proud of

Model Accuracy and Source Attribution Our system not only provides highly accurate responses, but also cites its sources transparently, allowing users to verify the information themselves. Data Quality and Consistency We successfully curated a robust dataset, prioritizing both accuracy and relevance to student needs. Seamless User Experience We focused heavily on crafting a smooth and intuitive user interface, enhancing usability and engagement. Automated Data Monitoring We implemented a system to track data expiration and automatically refresh outdated information, ensuring long-term reliability.

What we learned

Indexing and managing large volumes of data while maintaining high standards of accuracy and consistency is especially challenging when enabling real-time, conversational access. Building a system that balances scalability, data quality, and seamless interaction taught us what it takes to make information truly accessible through natural language.

What's next for Agora

We're excited to expand Agora beyond Penn, bringing conversational data access to more universities, nonprofits, and small businesses. These organizations often face similar challenges: fragmented internal data, limited technical resources, and a growing need to provide fast, reliable answers to users. Our next steps include: Expanding to other universities to help students, faculty, and staff access campus resources through natural language, without needing to navigate outdated websites or portals.

Partnering with nonprofits and small businesses to help them unify their public-facing data and provide AI-powered chat interfaces without needing to hire full development teams.

Improving customizability so organizations can easily control what data is indexed, how it's refreshed, and how the chatbot behaves.

Launching a self-serve platform where any org can set up its own Agora instance in just a few steps.

Ultimately, our goal is to democratize access to high-quality, conversational interfaces so anyone, regardless of technical ability, can turn their information into something usable, searchable, and helpful.