Inspiration
University course catalogs are treasure troves of information crucial for students, faculty, and administrators. Unfortunately, this valuable data is often trapped within unwieldy, unstructured PDFs, making access and utilization challenging. Our mission is to liberate this data by extracting key elements and converting them into a structured, machine-readable format. Moreover, we're committed to developing a natural-language query interface for precise inquiries, ensuring users receive accurate answers with specific page and paragraph citations.
What it does
Our project, Course Catalog Intelligence, pulls important information from university course catalog PDFs and lets you ask questions in plain language. It helps everyone at the university find what they need easier.
How we built it
- Indexing: We first indexed the university course catalog PDFs using Langchain's tools.
- Data Modeling: We defined a data model for the extracted course information using Pydantic. This model, named Course, includes fields such as title, code, description, credit hours, and prerequisites. This ensures consistency and clarity in the structured data representation.
- Generation: To generate answers to user queries, we implemented a generation pipeline. This pipeline consists of a prompt template, a language model (in this case, ChatOllama), and an output parser.
Challenges we ran into
To start, we initially used Azure AI Document Intelligence. It is a cloud-based service that leverages machine learning models to automate data processing within applications and workflows.
Advantages of Azure AI Document Intelligence:
- Efficient Data Extraction: Document Intelligence automates the extraction process, saving time and effort.
- Customization: Create custom models to handle unique document layouts and formats.
- Integration: Seamlessly integrate with your existing applications and workflows.
- Structured Output: Receive structured JSON output, making it easy to work with the extracted data.
- Custom Models and Classification Models: we created a custom model using OCR to retrieve fields in a structured manner. Unfortunately, this approach has some limitations:
Limitations Encountered:
- Training Page Limit: You mentioned that the training is limited to one page. This constraint can be challenging when dealing with longer documents.
- OCR Limitations: Using OCR (Optical Character Recognition) is restricted to a single page, and consistency issues may arise across files.
- Testing Variability: The testing process didn’t remain consistent across different files, which can impact reliability.
Accomplishments that we're proud of
Llama3 was released 3 days ago and we were able to leverage the model in our application!
We take immense pride in not only achieving our objectives but also in the learning trajectory that this project offers.
What's next for Course Converse
With a successful prototype in place, we are enthusiastic about expanding our model's horizon. We aim to integrate more diverse datasets, such as responses in the healthcare industry, and apply them to more industry problems.



Log in or sign up for Devpost to join the conversation.