Math Wiz

Math Wiz Project Summary 1
Math Wiz Project Summary 2
Math Wiz Project Summary 3
Math Wiz Project Summary 4
Math Wiz Project Summary 5
Math Wiz Project Summary 6
Math Wiz Project Summary 7
Math Wiz Project Summary 8
Math Wiz Project Summary 9
Math Wiz Project Summary 10
Math Wiz Project Summary 11

Introduction

Imagine a world where you can whisper a question to a vast ocean of knowledge, and it whispers back the perfect answer, tailored to your needs. This is the promise of large language models (LLMs), powerful AI tools that can learn and process information like never before. But there's a catch: sometimes, these AI giants need a little nudge in the right direction.

That's where our project comes in. We've built a smarter way to talk to LLMs, inspired by the groundbreaking work of "The PromptBreeder" by Google DeepMind. Think of it as a bridge between your thoughts and the LLM's understanding, a whisper translator that ensures your questions land on fertile ground.

The challenge? Crafting the right prompts for LLMs can be tricky. It's like asking a friend for a recipe without mentioning the ingredients you have on hand. Our solution? An automated prompt engineering system that uses a clever algorithm to evolve and refine prompts on the fly. Think of it as a language chef, constantly whipping up the perfect recipe for each LLM task.

This isn't just about making AI smarter; it's about unlocking its true potential. Imagine LLMs that can:

Solve complex math problems with laser-like precision. Craft personalized stories that ignite your imagination. Translate languages with nuanced understanding, capturing the essence of every word.

And the best part? This is just the beginning. Our solution is flexible and adaptable, with the potential to expand beyond its current math-focused capabilities. In the following, we will describe the challenges we faced and go into detail about our solution.

Our Solution

The first days, we spent learning about genetic algorithms and prompt engineering. Once, we understood the task and had an idea about the technologies used in the field, we started drafting the architecture of our solution.

Finalising the architecture took us almost 2 weeks. Once the architecture was set, we quickly wrote the pseudo-code for our project, assigned tasks to every member and then started coding.

From the start, we decided to work on grad school math problems as there are data sets available and checking the correctness of the answer should be easier, but we will see later on when we go into detail about the fitness function, that also math problems bring their own challenges along for the fitness function.

1. Generating Instructions:

Use an LLM to create instructions: "Generate Instructions on how to solve a math problem"
Save the instructions in Pool of Instructions

2. Add a Mutation prompt:

Get 1 mutation prompt from the Mutation List:
“As a really good teacher, explain the instruction, as if you were explaining it to a child.”
Combine Mutation Prompt (M) and Instruction (I) to get final instruction (I’)

3. Take 50 Questions from data set:

Retrieve questions: "Gretchen has 110 coins. There are 30 more gold coins than silver coins. How many gold
coins does Gretchen have?"
Combine Question (Q) and Instruction (I’)

4. Feed Q+I’ to LLM and Score the Answer:

We use LLama-7B to answer our question and instruction
We score the answer and in-between steps

5. Determination after more than 50 iterations? If less than 50 iterations, we continue the algorithm

We keep the 2 best instructions
We build a crossover of the best 2 instructions
We add another mutation to the best 2 instructions Repeat step 2-5 _ When we reach 50 iterations:_
We end the algorithm after 50 iterations and choose the best 10 instructions

The Genetic Algorithm

In the following, we will describe the main parts of the genetic algorithm in more detail.

Starting with mutations, which in our case are additional instructions or thinking styles to change the prompt. We start by mutating the initial set of instructions, and then again later on after the first evaluation, we randomly apply mutations to the best prompts.

For the crossover, we combine parts of the best-performing prompts. We decided to only exchange the same parts of speech from the two crossover prompts. For example, we exchange verbs only.

Returning to our choice of fitness function, as mentioned before, we restricted ourselves to grade school math problems to simplify the choice of fitness function. However, just evaluating the final result as right or wrong is not ideal. A modification to a given prompt will either be scored as 0 or 1. This makes it hard for the genetic algorithm to learn.

We decided to also give scores for intermediate steps, i.e., to rate the logic of the answer. This has several advantages. In particular, LLMs are generally not solving math questions like a calculator, but they try to predict the correct next word. Encouraging the correct logic has proven to be a good way for the genetic algorithm to learn what are good or bad changes to the prompt.

The provided solutions in the data set are showing the intermediate steps, thus, to rate the logic of an answer, we look for the numerical values of the intermediate steps in the answer. This is a fast and easy to improve our fitness function.

Space for improvement

Fitness Function: Currently, our fitness function scores the logic and final result of a mathematical problem by searching and comparing numbers in the answer and data set. Analysing the content of the answers, the sentiment and using a vector-database.

Generalisation: This goes hand-in-hand with the previous problem. At the moment, we are focused on math problems, but a better analysis of the solution will make it easier to extend our solution to general problems and not be limited by math problems. Once, the answers can be scored better, generalisation should be easier to achieve.

Applications: Once the code is generalised, insights can be drawn on how to improve prompts. We could write a guide on how to write better prompts. As LLMs had a rapid growth in popularity. It will be useful to have a guide on how to write better prompts to get the desired output.

Challenges

Building the Dream Team in a Crowd

Day 1 threw us into a sea of over 100 faces, all potential teammates but also daunting strangers. Initial conversations felt like mini-interviews, everyone sizing each other up for the perfect fit. We needed a team, but where to start?

Before attending the orientation event, we pre-screened people on Devpost. However, choosing a good candidate was a difficult task. A hackathon is a huge time investment and more experienced hackers might not be able to invest enough time. We looked for people with the right motivation and commitment. While our main goal was to learn and deepen our understanding of LLMs, we also looked for members who were aiming to complete and win the challenge.

During the orientation event, we already formed a small team and after the orientation event, we finalised our team to a total of 4 members. We had a quick chat about our ideas on how to tackle the daunting challenge and set up the first meeting.

We started off with a strong team, we had 2 programmers, 1 project leader and 1 all-rounder. All with their unique skill set and expertise. However, during a 3-week long hackathon, teams and priorities change.

Time, Quality and Cost

At first, we aimed really high. We wanted to create the perfect computer program, spending four days making sure every part was just right. But as time went on, we realized that we might not have been too ambitious and wouldn't have enough time to actually finish our solution. It felt like we were in a classic project management struggle, trying to balance time, quality, and the limited resources we had, like our computer's power.

So, how did we solve this tough problem? First, we decided to be practical. We switched from being idealistic to being realistic. We stopped trying to find a perfect solution and instead made a clear plan with a schedule. To keep the project on track, we introduced a Kanban board. Each task got a set amount of time, focusing on the most important things and leaving room for unexpected issues.

Next, we dealt with the challenge of our computers not having enough power. At first, we tried to use Llama-2 as our LLM, but the runtimes were too long. On our student budget hardware, we were not able to train our solution efficiently. We switched to a less demanding model. While the runtime was slower, this model answered the question less accurately. But this was also an opportunity as it left more room for improvement.

Finally, the biggest challenge was finding time for the hackathon during an already busy Christmas time. After one member left us, we recruited 3 more hackers to join our team and complete the challenge. While everyone had a busy schedule, we managed to support each other and work around everyone’s schedule.

Through all of this, we learned that sometimes the best solutions don't come from having a perfect plan but from being able to change and use what you have, and trusting your team. We managed to balance time, quality, and resources, showing that being practical and supporting each other can make a big difference, even in the busy world of a hackathon.

Other Challenges

Model Performance Issues:
The LLAMA2 model encountered difficulties loading and utilizing the Kaggle GPU effectively, resulting in slow performance.
The LLAMA2 model exhibited potential performance limitations in terms of accuracy with trivia questions and text generation quality.
Genetic Algorithm Challenges:
The algorithm failed to function when the fitness function consistently returned 0 for all individuals. - --- This issue was resolved through fixing inconsistencies in the code.
Mutation Function Considerations:
Careful design of mutation operators aligning with the specific problem domain and representation of individuals was necessary. This ensured that mutations introduced meaningful and potentially beneficial variations.
Fitness Function Development Progress:
The initial implementation of the fitness function had been completed, but further refinement was required to incorporate NLP techniques for context and feature checking.
A key question remained: how to base scoring on both the answer and the prompt. Langchain was considered as a possible solution.
Due to time limitations, the focus had been narrowed to numerical aspects, including logical steps and the final answer.

Built With

llama2
python
transformers

Submitted to

OpenD/I: Shaping Data & Infrastructure for the next 10-20 Years
- Winner Hackathon Challenge: Decentralized AI Training and Automated Prompt Engineering #3 (Tied)