AutoProfile

Experimental results for nano-gpt on H100 before and after optimization

Inspiration

Almost every single team at this hackathon and in the tech world has strongly considered implementing their own AI solutions. AI sounds great on paper, but it often comes with incredibly large costs both in compute and in hiring people specialized enough to apply state of the art optimizations. This culminates in top talent for optimization aggregating at large companies, whereas startups and midsized companies are left to figure it out. We wanted to create a deterministic and evidence based engine that democratizes this crucial knowledge so no compute is left wasted.

What it does

Our engine works in a few key steps.

A user's codebase is placed in a Modal sandbox for a configurable amount of time where it is heavily monitored by profilers that look through each layer on the software and hardware side.
The profile data is passed to specially prompted AI agents with selected optimization techniques like data pre-processing, CUDA graph generation, and memory and attention replacements.
Multiple sandboxes are then spawned where our optimizations are automatically applied and compete for the best time, cost, and accuracy for a configured amount of time.
The winning optimizations are then generated in a report and the user is able to deploy any of these models with their optimizations to Modal.

How we built it

The first thing that we did as a team was brainstorm all of the possible optimizations that can be applied on Pytorch, and what profiles could catch room for these optimizations. We then focused on how we can architect Modal sandboxes to verify and test these combinations of optimizations. We focused on making sure that the optimizations are both explainable and accurate. After we built our tool, we spent time benchmarking our tool against tons of popular machine learning repositories to verify that our engine had actual quantifiable value. We made sure to run full inference runs on the same hardware so that our data was bulletproof.

Challenges we ran into

Optimizations need to be generalizable to a ton of different types of models and frameworks without ever risking a significant divergence from the user's code. We initially struggled to create an engine that could find correct optimizations and check for divergence patterns in a time efficient manner.

Another challenge we faced was having to pivot midway through the weekend. At first, our vision was to create a developer tool that focused on optimizing surface level parameters for inference–things like quantization and other data representation. However, we realized that this idea was surface level. It is trivial to understand how differences in precision will impact a model’s speed and accuracy, but actually being able to apply optimizations requires a deeper understanding of the hardware and software. In the last 16 hours, we focused on nothing but execution and managed to breakthrough far past the surface of machine learning.

Accomplishments that we're proud of

Our engine was able to speedup the inference for Andrej Karpathy’s Nano-GPT (54k stars) by ~31% via a stacked compile + AMP optimization combo, optimizer swap/optimization, and triton fused kernels. Our engine increased average MFU (utilization/efficiency) from 44.09% to 90.03% representing a 2x increase in GPU efficiency. There was no noticeable difference in accuracy.

Being able to pivot from our original idea to a much bigger challenge tested our abilities to persevere. Swapping to an idea that was much more challenging to execute was something that we were all proud of, especially with only 16 hours left and running on 3 hours of sleep.

What we learned

The biggest lesson we learned as a team is to not be afraid of challenging problems. Those we talked to cited anything related to low level machine learning architecture and hardware as an incredibly difficult goal. As freshmen, we all had some experience in machine learning, but not nearly enough to convince others that we could come close to succeeding. We learned that pushing through for a weekend with limited delusion and putting maximum effort to a goal may not always work, but in the rare case that it does, feels incredible.

What's next for AutoProfile

The first thing that we want to do is continue adding more optimizations and data that can allow our model to provide even more value. We are in a constantly evolving space, and we need to remain vigilant in order to not fall behind. In the coming weeks, we plan on open-sourcing our code so others can use and contribute to our project without limitation. This will help us to accomplish our goal of democratizing the way that computers are used.