Inspiration

A/B testing can be repetitive and tedious. AB Agent, powered by Firework's function-calling firefunction-v1 API and open-source mixtral-8x7b-instruct models, aims to automate A/B test design and inference.

Having worked as data scientists, we noticed that designing and interpreting A/B test results often involved repetitive use of the same tools, such as sample size calculators and t-test functions. We hypothesized that AI agents, with access to these tools in the form of Python functions, could automate this process. Thus, we built AB Agent!

Now, non-technical folks from product teams can design and interpret A/B tests using just natural language.

What It Does

AB Agent automates two parts of the A/B testing workflow:

  1. Design:

    • Initially, we take a user's natural language instruction to set up an experiment. E.g., "Design an A/B test to test a UI change where the metric to increase is browsing time. We want the minimum effect to be 10% and we want to be 95% confident in the results."
    • Then, we pass this through the mixtral-8x7b-instruct model to rephrase the user query in a more statistically oriented way (this assists the function-calling model).
    • Subsequently, it gets passed to firefunction-v1, which ideally calls the sample size calculator to determine the sample size and other important details needed to design the A/B test.
  2. Inference: For interpreting the results of an A/B test, the function-calling model uses a t-test calculator along with other functions to make a go/no-go decision on whether to implement feature B. This is where the agent makes a decision for the A/B test using function calls.

Challenges We Ran Into

AI agents, especially those driven by open-source models, are highly unpredictable. Given more time, we would have dedicated more effort to prompt engineering to stabilize the outputs.

What's Next for AB Agent

Automate more A/B tests!

Slide deck: https://docs.google.com/presentation/d/1XjJ3ju01sKQEU14mc5DJGAM1JGTfrO4-MrNCbLl835s/edit?usp=sharing

Built With

Share this project:

Updates