Deep Random Thoughts

Guardians of The Experiment: The New Role of Product Managers in Agentic AI Era

Amir Feizpour (ai.science) — Tue, 22 Jul 2025 14:52:35 GMT

In our work with our commercial clients over the past few years I have firsthand experienced how my role as the product person has evolved as everything went from regular-software-with-AI-pieces-here-and-there to all-hail-agentic-ai kind of software. Surprisingly, or maybe in hindsight unsurprisingly, after developers the second most common persona that attends our agentic AI bootcamp are also product managers (PMs) and product oriented founders. And their motivation…

“What does all this mean for me as a product person?”

Product roles, like many other knowledge work areas, are being invaded from all fronts! The business stakeholders can brainstorm with ChatGPT and send you a long summary of what they think should happen. The CMO can throw in some thoughts into Loveable and send you the new frontend with some notes about how you should take it from here. And the engineers who have to spend less time on finding stupid bugs in the code, powered by their coding agents, spend more time daydreaming about what products can be built.

At the same time product people themselves are invading other areas. The most common area being tech or non-tech PMs who are now empowered, via vibe coding, to provide working prototypes to the engineering team instead of vague Jira tickets. Or the ones that after an intense market research session with Perplexity think to themselves “I should really stop following orders from my conservative CEO and launch that business I have wanted for a while”.

It seems like we are at a melting point, at the beginning of a new future, and product people are right at the center of this transformation.

In the early days of software product management, a PM’s job was clear: understand user needs, write a clear product requirements document (PRD), work with engineering to scope and build features, then ship them.

Those days aren’t gone - but they’re no longer sufficient. As software begins to rely more heavily on large language models (LLMs), agentic systems, and AI-assisted workflows, the nature of product development is changing. And so is the role of the PM.

In this new paradigm, PMs aren’t just writing specs in PRDs - they’re curating experiments. They’re not dictating behavior - they’re defining success through evaluation. And they’re not always building software in the classical sense - they’re managing emergent systems that must be guided more than constructed.

From Certainty to Stochasticity

Traditional software systems are deterministic: you click a button, and the code runs exactly as specified. Product managers in this world serve as the glue between business needs and engineering execution, translating ideas into feature sets, wireframes, and task lists.

But AI-powered systems, particularly those involving LLMs or agents, aren’t deterministic. They’re stochastic. They produce different outputs on different runs. Their behavior can’t be fully specified in code. Instead, it's shaped through:

training data,
prompts,
context management,
agent memory and orchestration,
tool usage strategies,
and more.

This means the role of the PM has to evolve. The future of product management in AI isn’t about control; it’s about alignment, experimentation, and iterative refinement.

This is not quite new. In the previous era, while good PMs were busy writing meticulous PRDs, great PMs were all about experimental design and rapid product evolution towards what needs to be built. Today, this is no longer an optional stretch goal; this is the definition of the job.

Instead of shipping static specs, these PMs:

Define product behavior via evaluation datasets: curated examples of expected inputs and desirable outputs.
Replace checklists with rubrics: frameworks for evaluating model behavior across axes like factuality, safety, and importantly impact-centric metrics.
Manage subjective quality via preference ranking: either through human raters or verified LLM-as-a-judge systems.
Track progress through behavioral metrics and evaluation set performance, not just usage stats or feature completion.

They own the process of asking:

"What does good look like? How will we know we’re improving?"

Rather than defining functionality, they define desirable behavior in an open-ended landscape. They become evaluation strategists.

Why PMs Matter More Than Ever

You can already see this evolution in practice:

PMs working with AI teams now spend more time in curation tools, annotation workflows, and model dashboards than in JIRA.
They’re prototyping flows using tools like GPT-4, LangChain, Cursor, Replit, and Figma AI; not waiting for design and engineering cycles to play out.
They’re defining user experience not through pixel-perfect screens, but through libraries of I/O examples, rubrics, and iterative behavioral improvements.

This shift doesn’t mean PMs are less relevant; it means they’re differently relevant. There is a lot of talk around PM functions being eliminated and the slack being picked up by engineering and growth teams. But I think that’s short-sighted and based on a misunderstanding of what (good) PMs bring to the table. PMs, from a skillset point of view, are the ones that are most suited to:

Align the stochastic AI behavior with business value using well maintained evaluation loops
Represent the user’s needs in an ambiguous, generative world
Help legal, design, infra, and research teams communicate, not via requirements but rather via evaluation data sets
Decide what “improvement” actually means when you can’t rely on binary “correctness”

So, what’s the playbook if you are a PM / or product oriented founder?

Learn how evaluation datasets work; Start thinking of examples as UX specs, not just model tests.
Practice writing rubrics and pairwise comparisons; Define not just “what” the system should do, but how to recognize better vs worse.
Prototype with AI tools; Use Cursor or Claude Code to sketch new flows. Simulate a customer service bot. Try agent frameworks.
Think like an experimenter; Learn how to write and version prompts, get really good at understanding context management for LLMs, and how to decide the next best intervention based on what you learned in the last experiment.

The job of a product manager has always been to navigate ambiguity, advocate for users, and guide software toward value. But in a world of stochastic models and agentic behavior, as the guardians of the experiment, the best PMs will be the ones who can define success even when they can’t predict the exact steps to get there.

And that might just be the most exciting job in tech.

What do startups get wrong about ai?

Amir Feizpour (ai.science) — Tue, 20 May 2025 15:10:38 GMT

I was recently sent a video in which someone asked Sam Altman what he thought startups were misunderstanding about AI. I was asked to react to his response.

My immediate reaction was:

“Meh! I don't completely agree with what he's saying. There are some areas that Open AI and similar firms care a lot about and investing significant effort in fixing something specific in those is a mistake; that I agree. But there's a long tail of use cases that Open AI and such will not care about well into the future and investing time in those is very worthwhile because you can dominate that market until Open AI starts to care about it [eg. Windsurf]”.

I thought it would be useful for me to organize and write my thoughts more clearly. This is an existential question for all of us, entrepreneurs, and naturally a question that comes up often in my advisory / bootcamp / training sessions.

The main issue is that a short snippet is cut out of a broader topic of conversation and perhaps due to time constraint it is articulated briefly. Now the issue is that looking at this out of context of that conversation, Open AI more specifically, and the market movements more generally results in an incomplete picture. What he says is right, but I think he says it in a tone deaf way; so, here’s my unpacking of his response with the luxury of having pages to babble.

SA: “The question is as a startup, do you bet that the technology is as good as it gets, or do you bet that it will get massively better”.

This is honestly condescending. Who, in their right mind, thinks that we’re at the top of technology and there's nothing else for us to do?

Of course, all of us understand this intellectually that “as a startup you have to be on your toes and always try to understand what the next important pivot is”. What I think he’s correctly pointing out is that a lot of us, against our own wisdom, either get lazy, or are too busy in the weeds, or feel completely overwhelmed by the number of things we need to track, that we forgot to watch where technology is going.

“Oh my god, there are 6 new models released since last time I blinked.”

Or you open and close a social media app and feel completely drained because every influencer is acting like the info they have is the most urgent thing you need to know or you’ll die. Balancing this unwelcome amount of inbound and surviving the challenges of a startup life, at a personal and professional level, is what a lot of us are mastering every single day. None of us are “betting that technology isn’t going to change”; we’re just tired, ok? I’d say what he meant to say is something like this:

“The question is as a startup founder, do you have the right support system around you to be able to clearly see where tech is going and play ahead of the curve?”

Support system = intentional information flow control to be objectively informed + emotional / social / financial / mental support so that you don’t make bad decisions on bad days + physical / cognitive health routines that keep you sharp and on top of your game.

Verdict: those of us who don’t sit on a $10B bank account and perhaps without an army of assistants and analysts to help us stay ahead of the curve, need to create very efficient systems to ensure we’re on top of the trends and know with high confidence where tech is going.

SA: “If you are building an AI Tutor company, as models get smarter, the level at which students can learn will naturally go up and up. So, maybe it’s effective for 6th graders, but with the next version it’s good for 8th graders, and eventually PhD students. So, you get to surf that wave. Or you might say I’ll put all my effort to barely make this work for the 8th graders in the limited case of history and then do a huge amount of work to have a human in the loop and correct factual errors for this one class. In the first world, you will be very happy when GPT5 comes out, and in the second world you’ll be really sad”.

This is a good example of what he is trying to convey, but a bad example in general. Open AI has a lot of resources, yes, but they don’t have infinite resources. Yes, they get to care about and work on much larger scale problems but ultimately they’re limited to do so for a handful of problems. “AI Tutor for grade school education” nicely fits in that small set because there’s a ton of data their models have been trained on and it’s expected that their level of competence in grade level knowledge handling would generally increase. There are a few other areas that are generally interesting for them which are largely highly verifiable spaces like coding and math. Windsurf acquisition is a very strong signal that they really care about the coding as a use case, as they should. I’m sure they care a lot about the mathematical abilities of their models because finance is a huge space and traditionally data driven, and is dying as a space to have more competent models. Probably the same goes for complex multi-reference reasoning for use cases in legal and material discovery and such areas which are very commercially lucrative areas. Yes, they are chasing AGI and that can be more general than we can imagine etc, but ultimately they too have to hedge their bets and choose their battles and if we are careful we can align our bets perpendicular to theirs.

A cynical read on his comment would also be this: of course, he doesn’t want you to have data that is better than what he has. Because as unlikely as it is, if you pull it off, then he has to pay a large premium to either out-compete you or acquire you or whatever other options are available to him. It is much cheaper for him to discourage you from pursuing use cases in the spaces Open AI cares about. However, he’s also serious about what he’s saying, if you want to go head to head with him, you’d better be damn sure you have the support system to pull it off because 99.999% chance you’d die there if you’re not prepared for the battle.

And of course, if you don’t have the edge in those, there’s a long tail of “unsexy” use cases that you can go after, scratch your entrepreneurial itch, probably make a bunch of money, and who knows might even make enough progress to be a future “windsurf” when Open AI is done with the more sexy spaces.

Regardless of which of these scenarios each of us are operating in, I think what he wanted to say is something like this:

“If you are building an AI business, as models get smarter and technology improves, do you have the right build-measure-learn-iterate systems in place to quickly evolve your product / mindset / thought process to keep your product and business model relevant?”

SA: “My intuition would have been that 95% of the entrepreneurs would pick the first world, but it looks like they pick the second world, and then you have this whole open ai killed my startup meme.”

This is a correct observation but the wrong implied reason. 95% of entrepreneurs are working in poorly designed environmental conditions and completely lack the support system to pull anything off let alone a complex AI business, period. That has nothing to do with what Open AI does. I’m not even talking about infrastructure issues with funding and supporting innovation from the governments etc. I’m talking about having the right mindset and asking the right questions, creating productive information flow and learning systems, and creating effective, intentional social and financial capital systems to pull off something as complex and fluid as a startup.

I too have used the open ai killed my product excuse to cover my embarrassment of taking my eyes off the ball when I was in a bad state of mind and cognitive sharpness.

What he probably meant to say is this:

“My intuition would have been that 95% of the entrepreneurs invest significant time in sharpening their ability to learn and surround themselves with an environment where learning about tech, their target audience needs, and how those interact with each other is accelerated; yet I see they are distracted by the next shiny thing, asking the wrong questions, and making decisions based on opinions of unqualified social media influencers.”

Summary of what he’s saying in that unnecessarily dramatic tone: if you’re a bad startup, you’ll die; if you’re a good startup, you’ll probably still die but maybe less likely to die too quickly. And none of this has anything to do with being an “AI startup”, whatever that means.

Business Case for AI Ethics

Amir Feizpour (ai.science) — Fri, 09 May 2025 13:40:21 GMT

I have a hard time recalling any instances where the “do the right thing” narrative made a significant headway in convincing the people in charge of capital allocation. This is a statement in general, but also about AI Ethics in particular.

Any push towards doing the ethical and responsible thing that fails to acknowledge the systematic biases towards maximizing profits is just simply wishful thinking. I’m not trying to be contrarian and definitely not trying to say that people who are spending their lives advocating for AI Ethics and such are barking up the wrong tree. But let’s be honest, big corporations pretend to care about ethics and responsibility when it helps their PR and as soon as things get tough, the AI Ethics teams are the first ones that get axed off.

So, what’s the missing ingredient if we want to sustain AI Ethics efforts in the long run?

The Ethical-Economic Paradox

Often, arguments about AI ethics start with examples like biased loan application processing systems. They go on to say “AI might deny loans to people from certain backgrounds due to biased data, and that’s bad”. Yes, it is! However, what you’re failing to mention is that the same AI increases overall processing efficiency, saving the financial institute lots of money, and therefore they have zero incentive to take this kind of outcry seriously. For the CFO of this business, “people from certain backgrounds” you are advocating for are simply statistical errors, in an otherwise well performing, now improved system.

You see the issue?

We create systems that are statistically efficient but cause individual harm, sometimes knowingly, and sometimes without even knowing why, thanks to black-box algorithms. The obvious answer is yes, we have a responsibility to make these systems ethically right. But how do we do that in a way that acknowledges the realities of the world?

You might expect me to say things like "ethics keeps you out of trouble," "it's good for your brand," or "values matter." I find these statements often meaningless because we've been saying them for a long time without seeing substantial outcomes.

"Short-term profit is always at odds with the well-being of the user, society, and environment,"

It even has a name: the "ethical-economic paradox."

Let’s look at this in action:

Startups chasing growth at all costs often deprioritize “soft values” like ethics.
Private equity firms acquire companies and cut anything that doesn’t immediately impact the bottom line.
Social media platforms are built on maximizing attention and outrage - not on protecting mental health.
Big Pharma has minimal incentive to heal when treating symptoms is more profitable.
Food industries thrive on sugar, chemicals, and addiction - because they drive sales.

So when people say, “Let’s build ethical AI,” I ask: in this environment - how, exactly, do you expect that to happen?

This fundamental tension explains why many well-intentioned ethics initiatives collapse when business conditions tighten.

In nearly all the examples above, the push for short-term gains leads to products or practices that cause long-term harm. While regulations attempt to address this, what we’ve seen is that the regulators are often a decade behind where the tech is and by the time they wrap their heads around it, the damage is done. Now imagine where we would be in 5 years at the current rate of development in foundation models and AI agents built around them!

Here's where I think things get interesting: I argue that the only way out of this tension is finding the sweet spot - an "opportunity zone” where ethical behavior, profit-making, and structural realities overlap. I believe the most impactful change will come from builders, founders, and entrepreneurs who build in this “sweet spot”. Maybe these wont be necessarily venture-capitalist friendly, but they could surely make enough money for the founder to live comfortably while also sleeping at peace at night.

Learning from Environmental Innovation

We can learn a lot from how the ESG and clean-tech sectors evolved over the decades.

Many startups tried to build products that reduce waste or promote clean energy. Their intentions were good, but their methods were flawed.

These companies often failed because they appealed to morality. They asked investors to fund them out of principle. They asked consumers to buy their products out of conscience. In the real world, that usually doesn’t work.

A large number of those startups never got off the ground. Many faded away without scale because they never solved a real business problem. They assumed that if people cared enough, things would change. That didn’t happen.

But some did succeed. Let me share two examples that explain how they got it right.

Smart Recycling Bins (MyMatter)

MyMatter created a smart bin that uses computer vision to sort waste automatically. If someone throws a recyclable item into the trash, the bin detects the mistake and moves the item into the correct compartment.

This solves a practical issue. People often don’t know whether something is recyclable, and they don’t want to think about it.

The product removes the decision-making burden from the user.
It is sold to cities and hotels where a lot of garbage mixing happens.
It uses AI to address a problem that would otherwise require behavior change.

This product works because it connects environmental goals with business needs. It reduces waste, saves time, and fits naturally into how people behave.

Kitchen Waste Monitoring (Winnow)

Winnow places a camera above and a scale below trash bins in hotel kitchens. The system identifies what food is thrown away and weighs it. Each day, the hotel receives a report that shows the exact cost of the waste.

For example, “You threw away $140 worth of cucumbers today.”

It also gives practical suggestions. Reuse tomato scraps for sauce. Reduce future orders for the items you often waste.

Staff don’t have to change their process. The system works passively.
Executives gain visibility into financial losses, and guess what, they’re motivated to reduce that.
Behavior shifts naturally through awareness and cost savings.

This is another case of solving a structural issue while aligning with both sustainability and profitability.

These examples succeed because they align doing the right thing (reducing waste) with making money (saving costs) and working around structural biases (making it easy for staff).

This is what ethical product design should do. It should eliminate resistance. It should reduce the friction of doing the right thing.

The Opportunity Zone: Where Ethics Meets Business

The key idea I want to present is identifying what I call the "opportunity zone" – the intersection of three critical elements:

Doing the right thing (ethical imperatives)
Making money (business viability)
Working around structural biases (practical implementation)

This framework shifts our thinking from abstract moral principles to concrete business implication. Instead of just saying "build ethical AI because it's right," we can reframe ethical considerations as business imperatives.

If you genuinely care about ethical AI, you must figure out how to operate in this intersection. The solution to the paradox lies in balancing these competing forces. The environmental examples did precisely this: they linked environmental goals directly to financial outcomes (cost savings) and addressed structural issues (making recycling effortless, automated waste tracking).

Crucially, you cannot go far doing the right thing and solving structural problems without making money. Ethical initiatives require funding. Whether inside a corporation or as a startup, if your ethical effort isn't tied to the bottom line, it risks being cut. No investor funds a project solely because it's ethical; they invest because they expect a return.

While massive companies like Meta or Google face different scaled challenges related to the paradox, for most of us building products, aligning ethics with these business realities is key to creating sustainable positive impact.

The ESG playbook evolved from “reduce waste” to “cut costs and access new asset classes.” AI needs the same shift.

The AI Ethics Playbook

Perhaps, this can be the beginning of a practical checklist for building AI products that succeed ethically and commercially:

Alignment with user preferences isn't just ethical – it drives adoption
Explainability isn't just transparent – it enables sales to the regulated industries and helps users stick around.
Guardrails aren't just responsible – they're necessary for business customers who require predictable systems.
Unbiased data isn't just fair – it expands your addressable market

1. Trust is Paramount

Trust isn't just a moral virtue – it's a business necessity. When ChatGPT first launched, initial hallucinations created excitement but quickly eroded user trust for some applications, as people found it unreliable for serious use. Only after addressing these credibility issues did sustained usage follow in many areas. While established companies might get second chances, most startups won't have that luxury. You have to get it right the first time.

2. Explainability Drives Adoption

If users don't understand how your system makes decisions, they won't stick around – especially in regulated industries. No financial institution or healthcare provider will adopt a black-box system that can't explain its recommendations when challenged. Explainability isn't just about transparency; it's about market access. You often can't sell the product otherwise.

3. Data Quality Determines Market Reach

Biased training data doesn't just create ethical problems – it limits your addressable market. If your product works well for urban users but poorly for rural ones due to skewed data, you're unnecessarily constraining your growth and losing out on a portion of the market. Every demographic your system underserves represents lost revenue and opportunity.

4. Goal-Oriented Design Creates Value

Generative AI systems that produce impressive outputs without helping users achieve concrete goals won't retain users long-term. The crucial question isn't just whether your system can generate compelling text or images, but whether it helps users accomplish meaningful tasks. Value creation drives retention and leads to happy customers.

5. Sustainable Engagement Builds Longevity

While it might be technically possible to create AI experiences that nudge users toward manipulative behavior and maximize short-term engagement, this approach ultimately leads to burnout and abandonment (like my decision to cut out news and social media from my life completely). Aim for sustainable engagement that provides long-term value. Even gaming platforms now often include features encouraging breaks because they understand that sustainable engagement creates more lifetime value than exploitation.

Closing Thoughts

My core message is this: The future of AI ethics isn't about more impassioned moral appeals – it's about demonstrating that ethical AI is better business (at least in some cases). As AI becomes increasingly integrated into critical domains, the companies that succeed won't be those with the most virtuous mission statements, but those that build trustworthy, explainable, and genuinely helpful systems that align ethical considerations with user needs and business objectives.

And maybe that’s the most important insight of all. Sustainable ethics requires sustainable business models. By reframing AI ethics as a business imperative - not just a moral one - we create the conditions for those values to survive and thrive in real markets.

Agents Bootcamp - Anniversary Reflection

Amir Feizpour (ai.science) — Thu, 24 Apr 2025 16:48:02 GMT

I have worked in AI for the majority of the past decade as a hands-on-keyboard coder, corporate manager, and founder. One lesson is quite clear across the board: there's a massive gap between theory and practice. Even the first product we launched at Aggregate Intellect in 2020, way before most people had heard of language modelling, was to use AI to reduce the so-called “translation gap” between conceptual and activated, practical knowledge. The n-th iteration of that product eventually got killed by ChatGPT in 2022, but the problem is still around and more pronounced than ever.

We've all seen those flashy linkedin posts and twitter threads about agentic systems doing all kinds of fun and productive things. As is the norm on social media, however, most of them fail to show the complexities that go into creating a system like that for real world use. Most of this is because they’re probably just cute demos, but some of it might also be because showing the shiny result gets many more likes than the messy hustle along the way. They make it sound straightforward, but anyone who's attempted to build one in production knows the truth: building a robust, well-behaved agentic system is incredibly challenging.

So, now the question is, what do you have to do if you want to build something more serious than a simple demo that doesn’t withstand any rigorous evaluation?

This disconnect is precisely what prompted the beginning of the journey of what eventually became our bootcamp, "Build Multi-Agent Applications”.

Check out our next cohort

Why Traditional Learning Falls Short

The problem with most educational content in general, and LLM systems specifically, is that it focuses on concepts rather than implementation challenges. Given what social media rewards, content creators try hard to balance education and entertainment at best, and fail to even have any meaningfully useful practical content in most cases.

At the end of the day, nothing teaches you like struggling until that magical moment of wrapping your head around it. You can read dozens of articles about agent architectures, but they won't prepare you for the obstacles you get to see while building:

Finding the right scope between ideas that are too big and fluffy or too small and boring
Dealing with unreliable tool integrations that introduce unexpected failure points
Navigating complex cloud infrastructure that requires specialized knowledge
Building product evals while designing and running experiments
Prompt engineering that works in playgrounds but fails in production

These aren't theoretical problems – they're practical engineering challenges that require hands-on experience to solve effectively.

Agent Development Lifecycle

Building complex AI systems isn't the neat, linear process that tutorials often make it out to be. In reality, it's messy, unpredictable, and requires constant adaptation. Whether you're rethinking your architecture at the last minute, troubleshooting broken integrations or trying to wrap your head around evaluating your agent’s performance, real-world development demands flexibility and problem-solving on the fly.

In the wild, this would be done in sprints that might take several months and often involves a lot of back and forth between various phases of design, development, deployment, and demo’ing. This structured chaos is a very efficient engineering process: an intense, fast-paced environment that reflects and tames the unpredictability of agentic product development.

In designing our bootcamp’s model we tried hard to mirror this reality. Our bootcamp is not a final product yet and we are still iterating but our goal remains the same: offering participants an authentic taste of real-world AI engineering in a contained and well structured sandbox.

When we first launched our bootcamp, we had a linear curriculum that walked participants through agent development step by step. That was easier to teach, easier to market, and probably easier for participants to “feel” they had achieved something. But the honest truth is that it wouldn't prepare them for building in the messy real world especially in a world that every hour there’s a new LLM or a framework that you could be using.

Over the past year, we spent time observing those who succeed in the program and tried to design around what the top percentiles of our participants do to thrive:

They were curious, scrappy and experimental
They asked A LOT OF questions
They enjoyed the structure as long as it didn’t limit their freedom to play

What did we do in response?

We curated all the theory stuff into a learning path they can learn from before the cohort and use as a reference during
We dropped the lecture style course and replaced it with a bootcamp style mentorship program
We onboarded several experienced assistants (some are our alumni) who participants can book 1 on 1 calls with to problem-solve and co-develop

“A mentorship model rooted in extensive experience is precisely what’s needed to build practical skills and achieve tangible outcomes”
~ Mykola, Bootcamp Testimonial

We built our program around a 3-sprint model - design, develop, deploy - that deliberately introduces the kinds of challenges, pivots, and iterative development cycles that characterize real-world engineering. And to give everyone a final rush, we cap this off with a final week of preparing for a public demo where we invite guests like experienced founders, investors, and corporate directors to give feedback to the teams.

“The instructors create an amazing environment for learning through a good 3-sprint structure, by inviting industry speakers and by being there themselves to help with all endeavors”
~ Sinan, Bootcamp Testimonials

Check out our next cohort

The program is something like this:

Before Week 1: We run a free workshop and invite registered bootcamp participants and the broader community to join. For the broader community this is an opportunity to get to know the teaching staff and our approach. For bootcamp participants, this is an optional preparation period. We explain our ways, provide templates (see "Agents Playbook"), and help them build something really quick.

"I really like the curated material and how your team supports each other in presenting and handling all the questions that are thrown at you during the calls."
~ Richard, workshop participant

Weeks 1-2 (DESIGN): During these initial weeks, participants create detailed workflow diagrams and requirement documents. Our mentors repeatedly challenge them with variations of "why does this need an agent?" - a question that often leads to important realizations about where AI actually adds value.

"We had to write a requirements document before building our application which highlights important factors to consider when building an agentic application."
~ Hadia Hameed

This design-first approach contradicts the typical developer instinct to jump straight into coding. But the reality is that premature implementation almost always leads to architectural problems that become exponentially harder to fix later. The design does not need to be and often is not perfect, but having the foundation that you can iterate on is super important in gaining velocity later.

Another important aspect of the first few weeks is team formation and getting to know the cohort participants. The participants’ immediate teams and the community formed by all the participants is the first and foremost layer of support and learning.

“I particularly like working within teams and the feedback from fellow bootcamp'ers.”
~ Mick, Bootcamp Testimonials
“It was also a fantastic opportunity to connect with and collaborate alongside experts in the field, making the [bootcamp] not only educational but also a great networking platform. “
~ Mykola, Bootcamp Testimonials
“I was also pleasantly surprised by the caliber of my peers in the cohort - the mix of expert instructors and highly motivated classmates makes this [bootcamp] a great investment for anyone looking to master agentic applications.”
~ Murtaza, Bootcamp Testimonials

The common challenges in this phase are: a) crafting the right scope for the project that is not too big or too trivial b) finding someone who has the pain point in question and can be used as the first app user for feedback and testing c) building the simplest version of the app in less than a day so that you can learn from it quickly. And the teaching staff are available via 1 on 1 calls to work through these with the teams.

“There is a lot going on in the field and the project helped staying grounded by focusing on the process of breaking the problem down and decomposing to tasks workflows and then grow towards agentic structures which was an amazing experience”
~ Jayant, Bootcamp Testimonials

Once you come out of these two weeks, what you’d learn, hopefully, is that good ideas don’t grow on trees but rather they are the outcome of an intentional, rigorous, and iterative process of experimentation, feedback, and learning.

Weeks 3-4 (DEVELOP): Once you lay the foundation with the work you do in sprint 1, participants get to wrestle with development issues, misbehaved prompts, and a lot of “wait, why is this not doing what i want” kind of moments. And hopefully these are followed by “ah, so that’s how you do it” moments in 1 on 1 conversations with the teaching staff.

“The instructors … gave valuable tips and feedback during the [bootcamp] and 1:1 meetings. I particularly liked the possibility of learning while building a product, and I enjoyed working with a small team”
~ Elena, Bootcamp Testimonials

The common challenges in this phase are:

Debugging code or no-code implementations
Creating good evaluation datasets that can actually help you iterate
Navigating the jungle of tools and frameworks that you might be able to use.
Deciding if I should use CrewAI, LangGraph, or build custom Python implementations
Discovering late in the development process that AI agents need systematic testing approaches

The hope is that you come out of this experience with a working solution that is evaluated and is starting to look like something that you can tame.

Weeks 5-6 (DEPLOY): By this point, teams have encountered API rate limits, implemented workarounds in Chainlit, and conducted late-night testing sessions.

"Setting up AWS infrastructure for NVIDIA GPUs is difficult. The learning curve is steep but necessary for building production-quality systems."
~ Sinan Ozel

The common challenges are a) what’s the cheapest and fastest way to deploy this so that some early users can test it? b) how can I expand my evaluation and testing? c) how can I iterate on the implementation quickly to tame the behavior of the agent more?

Hopefully, you will come out of this sprint with a working app that is deployed and ready to be shown off!

Week 7 (DEMO): Each team delivers a seven-minute presentation to an audience that includes investors and industry professionals.

"The demos exceeded our own expectations. It is exciting to have my team interested in continuing to work on the project beyond the [bootcamp]. "
~ Maher, Bootcamp Testimonials

What's particularly interesting is how rarely projects follow a straight path. Some teams completely change their architecture halfway through. Others discover fundamental limitations in their chosen tech stack. This unpredictability serves a purpose: it prepares participants for the real challenges they'll face when developing AI systems after the bootcamp.

Real Projects Built by Real People

Now you ask: what kind of projects has this structure produced? Glad you asked! Let me highlight a couple of examples where alumni also walk us through their experiences:

Dungeon-Master Assistant ~ Sinan Ozel

Sinan joined with enthusiasm for Dungeons & Dragons and completed the bootcamp with a functioning multi-agent narrative generator running on dedicated GPUs. In his detailed Medium article, he warned about infrastructure challenges:

"You need to configure every component of a Kubernetes cluster, including the complicated IAM settings, before the model will even start working."

What's notable here is that the bootcamp empowers the participants to pursue technically complex projects and still finish on schedule, provided they carefully manage scope and prioritize features.

ReferWell ~ Mick Lynch

Mick tackled the challenging area of physician-to-specialist referrals by creating an agent system combining healthcare data standards (FHIR), vector search, and human verification steps. He identified specific problems in healthcare:

"The current referral process suffers from specialist mismatches and loses up to 70% of patient data. Our agent workflow maintains complete context throughout the patient journey."

While his final demo showed prototype-level functionality, it represented a practical application with potential for further development into production-ready software. Pshh, don’t tell anyone, but we might have even facilitated a meeting for them with an angel investor who saw their demo!

Check out the video playlist of the public demos.

Check out our next cohort

Who Should Consider This Approach?

Based on the patterns I've observed in successful participants, this intensive bootcamp model works best for people who:

Are willing to examine and iterate on their idea to identify specific workflows that could benefit from automation
Are comfortable sharing unfinished work with others under tight deadlines to get feedback
Are interested in direct, honest feedback and the messy process of building rather than cleanly structured theoretical lectures

For those unsure about whether the program fits their needs, they can attend our FREE on-going workshop sessions where you can expect to hear about:

how to refine your idea into a design
how to take a quick stab at implementing your idea (focus on no-code in this session - coding extensions will come in session 2 and 3)
how to use workshop sessions to earn bootcamp refunds via the Incentive Program!
anything else you want to know about agents, our bootcamp, etc

The next cohort runs April 28 – June 13, 2025.

What happens after the Bootcamp

Seven weeks cannot turn anyone into a complete expert in agent development – that’s just not realistic. However, as shown across multiple cohorts, the program effectively compresses the learning process into a well-scoped timeframe with clear deliverables.

What’s particularly interesting is that the impact of the bootcamp doesn’t end when the formal program concludes. Many participants continue developing their projects long after the bootcamp finishes and with friends they just met.

Mechanistic Interpretability - Decoding Neural Networks Might Need a Physics Degree - Part 1

Amir Feizpour (ai.science) — Tue, 18 Mar 2025 12:51:24 GMT

As someone who is trained in traditional hard sciences, how people do “science” in computer science has always bothered me. It always reminds me of the rivalry we’ve had with chemists and biologists about how their “science” is just empirical “throwing things at the wall and see what sticks” kind of research. Of course, as I grew older in physics and started to face more and more complex problem statements my point of view started to become more modest as I saw really interesting problems that can’t be solved by nicely compact formula and could only be tackled by numerical simulations and data driven approaches.

One of the things that you sacrifice as you start looking at more complex systems as chemists, biologists, computer scientists, and physicists do, is the ability to neatly explain how the system behaves and why. Fortunately, most of these sciences value causal inference significantly, which means that we end up with much more generalizable and transparent statements about nature. However, a mechanistic understanding of how systems behave in computer science is at best a secondary consideration.

The need for transparency and explainability of how complex neural nets work, however, especially with the exponential rate of adoption of LLMs and the agentic systems they enable, is urgent particularly in high-stakes domains like healthcare and finance, where trust hinges on understanding the reasoning behind AI-driven choices.

So, you can imagine my delight when I heard about Neel Nada’s work on MLST.

A core challenge in this pursuit lies in how neural networks encode information. In an ideal world neurons represent singular, well-defined concepts (mono-semantic), but in reality, for efficiency purposes, they represent multiple overlapping ideas (poly-semantic)! While this boosts efficiency, it complicates interpretability, forcing a trade-off between performance and transparency. According to Neel, to bridge this gap, Mechanistic Interpretability (MI) comes into play which is a systematic approach that dissects neural networks much like physics dissects the natural world.

In this article series, we’ll explore how I understand frameworks used to advance MI towards transparent AI with a physics lens.

1. Introduction to Mechanistic Interpretability

Mechanistic Interpretability (MI) is an emerging field focused on reverse-engineering neural networks to understand how they operate at a fundamental level. We can imagine it as disassembling a complex machine - like a car engine - to examine each gear, spring, and bolt, observing how they interact to produce the final output. Similarly, MI seeks to decode neural networks layer by layer, neuron by neuron, to identify the specific features they recognize, the circuits that process information, and the interpretability bases that map these abstract computations to human-understandable concepts. The goal is not just to observe what a model does but to explain how and why it does it, down to the smallest actionable components.

Foundational Concepts

To understand MI, it’s really important to grasp foundational concepts that describe how neural networks store and process information:

Features:

Features are the building blocks of a neural network’s understanding. They represent specific attributes or patterns in the input data that the model has learned to detect. For example:

In image recognition, a feature might be a horizontal edge, a texture like fur or scales, or even higher-level concepts like "eyes" or "wheels."
In language models, features could correspond to grammatical structures (e.g., verb tenses), semantic categories (e.g., "scientific terms" or "emotional language"), or even abstract relationships (e.g., cause-and-effect).

Features are not hand-coded by humans; they emerge organically during training as the model optimizes to solve its task.

Circuits:

These are groups of a model’s weights and non-linearities that connect one set of features to another. Think of circuits as the pathways that determine how information flows and is processed within the network. For instance:

A circuit in a vision model might link a feature for "edges" to a feature for "shapes," which then activates a feature for "faces."
In a language model, a circuit could route a feature for "question words" (e.g., who, what, where) to a feature for "answer structure," ensuring the response matches the query.

Crucially, circuits are not just linear chains of neurons - they involve non-linear transformations (e.g., activation functions like ReLU) and interactions between multiple layers.

Interpretability Bases:

Interpretability bases are mathematical tools that help researchers "decode" a model’s internal activations. Neural networks process data in high-dimensional spaces (e.g., thousands of dimensions), which are inherently unintuitive to humans. Interpretability bases project these activations onto specific directions in the space that correspond to human-interpretable features.

For example, in a sentiment analysis model, one direction in the activation space might align with "positive sentiment," while another aligns with "negative sentiment." By analyzing these bases, researchers can quantify how much each interpretable feature contributes to the model’s predictions.

Neurons vs. Layers

Neurons: Individual units that activate in response to specific input patterns (e.g., a neuron in a vision model firing for diagonal edges).

Layers: Hierarchical collections of neurons. Early layers detect simple patterns (edges, textures), while deeper layers assemble these into complex concepts (objects, sentences).

Attention Heads (in Transformers)

Transformers, which power modern language models, process data using attention heads - specialized sub-circuits that determine which parts of the input to prioritize. Each head can be thought of as a "mini-circuit" with a specific role:

Query-Key-Value Operations: Attention heads compute relationships between words (e.g., linking pronouns like "he" to their antecedents).
Specialization: Some heads focus on syntax (e.g., subject-verb agreement), while others track semantic coherence (e.g., ensuring "bank" refers to a river, not a financial institution, based on context).
Why this matters: Reverse-engineering attention heads is a cornerstone of MI in transformers, as their behavior directly impacts model outputs.

Superposition

Neural networks often use superposition - a phenomenon where a single neuron or activation encodes multiple unrelated features. For example, a neuron might activate for both "cat ears" and "scientific terminology" in a multimodal model.

Polysemantic Neurons: Neurons that respond to many distinct features (common in large models due to sparse feature space).
Monosemantic Neurons: Neurons that activate for a single, specific feature (rarer but easier to interpret).
Why this matters: Superposition complicates MI by obfuscating the "clean" mapping between neurons and features, requiring advanced techniques to disentangle overlapping signals.

Activation Functions

These mathematical operations (e.g., ReLU, sigmoid) determine how neurons transform inputs into outputs. In MI, they act as "gates" that shape information flow:

Non-Linearity: Functions like ReLU introduce non-linear decision boundaries, enabling networks to learn complex patterns.
Saturation: Functions like sigmoid can "saturate" (e.g., outputting 0 or 1), which MI researchers study to identify when a circuit stops responding to input variations.
Why this matters: Activation functions define the "rules" for how circuits combine features, influencing everything from robustness to adversarial attacks to generalization.

Probing vs. Intervening

Two key methodologies in MI:

Probing: Training a simple model (e.g., linear classifier) on a network’s activations to test if a specific feature (e.g., "sentiment") is present in its representations.
Intervening: Actively modifying activations (e.g., ablating a neuron, amplifying a circuit) to observe causal effects on outputs. For example, silencing a circuit might reveal it was responsible for suppressing biased language.
Why this matters: Probing identifies correlations ("Feature X is here"), while intervening establishes causality ("Circuit Y causes Behavior Z").

Causal Scrubbing

A technique to validate hypothesized circuits by "scrubbing" (resetting) certain activations and observing if the model’s output degrades. If the hypothesis is correct, scrubbing should disrupt specific behaviors (e.g., failing math problems if a "number detection" circuit is scrubbed).

Why this matters: Causal scrubbing bridges the gap between observational and experimental science in MI, enabling rigorous falsification of theories.

How MI Differs from General Interpretability

While general interpretability aims to provide broad explanations of model behavior (e.g., "The model classifies cats by focusing on fur texture"), MI demands a mechanistic, step-by-step account. It asks questions like:

Which exact neurons detect "fur texture"?
How do these neurons communicate with others to trigger the "cat" classification?
What happens if we disrupt this circuit?

This granular approach allows researchers to rigorously test hypotheses about a model’s behavior, similar to how a biologist might study a cell by isolating and manipulating individual proteins. By contrast, general interpretability methods (e.g., attention visualization or feature importance scores) often provide correlational insights rather than causal explanations.

2. Drawing Parallels with Physics

To understand mechanistic interpretability (MI), it helps to borrow frameworks from physics - a field that has spent centuries decoding the universe’s most complex systems. Physics and MI share a common goal: to explain how systems work at their most fundamental level. Whether studying particles or neural networks, both fields rely on observation, hypothesis, and experimentation to move from mystery to mechanistic understanding.

Just as physicists decompose natural phenomena into fundamental principles, MI researchers deconstruct neural networks into interpretable components. This process mirrors the scientific method:

Observation

In physics: Galileo observed pendulum swings to infer laws of motion; astronomers mapped planetary orbits to deduce gravity’s role.
In MI: Researchers track how neurons activate when a model processes inputs. For example, in a vision model, you might notice a neuron firing every time the input contains a spiral shape (like a galaxy or a seashell).

Hypothesis

In physics: Newton proposed that gravity governs both falling apples and orbiting moons.
In MI: A researcher hypothesizes that a specific circuit in a language model resolves pronouns (e.g., linking “it” to “the cat” in the sentence “The cat sat down because it was tired”).

Testing and Validation

In physics: Young’s double-slit experiment tested whether light behaves as a wave or particle by observing interference patterns.
In MI: To validate the pronoun-resolution hypothesis, researchers might “ablate” (disable) the suspected circuit. If the model then fails to link “it” to “the cat,” the hypothesis gains support.

This iterative cycle will allow MI to build causal explanations, much like physics constructs theories to predict celestial motion or particle interactions.

Physics Concepts as Tools for MI

Beyond methodology, specific principles from physics illuminate how neural networks operate:

Classical Mechanics and Deterministic Systems

Classical mechanics predicts outcomes from initial conditions. For example, knowing a ball’s position and velocity lets you calculate its trajectory.

MI parallel: MI researchers trace input-to-output pathways in neural nets looking for ones that behave deterministically in response to particular input properties, much like calculating a ball’s path.
Example: If a vision model always activates Neuron #512 when it “sees” a cat’s eye, you can reverse-engineer how this neuron contributes to the final “cat” classification.

Superposition

In wave mechanics (quantum, electromagnetism, etc), particles / waves exist in multiple states simultaneously (superposition) and share correlated behaviors.

MI parallel: Polysemantic neurons activate for multiple unrelated features. For instance, a single neuron might fire for both “cat ears” and “mathematical integrals,” creating ambiguity.
Why it matters: Just as measuring a quantum particle collapses its state, intervening on a polysemantic neuron (e.g., silencing it) can disrupt seemingly unrelated model behaviors.

Statistical Mechanics and Emergent Behavior

Macroscopic phenomena like temperature emerge from countless microscopic interactions (e.g., molecules colliding).

MI parallel: High-level model capabilities (e.g., storytelling) emerge from low-level neuron interactions. No single neuron “knows” grammar, but circuits across layers collaborate to enforce syntax.
Example: A language model’s ability to write poetry isn’t stored in one neuron, it arises from how circuits combine words, rhythms, and emotions.

Symmetry Principles

Physical laws often remain unchanged under transformations (e.g., rotating a system doesn’t alter its energy conservation).

MI parallel: Convolutional Neural Networks (CNNs) use translational invariance, they detect edges or textures regardless of their position in an image.
Example: A CNN trained to recognize cats will identify a cat’s ear whether it’s in the top-left or bottom-right corner of an image.

Perturbation Theory

Physicists study systems by applying small perturbations (e.g., nudging a particle) to observe responses.

MI parallel: Researchers tweak neuron activations to test causality. For example, amplifying a “positive sentiment” neuron in a language model should make its output more optimistic.
Example: If silencing a circuit reduces a model’s accuracy on math problems, you’ve likely found a “number reasoning” module.

In the remaining parts of this series, we will look at how MI overlaps with physics and where it might go next.

Enhancing AI Agents with Causality

Amir Feizpour (ai.science) — Wed, 05 Mar 2025 17:57:25 GMT

We recently hosted Ali Madani for an insightful session on the intersection of AI Agents and Causality, a fundamental question that rarely gets enough attention: Can AI agents truly make reliable decisions without understanding cause and effect?

The distinction between correlation and causation is just like the difference between saying students should skip exams to avoid weight gain (because exams correlate with weight gain) versus addressing the actual causal chain (exams → stress eating → weight gain). This example, shared during the session, perfectly illustrates why causal reasoning matters in practical applications.

A few years ago I read the “Book of Why” and, as a physicist, I really enjoyed it. The book explores the concept of causality—how we determine cause-and-effect relationships rather than just correlations. It argues that traditional statistical methods (like correlation and regression - aka foundations of everything we do in ML) are insufficient for understanding causality. The book introduces a “causal inference framework” based on “causal diagrams” and “do-calculus”, which allow us to answer counterfactual questions like, What would have happened if X had not occurred? It contrasts different "levels of causation" using “Ladder of Causation”:

Association (Seeing) – Correlation and pattern recognition (e.g., "Smokers tend to get lung cancer").
Intervention (Doing) – Understanding the effects of actions (e.g., "What happens if we ban smoking?").
Counterfactuals (Imagining) – Reasoning about alternate realities (e.g., "Would this person have avoided cancer if they had never smoked?").

The book critiques traditional statistical methods (like those used in machine learning) for their reliance on correlation without causal understanding. It also discusses real-world applications in medicine, economics, AI, and social sciences.

The math we use in science often relies heavily on counterfactuals to understand fundamental assertions that generalize very broadly within the boundaries of their assumptions (think f = ma and such). In physics, for instance, sparse causal relationships enable tremendous generalizability. As Ali illustrated: "Newton didn't have millions of data points, it was an apple and then all the experiments and then he came up with the formulas, it worked out."

By identifying similar sparse causal relationships in other domains, we might achieve similar generalizability without requiring the massive datasets currently needed for correlation-based approaches. That is one of the most compelling aspects of marrying causality and classical ML is in hopes of improving generalization with less data, addressing a fundamental challenge in traditional machine learning approaches.

After a few years, I still believe that causal inference can be a significant addition to how we do AI, but I have moderated my view of it from “absolutely necessary” to “practically useful”. My go-to analogy for this kind of thing is flight: how nature flies is mechanistically very different from how humans fly. The “artificial” flight leverages a remarkable brute force power called a jet engine to pick up a significantly heavier object from the ground. That means that the absolutely necessary properties like light weight and features and wings, in their natural form, become largely irrelevant. The question that I’m struggling with these days is this: Given the remarkable brute force power we have access to, namely lots of data and computation, is causal inference the “light weight and feather” of cognition?

Well, I only think about that question when I have my philosopher hat on. When I have my pragmatic AI company hat on, I do spend a lot of time thinking about how causal structures can create scaffolding for the agentic systems we build for commercial and research purposes. While there’s a remarkably successful effort going on to build reasoning into the statistical models we use and love, say R1, in parallel and for practical applications, I think it is very important to still think about causality and counterfactual reasoning when designing agentic systems, especially those involving multi-agent interactions, autonomous decision-making, and adaptive learning.

Now let’s get into some notes from the session with Ali.

The Promise and Limitations of AI Agents

AI agents, at their core, are systems designed to interact with their environment through an iterative process of assessment, information processing, and autonomous decision-making. They're characterized by their ability to learn, adapt, and operate with varying degrees of independence. The recent explosion of large language models has accelerated interest in these agents, particularly for their potential to automate complex tasks across industries.

In healthcare alone, AI agents could revolutionize prevention, detection, diagnosis, and patient monitoring, not by replacing doctors, but by handling repetitive tasks and providing real-time support. The economic implications are significant, with potential cost reductions across multiple sectors.

But here's where things get interesting: most AI systems today operate primarily on correlative relationships rather than causal ones. This creates a fundamental limitation.

The Correlation Trap

"If you go correlative and identify association between different variables, you can see that exams definitely have correlation with gaining weight," Ali noted. "So many students go through stress eating through exams and they gain weight."

Imagine we want to recommend actions to help students avoid weight gain. Data analysis might show a strong correlation between exams and weight gain. A purely correlative approach might suggest the absurd recommendation to "avoid taking exams" to prevent weight gain. However, a causal understanding reveals that exams cause stress eating, which then causes weight gain. With this causal chain identified, we can make more meaningful recommendations targeting the actual mechanism (stress eating) rather than the initial trigger (exams).

This example highlights why correlation isn't enough for truly intelligent systems. Without causality, AI agents risk making recommendations based on spurious correlations, like the correlation between wind in Taiwan and Googling “I’m tired”.

The problem extends beyond obvious examples. In drug discovery, researchers spend years designing chemical compounds without knowing if they'll have the expected effect on patients. Some causal relationships remain unknown even to human experts, creating a significant challenge for AI systems.

Bringing Causality to AI Agents

There are several approaches to incorporating causality into AI agents, each with different applications:

1. Randomized Interventions

The gold standard for establishing causality involves randomized interventions, where confounding variables are controlled through randomization. This approach is widely used in clinical trials and allows for direct measurement of causal effects:

Causal Effect = Outcome(Treatment) - Outcome(Control)

While powerful, randomization isn't always feasible due to cost constraints or ethical considerations. As Ali noted, "From an ethical perspective in many situations, for example in the case of drugs, we cannot test every single thing that we hypothesize to work on human beings."

2. Causal Discovery Algorithms

These algorithms aim to generate directed acyclic graphs (DAGs) that represent causal relationships between variables. Unlike correlation, which merely shows association, these graphs reveal directionality, which variables cause changes in another.

So, for scenarios where controlled experiments aren't possible, causal discovery algorithms can extract causal relationships from observational data:

"We have causal discovery algorithms that aim to generate causal graphs and directed acyclic graphs... when you provide these values across variables into some of these causal discovery algorithms, what they try to do is to check some of the causality assumptions and at the end generates a directed acyclic graph for you."

These algorithms come in two main varieties:

Statistical methods (traditional constraint-based or score-based approaches like PC)
Machine learning-based gradient algorithms (more computationally efficient)

What's particularly valuable is that these approaches don't require massive datasets, hundreds or thousands of data points can suffice, making them practical for many real-world applications.

3. Causal Representation Learning

This emerging field aims to learn representations that reveal unknown causal structures. It's based on a fundamental insight from physics: most phenomena are governed by a sparse set of causal rules rather than thousands of continuous features.

Perhaps this fundamentally differs from traditional representation learning. While traditional approaches summarize raw features into latent variables, causal representation learning aims to uncover the underlying causal structure of data.

This approach draws inspiration from physics, where sparse sets of fundamental rules determine complex phenomena. As Ali explained: "We have a sparse set of rules that determine a specific phenomena... those rules are based on the causal roots... like gravity for example, the electromagnetic rules."

This sparsity principle applies across domains. In cancer research, for instance, while there isn't a single gene causing poor outcomes, we don't expect thousands of genes to be equally responsible either. Causal representation learning seeks to identify these sparse causal factors.

4. Large Language Models and Causality

While LLMs weren't explicitly trained for causal reasoning, research has shown they can effectively tackle certain causal tasks with proper prompting. A paper highlighted during the session demonstrated that models like GPT-4 can achieve up to 96% accuracy in identifying known pairwise causal relationships.

The key lies in smart but simple prompting strategies. Rather than asking broadly about causal relationships between multiple variables, researchers found success by asking direct questions like: "Which cause and effect relationship is more likely: changing A causes a change in B, or changing B causes a change in A?"

Importantly, LLMs excel at retrieving known causal relationships but cannot uncover novel ones:

"This way of using the large language models replace the experts for graph generation... But it doesn't uncover unknown relationship."

The key insight: domain knowledge is crucial. LLMs can only identify causal relationships they've encountered during pre-training. They excel at retrieving and applying known causal knowledge but cannot uncover truly unknown relationships.

This creates a natural categorization of causal tasks for AI agents:

Known causal relationships: LLMs can reliably retrieve these (e.g., smoking causing lung cancer)
Abundant data but unclear causality: Areas where causal discovery algorithms might help (e.g., sales data, web page optimization)
Unknown relationships: Domains requiring experimental validation and specialized causal learning algorithms (e.g., novel drug discovery)

5. Reinforcement Learning and Causality

The final piece of the puzzle involves using reinforcement learning to improve AI agents' causal reasoning. By providing feedback based on causal relationships, either from experts, experiments, or causal modeling, we can fine-tune models to make better causal inferences over time.

"The success of large language models was partially related to reinforcement learning... putting the transformers-based large language models and reinforcement learning for providing the feedback and fine-tuning and penalizing them and rewarding them resulted in huge success in the field."

Practical Applications Across Domains

The integration of causality with AI agents offers compelling applications:

Healthcare

More accurate diagnosis through root cause identification
Prevention and detection capabilities
Patient monitoring with causal understanding
Treatment recommendation based on causal effects

Business Applications

Understanding true drivers of sales beyond correlations
Designing effective A/B tests to measure intervention impacts
Web optimization based on causal rather than correlative insights

Drug Discovery

Target identification for different cancer types
Biomarker discovery for drug response prediction
Analysis of treatment regimens and patient journeys

Conclusion

What struck me most about Ali's presentation wasn't a single breakthrough technique, but rather the recognition that enhancing AI agents with causality requires integration across multiple approaches. It's not about waiting for perfect causal reasoning models, but about strategically incorporating causal thinking into existing systems.

As AI agents become more integrated into critical domains like healthcare, finance, and education, their ability to reason causally will directly impact human lives. An AI that recommends interventions based on genuine causal understanding rather than statistical correlation is not just more accurate, it's more trustworthy.

The future of AI isn't just about bigger models or more data, it's about smarter reasoning. And at the heart of smarter reasoning lies causality: understanding not just what happens, but why.

Q&A

Q: What's the relationship between reasoning models like R1 and causality?

A: While these models demonstrate impressive capabilities, true reasoning arguably requires causal understanding. Ali suggested we don't need to wait for perfect causal reasoning models, we can immediately begin providing causal feedback to existing models through reinforcement learning approaches while developing more fundamentally causal architectures in parallel.

Q: How does causal representation learning differ from traditional representation learning?

A: Traditional representation learning summarizes raw features into latent variables, while causal representation learning aims to uncover underlying causal structures. The latter involves additional assumptions beyond traditional IID (independent and identically distributed) assumptions, with the goal of identifying sparse causal relationships that enable better generalization and out-of-distribution performance.

Q: Can you give practical examples of how causality has helped in your work?

A: In drug discovery, Ali's team has used causal discovery and inference to identify new gene targets for different cancer types. They've also applied causal approaches to biomarker discovery, identifying underlying mechanisms related to drug responses. While specific results remain confidential, these applications demonstrate the practical value of causal approaches in real-world settings.

The Business Impact of DeepSeek R1

Amir Feizpour (ai.science) — Mon, 24 Feb 2025 13:44:44 GMT

We had a community session on this topic and the following are some notes from that session along with some additional thoughts from me!

The LLM industry has been dominated by a few major players - OpenAI, Meta, Google DeepMind, and Anthropic, to name a few. A few models from the Middle East and Europe popped up here and there and took the headlines for a few days and then they disappeared as quickly as they showed up. Therefore, US based companies have controlled not only model development but also pricing, infrastructure, and the regulatory landscape around AI. This centralization may be partly due to where the capital for moonshot projects is most readily available, but it has certainly created a single point of failure for markets, entrepreneurs, and businesses alike. The dominance that we have experienced so far from the US companies in this market had also created relatively stable market dynamics at a relatively high price point for the end users; well, until a new kid on the block messed things up!

DeepSeek is a relatively young company and in our last two posts (this and this) we looked at the technical side of their newest model: R1, a reasoning model developed in China.

For the first time, a non-Western model put in question the dominance of American LLM providers- not just in performance but in accessibility, cost, and infrastructure independence. While there are still many open questions about deployment, security, and long-term impact, one thing is clear: mindset has shifted away from “only Americans can do it”, competition is here, and that’s good news for entrepreneurs.

Even if R1 itself doesn’t deserve the hype it created and even if it can’t hold up the promise like other models from non-US origins, its emergence signals a breakaway from the idea that all major AI breakthroughs must come from Western corporations. The availability of alternatives fosters competition, drives down costs, and increases the diversity of AI applications, creating new opportunities for businesses to build, experiment, and innovate on their own terms.

What has made me quite excited about LLMs in the past few years is the equalizing power that they bring to innovation. I know there is a lot of business potential around automating mundane tasks using LLM agents, but the part that gets me out of bed every day is building agentic applications that facilitate knowledge intensive workflows. Even with the earliest versions of LLM apps like ChatGPT we saw lowering of significant barriers to knowledge that was traditionally reserved to smaller portions of the population, think coding, business strategy, marketing tricks, and product ideation. This led into many more people trying out their crazy ideas or at least feeling more encouraged to explore their options. Now imagine with the lowering cost of operationalizing LLM systems, a serious competition that can impact pricing, and more powerful models, what kind of tools can we build to give more founders and founders-to-be a fair playing field!

The following section covers some select Qs and As from our session.

1. What makes DeepSeek R1 different from existing AI models?

DeepSeek R1 is a reasoning model that claims to be smaller and more efficient than existing alternatives. However, its real distinction lies in its origin. Unlike models from OpenAI or Anthropic, R1 was developed in China and optimized for deployment in non-Western infrastructure, such as Alibaba Cloud. This means businesses that previously had no choice but to rely on US-based AI providers now have an alternative. The model’s efficiency also raises questions about the future of AI computing, as it suggests that high-quality reasoning tasks may not require the enormous compute resources traditionally associated with GPT-O1-level models. For the western audience this might not even be an option, but imagine how many countries are out there, say the Middle East, Africa, and Eastern Europe who are more than happy to consider their options more broadly now that there are options available to them.

2. How does R1 impact the cost structure of AI-powered businesses?

Cost has always been a major factor in AI adoption. OpenAI’s most advanced models can cost up to $15 per million tokens for reasoning tasks, a price that makes AI-powered applications prohibitively expensive for many startups and small businesses. In contrast, DeepSeek R1 is priced at just $0.14 per million tokens - a staggering difference. While this might be an apples to pears comparison and these figures don’t necessarily reflect training or operational costs, they indicate that AI reasoning capabilities may soon become dramatically more affordable. This reduction in cost could allow small businesses to integrate advanced AI into their workflows without needing the budgets of Big Tech corporations.

3. Is DeepSeek a direct threat to Nvidia and Western AI infrastructure?

The release of R1 led to a short-term drop in Nvidia’s stock, highlighting the market’s reaction to potential shifts in AI computing demand. Investors had largely assumed that AI adoption worldwide would remain dependent on US cloud providers and American GPUs. However, the rise of R1 and similar models means businesses may increasingly turn to non-US cloud infrastructure and non-Nvidia chips, such as those developed by Huawei. While Nvidia and other Western AI players will likely continue to thrive, the assumption of American AI dominance is no longer a given. To a large extent this is a market correction because there was no reason to assume, this early in the game, that the dominance will remain in the West.

4. What challenges do businesses face in deploying R1?

Despite its advantages, R1 has proven difficult to deploy. Its official API has suffered frequent downtime, reportedly operating only 20% of the time due to DDoS attacks. Additionally, its architecture, built on the Mixture of Experts (MoE) framework, adds complexity to serving the model. MoE models have historically been difficult to scale and operate efficiently in production environments, which is why they have not been widely adopted despite their theoretical efficiency benefits. Entrepreneurs looking to integrate R1 into their products will need to consider these operational challenges.

5. Are there security risks associated with using R1?

Security is a major concern when adopting any AI model, and R1 is no exception. Some businesses may be hesitant to use a Chinese-developed AI system due to fears about data security and compliance. Those are fair concerns, but at the same time all those concerns can and should exist for any other players in the space. AI models, including R1, could be susceptible to training data poisoning, where adversaries inject subtle biases or vulnerabilities into a model’s responses. Additionally, LLMs can be manipulated to promote specific software libraries, including ones that contain hidden vulnerabilities. Businesses considering any LLMs, including R1 must weigh these risks against its cost and efficiency benefits.

6. How does R1 change the competitive landscape for AI startups?

Before R1, reasoning-capable AI was largely controlled by OpenAI and a few other firms, meaning startups had little choice but to pay high API fees for access. With R1 and potential future competitors, smaller businesses can now explore alternatives that are both cheaper and more flexible. This shift could make AI-powered startups more viable and profitable, especially in emerging markets where cost constraints previously limited access to high-quality models.

7. Is R1 a sign that China is overtaking the West in AI?

It’s too early to say that China is surpassing the US in AI innovation, but R1 demonstrates that the playing field is becoming more balanced. In 2023, 8 of the top 10 AI models were American, with only two exceptions - one from France (Mistral) and one from Canada (Cohere). By 2025, it is expected that at least half of the leading AI models will be Chinese. Companies like Alibaba, Baidu, and DeepSeek are rapidly catching up, and the assumption that only Western companies can build cutting-edge AI is no longer valid.

Thanks for reading Deep Random Thoughts! This post is public so feel free to share it.

Innovations Leading up to DeepSeek R1

Amir Feizpour (ai.science) — Fri, 14 Feb 2025 13:01:57 GMT

In this session, we explored the architecture evolution and technical innovations that led to development of DeepSeek R1, a model that stands at the forefront of AI advancements. DeepSeek R1 pushes the boundaries of reasoning in artificial intelligence and is designed to handle efficiency, lower cost, and cutting-edge performance. To fully appreciate DeepSeek R1’s capabilities, it is important to understand the evolution of DeepSeek’s models and how each step led to the development of this advanced reasoning model.

Technical Evolution and Foundation

Since its inception in 2023, DeepSeek has continually advanced its large language models (LLMs), with each new release building upon the previous model’s strengths. Below is a detailed breakdown of the contributions made by each iteration, culminating in the creation of DeepSeek R1.

DeepSeek V2

We’ll start with exploring DeepSeek-V2, which was a large language model (LLM) released in May 2024. This model introduced significant architectural advancements, notably the integration of multi-head latent attention (MLA) and a mixture of experts (MoE) framework. The MLA mechanism enhanced the model’s ability to process complex patterns by utilizing compressed latent vectors, thereby improving performance and reducing memory usage during inference. The MoE architecture allowed the model to activate a subset of specialized experts per forward pass, optimizing computational efficiency.

DeepSeek-V2 was trained on an extensive dataset of 8.1 trillion tokens, with a higher proportion of Chinese text compared to English. The context length was extended from 4,000 to 128,000 tokens using the YaRN method, which improved the model’s ability to handle longer sequences. The training process involved supervised fine-tuning (SFT) on 1.5 million instances for helpfulness and 300,000 for safety, followed by reinforcement learning (RL) using Group Relative Policy Optimization (GRPO) in two stages: one focused on math and coding problems, and the other on helpfulness, safety, and rule adherence.

Source: DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek V3

Building upon the V2 architecture, DeepSeek introduced V3 in December 2024. This iteration maintained the MoE framework and MLA, featuring a total of 671 billion parameters with a context length of 128,000 tokens. The training process for V3 involved pretraining on 14.8 trillion tokens, predominantly in English and Chinese, with a higher ratio of math and programming content. The context length was further extended from 4,000 to 128,000 tokens using YaRN.

SFT was conducted for two epochs on 1.5 million samples of reasoning and non-reasoning data. Expert models were trained to generate synthetic reasoning data in specific domains (math, programming, logic), and model-based reward models were developed to guide the RL process. The final model, DeepSeek-V3, was trained using GRPO with both reward models and rule-based rewards. This version marked a significant step forward in computational efficiency and reasoning capabilities, ensuring that the model could better handle complex tasks and improve its overall performance.

Source: DeepSeek-V3 Technical Report

DeepSeek R1-Zero

In November 2024, DeepSeek released R1-Lite-Preview, an early version of R1, accessible via API and chat interfaces. This model was trained for logical inference, mathematical reasoning, and real-time problem-solving. It was reported to outperform OpenAI’s o1 model on benchmarks such as the American Invitational Mathematics Examination (AIME) and MATH.

R1-Lite-Preview was initialized from DeepSeek-V3-Base and shared its architecture. The model employed a Mixture of Experts (MoE) framework with 671 billion parameters, activating 37 billion per forward pass to maintain computational efficiency. The training process for R1-Lite-Preview involved supervised fine-tuning (SFT) on a small dataset of high-quality, readable reasoning examples, followed by reinforcement learning (RL) to further develop its reasoning skills. This approach encouraged the autonomous emergence of behaviors such as chain-of-thought reasoning, self-verification, and error correction, setting the foundation for the more advanced R1 model.

DeepSeek R1

On January 20, 2025, DeepSeek launched R1, an open-source AI model emphasizing reasoning capabilities. R1 was initialized from DeepSeek-V3-Base and shares its architecture, including the MoE framework with 671 billion parameters, activating 37 billion per forward pass to maintain computational efficiency. The training process for R1 involved a four-phase pipeline:

Cold Start: Supervised fine-tuning on a small dataset of high-quality, readable reasoning examples.
Reasoning-Oriented RL: Large-scale RL focusing on rule-based evaluation tasks, incentivizing accurate and coherent responses.
Supervised Fine-Tuning: Synthesis of reasoning data using rejection sampling, combined with non-reasoning data for comprehensive fine-tuning.
RL for All Scenarios: A second RL phase refining the model’s helpfulness and harmlessness while preserving advanced reasoning skills.

This approach led to the emergence of complex reasoning patterns, such as self-verification and reflection, without explicit programming. Distilled versions of R1, ranging from 1.5 billion to 70 billion parameters, were also developed to cater to different computational needs. R1’s focus on improving reasoning and inference abilities while maintaining computational efficiency marked a key advancement in the evolution of DeepSeek’s models.

Source: DeepSeek R1 architecture by @SirrahChan

Multi-Token Prediction Innovation

One of the most significant advancements in DeepSeek R1 is its novel approach to multi-token prediction, which enhances both the depth and flexibility of the model’s output.

Sequential Prediction Modules: Traditional AI models often rely on parallel token prediction, generating multiple tokens simultaneously. In contrast, R1 adopts sequential prediction, where tokens are generated one after another in a stepwise manner. This method improves contextual relevance and coherence, as each token’s prediction is informed by the ones that came before it, leading to more cohesive and meaningful output.
Enhanced Internal Representations: The switch to sequential prediction allows R1 to develop richer internal representations of data. This change improves the model’s planning capabilities and enhances its ability to capture long-term dependencies in the sequence, which is crucial for tasks involving complex logic or narrative structures.
Densified Training Signals: In traditional training setups, models predict a single token at a time, which limits the amount of useful training feedback per step. R1’s approach of multi-token prediction increases the density of training signals per step, providing more concentrated and effective learning, which contributes to its superior accuracy.
Shared Embedding Layers: By utilizing shared embedding layers in combination with sequential transformer blocks, R1 achieves better cohesion between tokens in a sequence. This improves the consistency of predictions across different tokens and helps the model generate more coherent outputs overall

Data Processing and Quality Control

DeepSeek R1’s performance is not just driven by its architecture but also by its innovative approach to managing and processing data, ensuring both efficiency and high-quality outputs.

Cross-Dump Deduplication: R1 implements cross-dump deduplication across 91 instances of Common Crawl data, eliminating redundant or repetitive entries. This ensures that the model is exposed to a broader range of unique, high-quality data during training, which enriches its understanding and generalization capabilities.
Strategic Exclusion of Multiple-Choice Questions: Unlike many other pre-training models that include multiple-choice questions, R1 excludes them. This strategic decision allows the model to focus on more complex language tasks that require deeper understanding and nuanced responses, enhancing its ability to process subtler forms of reasoning.
Mathematical Content Enhancement: R1 incorporates an iterative classification approach to enhance the mathematical content within its training data. This process strengthens its ability to process and reason through mathematical concepts, improving its performance in specialized tasks that require advanced mathematical reasoning.
Innovative Bin Packing Algorithms: To address the common issue of document truncation, R1 employs innovative bin packing algorithms. These algorithms optimize the organization of training data, reducing unnecessary loss of information and ensuring that the model has access to as much data as possible to predict the next token.

Practical Applications and Implementation

For those looking to implement R1, its practical applications are key to understanding its capabilities and maximizing its potential.

Best Suited for Verifiable Tasks: R1 is particularly well-suited for environments where success criteria are clearly defined and verifiable. This makes it an ideal choice for industries that require transparency, accountability, and high levels of precision, such as healthcare, law, and finance.
Different Prompting Strategies: Unlike traditional models that often rely on standard prompting strategies, R1 requires experimentation to determine the most effective prompting techniques. This flexibility allows for tailored interactions, enabling users to unlock the model’s full potential across a variety of applications.
Trajectory Planning in Agent-Based Systems: R1 performs exceptionally well in trajectory planning tasks, where it predicts the best possible path forward for an agent navigating dynamic environments. This ability makes R1 a valuable tool for agentic workflows.

Future Implications and Research Directions

As DeepSeek R1 continues to evolve, several exciting avenues for future research and improvement are emerging:

Optimal Stopping Criteria for Reasoning Chains: Research into the best points at which reasoning chains should be terminated could significantly optimize both performance and efficiency, preventing unnecessary computation while ensuring high-quality outputs.
Cross-Lingual Reasoning Capabilities: R1’s performance across languages, particularly in reasoning tasks, suggests a promising area for further exploration. Optimizing R1 for cross-lingual reasoning could expand its applicability to multilingual environments, broadening its scope in global applications.
Token Distribution Impact on Reasoning: Investigating how varying token distribution strategies influence reasoning quality could provide valuable insights into further optimizing R1 for different types of reasoning tasks, from simple queries to complex, multi-step deductions.
Integration with Existing Agent Frameworks: A key research direction is how R1 can be integrated with existing agent frameworks. This could enhance its ability to make autonomous decisions in real-world applications, further extending its utility in dynamic and interactive environments.

By understanding these technical foundations and innovations, we can better leverage DeepSeek R1’s capabilities while acknowledging areas for future development and optimization. Through thoughtful implementation and ongoing research, DeepSeek R1 holds immense potential to drive forward the field of artificial intelligence reasoning.

Resources

FAQ

Q: What are the primary innovations leading up to DeepSeek R1?

A: Multi-head Latent Attention (MLA)

⦿ What it is: Enhances attention mechanisms by working on latent representations instead of raw token sequences, improving efficiency and scalability.

⦿ Prior work: Perceiver (DeepMind, 2021) introduced attention over latent variables, reducing quadratic complexity in long-context scenarios.

⦿ Why it matters: Helps with long-context processing by operating on compressed representations, enabling better retrieval and reasoning.

Load Balancing for MoE Models

⦿ What it is: Ensures even distribution of workload across experts in Mixture of Experts (MoE) models, preventing bottlenecks.

⦿ Prior work: GLaM (Google, 2021) improved expert selection using an auxiliary routing loss, optimizing compute efficiency.

⦿ Why it matters: Makes MoE models more efficient and scalable, allowing better utilization of compute resources while maintaining high performance.

Fill-in-the-Middle (FIM) Learning Objective

⦿ What it is: Trains models to generate missing text segments, not just predict the next token, improving bidirectional reasoning.

⦿ Prior work: Codex (OpenAI, 2021) leveraged FIM to enhance code completion, significantly improving edit and autocomplete capabilities.

⦿ Why it matters: Enables better document completion, code generation, and interactive AI assistants that can modify text instead of just appending to it.

FP8 Training (Floating Point 8-bit Precision)

⦿ What it is: Uses lower-precision floating-point formats to reduce memory usage and accelerate training.

⦿ Prior work: NVIDIA Hopper Architecture (2022) introduced hardware-optimized FP8 support, enabling more efficient training of large models.

⦿ Why it matters: Reduces training costs and memory constraints, making long-context and large-scale models more feasible.

Multi-token Prediction

⦿ What it is: Instead of generating tokens one at a time, the model predicts multiple tokens in parallel, improving response fluency and speed.

⦿ Prior work: PaLM 2 (Google, 2023) refined parallel decoding techniques to improve latency and coherence in text generation.

⦿ Why it matters: Reduces response time and improves fluency in long-form text generation, making AI models more usable in real-time applications.

Q: How does the use of multi token prediction in R1 improve context and coherence compared to traditional models?

A: In non-reasoning models, the next token is predicted individually at each step, which can sometimes lead to a lack of context or coherence in longer sequences. R1’s approach of predicting multiple tokens allows for a richer internal representation, which helps in planning and reasoning. Although these multi-token prediction modules are removed during inference, the richer training signals learned during the process contribute to more accurate token distributions and better reasoning, ensuring that R1 maintains context and coherence when generating responses.

Q: How did the inclusion of mathematical and coding tokens during training enhance R1’s reasoning abilities?

A: R1 benefited from a large set of math tokens, which were gathered through a process that involved using open web math as a seed, followed by applying a classifier to identify math-related documents in common crawl data. This methodology enabled R1 to acquire 120 billion math tokens, which were important for improving the model’s mathematical reasoning abilities. Additionally, the model’s ability to handle code was enhanced by a pre-processing pipeline that ensured proper ordering of files, including dependencies. Learning from more structured and verifiable domains like math and coding helped R1 learn the mechanics of reasoning which was generalized to other types of reasoning through training.

Q: How does DeepSeek R1 handle model bias?

A: All models have bias and their creators take steps to mitigate that. One of the observations I (Suhas) made during testing is that the model performed better when reasoning in Mandarin compared to English, especially for tasks requiring logical reasoning. This improvement seems to be related to the higher Shannon entropy of the Chinese alphabet (9.56 bits per character) compared to the English alphabet (3.9 bits), which may allow for richer token distributions and more efficient encoding of information. In terms of mitigating bias, the model seems to respond well to diverse inputs, and further research is ongoing to test how language and token distribution affect reasoning capabilities.

Q: What are the primary evaluation metrics and testing strategies for assessing the performance of new AI models like R1?

A: When evaluating new models like R1, I (Suhas) use a variety of strategies to assess reasoning capabilities. One method is to take potentially familiar data from the model’s pre-training and alter it in such a way that it challenges the model to demonstrate whether it is merely recalling information or engaging in actual reasoning. For instance, I may jumble parts of the prompt to see if the model can still generate a correct answer based on the modified context. Additionally, I break complex tasks into smaller sub-tasks to evaluate whether the model can handle these individual components. If the model performs well on each sub-task independently, it provides insight into whether it can successfully tackle the full, multi-step task. This helps assess both the model’s ability to memorize and its capacity for generalizing reasoning across different problem types.

Q: Can DeepSeek R1 be effectively used in agentic applications, where reasoning and planning are required, and if so, how?

A: Yes. One of its strengths is the ability to sample from a set of potential actions, using heuristics to guide decision-making in agentic trajectories. Even without explicit support for tool calls, the model performs well when tasked with reasoning and planning. In my experience, I (Suhas) have tested the model by providing specific tokens in the prompt, which act as delineators, helping the model to follow a structured reasoning path. This approach has shown promising results, demonstrating the model’s potential for handling tasks involving reasoning and planning.

ACKNOWLEDGEMENT: These notes are prepared by Mohsin Iqbal.

Understanding DeepSeek R1

Amir Feizpour (ai.science) — Mon, 03 Feb 2025 19:47:27 GMT

We’ve been tracking the explosive rise of DeepSeek R1, which has taken the AI world by storm in recent weeks. In this session, we dove deep into the evolution of the DeepSeek family - from the early models through DeepSeek V3 to the breakthrough R1. We also explored the technical innovations that make R1 so special in the world of open-source AI.

The DeepSeek Family Tree: From V3 to R1

DeepSeek isn’t just a single model; it’s a family of increasingly sophisticated AI systems. The evolution goes something like this:

DeepSeek V2:

This was the foundation model which leveraged a mixture-of-experts architecture, where only a subset of experts are used at inference, drastically improving the processing time for each token. It also featured multi-head latent attention to reduce memory footprint.

DeepSeek V3:

This model introduced FP8 training techniques, which helped drive down training costs by over 42.5% compared to previous iterations. FP8 is a less precise way to store weights inside the LLMs but can greatly improve the memory footprint. However, training using FP8 can typically be unstable, and it is hard to obtain the desired training results. Nevertheless, DeepSeek uses multiple tricks and achieves remarkably stable FP8 training. V3 set the stage as a highly efficient model that was already cost-effective (with claims of being 90% cheaper than some closed-source alternatives).

DeepSeek R1-Zero:

With V3 as the base, the team then introduced R1-Zero, the first reasoning-focused iteration. Here, the focus was on teaching the model not just to generate answers but to “think” before answering. Using pure reinforcement learning, the model was encouraged to generate intermediate reasoning steps, for example, taking extra time (often 17+ seconds) to work through a simple problem like “1+1.”

The key innovation here was the use of group relative policy optimization (GROP). Instead of relying on a conventional process reward model (which would have required annotating every step of the reasoning), GROP compares multiple outputs from the model. By sampling several potential answers and scoring them (using rule-based measures like exact match for math or verifying code outputs), the system learns to favor reasoning that leads to the correct result without the need for explicit supervision of every intermediate thought.

DeepSeek R1:

Recognizing that R1-Zero’s unsupervised approach produced reasoning outputs that could be hard to read or even mix languages, the developers went back to the drawing board. They used the raw outputs from R1-Zero to generate “cold start” data and then manually curated these examples to filter and improve the quality of the reasoning. This human post-processing was then used to fine-tune the original DeepSeek V3 model further—combining both reasoning-oriented reinforcement learning and supervised fine-tuning. The result is DeepSeek R1: a model that now produces readable, coherent, and reliable reasoning while still maintaining the efficiency and cost-effectiveness of its predecessors.

What Makes R1 Series Special?

The most fascinating aspect of R1 (zero) is how it developed reasoning capabilities without explicit supervision of the reasoning process. It can be further improved by using cold-start data and supervised reinforcement learning to produce readable reasoning on general tasks. Here's what sets it apart:

Open Source & Efficiency:

R1 is open source, allowing researchers and developers to inspect and build upon its innovations. Its cost efficiency is a major selling point especially when compared to closed-source models (claimed 90% cheaper than OpenAI) that require massive compute budgets.

Novel Training Approach:

Instead of relying solely on annotated reasoning (which is both expensive and time-consuming), the model was trained using an outcome-based approach. It started with easily verifiable tasks, such as math problems and coding exercises, where the correctness of the final answer could be easily measured.

By using group relative policy optimization, the training process compares multiple generated answers to determine which ones meet the desired output. This relative scoring mechanism allows the model to learn “how to think” even when intermediate reasoning is generated in a freestyle manner.

Overthinking?

An interesting observation is that DeepSeek R1 sometimes “overthinks” simple problems. For example, when asked “What is 1+1?” it might spend nearly 17 seconds evaluating different scenarios—even considering binary representations—before concluding with the correct answer. This self-questioning and verification process, although it might seem inefficient at first glance, could prove advantageous in complex tasks where deeper reasoning is necessary.

Prompt Engineering:

Traditional few-shot prompting techniques, which have worked well for many chat-based models, can actually degrade performance with R1. The developers recommend using direct problem statements with a zero-shot approach that specifies the output format clearly. This ensures that the model isn’t led astray by extraneous examples or hints that might interfere with its internal reasoning process.

Getting Started with R1

For those looking to experiment:

Smaller variants (7B-8B) can run on consumer GPUs or even only CPUs
Larger versions (600B) require significant compute resources
Available through major cloud providers
Can be deployed locally via Ollama or vLLM

Looking Ahead

We're particularly intrigued by several implications:

The potential for this approach to be applied to other reasoning domains
Impact on agent-based AI systems traditionally built on chat models
Possibilities for combining with other supervision techniques
Implications for enterprise AI deployment

Open Questions

How will this affect the development of future reasoning models?
Can this approach be extended to less verifiable domains?
What are the implications for multi-modal AI systems?

We'll be watching these developments closely, particularly as the community begins to experiment with and build upon these techniques.

Resources

Join our Slack community for ongoing discussions and updates about DeepSeek and other AI developments. We're seeing fascinating applications already emerging from our bootcamp participants working with these models.

Chat with DeepSeek:

https://www.deepseek.com/

Q&A

Q1: Which model deserves more attention – DeepSeek or Qwen2.5Max?

A: While Qwen2.5 is also a strong model in the open-source community, the choice ultimately depends on your use case. DeepSeek R1 emphasizes advanced reasoning and a novel training approach that may be especially valuable in tasks where verifiable logic is critical.

Q2: Why did major providers like OpenAI opt for supervised fine-tuning rather than reinforcement learning (RL) like DeepSeek?

A: We should note upfront that they do use RL at the very least in the form of RLHF. It is very likely that models from major providers that have reasoning capabilities already use something similar to what DeepSeek has done here, but we can’t be sure. It is also likely that due to access to more resources, they favored supervised fine-tuning due to its stability and the ready availability of large annotated datasets. Reinforcement learning, although powerful, can be less predictable and harder to control. DeepSeek’s approach innovates by applying RL in a reasoning-oriented manner, enabling the model to learn effective internal reasoning with only minimal process annotation - a strategy that has proven promising despite its complexity.

Q3: Did DeepSeek use test-time compute strategies similar to those of OpenAI?

A: DeepSeek R1’s design emphasizes efficiency by leveraging techniques such as the mixture-of-experts approach, which activates only a subset of parameters, to reduce compute during inference. This focus on efficiency is central to its cost advantages.

Q4: What is the difference between R1-Zero and R1?

A: R1-Zero is the initial model that learns reasoning solely through reinforcement learning without explicit process supervision. It generates intermediate reasoning steps that, while sometimes raw or mixed in language, serve as the foundation for learning. DeepSeek R1, on the other hand, refines these outputs through human post-processing and supervised fine-tuning. In essence, R1-Zero provides the unsupervised “spark,” and R1 is the polished, more coherent version.

Q5: How can one stay updated with in-depth, technical research while managing a busy schedule?

A: Staying current involves a combination of actively engaging with the research community (like AISC - see link to join slack above), following preprint servers like arXiv, attending relevant conferences and webinars, and participating in discussion groups and newsletters. Continuous engagement with online communities and collaborative research projects also plays a key role in keeping up with technical advancements.

Q6: In what use-cases does DeepSeek outperform models like O1?

A: The short answer is that it’s too early to tell. DeepSeek R1’s strength, however, lies in its robust reasoning capabilities and its efficiency. It is particularly well suited for tasks that require verifiable logic—such as mathematical problem solving, code generation, and structured decision-making—where intermediate reasoning can be reviewed and confirmed. Its open-source nature further allows for customized applications in research and enterprise settings.

Q7: What are the implications of DeepSeek R1 for enterprises and start-ups?

A: The open-source and cost-efficient design of DeepSeek R1 lowers the entry barrier for deploying advanced language models. Enterprises and start-ups can leverage its advanced reasoning for agentic applications ranging from automated code generation and customer support to data analysis. Its flexible deployment options—on consumer hardware for smaller models or cloud platforms for larger ones—make it an attractive alternative to proprietary solutions.

Q8: Will the model get stuck in a loop of “overthinking” if no correct answer is found?

A: While DeepSeek R1 has been observed to “overthink” simple problems by exploring multiple reasoning paths, it incorporates stopping criteria and evaluation mechanisms to prevent infinite loops. The reinforcement learning framework encourages convergence toward a verifiable output, even in ambiguous cases.

Q9: Is DeepSeek V3 completely open source, and is it based on the Qwen architecture?

A: Yes, DeepSeek V3 is open source and served as the foundation for later iterations. It is built on its own set of innovations—including the mixture-of-experts approach and FP8 training—and is not based on the Qwen architecture. Its design emphasizes efficiency and cost reduction, setting the stage for the reasoning innovations seen in R1.

Q10: How does DeepSeek R1 perform on vision tasks?

A: DeepSeek R1 is a text-based model and does not incorporate vision capabilities. Its design and training focus solely on language processing and reasoning.

Q11: Can professionals in specialized fields (for example, labs working on cures) apply these methods to train domain-specific models?

A: Yes. The innovations behind DeepSeek R1—such as its outcome-based reasoning training and efficient architecture—can be adapted to various domains. Researchers in fields like biomedical sciences can tailor these methods to build models that address their specific challenges while benefiting from lower compute costs and robust reasoning capabilities. It is likely that in deeply specialized fields, however, there will still be a need for supervised fine-tuning to get reliable results.

Q12: Were the annotators for the human post-processing experts in technical fields like computer science or mathematics?

A: The discussion indicated that the annotators primarily focused on domains where correctness is easily verifiable—such as math and coding. This suggests that expertise in technical fields was indeed leveraged to ensure the accuracy and clarity of the reasoning data.

Q13: Could the model get things wrong if it relies on its own outputs for learning?

A: While the model is designed to optimize for correct answers via reinforcement learning, there is always a risk of errors—especially in ambiguous scenarios. However, by evaluating multiple candidate outputs and reinforcing those that lead to verifiable results, the training process minimizes the likelihood of propagating incorrect reasoning.

Q14: How are hallucinations minimized in the model given its iterative reasoning loops?

A: The use of rule-based, verifiable tasks (such as math and coding) helps anchor the model’s reasoning. By comparing multiple outputs and using group relative policy optimization to reinforce only those that yield the correct result, the model is guided away from generating unfounded or hallucinated information.

Q15: Does the model rely on complex vector mathematics?

A: Yes, advanced techniques—including complex vector math—are integral to the implementation of mixture-of-experts and attention mechanisms in DeepSeek R1. However, the primary focus is on using these techniques to enable effective reasoning rather than showcasing mathematical complexity for its own sake.

Q16: Some worry that the model’s “thinking” may not be as refined as human reasoning. Is that a valid concern?

A: Early iterations like R1-Zero did produce raw and sometimes hard-to-read reasoning. However, the subsequent refinement process—where human experts curated and improved the reasoning data—has significantly enhanced the clarity and reliability of DeepSeek R1’s internal thought process. While it remains an evolving system, iterative training and feedback have led to meaningful improvements.

Q17: Which model variants are suitable for local deployment on a laptop with 32GB of RAM?

A: For local testing, a medium-sized model—typically in the range of 7B to 8B parameters—is recommended. Larger models (for example, those with hundreds of billions of parameters) require significantly more computational resources and are better suited for cloud-based deployment.

Q18: Is DeepSeek R1 “open source” or does it offer only open weights?

A: DeepSeek R1 is provided with open weights, meaning that its model parameters are publicly accessible. This aligns with the overall open-source philosophy, allowing researchers and developers to further explore and build upon its innovations.

Q19: What would happen if the order of training were reversed—starting with supervised fine-tuning before unsupervised reinforcement learning?

A: The current approach allows the model to first explore and generate its own reasoning patterns through unsupervised RL, and then refine these patterns with supervised methods. Reversing the order might constrain the model’s ability to discover diverse reasoning paths, potentially limiting its overall performance in tasks that benefit from autonomous thought.

ACKNOWLEDGEMENT: These notes are prepared by Mohsin Iqbal and edited by Boqi (Percy) Chen and myself.

A.I.'s 6-Year Anniversary and a New Beginning

Amir Feizpour (ai.science) — Tue, 21 Jan 2025 13:55:57 GMT

The big news is that for a combination of personal and business reasons, I have relocated to the US. This has been coming for a while but this past year was the year that everything finally came together. This was partly due to the logistics of the move working out (visa etc) and partly because I finally had clarity about the importance of the move happening right away. The decision was also empowered by the fact that our business started doing much better than before (more than 5x the revenue of the previous year) and partly by my growing emphasis on becoming a lifelong builder.

Last year, I wrote an anniversary blog post reflecting on five years of running Aggregate Intellect. It was a momentous milestone, and I distilled five hard-earned lessons from that journey:

Instead of building a startup, build a product.
Instead of building a product, build a company.
Instead of building a company, build a business.
Instead of building any business, build a sustainable business.
Instead of building a sustainable business, focus on being a healthy founder.

Each of these lessons represented a turning point in my thinking—a shift from chasing status, to creating value, to surviving, to sustaining, and, finally, to strengthening myself as a founder. Life, as it tends to do, has a way of humbling you with even deeper questions just when you think you’ve figured things out. My good friend, Serena M., commented on that article saying that she was excited to see what the next lesson was going to be! I was excited to figure that out too!

Over the past year, I’ve lived the answer to that question!

Now, looking back, I think I know where I’m headed. You know you learned a good lesson when it changes how you think, live, and work and this time the lesson was yet another pivot in my mindset. In the early years, my goals were mostly external: money raised, revenue targets, product milestones, team growth. But over time, I’ve realized those goals only make sense when anchored to something deeper: Who do I want to become?

This reminds me of the movie “inside out” and how all parts of identity have to work together to help you achieve your goals

This shift—from focusing on outcomes to focusing on identity—has been transformative. Instead of asking, “What do I want to achieve?” I’ve started asking:

What kind of person do I want to be?
What kind of impact should my work have on society, especially at this turning point with agentic AI systems becoming the norm? And
How do I increase my capacity to become what I want to be?

The last question on capacity is an interesting one. Of course, every day I have a limited number of hours and a limited amount of energy to take care of all the things I need. Very quickly it becomes obvious that I can’t do everything that crosses my mind. So, what’s the right strategy here?

Strategy is learning what to say ‘no’ to

It’s easy to think of strategy as a template you filled, a plan, a roadmap, or even a chessboard where you outmaneuver competitors. But I’ve come to realize that strategy is as much about subtraction as it is about addition. It’s not just about deciding what to do; it’s about learning what to say no to. And honestly that part is far more important because there are often many more things I shouldn’t do!

This past year, I’ve had to confront some hard truths:

Not every dollar is worth earning.
Not every opportunity is worth pursuing.
Some "good" things can sabotage the "great" things.

I’ve seen how easy it is to get distracted by short-term gains or shiny opportunities that don’t align with the bigger picture. Or how choosing to work on something out of despair results in spending way more time and money to undo that. Saying no isn’t just about preserving my time and energy; it’s about protecting my focus and being brutally honest about how every action serves my long-term goals.

If you read between the lines of my journey, there’s an increasing pattern of me learning again and again that if I want to sustain my ability to be a founder long term, I need to be the architect of that future. In other words, I have to build that future with my own hands rather than delegating it to others to do it for me. You might think “duh, isn’t that what founders do”; trust me, every year I’m discovering a new layer to that simple statement.

For a long time, I thought the natural evolution of leadership was moving from builder to manager: delegating tasks, overseeing others, and focusing on “higher-level” decisions. “My time is too valuable to spend it on being in the weeds” is a fallacy that I think I picked up in my corporate days and I still carried traces of it. But I’ve realized that being a builder isn’t a phase to grow out of—it’s the core of who I am.

Ironically, the transition from an individual contributor to a manager back in my corporate life was the moment when the AISC community was born. I wanted to stay “technical” outside of my managerial work hours but over the years I had lost my hands on keyboard coding abilities. Now I’m reclaiming that and the empowerment I feel is exactly what AI agents bring to the conversation, and what I’m excited to bring to our customers (primarily workers in more traditional businesses like construction).

This past year, empowered by all the software co-pilots, I found my way back to coding and contributed to our codebase and built several sales demos with Sherpa. We used one of those to close our largest deal ever, and several other ones I’ve been building with help from the team are making major strides towards our next deals.

I feel that with effective use of technology that is available to me, the same kind of technology that we are developing for our customers, and my ability to hire remarkable assistants and collaborators from the AISC community, I can become the builder that I was meant to be. Combine that with ruthless prioritization of what I should spend my time on, and I no longer have to compromise what I want to be for what I need to be as a business owner.

The best leaders don’t order people to build; they inspire others by building with them.

The Road Ahead

If the first five years of Aggregate Intellect were about building a foundation, this next chapter that has begun this year is about refining the essence. It’s about focusing, scaling with intention, and aligning everything I do—both personally and professionally—with the kind of person I want to be and the future I want to build.

So, if you’re reading this and wondering what comes after “being a healthy founder,” here’s my answer: You stop thinking of yourself as a founder altogether. You stop chasing the title, the milestones, and the status. Instead, you start focusing on becoming the kind of person who builds things that matter—without losing yourself in the process.

And that’s the most exciting journey of all.

Should startup founders learn acting?

Amir Feizpour (ai.science) — Fri, 06 Dec 2024 13:46:17 GMT

I have been thinking about writing this piece for a while and a recent initiative my good friend, Diana Di Mauro, started triggered me to finally do it. She has been one of the active participants in our startup mastermind circles, and has always shared insight that surprises my brain. She always sees the world from an angle that my too-logical brain fails to notice. So, once I asked her “what’s your secret for being so in touch and comfortable with your feelings even if they really stink”. Her answer was “oh, as an actor I’m trained to do that”. And there was a mix of silence and ongoing conversation in my head for a few minutes! “wait! what does that even mean? like you can train yourself to be ok with hearing rejection after rejection and still be able to function? that’s surely crazy, right?” …

A while later I asked her (and myself), so should startup founders and small business owners learn acting? There are some obvious cases like as a startup founder, you're on "stage" all the time, presenting to customers, pitching to investors, inspiring the team. And that’s perhaps why a lot of startup founders do “improv” both as a hobby and as a way to learn how to handle spontaneous situations better. But there seems to be more!

Credit: Diana Di Mauro

Running a business can be an intensely demanding endeavor, often leaving founders and small business owners feeling stretched thin as they navigate the pressures of leadership, decision-making, and relationship management. Engaging in acting offers a unique and refreshing way to step outside the routine of business as usual, fostering not only mental and social well-being but also critical skills for professional success. Acting provides an opportunity to disconnect from the daily grind, practice creativity, and connect with others in an entirely different context. This break from the norm can help reduce stress, build confidence, and improve overall mental clarity, empowering founders to approach their work with renewed energy and focus.

Beyond its immediate mental and social health benefits, acting also develops essential business skills in profound and practical ways. It helps founders and business owners improve their communication, storytelling, and improvisational abilities while fostering critical qualities like empathy, adaptability, and resilience. These are the very traits that drive successful leadership, collaboration, and decision-making. By practicing acting, entrepreneurs gain hands-on experience in these areas, strengthening their ability to navigate the dynamic and interpersonal challenges that come with running a business.

Practicing these skills in the context of business as usual often comes with high stakes, where mistakes can lead to lost deals, damaged relationships, or missed opportunities. For example, fumbling through a critical investor pitch or mishandling a sensitive team conflict can have real consequences for the business. Acting, on the other hand, provides a low-stakes playground where individuals can hone these same skills without the fear of high-stakes failure. In a theater exercise, a founder might practice delivering an emotional monologue to connect with an audience—an experience that translates into engaging and impactful storytelling for investor presentations. Similarly, role-playing improvisational scenes helps build quick thinking and confidence, skills that are invaluable in negotiations or unplanned client interactions. Acting gives entrepreneurs the space to experiment, learn, and refine their approach in a supportive, consequence-free environment.

Concretely, here are some benefits:

Technical Aspects of Acting
- Pitching and Presenting
  - How Acting Helps: Acting techniques such as voice modulation, physicality, and focus enable clarity, engagement, and impact during presentations. Actors use vocal exercises to project their voice and control tone, which is directly transferable to speaking confidently in front of an audience.
  - Concrete Example: A startup founder preparing for a pitch can use the technique of mirroring (where an actor mimics the movement of another person) to feel more confident and comfortable in front of investors. They can practice vocal exercises and gestures to make their pitch more dynamic and persuasive.
- Storytelling
  - How Acting Helps: Acting is built on the arc of a character—where they start, how they grow, and the challenges they face. This technique teaches the art of constructing a compelling narrative. Actors also use emotional recall to make stories feel authentic and engaging, which is crucial for effective storytelling in business.
  - Concrete Example: When sharing the business’s origin story or explaining the product’s journey, a founder can apply the actor's technique of "beats"—breaking the story into key emotional or narrative shifts—to captivate an audience and keep them invested.
- Improvisation
  - How Acting Helps: Improvisational techniques, like the "yes, and" principle, encourage quick thinking and adaptability, vital when situations change unexpectedly. In business, improvisation helps with making decisions in the moment, without a scripted response.
  - Concrete Example: A founder who is faced with an unexpected question from a customer or investor can use improvisation skills to confidently adapt their answer while staying aligned with the overall message, maintaining composure even under pressure.
Emergent Aspects of Acting
- Empathy and Emotional Intelligence
  - How Acting Helps: Actors develop deep empathy by embodying a variety of characters and situations, learning to understand different perspectives. Through this practice, they cultivate emotional intelligence, which helps them respond sensitively to the emotions and needs of others in real life.
  - Concrete Example: A founder who has practiced stepping into multiple roles might better understand the perspectives of their team members or customers, leading to more thoughtful and responsive leadership. By embodying a character’s motivations, they can navigate challenging interpersonal dynamics with greater empathy and effectiveness.
- Resilience Under Pressure
  - How Acting Helps: Repeated exposure to performing in front of an audience teaches actors to manage nerves and remain poised in high-stakes environments. Techniques such as breathing exercises and relaxation routines help actors maintain composure, which is a critical skill for leaders during stressful moments.
  - Concrete Example: A business owner facing a difficult negotiation can apply acting techniques, such as deep breathing or visualization (common in performance prep), to stay calm and collected, ensuring they don’t let stress dictate their responses during tense moments.
- Overcoming Fear and Self-Doubt
  - How Acting Helps: In acting, embracing vulnerability and taking risks is central to a performance. Actors confront their self-doubt regularly by practicing in front of others, which builds a mindset of acceptance and resilience. This process is crucial for founders when they need to take bold steps or face rejection.
  - Concrete Example: A founder giving their first major public speech may feel nervous, but acting exercises like "being present" (focusing entirely on the moment and the audience rather than internal fears) can help them push through self-doubt and deliver a more confident performance.

A lot of this might be things that either you already try to practice in other context on your own, or things that you know you should but you can’t create the environment to do them in a sustained way. So, trying to do them as part of a fun side hobby like acting, might introduce a way to sharpen these skills, and in turn become more successful in your work and life.

AI and Talent Development

Amir Feizpour (ai.science) — Mon, 21 Oct 2024 19:46:36 GMT

As AI takes over the world across industries, one of the big topics of discussion is: what would humans do then? In more recent history, educational institutions have been responsible for providing an answer to this question. You would go to school to become an accountant, or a lawyer, or a doctor etc, and then you become one. The dotted line between “what do you want to be when you grow up” and “what you really ended up doing” has been connected to each other via various stages of progressively advanced education. Now the abundance of the question “what would humans do” and its tangential variants tells me that there is a big gap in people’s heads between educational output and what is realistic for people to do going forward.

This is, of course, not a complete surprise. Easily 9 out of 10 data people I know come from non-computer science backgrounds. While “people I know” might not be exactly a representative population, I doubt anyone can argue against the possibility and the occurrence of significant career movement to areas that are superficially unrelated to one’s educational background in the past decade or so. Is this the sign of a declining and failing educational system or is it just the natural evolution of things?

A Historical Lens

If we take a look at how “education” has evolved over time, the major shifts over the past few centuries closely align with the various industrial revolutions. In other words, every time the work context has changed drastically, a combination of market forces, business interests, and government incentives created a force towards aligning what people learned as they grew up to what kind of workforce was needed.

In the 18th and 19th century when work changed from manual labor to mechanization and rise of factories, it transformed the agriculture focused societies into industrialized urban centers, and kickstarted a growing demand for a literate and numerate workforce. Factory jobs required workers who could read instructions, measure, and do simple calculations. The increasing demand for skilled workers led governments to invest in primary education systems.

During the late 19th and early 20th century electrification, telecommunications, and large-scale infrastructure like railroads created another tectonic shift. As industries grew more specialized, the need for vocational and technical education became apparent. Schools began offering specific training for careers in engineering, manufacturing, and other technical fields. Polytechnic institutions and trade schools emerged to provide practical skills to the working class. Governments began extending compulsory education beyond primary school to better prepare students for skilled labor in an industrial society, and even public universities started to emerge.

Then post-WW-II and until very recently, we have been going through the digital revolution where everything has been slowly becoming computerized. Schools and universities began integrating computers and other digital tools into the classroom. The rise of personal computers in the 1980s and 1990s and the internet in the 1990s dramatically changed how information was accessed and shared. Computer literacy became essential for students. The internet made distance learning and online education possible, democratizing access to education, and then Covid made it the norm. The rapid pace of technological change meant that workers needed to continually update their skills. Education systems began placing a greater emphasis on lifelong learning and adult education programs to help workers adapt to new digital realities and corporates started designing and running reskilling and upskilling programs.

Rise of Language Models

One of the major technological shifts in the past 5 years has been our ability to computationally analyze and generate natural and formal language. While language in itself is not a complete representation of intelligence, the entrance of the large language models (LLMs) into the public vocabulary has created the speculation of the upcoming “artificial general intelligence” (AGI) and the 4th industrial revolution. If (when?) that actually happens, then we will be one of the first generations that might experience more than one industrial revolution in their lifetime which has significant implications for how we live, work, think, learn, entertain ourselves, socialize, find love, and more. The speculations are also partly based on the progress we are making in other technologies like robotics, IoT, bio-tech, and quantum computing, and the hope that more powerful AI systems mean more major breakthroughs in these areas as we saw in the case of protein folding.

Even before AGI is here, the rate at which LLMs are getting better at tasks beyond simple linguistic ones is remarkable and the expectation is that we will see significant progress towards automation in tasks that are more traditionally reserved for the human brain. Multi-modal language models and their cousins, especially when combined with more traditional software scaffolds, are expected to be able to do all sorts of tasks that require entry to mid level expertise relatively soon. This type of automation is displacing many tasks that have traditionally been entry points for junior workers after going through the educational system. Routine tasks like data wrangling, report generation, writing software tests, and basic analyses—key areas where people typically learn on the job—are being automated by AI tools. As a result, fewer opportunities for hands-on learning exist at the lower rungs of the professional ladder. The tasks left for humans often require advanced problem-solving, decision-making, and strategic thinking, which are typically handled by senior employees. This could lead to a bifurcated workforce, where junior talent lacks the experience needed to develop into senior positions, potentially leading to talent gaps in more senior, complex roles.

So, the question is, what do LLMs and their future iterations mean for the future of learning and talent development?

Emerging Trends in Talent Development

It is hard to predict where things will go given the pace of change and chaos created by lack of preparedness in the society, industry, and academia. But a few general patterns are most likely to happen:

Interdisciplinary Skills: The future of work is increasingly interdisciplinary. As AI integrates with fields like healthcare, finance, logistics, and even the humanities, talent development will need to focus on cultivating skills that cross boundaries between disciplines (e.g., AI for biology, AI ethics, or AI in creative industries). This shift will encourage universities and industries to promote cross-disciplinary education where AI workers (human or machine) collaborate with domain experts.
AI Automation and Tooling: There’s a growing trend toward automated workflow execution and no-code / low-code platforms which require less technical know-how and might operate with natural language as an interface. Talent development will likely focus on higher-order problem-solving skills rather than specific subject matters like manual coding, as more AI tools abstract away the implementation details. The emphasis will shift to developing business acumen and problem-definition capabilities—being able to frame business problems and align them with AI solutions will become critical.
Fluid Academic Disciplines and Accreditation: The fluidity in learning and working, particularly in AI-driven environments where automation and interdisciplinary skills are critical, does raise important questions about the future of formal, distinct academic disciplines. Rather than seeing academic disciplines as isolated silos, they will be viewed as building blocks for more fluid, interdisciplinary research and industrial work. And instead of solely focusing on the subject matter of the discipline, education will focus on methods of inquiry and problem-solving that are crucial in interdisciplinary collaboration. For instance, the scientific method in the natural sciences, the design thinking process in engineering, and the interpretative methods in social sciences all become the direct learning objectives. Finally, education could shift to a more modular structure where students can specialize in certain disciplinary areas but take modules from multiple fields to create customized learning pathways. This would maintain the rigor of disciplinary knowledge while allowing flexibility for interdisciplinary applications.
Experiential Learning and Research: Universities will balance traditional disciplinary learning with experiential and project-based learning that reflects the real-world challenges of AI and automation. By integrating distinct academic disciplines with applied, hands-on learning experiences, universities will prepare students to work across boundaries in industry while still being grounded in the depth of formal knowledge.
Experts Mentors and AI Gyms for Juniors: With the automation of the grunt work, more senior employees will spend some of their time creating learning paths and challenges in AI-powered learning playgrounds where junior employees work with their AI learning buddies to fill the talent gap and become skilled in advanced problem solving, critical thinking, and strategic planning.

These are likely trends for undergraduate level education and early career development, and the next question becomes the impact of the AI revolution on the postgraduate part of academia and research.

Blurring Academia / Industry Boundaries

Another important trend is that the translation gap, the need for change in an academic idea to become operational in industry, has been shrinking as well. With more research and development firms getting funded, more corporations starting their research labs, and more prominent scientists working for the tech giants or starting their own companies, the separation of research responsibilities between academic and industry is quickly vanishing. This intermingling has started to bridge the gap between the technical knowledge required for developing complex AI systems and the operational expertise necessary to manage them.

Bridging the Tech-Business Divide: With more co-pilots and agent systems deployed, business people are more empowered to interact with the traditionally academic artifacts like language models and provide feedback about how those can fit into their operational workflows better. The more applied scientists, on the other hand, spend more time with their industry counterparts learning about what is needed to move their models and pipelines from the abstract world of the lab to the messy, real world.
AI System Lifecycle Management: Academics focus more on the entire AI system lifecycle—covering design, deployment, monitoring, and governance—to ensure researchers and students understand both the technical intricacies of building AI systems and the operational aspects of maintaining and scaling them over time. This would even go deeper towards topics like governance, fairness, bias, and ethics built by design in AI systems.
Human-Machine Interaction: Research and development in human-AI interaction will become increasingly relevant in the gray area between academia and industry as AI systems are integrated into everyday life and workplaces. Researchers and their industry counterparts will focus on designing AI systems that complement human decision-making and create seamless interactions between AI tools and users.

Verdict

The future of work and education is increasingly intertwined with technological advancements, especially as the onset of the 4th industrial revolution reshapes industries. As AI and automation accelerate, the tasks traditionally performed by junior workers may become obsolete, requiring educational systems to evolve beyond rigid structures like the four-year degree. Interdisciplinary learning, lifelong upskilling, and adaptability will be critical as the distinction between academic disciplines blurs in response to workforce demands. The role of industry in driving innovation, as seen in Nobel Prize-winning research from both academic and industrial sectors, underscores the importance of collaboration. Universities and organizations must create sustainable talent pipelines that focus on practical problem-solving and continuous learning, ensuring workers remain equipped for complex roles.

“Move fast and break things” is a failed VC math

Amir Feizpour (ai.science) — Mon, 14 Oct 2024 18:22:46 GMT

I often like to point out why I take the time to write a piece. So, here it is for this one: I hear from some founders-to-be who book me for advice that they want to “move fast and break things”. They say it in different ways like “oh, i just need to get this demo together, and then I’ll raise money and scale quickly” or “XYZ raised money with a rough idea and no business model” etc.

As someone who bought into that ethos without thinking about it and wasted a lot of time and money, I advocate for a more measured approach with a special attention to the infrastructure necessary for moving fast.

A History Lesson

A quick look at history teaches us a lot about this. “Move fast and break things" was a direct response to the traditional waterfall approach, which originated in manufacturing and later became a dominant method in software development. Waterfall emphasizes a sequential, phase-by-phase approach to projects, where each stage (like design, development, and testing) is completed before moving to the next. This method is known for its slow, rigid processes, which can hinder innovation and adaptability. Waterfall methodologies were popular because they were designed to minimize risk and prevent mistakes by following a linear, highly controlled process.

In this climate, Mark Zuckerberg's mantra of "move fast and break things" emerged at Facebook in its early stages to foster rapid experimentation, especially in a tech landscape where companies needed to iterate quickly to stay competitive. The idea was to prioritize speed and agility, even if it meant making mistakes along the way. The goal was to learn and adapt faster than competitors, recognizing that perfection often leads to paralysis. This approach stands in stark contrast to the conservative and cautious ethos that was common: one might even say an over-correction for the existing trends.

It is not surprising that this took over silicon valley and later the broader startup conversation in the Western countries given what was going on in the VC land. The tech world was recovering from the dot com crash and new tech trends like social media, search engine, e-commerce, and cloud infra were seeing promising examples, and ideologies like lean startup were starting to shape up. All this was changing the venture capital more and more towards a “disruption” focused narrative and the “go big, or go home” type of mentality. The way this manifested itself was that VCs started coming up with all sorts of theses around how they picked their so-called outliers, and of course, they would occasionally make mistake and the way their math work is that it’s best for those mistakes to run out of business as quickly as possible in favor of the ones that prove to be more promising. See the connection?

Tiger Global, one of the early investors in WeWork, was the firm that took this philosophy to another extreme by following a “what can go right” mentality and therefore investing 10x more than what the firm needed to be great in favor of massive velocity. While this approach had remarkable success, it also produced spectacular failures like WeWork’s remarkable IPO failure, and other examples that a quick Google search would surface.

Role of Infrastructure

The world we live in today is very different from the one in which “move fast and break things” was a relevant anti-trend. Since then Facebook has changed its mantra to “move fast with the right infrastructure” and the VC world has corrected its philosophy to focus on startup infrastructure and ecosystems well past “throw money at it and hope for the best”.

In this journey, the tech world has come up with so many interesting technical, organizational, and operational practices to enable moving fast and safely. DevOps is one of the most remarkable examples that has significantly reduced the cost of failure in software development by introducing tools and processes that automatically integrate, test, monitor, and scale code changes. You no longer need significant investment in racks of hardware in your basement because you can provision, use, and wind down virtual machines on the cloud with a few clicks. And especially in the post-pandemic world, you can find cross-functional collaborators or gig experts within your org or from across the globe with a few online calls.

So, the right question for a founder in 2024 should be more like: what infrastructure / environment / ecosystem do I need so that I can get the necessary failures out of the way quickly, cheaply, and safely? Capital is definitely one of the important parameters, but certainly not the only one by any means. Velocity is definitely an important factor but not at the cost of safety or learning quality.

Verdict

If you are an early stage founder and you’re thinking about “moving fast, and breaking things”, then take a moment to re-evaluate your context. Are you driving a Fiat 500 on an unmaintained back alley, or a Ferrari LaFerrari on a German Autobahn? If you try to move fast in the former you’re going to injure yourself, anyone around you, and the vehicle you’re in. You need to think carefully, deeply, and importantly: honestly, about what you need to fail at first, and how you can do that cheaply, and then how you can streamline and refine your experimental process.

LLM Agents, Part 6 - State Management

Amir Feizpour (ai.science) — Mon, 02 Sep 2024 14:37:25 GMT

At this point, we've seen how Service-Oriented Architecture (SOA) and Event-Driven Architecture (EDA) boost modularity, responsiveness, and scalability in our multi-agent system. However, these architectures don't fully address the complexities of managing internal task progression or multi-step workflows within agents. That's where State Management steps in, providing an explicit structure to agent behaviors and system-wide data flow. In this article, we'll explore how State Management can significantly improve multi-agent systems.

State management in multi-agent systems is all about defining the playground for autonomy. It's like drawing boundary lines that let agents explore and act within a landscape of possible states, guided by the rules of the system. This balance between freedom and structure ensures each agent can play its part while keeping the overall system in harmony.

Consider LLM agents in our biotech sales example. An agent processing potential leads might freely prioritize and categorize them, but it can't access financial records or communicate with clients directly—those boundaries are set by the state management system. Additionally, certain states might be conditionally available. For instance, the agent may only access prior communication history for a client if those documents are tagged as unclassified, ensuring that sensitive data is only handled when relevant.

What Is a State?

State, in the context of software systems, represents the condition or status of an application or its components at a specific point in time.

In a multi-agent system, state can encompass various elements:

Agent Internal State: The current status, knowledge, beliefs, and decision-making parameters of individual agents.
Task State: The progress of ongoing tasks or processes within the system.
Environment State: The current condition of the environment in which the agents operate.
System-wide State: The overall status of the multi-agent system, including inter-agent relationships and global parameters.

As you can see already, there are multiple levels of granularity that can exist when describing the state of the overall system and its components and their sub-components, and so on.

Each agent in our system maintains its own internal state, which influences its decision-making processes and actions. For example, the Lead Qualification Agent's state might include the criteria it's currently using to evaluate leads, while the Proposal Generation Agent's state could include the sections of a proposal it has completed.

What is State Management?

State Management is the practice of organizing, tracking, and controlling the state of a software system. In multi-agent systems, it extends to coordinating the states of individual agents, their underlying services, and the overall system state.

State Management provides mechanisms for:

Defining possible states and transitions between them
Updating state or transition based on events or actions
Propagating state changes to relevant parts of the system
Ensuring consistency across distributed components

For example, in our biotech sales system, a Lead Management Agent might progress through states such as "Lead Identified," "Lead Qualified," "Proposal Generated," and "Deal Closed." Each state represents a distinct phase in the sales process, with transitions driven by specific events or conditions.

Build Agents with Us

Benefits of State Management in Multi-Agent Systems

Implementing robust State Management in multi-agent systems offers several key advantages:

Agent Autonomy and Interaction: State Management provides a framework for representing an agent's internal state and its relationship to the overall system state, modeling its decision-making process, and enabling its interaction with other agents. This is crucial in multi-agent systems where agents are autonomous entities making decisions based on their internal states and perceptions of the environment. In absence of effective state management, the state space available to autonomous agents might be too wide to allow efficient decision making and progress.
Managing Complexity: As agents become more sophisticated and handle more complex workflows, it becomes essential to have a clear structure governing how they move through their tasks. State Management provides this structure by explicitly defining a series of states and transitions, ensuring that agents follow logical and predictable paths. The modular nature of state based representation of the system dynamics also makes the system easier to understand and improve. For example, we might have a hierarchical structure with states that are selected via routing type of transitions and then a series of transitions within their substructures.
Ensuring Task Completion: By explicitly defining states and transitions, State Management ensures that agents complete all necessary tasks before moving on to the next phase. This is particularly important in processes where each step must be completed before the next can begin. For example, in our biotech sales system, the Business Development Agent must complete the "Qualify Lead" task before moving to the "Assess Viability" task.
Improved Coordination: By clearly defining and managing states, we ensure that all agents have a consistent understanding of the system's status, leading to better coordination. This might involve tracking important variables and their current values which can be used as inputs to the next best action selection as part of the system state.
Enhanced Reliability: State Management helps prevent agents from entering invalid states or performing actions out of sequence, reducing errors in complex processes. This could manifest as a guardrail, for example, preventing users from asking questions about politics from a system designed to help with selling biotechnology products. Or it might prevent the system from placing a sales order before a certain checklist of approvals are obtained.
Increased Scalability: As we add more agents or expand the system's capabilities, a well-structured State Management approach makes it easier to integrate new components without disrupting existing workflows. Thinking about the system design as states and their corresponding transitions is naturally modular with easier pathways for extensibility.
Better Observability: With centralized State Management, it becomes easier to monitor the system's overall status, track progress, and identify bottlenecks in various processes. All of us have experienced the nightmare of tracking the values of our variables as they flow through the pipelines especially as the software becomes more and more complex, and now we also have to do that for complex data objects that contain lots of natural language strings.
Simplified Debugging: When issues arise, having a clear state model makes it easier to trace the sequence of events and identify the root cause of problems. This is a consequence of higher visibility in the inner workings of the system but also an outcome of having a more unified pattern of investigation.
Adaptive Behavior: State Management allows agents to adapt their behavior based on their current state and the state of the system, enabling more intelligent and context-aware decision-making.

It's worth noting that state management techniques have been used in traditional software development for decades, particularly in areas such as user interface design, game development, and embedded systems. This same principle is now being applied to multi-agent systems, allowing us to manage the complexity of agent behaviors in a similar manner.

Implementing State Management in the Biotech Sales Example

To better understand the role of State Management in our multi-agent system, let's apply them to the biotech sales scenario. Consider the Business Development Agent, which follows a structured series of states to evaluate and qualify leads. This agent might progress through the following states, which align with its services:

Lead Identified (Lead Generation Service)
Lead Qualified (Lead Qualification Service)
Viability Assessed (Viability Assessment Service)
Objections Handled (Objection Handling Service)
Meeting Scheduled (Meeting Scheduling Service)

Each state is defined by specific actions and rules for transitioning to the next state and can contain more granular sub-states. For example, the transition from "Lead Qualified" to "Viability Assessed" might depend on whether the lead meets certain qualification criteria set by the Lead Qualification Service.

This structured approach ensures that the agent doesn't skip crucial steps, like assessing the technical fit of a lead before scoring it. It also enables the agent to handle errors, such as missing data, by transitioning to an error-handling state and retrying the process. Another important point about the State Management approach is its capability to enable the agents to handle routing gracefully. For example, it helps in choosing the right chain of actions or pathways in the workflow, ensuring that the agent follows the most optimal path based on the current state and context.

Challenges and Considerations

While State Management offers numerous benefits, it also comes with challenges that need to be considered:

Complexity: As the number of states and transitions grows, the system can become complex and harder to manage. In larger multi-agent systems with multiple agents like our Business Development Agent, Sales Agent, and Customer Simulation Agent, it's crucial to keep the state diagrams well-organized and ensure that transitions are clear and logical.
Redundancy: In some cases, similar actions might need to be performed in multiple states across different agents. To avoid redundancy, it's important to identify common actions and abstract them into reusable components that can be called by different states or services.
Debugging Transitions: While state management can simplify debugging by providing clear states, identifying issues in transitions, especially in complex decision-making logic, can still be challenging. Careful testing and monitoring are essential to ensure smooth operation across all agents and their services.

Build Agents with Us

Wrapping Up

As we've explored throughout this article, the combination of Service-Oriented Architecture (SOA), Event-Driven Architecture (EDA), and State Management forms a solid foundation for building sophisticated multi-agent systems like our biotech sales platform. By combining these architectural patterns, we create multi-agent systems that are Flexible, Scalable, Responsive, Robust, and Maintainable.

LLM Agents, Part 5 - Communication Protocol in Agentic Systems

Amir Feizpour (ai.science) — Wed, 21 Aug 2024 19:36:25 GMT

In our multi-agent systems series, we have started by introducing what agents are and how multi-agent systems emerge as a natural evolution of the software architecture as we move on to more complex workflows. We explored how Service-Oriented Architecture (SOA) can be applied to create flexible, modular multi-agent systems, and then looked at how it can be used for a biotech sales organization as our example.

While SOA provides a solid foundation for structuring our multi-agent system, it doesn't fully capture the dynamic nature of complex, real-time interactions. SOA tells us what the components are, but it doesn't address how these components interact or manage their internal workflows. To create truly responsive and adaptable systems, ones that can eventually mimic some degree of agency, we need to go beyond static structures and incorporate patterns that handle the flow of information and the progression of tasks.

Today, we’ll be building on that foundation by introducing another critical architectural component: Event-Driven Architecture (EDA).

Service-Oriented Architecture (SOA)

Before getting into EDA, let's first recap why Service-Oriented Architecture really matters for multi-agent systems. SOA is all about breaking down complex processes into manageable, independent services. In our biotech sales system, SOA allowed us to modularize the entire sales process into independent, specialized services. Each service, such as lead generation, lead qualification, viability assessment, and proposal writing, operates like a specialized team within a business, each performing a distinct role. The key here is that these services communicate with each other through well-defined interfaces, promoting loose coupling. This, we argued, results in a system that won't fall apart every time you need to make a change.

For example, let's say you decide to develop a more advanced logic for lead qualification. In a monolithic system, this could be a nightmare, potentially affecting everything from data input to proposal generation. But with SOA, you can update the lead generation service without disrupting other services like proposal writing or viability assessment. This flexibility is what makes SOA such a powerful bedrock for multi-agent systems.

Event-Driven Architecture (EDA)

Now, let's talk about Event-Driven Architecture. EDA isn't new, it's a well-established design pattern that's been around for years. But its application in multi-agent systems, while in its infancy, is where things get interesting, and potentially messy if not handled correctly.

EDA is a software design pattern that emphasizes real-time system response to events. An "event" is a significant state change such as a new lead being identified or a proposal being finalized. In EDA, components produce and consume these events, triggering further actions across the architecture. It promotes decoupled, asynchronous interactions, making systems more flexible and scalable.

This approach has been used in systems like enterprise applications, where services like customer orders, payments, and inventory updates function independently but are synchronized through events. The same principles now apply to multi-agent systems, where agents can respond to events without being tightly coupled to other agents or services.

Build Agents with Us

Why EDA in Multi-Agent Systems?

In the context of our biotech sales system, EDA allows us to design a responsive system where services and agents react to events as they occur, without waiting for direct interactions. For example, when the lead generation service identifies a new potential client, it produces a "New Lead Identified" event without necessarily knowing or being impacted by how other services or agents might interact with that event. This event triggers actions in services that subscribe to that event type such as lead qualification, market analysis, and proposal generation. This architecture choice would lead to flexibility to adapt to evolving business needs by adjusting events and agent interactions without requiring significant system-wide changes:

Real-time Responsiveness: EDA ensures that when an event like identifying a new lead occurs, multiple agents can start immediately, such as the lead qualification and market analysis agents.
Decoupling: One of the core principles of EDA is decoupling. In this approach, agents or services react to events independently, without any direct connection. In the biotech example, the lead qualification agent doesn’t need to know how the lead generation agent works. It just reacts to the event that the lead agent produces. This allows the system to remain modular and flexible.
Scalability: New agents, say a pricing analysis agent, can be easily integrated to listen for relevant events and act without disrupting existing workflows.

Mechanics of EDA in Multi-Agent Systems

1. Event Producers and Consumers

In EDA, agents or services are categorized as event producers (those that trigger events) or event consumers (those that react to events). In many cases, an agent can play both roles. For example:

The lead generation service identifies a potential client and creates a "New Lead Identified" event.
The lead qualification service consumes this event, evaluates the lead, and produces a "Lead Qualified" event.
The proposal generation service consumes the "Lead Qualified" event to start preparing the proposal.

This approach allows for greater flexibility, as services can operate independently but are still coordinated by events.

2. The Event Bus: System Coordination

The event bus is the backbone of the EDA system, routing events between producers and consumers. In the biotech sales scenario, it ensures that when the lead qualification service produces a "Lead Qualified" event, it is automatically routed to all relevant services—such as proposal generation, pricing strategy, and market analysis.

This centralized coordination ensures that agents and services stay decoupled, yet the entire system stays synchronized as events flow through the architecture.

3. Event Schemas: Standardizing Communication

Event schemas define data structure, standardize communication, and ensure correct data interpretation.

For example, in the biotech sales system, a "Lead Qualified" event schema might look like this:

{

"eventType": "LeadQualified"

"lead_Id": "12345"

"companyName": "PharmaCorp"

"potentialValue": 500000,

"productInterest": ["Lab Equipment", "AI Drug Discovery"]

"qualificationScore": 85

}

This standardization allows agents to communicate consistently, ensuring that data is interpreted correctly, like predefined contracts (APIs) in a microservices architecture.

In LLM-based agentic architectures, large language models are often used to create and / or interpret some of the components of the schema that are best captured as natural language or domain specific language. For example, the event produced by the “Proposal Writing” service might contain a field called “content” that provides a nested dictionary with section titles and paragraphs of the proposal. That nested dictionary is likely to be generated using an LLM call (potentially a RAG based sub-system). On the other hand, the receiver of the event would also likely need an LLM call to interpret and react to that data object.

Challenges and Considerations

While EDA has many benefits, there are also some challenges to consider:

Event Storage: As the system grows, the number of events increases, making efficient event storage crucial. Event sourcing patterns, which are commonly used in traditional EDA systems, can be applied to reconstruct system states from past events.
Debugging Complexity: Tracing the flow of events in a large system can be challenging, especially when issues arise. Distributed tracing tools are often required to pinpoint problems in the event chain.
Over-communication: If systems are not carefully managed, they can become overwhelmed with too many events. It’s important to balance responsiveness with efficiency to avoid performance bottlenecks.

Build Agents with Us

Event-based Communication Protocol

In this article we explored event-driven architectures as one of the more promising communication protocols in multi-agent systems. We looked at how this architecture choice complements the service-oriented architecture that we previously discussed as a design pattern for constructing multi-agent products based on existing business workflows. In the next parts of this series, we will introduce more architecture concepts, and will eventually discuss how they will combine to create a full picture of multi-agent systems.

LLM Agents, Part 4 - Workflow to Multi-agent Architecture Design

Amir Feizpour (ai.science) — Wed, 14 Aug 2024 18:46:49 GMT

BACKGROUND

To understand this better, feel free to skim the previous writeups first;

TRANSCRIPT SUMMARY:

Here our focus would be to transform a regular sales workflow of a Bio Tech Sales company into a modular, maintainable. and a scalable multi-agent architecutre. We’ll explore how we can utilize a traditional architecture design called Service-Oriented Architecture (SOA) to create a sophisticated multi-agent system.

Current Workflow of Bio Tech Sales Scenario

Before we dive into architectural patterns, it's crucial to understand the current process we're working with. Let's first break down the sales workflow of Bio Tech Sales company:

The founder identifies potential pharmaceutical company leads
A sales engineer evaluates the leads and asks clarifying questions
The technical lead suggests relevant use cases, and the founder estimates ROI
The sales engineer assesses the feasibility of use cases with the tech lead
A proposal is drafted, refined, and sent to the pharma company prospect
The prospect's team (medical scientist, chemist, CFO) reviews the proposal
Clarifications and objection handling occur via email
A meeting is scheduled if the proposal is accepted

Service-Oriented Architecture

Before we apply SOA to our biotech sales scenario, let's briefly recap what it is and why it matters.

Service-Oriented Architecture (SOA) is a design approach used in traditional software architectures to break down complex systems into independent, modular services that communicate through standardized interfaces. These services are designed to perform specific business functions while remaining loosely coupled and reusable, allowing for greater flexibility and scalability across the system. SOA enables different services to operate autonomously, while still being able to collaborate when needed, ensuring that systems can evolve and adapt without requiring a complete overhaul.

You can read about SOA in detail here.

Applying SOA to the Biotech Sales System

With a better understanding of SOA in traditional software architectures, let's now apply these concepts to our biotech sales scenario.

In order to identify our key agents, let’s break down the workflow into discrete, reusable services grouped together based on theme or domain of work. This approach will allow for greater flexibility, easier maintenance, and improved scalability.

Once we have the grouping based on our workflow analysis, we can identify three key agents:

Business Development Agent
Sales Agent
Customer Simulation Agent

These agents will act as logical groupings of related services, each responsible for a specific aspect of the sales process. We're also keeping a human in the loop - the founder - who will provide critical inputs and oversight.

Business Development Agent Services

Lead Generation Service: Identifies potential pharma company leads.
Lead Qualification Service: Evaluates and qualifies the identified leads.
Viability Assessment Service: Assesses the viability of pursuing a lead, including suggesting relevant use cases and estimating ROI.
Objection Handling Service: Manages clarifications and objections raised by prospects.
Meeting Scheduling Service: Arranges meetings with interested prospects.

Sales Agent Services

Feasibility Assessment Service: Evaluates the technical feasibility of proposed solutions, working closely with the technical lead.
Proposal Generation Service: Drafts, refines, and sends proposals to pharma company prospects.

Customer Simulation Agent Service

Proposal Review Service: Simulates the review process by the pharma company's medical scientist, chemist, and CFO.

Each of these services follows SOA principles with standardized interfaces, loose coupling, abstracted internal complexity, and potential reuse in different contexts.

Walkthrough of Service Oriented System

Let's walk through how this SOA-based system would handle a typical sales process:

The process begins with the Business Development Agent's Lead Generation Service identifying potential pharma company leads using data from sources like Crunchbase and PitchBook.

The Lead Qualification Service then evaluates these leads, possibly using a machine learning model to score them based on predefined criteria.

For qualified leads, the Viability Assessment Service kicks in, suggesting relevant use cases and estimating ROI. This service might use a combination of historical data and AI-driven forecasting.

The system notifies the founder (our human-in-the-loop) via Slack, presenting the qualified leads and viability assessments. The founder can provide feedback, add additional insights, or approve moving forward.

Once approved, the Sales Agent's Feasibility Assessment Service evaluates the technical feasibility of the proposed solutions. This might involve analyzing technical requirements and consulting internal knowledge bases.

The Proposal Generation Service then creates a tailored proposal based on all the gathered information. This could involve using templates and AI-driven customization.

The generated proposal is sent to the Customer Simulation Agent's Proposal Review Service, which simulates the review process of the pharma company's team. This service might use NLP to analyze the proposal and generate realistic objections based on historical data.

Any objections or requests for clarification are handled by the Business Development Agent's Objection Handling Service, which might use a combination of pre-defined responses and AI-generated explanations.

If the simulated customer is satisfied, the Meeting Scheduling Service arranges a follow-up meeting, integrating with calendar systems like Google Calendar or Outlook.

Throughout this process, the system maintains a shared memory where all actions and important data are recorded. This allows for better coordination between services and provides a clear audit trail.

Benefits of This SOA Approach

By applying SOA principles to our biotech sales system, we gain several advantages:

Modularity: Each service can be developed, tested, and maintained independently. If we need to update our lead scoring algorithm, we can do so without touching the proposal generation system.

Scalability: Individual services can be scaled based on demand. If we're seeing a surge in lead generation, we can allocate more resources to that service without affecting others.

Flexibility: New services can be added or existing ones modified as business needs evolve. For instance, if we later want to add a pricing optimization service, we can do so without overhauling the entire system.

Reusability: Services like Objection Handling could potentially be reused in different contexts, not just in initial sales but also in account management.

Improved Efficiency: By automating many of the time-consuming aspects of the sales process, we free up human resources to focus on high-value activities like relationship building and strategic decision-making.

Conclusion

In this part of our multi-agent series, we demonstrated how to transform a real-world workflow into a modular and scalable multi-agent system using Service-Oriented Architecture (SOA). By breaking down complex processes into discrete and reusable services, we created a flexible system that can adapt easily without any major overhauls.

Next, we'll explore how these services interact with each other, while examining the communication patterns that enable collaboration between agents.

Your Business Needs a Fractional Chief AI Officer

Amir Feizpour (ai.science) — Fri, 09 Aug 2024 14:13:25 GMT

Here are some facts about you:

You are excited about what AI can do for your business.
You have built a business around your wealth of domain knowledge but AI hasn't been your priority yet.
You have heard that most AI projects die at infancy (POC or MVP stage), and you are committed to figure out how to do it right.

If the above describes your general thought process, you are in the company of many other business owners and startup founders who realize the value of AI but also don’t want to overlook the risks.

The AI Adoption Paradox

In a recent Deloitte report, 79% of the participant organizations thought that their businesses will be drastically impacted by generative AI in the next 3 years. However, in the same report and a similar one by IBM, participant organizations indicated their biggest barriers to AI adoption as follows:

Can we sustain a strategic edge with AI? 66%
How can we use AI with data security in mind? 53%
How can we use AI with data privacy in mind? 51%
How can AI achieve our desired business grade accuracy? 47%

These are all very important concerns, but guess what was by far the largest concern of all? 78% of the organizations felt unprepared in terms of access to the right AI talent to execute their vision. This is a big deal because unlike what some vendors might try to convey, doing AI properly inside your business is not a trivial activity. It requires strategic thinking about your data, your software systems, and the talent available in your business currently and in near future. Who is most suitable for this then? A person who has done it a few times in a few different contexts.

The Role of a Chief AI Officer

The title itself is less important than what this person should do for you:

Meet you where you are in terms of business objectives and concerns, denoise through the hype, and translate what trendy topics mean for you, and what you should pay attention to.
Strategize and create a concrete roadmap of data, software, infrastructure, and AI capabilities that need to come together to build and maintain AI integration into your business workflows.
Brings enough technical chops to the table to oversee the execution and implementation by your current talent or any newly hired ones (that they help vet).

Perhaps, why the Linkedin report on Future of Work observed that the “head of AI” title has tripled over the past 5 years!

Well, at this point you might say, yes, yes, and yes, but first of all, I don’t know how to fill up 40 hours a week of work for this person, but also I doubt I can afford this person for 40 hours a week! You are in luck, my friend! Read on!

BOOK A DISCOVERY CALL

Understanding Fractional Work

Fractional work refers to a professional arrangement where an individual provides services to multiple clients on a part-time or project-based basis. In essence, they offer their full range of expertise to you in a "fraction" of their time. Key benefits of fractional executives include:

Lower overhead compared to full-time executives.
Speed and Agility to address specific challenges or opportunities.
Offer a fresh, objective perspective on the business and technological opportunities.

Factors Driving Fractional Work Popularity

Several factors contribute to the increasing popularity of fractional work, particularly at the executive level:

Companies are becoming more cautious about committing to full-time, high-salary executive positions. Fractional work offers a flexible and cost-effective alternative.
In many industries, finding qualified executives can be challenging. Fractional executives provide access to a broader talent pool by themselves or through their often large networks.
Companies often require specific skills or experience for short-term projects. Fractional executives can be brought in to address these needs without the overhead of entertaining a full-time hire post project.

In the current climate where the technology landscape rapidly shifts and in turn results in sudden changes in business opportunities and risks, fractional work is likely to become an even more prevalent model for organizations seeking to optimize their talent and resources.

Fractional CAIO vs. Consultant: Understanding the Difference

While similar in capacity of engagement, aka both roles being at arms length to a company, there are differences: Fractional workers typically take on specific roles within an organization for an extended period, often acting as part-time executives. They are involved in day-to-day operations, while consultants are usually project-based, offering advice and recommendations without long-term commitment or deep integration into the company. Consultants, in a lot of cases, engage with companies in a serial monogamy kind of way, while fractional execs split their time amongst several companies at any given time. There is no right or wrong answer and the question is what your business needs.

The very simplified decision heuristic is something like this: if you know what to do and need more of a worker bee type, you need a consultant, but if you need someone with a broad range of knowledge to strategize with you, then you need a fractional exec. Of course, like everything in life there’s a gray area in the middle where you could have a bit of both, but that comes with trade-offs as you can imagine. So, it is important to be very clear about what can most help your business at this time.

Finding Your Fractional CAIO

Well, the usual places!

Freelance marketplaces like Upwork.
Specialized fractional platforms offering a curated pool of talent.
LinkedIn is an excellent resource for finding experienced tech professionals.
Tech conferences and meetups.

Finding the Right Fit

Yes, the internet is full of guidelines for how to do this, but I think it’s pretty simple: is this a person you can trust to help you achieve your business goals? There are a few observations you can make: a) is this person a recognizable authority on the topic? b) are they a “yes-person” or do they challenge your assumptions in a constructive way? c) do you see yourself brainstorming with them for hours about a problem you need to solve?

Of course, these might be a bit tough to judge, but you need to be honest with yourself about these and spend enough time with them to determine if this is the right arrangement for your goals or otherwise the opportunity cost would be high. Here are some questions and topics you can discuss with them to facilitate uncovering insight about the above:

How do you see AI driving our specific business objectives? This question explores their AI knowledge but importantly within the context of your industry and vertical.
What are the steps of your approach for developing and implementing an AI strategy for our organization? The most important function of this question is to see if they have done it before.
What are the biggest risks projects like this face, and how have you helped mitigate them in the past? Risks are abundant, so this explores how aware of those they are and how much experience they have dealing with them. Hint: if they’re not concerned about your data and its quality, big red flag!

Ultimately though, these aren’t yes and no questions and the goal is for you to assess how comfortable you feel with their thought process, communication, and potentially working with them.

Can I Hire you as our Fractional Chief AI Officer?

Ah! I thought you’d never ask! Yes, you can. Let’s see what those who did, say!

“We engaged Amir and his team when we had a prototype of our gen ai product. Our goal was to increase the reliability and generalizability of our product to ingest a larger breadth of Canadian immigration law documents. They helped us navigate advanced AI techniques and strategies to organize and modularize our software to add new features without compromising the existing ones. Their domain expertise, effort, and clarity of communication enabled us to scale our immigration law chatbot from handling hundred documents to several thousands, and growing confidently.” -Josh S. CEO, Visto AI

BOOK A DISCOVERY CALL

What makes me unique is this:

As someone who has run a business for nearly 6 years, I understand the importance of ROI and the real challenges preventing businesses from realizing their tech dreams.
As someone who has built a community of more than 5000 AI researchers and engineers, I can bring in a wide range of talent to the table to deal with any emergent technica challenges.
As someone who has been deep in the trenches of technical details of AI for the past 10 years, I know how to translate and connect business and tech worlds in the most effective way.

Hopefully this article helps you understand another option that is available to you, regardless of me being that option!

LLM Agents, Part 3 - Multi-Agent LLM Products: A Design Pattern Perspective

Amir Feizpour (ai.science) — Wed, 07 Aug 2024 14:37:46 GMT

Why write this?

I see lots of “multi-agent” frameworks out there and I, personally, think most of them are nonsense. They are nonsense because they try to paint rosier picture than what it really takes to build extremely complex intelligent software systems. For example they claim that if you get a few LLMs to talk to each other in natural language you have a software system that robustly solves your complex business problems. Or if you throw a large crew of LLMs at a problem, they can reliably do sales and marketing and operations for your business. I think the creators of (and perhaps those who get excited about) these either have never written serious software, or are just interested in the academic exercise of “what if” rather than building anything that can actually go into production.

Starting from principles is such an important thing to do when proposing and building a new complex framework and I’m utterly surprised by how unimportant it seems in many of the proposed frameworks. Hopefully, I have convinced you that going to the basics of RL is important in thinking through agentic workflows, and in this article my attempt is to convince you that going back to software design principles is the way to go about creating multi-agent systems.

Software Architecture and ML

In this article, we will explore how established software design principles can be applied to the emerging trend of multi-agent large language model (LLM) systems. We will examine how traditional software design patterns, such as Domain-Driven Design (DDD), Service-Oriented Architecture (SOA), and microservices architecture, contribute to the development of these multi-agent systems.

Traditional design patterns provide a robust framework for software development. By integrating machine learning (ML) into these patterns, we can introduce a new dimension to software architecture. ML enables probabilistic routing between software components, replacing pre-programmed deterministic routing. This integration not only enhances the functionality of individual components but also introduces new capabilities. Both LLMs and specialized ML models, and often a combination of the two, can be utilized to achieve these improvements.

The incorporation of LLMs into software systems brings a broad range of benefits, making them more dynamic and flexible. These systems can exhibit a diverse set of behavior without the need for explicit programming, which offers a significant advantage. However, this flexibility comes at a cost: such systems can be harder to predict, maintain, and debug reliably.

Communication methods between components, previously services and more recently agents, remain consistent with traditional approaches, using REST, GraphQL, JSON, and DSLs. However, the introduction of natural language as an interface adds a new layer of complexity, with its own set of advantages and challenges. These hybrid systems, combining predetermined and probabilistic behavior, may become the new standard in software development.

In the following sections, we will delve deeper into the concepts of DDD, SOA, and microservices architecture. We will explain how DDD focuses on modeling software based on real-world domains with isolated data sharing between domains. We will also explore the benefits and challenges of this new approach, drawing parallels to successful microservices implementations to illustrate suitable use cases.

Build Agents with Us

Domain-Driven Design: The Foundation

DDD emphasizes modeling software around the core domain of a business. It advocates for a common language shared by developers and domain experts, ensuring everyone speaks the same language. DDD breaks down the domain into bounded contexts, areas with well-defined and segregated responsibilities, and often minimal dependency on other areas.

Bounded contexts ensure that complexity is manageable by focusing on specific aspects of the domain. This focus also promotes better communication and understanding between developers and domain experts. By breaking down the domain into bounded contexts, we lay the groundwork for introducing agents with specialized capabilities, each responsible for a specific bounded context within the larger multi-agent LLM system. Just as bounded contexts promote modularity and focus within the domain, agents with bounded responsibility will do the same.

Example: E-commerce Platform

Consider an e-commerce platform. DDD could be used to define several bounded contexts:

Customer Management: Handles customer accounts, profiles, and preferences.
Product Catalog: Manages product information, categories, and pricing.
Order Processing: Processes orders, manages inventory levels, and handles payments.
Content Management: Creates and manages product descriptions, promotions, and other content.

Note that each of these are borrowed from the business domain of commerce to facilitate better communication between stakeholders and developers but also to tap into the robust nature of trusted and true business workflows. Each bounded context has its own data entities, business rules, and common language. This modular approach allows developers to focus on specific areas of functionality without getting overwhelmed by the complexity of the entire system.

From Bounded Contexts to Services

SOA takes the concept of bounded contexts from DDD and maps them to services. Each service encapsulates a specific domain functionality and exposes a well-defined interface. This promotes loose coupling, allowing services to evolve independently without impacting others.

Microservices architecture takes SOA a step further by creating even smaller, more focused services. Unlike SOA, in microservice architectures the services focus as narrowly as possible, often only on a single function. This approach offers greater agility, scalability, and resilience. Each microservice owns its data and logic, promoting independent development and deployment.

Example: E-commerce Platform - Microservices Breakdown

The platform would be composed of independent, loosely coupled microservices:

Customer Service: This service manages customer accounts, profiles, login credentials, and preferences. It would expose APIs for user registration, login, profile management, and wishlist functionalities.
Product Service: This service handles product information, including descriptions, categories, images, pricing, and availability. It would provide APIs for product search, filtering, retrieving product details, and managing inventory levels.
Recommender Service: This service handles proactive product recommendations functionality and integrates with the Product Service for data retrieval.
Search Service: This service handles product search functionality and integrates with the Product Service for data retrieval.
Order Service: This service oversees the order processing flow. It would handle actions like adding items to the cart, managing shopping carts, initiating checkout, processing payments, and managing order fulfillment. The order service would interact with both the Product Service and the Payment Service.
Payment Service: This service handles secure payment processing, integrating with various payment gateways. It would expose APIs for initiating payments, handling authorization, and receiving transaction confirmations.
Content Management Service: This service focuses on creating and managing website content, including product descriptions, promotions, blog posts, and other informational pages. It would provide APIs for content creation, editing, and publishing.

Each microservice would expose well-defined APIs for other services to interact with. For example:

When a customer adds an item to the cart in the frontend, it would send an API request to the Cart functionality within the Order Service.
The Order Service might then interact with the Product Service to retrieve product details and confirm availability.
Upon checkout, the Order Service would communicate with the Payment Service to initiate the payment process.

You might notice that some of the things that you’ve imagined as “multi-agent” systems could be achieved simply by a well designed software system.

The challenge with a system like this is that, although it might include some narrow scope AI components, it is ultimately passive and fairly rigid in what it can do. Combining the reliability of software written within those design principles with the flexibility of emergent capabilities offered via LLMs can be a winning formula.

Build Agents with Us

Multi-Agent LLMs: A New Design Pattern

Multi-agent LLMs borrow heavily from the principles discussed above. Just like microservices, they consist of multiple, specialized agents (in addition to services), each focusing on a specific aspect of the task. These agents collaborate and leverage services to achieve a common goal, similar to how services interact through APIs.

Beyond Microservices: Active Agents vs. Passive Data Handlers

Microservices excel in building modular, scalable software systems. However, they primarily function as passive data handlers, responding to requests and manipulating data. Multi-agent LLMs, on the other hand, take a leap forward by introducing “active” components inside these services, effectively allowing them to “make decisions” in scenarios without being deterministically programmed to do so. These agents can:

Continuously monitor the situation, analyze data, and identify potential issues or opportunities.
Take initiative and perform actions without explicit instructions. This can involve initiating communication with other agents, retrieving information, or even triggering predefined workflows.
Collaborate and negotiate with each other to achieve a common goal. This allows for dynamic decision-making and adaptation to unforeseen circumstances.

This shift from passive data handling to active agents unlocks new possibilities:

Complex Task Automation: Multi-agent LLMs can automate complex tasks that require reasoning, planning, and collaboration across different domains. Imagine a system with a service constantly monitoring traffic patterns, augmented with an agent analyzing weather data, and a second one making decisions about rerouting deliveries to avoid congestion – a scenario beyond the pre-determined nature of microservices.
Emergent Behavior: LLMs themselves show emergent properties; they can classify text, extract entities, and more although they are only explicitly trained on predicting the next most likely token. When LLM-powered agents that are fine-tuned to strengthen any of these properties or are augmented with tools that give them specialized capabilities can interact with each other, non-trivial collective behavior might appear. The semantic flexibility, although less controllable in nature, combined with the reliability of JSON based communication between various software services, agentic or otherwise, could result in systems that work in ways that are more “expected” by human operators, for example by adapting and responding to situations in ways that might not be explicitly programmed.
Continuous Improvement: The modular, and to some extent potentially redundant, nature of multi-agent systems can make them less constrained by improvements in a single component as the only opportunity for improvements in the overall system. For example, an agent that is fine-tuned to do task decomposition effectively can help other agents do that task well by providing examples in a few shot setup. In a more well setup system, each component inside agents, potentially small LM or non-LM models, can have a feedback loop continuously being restrained and improved. This could include models that are involved in the policies of individual agents or the overall system.

Contracts, Languages, and Communication

Microservices architectures thrive on clear and well-defined communication. This communication relies on predefined API contracts, essentially agreements that dictate how services interact with each other. These contracts act like sheet music for an orchestra, ensuring each microservice plays its part seamlessly.

REST APIs and JSON are the cornerstones of these contracts. REST (Representational State Transfer) defines a standardized architecture for requesting and receiving data between services. JSON (JavaScript Object Notation) acts as the “language” for transmitting data, offering a lightweight and human-readable format for exchanging information.

Agentic systems use these existing mechanisms and will also introduce a new dimension to communication, adding two more communication types:

Domain-Specific Languages (DSLs): These are custom languages tailored to a specific domain or purpose. Imagine a trading agent responsible for capital market transactions using a combination of statistics, machine learning, and business logic rules. Communicating this info in natural language is too complex and error-prone, and in JSON is too limited. However, using a DSL, imagine a set of pseudocode snippets describing the logic of the rules, as a communication contract between the controller agent and the executor agent can be the most efficient channel. DSLs offer more expressiveness and efficiency compared to generic JSON data, but require specialized knowledge to understand and implement.
Natural Language (NL): This is the most human-like form of communication. Agents could potentially communicate and share information using natural language processing (NLP) techniques. However, natural language is inherently ambiguous and prone to misinterpretations. While offering the most flexibility, NL communication is also the least robust and requires advanced NLP capabilities to manage effectively.

Even in the realm of multi-agent systems, the established approach of API calls and JSON data exchange remains the most reliable and robust communication method. It provides a clear and well-defined path for information exchange. DSLs offer a middle ground, balancing expressiveness with control. Finally, natural language communication, while offering the most flexibility, comes with the greatest risk of misunderstandings and requires significant development effort to implement effectively. In all likelihood, a product that you design would tap into all these different communication channels between services and therefore agents to achieve the best balance between performance and control.

Shared Benefits and Challenges

Both multi-agent LLMs and microservices architectures offer several advantages:

Modularity: Break down complex tasks into smaller, manageable units.
Scalability: Scale individual agents or services independently based on needs.
Resilience: If designed right, given the adaptability of agent policies that leads to some redundancy, failure of one agent or service doesn't cripple the entire system.
Independent Deployment: Deploy and update individual agents/services without affecting others.

However, both approaches also come with challenges:

Increased Complexity: Managing interactions and dependencies between agents/services requires careful planning.
Testing and Debugging: Debugging issues that span multiple agents/services can be intricate. Also, the probabilistic nature of agents can make systems built with them considerably harder to debug.
Distributed System Management: Distributing resources and ensuring consistent behavior across agents/services adds complexity.

Therefore, multi-agent systems, unlike what vendors tell you, are not a silver bullet for everything and choosing to approach solving a business problem with them comes down to a careful pros/cons analysis.

Build Agents with Us

How to design multi-agent architectures?

Map the workflow humans execute to achieve a particular objective including people, processes, and tools involved.
Draw context boundaries around parts of the process that are self-contained.
1. You want each of these areas to have minimal data dependency on another one (if they share / exchange a lot of data they might have to be merged).
2. You want each of these areas to have minimal functional dependencies on another one (if most of the time a change in one requires a change in another one, they should be merged into one context).
Decide if each of these contexts are a software services or if they need to become “agentic” and mark them as such (“Payment Processing Service”, “Data Analyzer Agent”)
1. Note that each agent might contain microservices (“PDF Parser microservice”, “info retrieval microservice”)
Add any other services necessary that your architecture doesn’t explicitly contain (“User Management Service”, “Shared Memory Service”)
Determine how data flows through the system (“PDF goes from File Upload service to parsing service”, “JSON containing rewritten query and search filters goes from query analyzer service to info retrieval service).
1. Revise the system modularity (context boundaries) to minimize data movement.
Determine and document communication protocols between different services and agents.
1. In most interfaces the protocol should be REST based and the data load should be JSON for robustness purposes.
2. If that fails to meet your requirement, then try to use DSLs, only if that fails also, use natural (or formal) language.
3. It is ok to use natural language for most of your interfaces in a quick and dirty prototype implementation, but you should remind yourself that, in all likelihood, it will not meet the reliability threshold for user facing production deployment.
Determine how you would unit test each microservice.
1. If conceptualizing a unit test for a microservice is too complex, it might be a sign of a need to break it into smaller pieces.
2. The unit test for a service would be all component unit tests passing. The unit test for an agent might be a less trivial heuristic on how all component unit tests behaved (in principle agents should be able to recover from some of the failing components; for example it can choose a document search action instead if google search action is failing).

Congratulations! You designed your first multi-agent architecture.

LLM Agents, Part 2 - What the Heck are Agents, anyway?

Amir Feizpour (ai.science) — Tue, 30 Jul 2024 16:42:12 GMT

An Intelligent Agent (IA) is an autonomous entity that observes and acts upon an environment to achieve specific goals. These agents can range from simple systems, such as thermostats or basic control mechanisms, to highly complex AI-powered systems. The exact definitions and the thresholds necessary to attribute agency to a system are up for debate and can only be contextually discussed. However, most IAs possess some or all of the following key properties:

Autonomous operations
Reactive to the environment
Proactive (goal-directed)
Interactive with other agents (via the environment)

To better understand how our understanding of the concept has evolved, the current state, and the potential future of IAs, it's essential to trace their history and examine the key milestones that have shaped the field. But first…

Why should I care?

Are agents yet another hype that will soon die out? Or are they the next major platform shift?

Why are we excited about LLM Agents?

The emergence of LLM agents has sparked considerable excitement in the AI community. Their ability to comprehend and generate coherent text, undertake complex tasks, and exhibit autonomous behavior has opened up a wide array of possibilities. One of the key factors contributing to this excitement is the potential of LLM agents to serve as planning modules for autonomous agents.

Open-source LLMs have also reached a point where they can effectively drive agentic workflows. For example, the integration of LLMs into systems where they can call tools has further enhanced their capabilities, allowing them to perform more complex and diverse tasks.

This growth has led to a significant rise in both the development and adoption of LLMs as core components of autonomous agents. The excitement surrounding them is further fueled by their potential to function as artificial general intelligence systems (AGI), capable of performing a wide range of tasks with human-like proficiency. However, it is important to note that there are still significant challenges to be addressed before LLM agents can truly achieve such advanced capabilities.

I am working on use case X, should I really care about LLM agents?

No, if:

Your task is well-defined and specific.
Needs a single function like grammar checking, text summarization, or code generation.
Doesn't require remembering past interactions or context.
Operates solely on the information provided in the current prompt.

Examples:

Highlighting grammatical errors in a document.
Creating a concise summary of a lengthy article.
Translating a simple sentence from one language to another.
Generating different creative text formats based on a single prompt (e.g., poems, scripts).

Yes, if:

Your task is more complex and involves multiple steps.
Needs the ability to remember past interactions and context.
Benefits from accessing and interacting with external tools or resources.
Requires a level of autonomy in completing the task.

Examples:

A virtual assistant that manages your schedule, checks weather data, and books appointments.
A system that analyzes customer reviews and recommends product improvements.
A chatbot that can answer complex questions by searching the web and integrating information from different sources.
A content creation tool that understands your previous creative decisions and generates content that aligns with your overall vision.

Build Agents with Us

Brief History of Intelligent Agents

The concept of Intelligent Agents has evolved alongside the development of Artificial Intelligence (AI), with its roots dating back to the 1950s. Let's take a look at a brief history of intelligent agents and how they have progressed over time.

1950s and before: The Dawn of AI

Turing Machine (1936): Though not an agent, Alan Turing's theoretical model provided a foundation for defining computation and intelligence.
Turing Test (1950): Proposed by Alan Turing, this test established a benchmark for a machine's ability to exhibit human-level intelligence.

These early concepts laid the groundwork for the development of autonomous agents in the following decade.

1960s: The Rise of Autonomous Agents

ELIZA: A natural language processing program created by Joseph Weizenbaum in the 1960s, was one of the earliest intelligent agents capable of simulating a psychotherapist through natural language conversations.
General Problem Solver (GPS): Developed by Herbert Simon, J.C. Shaw, and Allen Newell in the late 1950s, was an early intelligent agent system that could solve problems by searching through a space of possible solutions, laying the foundation for future problem-solving agents.
SHRDLU: Developed by Terry Winograd, SHRDLU demonstrated rudimentary natural language processing capabilities to solve tasks in a simulated block world.

Building on these early successes, the 1970s and 1980s saw intelligent agents finding applications in specialized domains.

1970s-1980s: Growth and Specialization

MYCIN: An early expert system designed for medical diagnosis, MYCIN showcased the potential of knowledge-based systems in specialized domains.
Shakey the Robot (1970s): A mobile robot from SRI International, Shakey pioneered basic navigation and manipulation tasks in a controlled environment.

As AI technology advanced, the 1990s and 2000s witnessed the rise of intelligent agents in more practical and everyday applications.

1990s-2000s: The Rise of Practical Applications

Deep Blue (1997): IBM's Deep Blue, a chess-playing computer, defeated chess grandmaster Garry Kasparov, demonstrating AI's potential for complex decision-making.
Roomba Vacuum Cleaner (2002): The Roomba became a popular example of IAs entering everyday life, performing basic cleaning tasks autonomously.

In the 21st century, intelligent agents have become increasingly sophisticated and integrated into various aspects of our lives.

2000s-Present: Evolution to Advanced Intelligent Agents

Virtual personal assistants such as Siri, Alexa, and Google Assistant are prime examples of intelligent agents.
Self-driving cars, recommendation systems, and game-playing AI are other examples of intelligent agents.
NASA's mobile agents for human planetary exploration are some of the most advanced machines we have created.

The 21st century has witnessed a remarkable surge in the development and deployment of intelligent agents across various domains. The evolution of powerful machine learning algorithms, coupled with the exponential growth in computing power and data availability, has enabled the creation of highly sophisticated autonomous systems. One of the most significant breakthroughs in this era has been the emergence of Reinforcement Learning (RL) as a key approach for training intelligent agents.

RL has proven to be a game-changer in the realm of game-playing AI, with notable examples such as AlphaGo, which made history by defeating world champion Go players in 2016. This achievement highlights the potential of RL in enabling agents to learn and adapt to complex environments through trial-and-error learning and reward maximization.

Build Agents with Us

Agents in Reinforcement Learning

Reinforcement Learning (RL) is a subfield of machine learning that focuses on training agents to make sequential decisions in an environment to maximize a cumulative reward signal. In RL, an agent interacts with its environment by taking actions, observing the resulting state, and receiving rewards or penalties based on its actions. The goal here is to learn a policy, which is a mapping from states to actions, that maximizes the expected cumulative reward over time.

In RL, an “agent” is strictly defined by its policy, a mapping function from the current state to the most appropriate next action, informed by the reward earned from all previous actions.

Let’s look at an example. Say an RL agent is in charge of controlling the conversation flow inside a customer service chat system.

“State” in this case is some indicator of progress extracted from the last utterance from the customer (eg. intent, like needs more information to purchase).
“Reward” could include positive points for retrieving relevant information, or positive sentiment from the customer, or improved likelihood of a purchase or service renewal, and negative points for signs of frustration (eg. repeated asks) or negative sentiment from the customer, or abandoned conversation.
The “environment” in this case is the current conversation, all the data we have about the customer (their purchase history, demographic, previous communication transcripts), and say data related to competitive services.
“Actions” could include offering a solution, retrieving data to answer questions, and asking clarifying questions.
The “policy” (and therefore the agent) is the decision making function (could be learned from historical data or designed based on business logic or both) that selects the next best action given the current reward. For example, a “perception” function might evaluate the intent of the last utterance (eg. complaint) to infer the state and using that information the policy determines what the best next action is (eg. apologize and offer a discount).

Note that while reward has a significant impact on how the agent behaves, it is not an internal property of the agent. It is instead a property determined by the designer of the system (part art, part science) to help it learn the desired goal seeking behavior. In other words, the reward in this case is externally driven (as opposed to human behavior, for example, where the incentives could be internal or external).

Properties of RL agents

We can explore the key properties we mentioned at the beginning of this article for RL agents.

Autonomy: RL agents can make decisions independently based on their policy, without requiring explicit instructions or supervision.
Interactivity: Agents continuously interact with their environment, including other agents, by taking actions and receiving feedback in the form of rewards or penalties.
Adaptability: Through trial-and-error learning, RL agents can adapt their behavior based on the feedback they receive, allowing them to improve their performance over time.
Goal-orientation: RL agents are driven by the objective of maximizing cumulative rewards, which enables them to learn optimal strategies for achieving specific goals.

Examples of RL agents

Game-playing agents:

AlphaGo: Developed by DeepMind, AlphaGo is an RL agent that learned to play the complex game of Go at a superhuman level, defeating world champion players.
OpenAI Five: Created by OpenAI, this RL agent mastered the multiplayer video game Dota 2, showcasing the potential of RL in complex strategic environments.

Robotics and autonomous systems:

Autonomous vehicles: Reinforcement learning is actively being explored for training self-driving cars. Companies like Waymo and Tesla utilize RL for tasks like lane following, obstacle avoidance, and optimizing driving behavior.
Robotic manipulation: RL agents can learn to perform dexterous manipulation tasks, such as grasping and assembling objects, by learning from trial-and-error interactions with the environment.

Recommendation systems:

News recommendation: RL agents can be employed to personalize news article recommendations based on user preferences and engagement, optimizing for long-term user satisfaction. The DRN framework, for instance, uses Deep Q-learning to deliver personalized news content.
E-commerce recommendations: By learning from user interactions and purchase history, RL agents can provide personalized product recommendations that maximize user engagement and revenue. This paper proposed to use deep reinforcement learning to recommend product sequences that sustain user interest and drive purchases.

The examples we've explored demonstrate the wide range of applications for RL agents across various domains, from game-playing and robotics to recommendation systems. As research in RL continues to advance, we can expect to see even more innovative and impactful use cases for these adaptive, goal-oriented agents.

RL and NLP

One particularly exciting area where RL is making significant advancements is in the field of Natural Language Processing (NLP). By leveraging the power of RL, researchers and practitioners are developing agents that can effectively tackle complex language tasks, such as text generation, dialogue management, and summarization.

The intersection of Reinforcement Learning and Natural Language Processing has given rise to a new generation of language-based agents that can learn to generate, manipulate, and understand human language in increasingly sophisticated ways.

Text Generation Control:

RL agents can be employed to control the style and content of text generation tasks, enabling the creation of tailored writing styles for different audiences.
For example, Reinforcement Learning with Human Feedback (RLHF) has been used to train models that can generate text with desired style or tone and even content, opening up new possibilities for creative writing and content generation.

Dialogue Management in Chatbots:

RL agents can learn optimal conversation strategies in chatbots, allowing them to engage users more effectively and achieve specific goals.
By training RL agents to select appropriate responses based on user input and conversation context, chatbots can maintain engaging discussions, provide relevant information, and even assist in tasks like booking appointments or making recommendations.

Text Summarization:

RL agents can be applied to the task of text summarization, learning to generate concise and informative summaries of longer documents. This has also been used in the context of making language model prompts more efficient.
By designing reward functions that encourage faithfulness to the original text, coherence, and brevity, RL agents can produce high-quality summaries that capture the key points of a document while maintaining readability.

The potential of Reinforcement Learning in Natural Language Processing is truly remarkable. This enables the development of intelligent agents that can generate, manipulate, and understand human language in increasingly sophisticated ways. However, the concept of intelligent agents in NLP is not a recent development. In fact, it has been deeply rooted in the field since its early days, aiming to create systems that can understand and respond to human language in a meaningful way.

Build Agents with Us

Agents in Natural Language Processing

The idea of agents in NLP can be traced back to the field's early focus on interaction and reasoning. As researchers sought to develop systems capable of engaging in human-like communication, the notion of language-based agents naturally emerged.

Focus on Interaction and Reasoning:

NLP has long been motivated by the desire to create systems that can understand and respond to human language, mimicking human-like interaction and reasoning capabilities.
This focus on interaction and reasoning naturally led to the conceptualization of agents as entities that can engage with users using natural language.

Early NLP Systems as Agents:

Some of the earliest NLP systems, such as SHRDLU (1972), can be considered simple agents in their own right.
SHRDLU, for example, could understand and respond to natural language questions about a simulated block world, showcasing basic reasoning capabilities within a limited domain.
These early systems laid the foundation for the development of more sophisticated language-based agents in the years to come.

Dialogue Systems and Chatbots:

The development of dialogue systems and chatbots heavily relied on the concept of agents, as these systems needed to process user input, understand intent, and generate appropriate responses.
Early chatbots, while less advanced than modern language models, were essentially software agents operating in the domain of human-computer conversation.
These systems paved the way for the more sophisticated conversational agents we interact with today.

As NLP technologies continue to evolve, the concept of agents has taken on new dimensions, particularly with the advent of large language models. LLMs have unlocked unprecedented possibilities for creating intelligent, language-based agents that can understand and generate human-like text with remarkable coherence and contextual awareness.

LLM Agents

What are they?

LLM agents are a new class of AI systems that combine large language models (LLMs) with the ability to make informed decisions, take actions, and work towards specific goals. They can be described as a system that uses an LLM to reason through a problem, create a plan to solve the problem, and execute the plan with the help of a set of tools. LLM agents are typically characterized by 3 properties:

Memory (equivalent to environment in the context of RL)
Tool usage (equivalent to actions in the context of RL)
Planning (equivalent to policies in RL mapping states to actions based on the current state to maximize reward)

This concept enables LLMs to analyze information they encounter and choose the most appropriate tool for the task at hand based on their available policies. This empowers them to make informed decisions and achieve their goals. This is exactly what we humans do. When we have a task to solve, we gather information, we look for ways and tools that help us to solve this task as easily as possible. Memory and tool usage are relatively well established, but planning has still significant room for debate and improvements.

Early LLM agents like AutoGPT and BabyAGI have shown promise in complex tasks like web searches and code generation. However, these agents are still under development, and their stability, reliability, and applicability to real-world problems remain open questions.

Agents and Autonomy

One of the important characteristics of agents is their level of autonomy characterized by their ability to execute increasingly complex tasks with little to no supervision (correlated with the complexity of their policies). A software system at a low level of autonomy might resemble tools and one with a high level of autonomy might behave like an agent. While there is no clear cut differentiation, comparison to autonomous driving can be illuminating.

Levels 0 and 1 are largely what we have seen in industry in the past 2 decades for narrow and specialized use cases. Level 2 is what we have seen in the past 5 years or so, especially with generative models that can convert effectively between modalities (text to image) or ones that have emergent capabilities although they’re only trained on single tasks (aka large language models). The important property of level 2 is that systems at this level are combinations of smaller narrow systems and are coupled to each other via carefully designed interfaces (the most speculated architecture for GPT4 is a mixture of experts). In all these levels, systems have a low level of autonomy which means that they can perform specific actions based on clear instructions but have limited decision-making authority.

Level 3 is where things start to get interesting. Medium level of autonomy means that the systems can choose between different predefined options or strategies based on the context. A lot of the cases we call “agentic workflows” today fall into this category where a pre-trained classifier (eg. intent classifiers in chatbots) routes queries to specific pipelines or action chains. This is a bit of a gray area in terms of the strict definition of agents but robustness demands for most near-term applications would mean that we will see most systems follow this design pattern until more robust infrastructure is available for higher levels of autonomy.

Levels 4 and 5 have high autonomy which means that the agent can learn from its interactions, set its own goals within a broader framework, and make complex decisions without needing explicit instructions. Very much like what we have experienced with autonomous driving at levels 4 and 5, there are significant infrastructure prerequisites for these levels to exist and deliver robust performance. For example, there might be a need for significant organizational changes to allow a software system to execute a large number of tasks within a business workflow.

What are the challenges of deploying LLM Agents in business workflows in the near term?

While LLM agents have a potential to enhance business workflows and to enable more intelligent systems, there are significant challenges that need to be addressed before these systems can be widely deployed in real-world settings.

Technical Challenges

Stability and reliability: Early experiments with LLM agents have shown that they can be prone to erratic or unexpected behavior, often deviating from intended goals or producing nonsensical outputs.
Measuring progress and performance: Evaluating the effectiveness of LLM agents can be complex, as they may take unexpected approaches to achieve goals or potentially deviate from desired outcomes.. Developing robust metrics and evaluation frameworks is an active area of research.

Organizational Challenges

Integration with existing processes and infrastructures: Deploying LLM agents may require complex setup and management of data sources, APIs, and tools. Compatibility issues with legacy systems and the need for custom interfaces and integrations are also challenges.
Human oversight and intervention: Mechanisms for humans to monitor, guide, and correct the behavior of LLM agents are important. Designing workflows and interfaces that allow for seamless collaboration between humans and AI agents is a challenge.

These challenges illustrate that while LLM agents have the potential to augment and automate various business tasks, their deployment requires careful planning, iterative testing, and ongoing monitoring and refinement.

Agents and Control Flow

In the near term, overcoming these challenges hinges on robust control flow mechanisms. Control flow dictates how the LLM agent navigates interactions, makes decisions, and ultimately achieves its goals. Without it, LLM agents risk producing nonsensical outputs, deviating from intended tasks, or simply becoming unstable.

Imagine an LLM agent designed to write customer service emails. Control flow ensures it understands the situation (e.g., complaint, inquiry), retrieves relevant information (e.g., customer details, order history), and crafts a professional and appropriate response. This might involve routing the user's request to the appropriate department or dynamically generating different email templates based on the issue. Control flow keeps the agent on track, preventing irrelevant tangents or factual errors.

Effective control flow also addresses the challenges of measuring progress and integrating with existing systems. By establishing clear decision points and expected behaviors, developers can create metrics to track the LLM agent's performance and identify areas for improvement. Furthermore, control flow allows for the integration of human oversight and intervention. Developers can design control mechanisms that allow humans to guide the LLM towards desired outcomes or step in when the agent encounters unexpected situations. In essence, control flow acts as the bridge between the raw power of LLMs and the need for stability, reliability, and human oversight in real-world applications. See also:

You can think of control flows as generalized policies that come from a deep understanding of the workflow the agent is trying to automate. They could include things like:

Guardrails: Type of control flow specifically designed to restrict the LLM's behavior in certain ways. They act like safety rails, preventing the LLM from venturing into undesirable areas or generating harmful outputs. For instance: preventing offensive language, staying on topic, and fact-checking.
Routing: LLMs can be used to make decisions about how to respond to a user's query. This can involve classifying the query type (e.g., question, request, instruction) and then directing the response accordingly. For instance, an LLM might predict the best response to a question is a factual summary, while a request might require completing an action (like booking a flight).
Error Handling and Recovery: When an agentic system encounters an issue, control flow mechanisms allow it to diagnose the problem and take corrective actions. This might involve prompting the user for clarification, reformulating a request, or attempting alternative strategies to complete the task.
Prioritization and Decision Making: Agentic systems often juggle multiple tasks or goals. Control flow structures help them prioritize based on urgency, importance, or available resources. For instance, a virtual assistant might prioritize responding to an urgent message over completing a less time-sensitive task.
State Management: Many agentic systems track their internal state (e.g., conversation history, user preferences) to provide a more consistent and personalized experience. Control flow dictates how the system updates its state based on new information and uses it to inform future actions. Imagine a chatbot remembering your previous order preferences while recommending a new product.
Learning and Adaptation: Advanced agentic systems can learn and adapt their behavior over time. Control flow allows them to integrate newly acquired knowledge into their decision-making process. For instance, a recommendation system might adjust its suggestions based on your past interactions and positive feedback.

Most common implementations involve training small, specialized models (not necessarily language models) that carry out tasks and provide information or constraints to the overall system including crafting prompts that maximize the likelihood of the desired LLM response.

Build Agents with Us

Multi-agent LLM systems

Why multi-agent systems?

Multi-agent systems involve multiple single-agent systems that interact to achieve complex tasks. This is particularly useful when a monolithic agent (one policy based on a singular reward function) is too inefficient or impossible. In our customer service example, imagine an agent who is rewarded by maximizing the likelihood of upselling products. This can hurt the business by sacrificing long term retention of the customer if they start to feel they are being “sold to”. One solution could be adding more terms to the reward function to account for the long term retention. Now collapsing all those contradictory requirements into one function might not necessarily result in the best policy learned by the system. Alternatively, one could create a system where two agents, one rewarded for maximizing short term profit and one rewarded for reducing the risk of retention, can collaborate and keep each other in check.

Some of the other reasons for breaking up the system into multiple agents are:

Optimized task allocation: It might be more efficient (from a design, implementation, and maintenance point of view) to break down a complex problem into smaller subproblems and have agents rewarded for solving those subproblems specifically. This more modular design, although unnecessary from a functionality point of view, could be easier to improve and scale.
Enhanced response time: The modular design of multiple agents is not only useful if the sub-problems are different but also in cases where they might be similar but they can be done in parallel to save analysis time.
More robust specialization: While it is plausible for one agent to learn how to choose amongst a large number of actions, it might be more robust to partition actions based on some relevant property and have agents specialize in using them more effectively.

Agents UX and HCI

Software agents have become integral components of various Human-Computer Interaction (HCI) applications. These agents, powered by large language models (LLMs), serve various roles, from acting as personal assistants to customizing user interactions based on individual preferences. This section explores the roles of LLM agents in HCI and their impact on user experiences.

Intelligent User Interfaces (IUIs): Agents can act as intelligent assistants that can understand user needs, provide recommendations, and automate tasks. They offer a more intuitive and efficient means of interaction, reducing the cognitive load on users. Virtual assistants like Siri, Alexa, or Google Assistant are prime examples of IUIs which allow LLM agents to interpret user queries and provide relevant information.
Personalization and Recommendation Systems: Recommendation systems in e-commerce or streaming services can be powered by agents. These agents learn user preferences and recommend products, movies, or music based on that information. However, it's important to acknowledge that LLM agents can inherit biases from the data they're trained on. Transparency in how recommendations are generated is crucial for user trust.
Adaptive Interfaces: Agents can be used to create adaptive interfaces that adjust to user behavior or skill level. For instance, an educational software program might use an agent to tailor the difficulty of exercises based on the user's performance.
Embodied Conversational Agents (ECAs): These are virtual characters that can interact with users through spoken language or gestures. ECAs powered by LLMs can be used for customer service, education, or even companionship. Imagine an ECA tutor that personalizes learning experiences and provides emotional support.
Augmented Reality (AR) and Virtual Reality (VR): Agents can be integrated into AR/VR experiences to guide users, provide information, or even act as companions within the virtual environment. An LLM agent in an AR museum experience could provide historical context about exhibits or answer visitor questions in a natural, conversational way.

LLM agents are revolutionizing HCI by creating more intuitive, efficient, and personalized user experiences. They reduce cognitive load, improve accessibility, and offer a more natural way to interact with technology. As LLM agents continue to evolve and become more sophisticated, we can expect even more innovative applications that enhance human-computer interaction in the years to come.

Build Agents with Us

Parting words

Hopefully, with this article you have learned more about agents and are as excited as we are about them! Agents are the holy grail of a lot of what we have done and have been wanting to do with computers for the past several decades. Today, with natural language as a new way to interface with machine, we are closer than ever to that dream!

Happy building!

LLM Agents, Part 1 - The “9” Commandments: How to Build LLM Products Successfully

Amir Feizpour (ai.science) — Wed, 24 Jul 2024 14:04:40 GMT

In this write up we will go over the most important principles you should follow as you ideate, validate, design, and build your LLM product. One thing that you will realize by the end of this is that the principles of building the most sophisticated multi-agent LLM products is the same as the ones for any LLM product and ultimately the same as the ones for any data-powered software product.

1. Data is most probably your only moat

We are living in a world where powerful open-source models are just a few clicks away, and your proprietary data is likely your only sustainable competitive advantage. While anyone can access state-of-the-art models like GPT-4o, Llama3, and Claude, the data you use to fine-tune and augment these models is what will truly set your product apart. Your data is the secret sauce that enables you to build AI systems that can perform tasks and provide insights your competitors can only dream of. Even if becoming a unicorn is not necessarily your thing, being able to interface LLMs with different types of data (eg. multi-modality) blows up the space of possibilities you can explore in terms of use cases.

It is crucial to focus on building products and features that allow you to collect unique and valuable data that others can't easily replicate. This might mean targeting niche domains where you have deep expertise in, or creating AI-powered tools that incentivize users to contribute their own data. Another strategy is to form data partnerships with organizations that have complementary datasets, allowing you to enhance your models' capabilities without starting from scratch.

Perhaps, collecting high-quality data can be challenging and resource-intensive. One approach to mitigate this is to create synthetic data that mimics real-world scenarios. Synthetic data can help augment your existing datasets and improve model performance, especially in cases where real data is scarce or expensive to obtain.

When it comes to preparing your data for training models, it's important to weigh the benefits of annotating data versus relying solely on unsupervised methods. While unsupervised learning can be appealing due to its potential to reduce manual labor, annotated data often leads to better model performance and faster convergence. Investing in data annotation can pay off in the long run by improving the accuracy and reliability of your AI systems.

Adopting a data-centric approach to machine learning is key to building a strong competitive moat. By focusing on collecting, curating, and leveraging high-quality data, you can create AI products that are more accurate, insightful, and valuable to your users. Always be on the lookout for opportunities to expand and enrich your data moat, as it will be the foundation upon which your AI business is built.

The flip side of this advice is that blindly following this principle without thinking deeply about what results in customer’s long term loyalty could simply result in disappointment. Ultimately, the big question is how to use the data (or any other unique resources you have) to deliver value to your customers in a way that compounds: you win new customers and expand your relationship with the ones that you have.

Build Agents with Us

2. Follow validation driven development

To build successful AI products, you need a rigorous approach for measuring and optimizing performance. This is where evaluation-driven development comes in. This is particularly important for agentic workflows and multi-agent systems. A very common problem in naively built agentic systems is compounding error in these systems that quickly leads into systems falling in endless loops or producing nonsensical results. The only way to avoid these problems is having reliable and granular metrics throughout the system that act as feedback or reward mechanisms keeping the components and overall system in check.

Start by defining clear, quantitative metrics that capture what "good" looks like for your product - whether that's accuracy, user engagement, task completion rate, or some combination of these. This has to be done for the components of your architecture as well as the overall performance of the system.

With your key metrics in place, orient your development process around continuously evaluating your pipelines against these benchmarks and iterate to improve performance. This could involve experimenting with different system designs, model architectures, model combinations, fine-tuning techniques, prompt engineering approaches, and UX designs. The key is to have a solid experimental setup where you're constantly shipping new arrangements of components, measuring their impact on your core metrics, and doubling down on the most promising ideas. This is also particularly important for LLM agent systems since the landscape of potential improvements is so vast that a thorough investigation of all possibilities with limited resources is simply impractical.

Don't get caught up in chasing the latest shiny model or technique without a clear sense of how it actually moves the needle on your core evaluation criteria. If you can't measure what "better" means, you're at high risk of turning in circles or fixating on the wrong things. By grounding your development in rigorous evaluation, you can efficiently zoom in on system architecture designs that actually deliver value to your users.

Build Agents with Us

Further Reading:

3. Get your product in the hands of your (ideally paying) users asap

One of the biggest pitfalls in AI development is getting bogged down in endless technical tweaks before getting any feedback from real users. This is especially tempting with LLMs, where there's always another parameter to tune or dataset to incorporate. But the reality is, you'll never know if you're building something people actually want until you put it in their hands. The feedback gets even more real if they are paying you (or at least they anticipate having to pay you to use the product).

The antidote is simple (but not always easy): Build the simplest viable version of your product and get it in front of users as quickly as humanly possible. This might mean starting with a bare-bones MVP that only does one thing, or even launching a "fake" version powered by human labor behind the scenes. The point is to start collecting real feedback and data from day one, so you can validate your core assumptions and start iterating in the right direction. Doing this can also help regulate your understanding of the right metrics to track as per last commandment. It is easy to lose sight of what really matters to the user quantitatively by hiding behind technical metrics like accuracy.

Ideally, get this initial version in the hands of paying customers, even if it's just a small pilot group. Seeing real people actually fork over their hard-earned cash for your product is the ultimate validation that you're onto something. Plus, having revenue coming in from the get-go will help extend your runway and give you more breathing room to iterate.

Another important aspect of this is deployment. It is great that your product works on your laptop, but if the user can’t interact with it you have significant friction in getting the feedback that you need.

Build Agents with Us

Further Reading:

4. Separate the data and interface layers, and be prepared to invest in data engineering

Using LLMs doesn't give you a free pass to ignore established software and data engineering best practices. In fact, as LLM-based systems grow in complexity and capability, it becomes even more critical to architect your systems in a modular, maintainable way.

A key principle here is maintaining a clear separation between your data and interface layers.

Designing your system in a way that the LLM itself becomes the source of knowledge is a risky and ill-advised approach. Instead, strive to architect your system, craft your prompts, and provide the relevant context to the LLM to ensure it relies solely on the information you supply to it when generating a response. While this may evolve in the future, cleanly decoupling your data from your interfaces gives you greater control, allows you to layer on additional security and privacy measures, and makes your system more robust to changes in the underlying models. Retrieval-augmented generation (RAG) techniques provide a powerful way to achieve this decoupling while still harnessing the full power of LLMs.

It is tempting to think that you just fine-tune one model using your data and it will work as expected with all the controls that you need. The reality is that LLMs are not well-behaved enough to achieve any granular level of control necessary for real world applications and use cases. It is best to separate the data layer (knowledge base documents, structured data, etc) in already well established structured (aka databases) with all the necessary controls (eg. identity and access management) that come with those. This also makes it easier for you to pre- / post- process that data before feeding it to the model in retrieval augmented generation (or equivalent) setups.

Separating the data layer also gives you the ability to build all the logic necessary for processing, storing, and retrieving the data used to train and run smaller models you use in your control flows or to fine-tune your larget models. This includes data ingestion pipelines, data cleaning and transformation steps, feature engineering, and data versioning. Your interface layer, on the other hand, should focus solely on exposing the capabilities of your models to end-users, whether that's via APIs, chatbots, or interactive GUIs. Of course, LLM itself can act as a linguistic interface by providing conversational interactions with the user.

Build Agents with Us

Further Reading:

5. Do not count on LLMs beyond linguistic interfaces

LLMs are incredibly powerful for natural language tasks - they can engage in human-like dialogue, answer questions, summarize long passages, and even write creative fiction. But it's critical not to get swept away by the hype and expect them to be a magic bullet for every use case. As the name suggests, LLMs are language models - they excel at generating statistically plausible sequences of words, but struggle with many other desirable capabilities like reasoning, analysis, and grounding in real-world facts.

Many people fall into the trap of hoping LLMs will handle complex reasoning, read their minds to infer intent, write flawless code on the first try, or magically handle scheduling and workflow automation. But today's models simply aren't reliable for these types of tasks. Outside of linguistic interfaces, LLMs have significant limitations that constrain their usefulness. They are notoriously prone to "hallucinations" - confidently generating false or nonsensical information that can be hard to detect. They struggle with maintaining coherence over long time horizons or complex multi-step tasks.

So when architecting LLM-powered products, it's crucial to be ruthlessly realistic about what the models can and can't do. Focus on leveraging LLMs for what they excel at - engaging with users through natural language - and thoughtfully architect supporting systems to handle any downstream tasks. Be prepared to break down complex workflows into atomic steps, provide extensive context and guidance, and double-check outputs for factual and logical consistency. By playing to the strengths of LLMs while proactively addressing their limitations, you can design products that harness their power while mitigating their downfalls.

Build Agents with Us

Further Reading:

6. Create Robust Feedback Modules

In academia, scientific papers undergo peer review, where different experts independently critique the work before publication. Borrowing from this process, a powerful paradigm for building self-improving AI systems is to train multiple models that play distinct roles akin to authors and reviewers.

In this setup, you might use a generative model to produce some output, like a dialogue response, a document summary, or a piece of code. You then have to use a separate "critic" model to evaluate the quality of that output along various dimensions like factual accuracy, logical coherence, style and tone. Crucially, these models are trained independently, so the critic acts as an objective assessor, not just a rubber stamp.

This is particularly important in agentic systems where the goal is for the system to continuously monitor its performance, reflect on the outcome, and try again with improved likelihood of better performance. This is a crucial ingredient for the level of autonomy we seek in agents. Therefore implementing highly reliable, accurate, and trustworthy feedback sub-systems (aka “reward mechanism” in the context of RL) is a big part of success in building an agentic product. You can even equip the agents with ensembles of critic tools (including but not limited to occasionally asking for human input) to cover different facets of evaluation, like long-term coherence vs. individual response quality.

The key benefit of this architecture is that it provides a scalable mechanism for quality control and continuous improvement that doesn't rely solely on human judgement. That said, it's not a total replacement for human evaluation - you'll still want to spot check the system's outputs, especially in the early stages. And there's an art to designing the right training setup and reward functions to get useful feedback while avoiding degenerate equilibrium between the generator and critic. But when done right, this approach can imbue your AI systems with the benefits of peer review to make them more robust and self-correcting over time, and therefore achieving a higher level of autonomy.

Build Agents with Us

Further Reading:

7. Actually Improve User’s Productivity

It's easy to get caught up in building flashy AI demos that showcase the latest and greatest model capabilities. But at the end of the day, the true measure of success for your LLM products is how well they improve users' lives in tangible ways. In particular, since the most promised benefit of LLM agent tools is productivity, it's critical to honestly assess whether you're making people more efficient at important tasks, or just giving them one more thing to babysit.

To deliver real productivity gains, you need to deeply understand your target users' existing workflows and ruthlessly prioritize AI features that will save them time and effort. Now productivity does not equal saving time only, but rather it implies saving unwanted effort, therefore, your product has to address workflows that your users:

Spend significant time on, AND
They do not want to spend that time doing that task.

Approach every new capability through the lens of "how does this concretely make my user's job efficient and effective?" If you can't quantify the impact, chances are it's not worth building.

Another important nuance here is that builders are sometimes excited about taking away the parts of the job that people actually enjoy spending time on rather than parts that they hate doing. While that product might theoretically improve productivity, the psychological barrier of using it will backfire.

Build Agents with Us

8. Think deeply about integrations into users' tools and workflows

To drive successful adoption, your AI product needs to fit seamlessly into users' existing workflows and tool chains. No matter how impressive your models are under the hood, if using your product feels like a clunky, disjointed experience, people simply won't bother. On the flip side, if your product slots nicely into the tools and processes users are already using day-to-day, you'll dramatically lower the barriers to adoption and make your AI feel like a natural extension of users' workflows. The last thing people want is yet another siloed app to switch back and forth from. Instead, look for opportunities to embed your AI capabilities right within the apps users already live in day-to-day, whether that's their email client, messaging platform, note-taking tool, or code editor.

By meeting users where they already work and focusing relentlessly on concrete effort savings, you can ensure your AI product isn't just a novelty, but an essential part of people's daily flow. And those efficiency gains add up fast - saving someone a few minutes or clicks on a task they do 10 times a day is a game changer. Keep humans at the center, measure what matters, and optimize for their productivity above all else.

To get this right, you need to invest significant time upfront to deeply understand how your target users currently work and what their key pain points are. This means going beyond surface-level interviews and surveys to really immerse yourself in their world. Shadow them as they go about their tasks, paying close attention to all the tools, systems, and collaborators they interact with along the way. Map out their end-to-end workflows to identify bottlenecks, inefficiencies, and opportunities for AI to streamline the process.

Armed with this deep understanding, architect your AI product to integrate with the specific tools your users depend on, with seamless bridges for importing and exporting data, triggering actions, and collaborating with teammates. In many cases, this means delivering your AI capabilities as plugins or add-ons right within users' primary tools, instead of forcing them to switch to a separate app.

When well executed, this deep integration approach makes your AI product feel less like a tool and more like an intelligent assistant that's always there in the flow of work, ready to lend a hand. Users don't have to disrupt their normal processes or learn new interfaces - they can simply tap into the power of AI whenever and wherever they need it. And that frictionless experience is the key to making AI an indispensable part of people's daily lives.

Build Agents with Us

Further Reading:

Change Management for LLM-based Products

9. Design for humans!

Amid the excitement around LLMs and other AI breakthroughs, it can be tempting to get carried away imagining a world where machines handle every task and decision. But the reality is, humans are going to remain an essential part of the equation for the vast majority of use cases for the foreseeable future. Even the most sophisticated AI systems today are narrow in scope and brittle in the face of edge cases. They are powerful tools to be wielded by humans, not wholesale replacements for human judgement.

As such, it's critical that we keep real human needs, behaviors, and constraints at the center of our AI product development process. At every step along the way, we need to be testing our products with actual users, seeing how they integrate (or don't) into their real-world contexts, and shaping the user experience accordingly. Pretty model performance numbers in a lab setting are meaningless if they don't translate into tangible benefits for humans in the messy real world.

Prioritizing the human element means investing deeply in thoughtful UX design, extensive user testing, and rapid iteration based on feedback. It means providing robust, accessible user education to help people understand both the capabilities and limitations of the AI systems you're putting in their hands. And it means proactively considering and mitigating the potential risks and unintended negative consequences your product could have in people's lives.

Ultimately, our north star as AI product builders should be empowering humans to do their best work. We have an incredible opportunity to usher in a new era of productivity and creativity, but it will require the hard, patient work of aligning powerful AI capabilities with real human needs. If we keep humans at the center and measure success by the positive impact we have in their lives, not just our model metrics, we can build an AI-powered future that brings out the best in both machines and people. The road ahead won't be easy, but the destination will be more than worth it. So stay focused on those human needs, keep iterating, and let's build the future together!

Build Agents with Us

Further Reading:

Crafting LLM-powered Interactions: Design Principles for Natural-Language User Interfaces

Conclusion

Building successful multi-agent LLM products requires a multidisciplinary approach that blends technical chops, product sensibilities, and proactive ethical responsibility. The 9 principles we've explored provide a comprehensive roadmap for navigating this complex landscape.

By focusing on building proprietary data moats, maintaining modularity in your architecture, relentlessly evaluating and iterating on product performance, and deeply integrating with users' workflows, you'll be well on your way to creating AI systems that deliver real value. And by keeping humans at the center of the process and proactively addressing potential negative impacts, you can ensure that value is achieved responsibly and sustainably.

But while these principles provide a solid foundation, the reality is that building game-changing AI products is hard. It requires grappling with cutting-edge research, wrangling messy real-world data, and constantly iterating in the face of shifting user needs and expectations. There will be setbacks, dead-ends, and pivots along the way. The key is to stay focused on your north star of empowering users, stay humble in the face of complexity, and keep pushing forward one experiment at a time.

The potential for LLMs and multi-agent systems to transform how we live and work is immense - but it won't be realized without diligent, human-centric innovators translating the raw capabilities into meaningful products. By internalizing these 9 principles and tenaciously applying them in practice, you'll be at the vanguard of this exciting frontier. The journey won't be easy, but the destination will be more than worth it. So go forth and build the future!