TruEra

Snowflake is acquiring the TruEra AI Observability platform

TruEra Founders — Wed, 22 May 2024 20:05:00 +0000

We are excited to announce that Snowflake is acquiring the TruEra AI Observability platform to bring LLM and ML Observability to its AI Data Cloud. We are looking forward to this next phase in our journey with the Snowflake team with whom we share a commitment to delivering high quality, trustworthy AI at scale.

Snowflake is committed to making enterprise AI easy, efficient, and trusted. Nearly ten thousand companies around the globe, including hundreds of the world’s largest, use Snowflake’s AI Data Cloud to share data, build AI and machine learning applications, and power their business. Snowflake is making massive investments in generative AI services (e.g. Snowflake Cortex AI) and end-to-end machine learning capabilities (e.g. Snowpark ML) to help customers build and deploy high-impact AI use cases that maximize the value of their data.

At the same time, to deliver value to these customers, it is absolutely critical to ensure that generative AI applications and ML models built with these services are effective and trustworthy – that they produce accurate and relevant results and guard against risks such as hallucinations, unfair bias, toxicity, and more. This is exactly the problem that the TruEra AI Observability platform addresses by evaluating, monitoring, and debugging models and apps across the full lifecycle from development to production.

We are excited to bring these world-leading capabilities to Snowflake and to join the team to enable effective and trustworthy end-to-end generative AI and ML capabilities for its thousands of customers. Our customers have told us they want both innovative technology and end-to-end solutions across the AI lifecycle that are easy to deploy, integrate and use. We are excited now that we can fully deliver that through the combination of TruEra and Snowflake.

Indeed we believe this path aligns well with TruEra’s mission to ensure effective and trustworthy AI for everyone. The core elements that we founded TruEra on 5 years ago – world-leading AI Research, hands-on experience with building and deploying ML, domain expertise with specific verticals, and most importantly, a shared passion to make AI adoption effective and safer at scale – will remain key in the next phase of our journey with the team at Snowflake. TruEra’s AI Observability product combines a scalable data infrastructure with AI research innovations to enable scalable ML monitoring, precise root cause analysis, and fast, accurate explainability. Our expansion into LLM Evaluation and Observability – initially with our open source TruLens library and more recently with our commercial offering – introduced technical concepts, such as the RAG Triad, and innovative ways of using smaller language models to cost-effectively scale up monitoring of LLM apps. We look forward to bringing these differentiated capabilities to Snowflake’s AI and ML services.

We are grateful to the many TruEra customers and TruLens users who have joined us on our journey and believed in us. We have had the great pleasure of working with innovative organizations, leaders, and practitioners who have trusted us with mission-critical AI projects. These companies include Standard Chartered, AES, Harper Collins, Zurich Insurance, Haystaq, Equinix, Union Bank and KBC, along with a number of other Fortune and Global 500 enterprises whom we cannot name publicly, but upon whose support we have relied. We also appreciate the thousands of developers who have embraced TruLens, joined our community, and provided rapid feedback, shaping both TruLens and our commercial LLM Observability offering.

Thank you to our partners, who have stood with us and shared many webinars, events, online educational courses, and cocktails, especially our partners at Intel, Nvidia, HPE, and Amazon, and our friends in the AI stack, Llama Index, Pinecone, OpenAI, Google, and Anthropic.

And of course, thank you to our hardworking and steadfast TruEra team members, past and present. Over the years, amazingly talented individuals have dedicated their time, skills, and passion to TruEra, helping us to accomplish great things in a very short amount of time. We have covered a tremendous amount of ground on this journey so far, and we are proud to have worked and to continue to work alongside you.

We are delighted to move forward on this journey with our new colleagues at Snowflake. The Snowflake leadership and engineering teams are truly impressive, and share our passion for high quality, trustworthy AI. We look forward to joining forces to help them drive their AI vision forward. Look for the TruEra team to be a major contributor to Snowflake AI initiatives soon.

Sincerely,

Anupam Datta, President and Chief Scientist

Shayak Sen, CTO

Will Uppington, CEO

The post Snowflake is acquiring the TruEra AI Observability platform appeared first on TruEra.

GPT-4 Powered App Creation and Evals – Hackathon Wrap-Up

Lofred Madzou — Tue, 02 Apr 2024 11:03:35 +0000

From Feb 16 to 23, Truera hosted its second hackathon in partnership with OpenAI, Pinecone, Portkey and LabLabAI. We had over 1604 participants divided into 302 teams. Over a week, talented hackers used GPT-4, Trulens and complementary technologies to build and iterate on 56 LLM-based applications. Working with these teams to build their applications, a few themes emerged:

Theme 1: Beyond basic RAG. The teams that separated themselves from the pack innovated beyond question-answering RAG to define user flows and interactions meaningful for their use case.
Theme 2: Informed iteration. With TruLens in hand, hackers could quickly discover failure modes and optimize their apps. Teams evaluated different architectures, models and more to find a winning solution.
Theme 3: It’s not always necessary to optimize evaluation scores. Using TruLens, the teams were able to understand the tradeoffs between cost and response quality. This allowed them to find when GPT-3.5 was sufficient for their use case, and when GPT-4 was really needed – saving cost!

We received fantastic submissions but three really stood out.

Winner #3: SimpliMedi-Assist

Team: Timothy Yeboah

SimpliMedi-Assist is an innovative and user-friendly application designed to facilitate the understanding of complex medical reports. It leverages sophisticated language models to translate difficult medical terms into accessible information for patients and non-medical professionals. In doing so, it empowers patients who may otherwise feel intimidated by medical jargon. Also, it promotes health literacy and strengthens the relationship between physicians and patients, which ultimately improves healthcare outcomes.

Winner #2: Conscious Click

Team: Joey Wang

Conscious Click is a web application that helps fight internet addiction, which has become a global healthcare issue. Indeed, according to a study doneIn 2010 8.2% of the US population suffered from internet addiction. A more recent study shows that 36.7% of the population in China is addicted to the internet because of COVID-19. Conscious Click leverages insights from Dr. Anna Lembkethe’ best seller “Dopamine Nation: Finding Balance in the Age of Indulgence” to build a coach assistant that offers customized 30-day plans to adjust a user’s level of dopamine. It is complemented by a chrome extension that tracks a user’s internet consumption and apps usage. If a user tries to watch a video which is not aligned with his goal, it will automatically block it.

Winner #1: CogniSmile

Team: Mohamed Regaya, Yahya Samet, Mohamed Aziz Omezine, Aya Omezzine, Hachicha Mohamed

CogniSmile is an AI-powered storytelling app that creates personalized stories and artwork based on children’s preferences and drawings. Here’s how it works. First, kids select themes that they care about and make related drawings. Then, an AI model generates captivating stories based on their selected themes and drawings. In addition, they can turn these experiences into learning opportunities by using quizzes. This app has many benefits: 1) it stimulates kids’ imagination and creativity, and 2) it promotes literacy and language skills.

The post GPT-4 Powered App Creation and Evals – Hackathon Wrap-Up appeared first on TruEra.

LLM App Success Requires LLM Evaluations and LLM Observability

Will Uppington — Thu, 21 Mar 2024 16:45:00 +0000

TruEra recently announced the launch of a major update to its AI Observability offering. This launch dramatically expands TruEra’s LLM Observability capabilities, so that it provides value from the individual developer all the way to the largest enterprises, across the complete LLM app lifecycle. A developer can get started for free with the TruLens LLM eval solution and transition seamlessly to using TruEra for LLM Observability in production, at scale, collaborating with a broader team. We are seeing significant increases in the use of the TruLens (2-3x month over month) and are grateful for the positive feedback we’ve gotten from organizations actively using TruLens, like Equinix and KBC Group.

We are at the advent of a new era in software

We are at a critical time in the evolution of GenAI technology for the enterprise. GenAI is a paradigm-shifting technology like predictive AI, SaaS, mobile, and other large evolutionary stages in software and computing. But the full extent of the impact on the enterprise and how quickly it achieves this full impact is still uncertain. In 2023, most enterprises developed prototypes and ran internal POCs, and were charged with enthusiasm about GenAI’s potential. We are at a critical point – will this enthusiasm sustain and prove itself worthy, or will an embarrassing parade of unready apps create a major deflation?

Figure 1 – There are many ways that this LLM app hype cycle could go.

In late 2023, we conducted a survey that showed that very few organizations had actually moved LLM applications into production. Only 11% of enterprises had moved more than 25% of their GenAI initiatives into production. Approximately 70% of enterprises have not yet moved any applications into production. Customers tell us that, now, their top priority is to get these apps into production and to achieve business impact. They want to take advantage of this path-breaking technology and stay ahead of their competition. The questions that these customers and the industry face are fundamentally: How can these LLM applications quickly become successful and fulfill the promise of GenAI? Which factors will determine the ultimate success of these applications?

Figure 2: Most often, LLM apps are still in the development phase.

The unique challenges of LLM applications

As enterprises and startups get serious about moving LLM applications from the POC stage into production, they experience new pain points. TruEra surveys show that by far the top pain points are quality and governance. Providing a high-quality user experience is key to the success of any application, and LLM-based applications create unique quality considerations, different from other applications. Most software systems, such as CRM systems, tend to be deterministic, providing expected and reliable output based on a given input. LLM apps are less understood and deterministic, producing unexpected and potentially problematic output.

Here are the key concerns that we hear about:

Relevance and accuracy. In particular, developers need to ensure that LLM apps provide relevant responses, such as an accurate response to a question, or a comprehensive and appropriate summary of content. The LLMs powering these applications have been shown to both provide non-relevant answers (things that may be true, but are not the answer to the question) and to hallucinate, to provide made-up non-factual output. In order to be trustworthy and minimize corporate risk, LLM apps need to meet relevance and accuracy benchmarks.

Inappropriate behavior, toxicity, or bias. In order to protect their brand equity, stay in the favor of their customers, and to ensure compliance with regulations, organizations want to ensure that their LLM apps are not inappropriate, rude or unfairly discriminatory.

Leaks or misuse of confidential, personal, or other sensitive information. Enterprises are concerned about sending company proprietary information outside of their companies, such as to LLM providers, as part of operating these applications. They are also concerned about risks related to the output of these applications, including leaking confidential company information, or information that they are legally required to restrict, such as sharing private customer information.
Cost containment. As the usage of these applications scale, developers also become concerned about costs. For example, variable costs related to usage of the LLMs especially that are low at an aggregate level during an internal POC can rise substantially as the usage of the application increases. Enterprises are also concerned about optimizing their application, usage, and latency but to a lesser degree than the top 3 pain points.

Ongoing quality management needs. Quality is constantly changing due to changes in inputs. LLM app developers are going to find that addressing relevance and other key performance metrics will be an ongoing battle, as users enter prompts that they did not anticipate, and the underlying content in LLM app knowledge stores changes.

Figure 3: The leading pain points of organizations building LLM apps.

To the rescue: programmatic LLM Evaluation and Observability

How are enterprises addressing quality, governance, and cost challenges? Currently, developers run manual tests, assess quality, watch for governance issues, and calculate costs. This manual approach is limited, time consuming, and painful. It negatively impacts developer productivity, reducing the speed at which developers can improve quality and address stakeholder concerns. In addition, manual methods will, inevitably, miss many quality or governance failures, leading to quality problems during testing or POCs, and reduce trust. Organizations are rapidly recognizing that addressing quality, governance, and cost challenges requires new, programmatic evaluation technology during development and production. Or, as Greg Brockman of Open AI put it: “Evals are surprisingly often all you need.”

Figure 4: Greg Brockman, President of OpenAI, commenting on the importance of evaluations in December, 2023 on X (formerly Twitter).

And these evals are not only needed during development for testing, debugging, and iterative improvement, but they are also needed on an ongoing basis during production. Enterprises thus need both evaluations and observability.

LLM Evaluation. Evaluations are critical for identifying quality and governance issues during development, identifying root causes, and then iteratively improving the application. Evaluations address quality and governance issues, all while also managing costs, such as the token cost of using the LLMs. Programmatic evaluations supercharge this by allowing developers to evaluate LLM apps much more quickly, efficiently, and effectively.

LLM Observability. As developers seek to move their applications into production, they will need to programmatically evaluate their applications on an ongoing basis. The inputs to LLM apps will constantly evolve. This is commonplace for language apps. For example, Google has shared that 15% of the keywords entered into Google search are new each day. While the number of new inputs might not be exactly the same in Enterprise LLM applications, LLM application developers should expect significant change in most apps. Similarly, the content summarized in LLM applications will change over time, leading to new relevance and hallucination challenges. Risks, such as toxicity and bias, will also remain. Ongoing programmatic evaluations will be critical to detecting potential failures. The new challenge that faces developers will be how to scale ongoing evaluations for production level usage.

TruEra: the solution for LLM evaluations in development and production

With our new release, TruEra now provides a comprehensive solution for programmatically evaluating and monitoring LLMs across both development and production. What’s unique about our approach is that we offer both a leading open source offering and a proven, scalable commercial solution. Developers can start evaluating during development using TruLens for free, and then seamlessly start using TruEra as they need to scale their evaluations in production. TruEra is built on top of our proven, predictive AI solution platform,so that customers can have confidence in our scalability and security.

TruEra offers critical features for evaluation and monitoring including:

Granular evaluations, including RAG Triad analytics
- In order to effectively evaluate LLM applications, it’s important to run granular evaluations. If a developer were to just run a single evaluation of the relevance of the output of a LLM application, it would not provide enough information to effectively debug the application. Instead, developers need to run multiple, granular applications for each of the application’s constituent parts.
- For example, in order to effectively evaluate and debug a Retrieval Augmented Generative (RAG) application, TruEra created a concept called the RAG Triad. . The RAG Triad requires three granular evaluations:
  - Context relevance: assess the relevance of context retrieved by the RAG vector database relevant to the prompt
  - Groundedness: evaluates the truthfulness of a LLM summary relative to the provided context, or, in other words, how much the summary may have hallucinations.
  - Query relevance: assesses the relevance of the final response relative to the prompt
- These evaluations enable developers to quickly hone in on whether a poor output is due to relevance of the context provided by the vector database, LLM hallucinations, or the quality of the LLM summarization capabilities.

Figure 5: The RAG Triad entails three critical analytics to test for app quality.

Comprehensive evaluations for quality and governance
- In addition to RAG Triad evaluations, TruEra offers over 10 additional out-of-the-box evaluations, including: Embedding distanceContext relevance, Groundedness, Summarization quality, PII Detection, Toxicity, Stereotyping, Jailbreaks, Prompt sentiment, Language mismatch, Conciseness, and Coherence.
- TruEra also provides the ability to add custom evaluations, such as tracking manual evaluations.
Scalable monitoring and alerting
- TruEra can track evaluation output over time at large scale
- Developers can create sophisticated alerting rules to proactively inform developers when any metric hits problem thresholds
Custom metric monitoring
- In addition to programmatic evaluations, TruEra can track and visualize custom metrics. For example, a developer might instrument their application to track engagement metrics, such as click-through-rate and time-on-page. These metrics can be imported into TruEra so that developers can visualize them over time and create alerts.
Debugging
- If a developer observes a period of time with problematic values, they can hone in on a particular time range and then create a data set containing all of the underlying application prompts, responses, and metadata during this period. This makes it easy for developers to identify and debug problematic inputs, outputs, and metadata within the application.

Manage GenAI and predictive AI in one place

TruEra’s evaluation and monitoring solution doesn’t just work for LLM applications. It also provides a comprehensive set of explability, evaluation, model validation, testing responsible AI, monitoring and debugging functionality for predictive ML models. Enterprises can use a single solution for both predictive and generative AI.

Helping developers and enterprises achieve production AI

Most enterprises find themselves at a critical juncture in their generative AI development, and TruEra is proud to offer the first combined open source and commercial offering to help them achieve production AI success. Interested in getting started? Download TruLens LLM evaluation software for free today, or contact us to get started with TruEra.

There is a lot more yet to be done. In particular, monitoring LLM applications will require new lower cost and lower latency programmatic evaluation methods. Current common evaluators will not scale that well. More on this soon!

The post LLM App Success Requires LLM Evaluations and LLM Observability appeared first on TruEra.

LLM Evaluation and LLM Observability – Now at Enterprise Scale

Shayak Sen — Wed, 13 Mar 2024 10:00:00 +0000

Co-author: Anupam Datta, President and Chief Scientist, TruEra

Today we are announcing a significant update to TruEra AI Observability that allows users and enterprises to seamlessly monitor and debug their LLM apps for quality at scale. This launch integrates TruLens, the open source evaluation and experiment tracking framework with the TruEra AI Observability product. TruLens is used by developers in enterprises to evaluate and track quality during experimentation and before deployment. As more teams are now transitioning their apps to production, they need greater scale and flexible views over time to understand how their apps are doing in the real world.

Evaluations are critical

Large Language Models have made it straightforward to build rich apps from chatbots to question answering. However, enterprises are realizing that going from prototype to production requires a significant amount of iteration and improvement to mitigate problems like hallucinations, accuracy etc. Further, once these apps are put in production, these need to be vetted on an ongoing basis. As a result, evaluations are critical throughout the lifecycle.

Evaluations aren’t enough – monitoring is also key

The key insight that powers all TruEra observability solutions is that metrics and evaluations need to be coupled with tools that help users deep dive into their performance and understand the causes of failures and errors. With this release, we’ve integrated LLM monitoring with advanced tracing and root cause analysis that helps users understand why the metrics may be dropping.

2024 looks very different from 2023 for LLM projects

2023 was an inflection point for AI developers. The rise of easily available, hosted LLMs made it straightforward to build compelling prototypes for a variety of use cases, including conversational agents, summarization, and text extraction. This led to a flurry of activity in enterprises to jump on the LLM bandwagon. However, in order to move these apps to production there needs to be a lot more trust in how these apps behave in the real world. As a result, we noticed a much stronger focus on evaluations and quality as 2023 progressed. From November 2023 to January 2024, downloads for our open source evaluation project jumped 8x. At the same time, a lot of enterprises using TruLens also started asking for production monitoring tools, as models were starting to finally make their way to production at scale.

How to get started with LLM evaluation and observability

Getting started with TruEra is straightforward. Check out this guide (Evaluate and Monitor LLM Apps using TruEra) on how to get going. For most users, the first step is to wrap your app with TruEra like so:

tru_recorder = tru.wrap_custom_app(
    app=rag_app,
    project_name=PROJECT_NAME,
    app_name=APP_NAME,
    feedbacks = [f_groundedness, f_qa_relevance, f_context_relevance],
    dataset_config=dataset_config
)

In the code above, we’re wrapping the app and also specifying feedback functions that would run in parallel to the app to evaluate different aspects of quality in the app. In this example, we’re using the RAG Triad of metrics – groundedness, question answer relevance, and context relevance – to detect and debug hallucinations.

Once you’ve wrapped your app, when you call it in the context of the recorder, all inputs, outputs and internals of your app are logged; the feedback functions run as evaluations and are logged to TruEra.

with tru_recorder as c:
        response = tru_recorder.app.query("What is TruEra?")

At that point, you can create a dashboard for the apps you have logged to TruEra and see how performance changes over time. Dashboards allow you two choose multiple apps and display a range of feedbacks to be displayed on a single dashboard.

Here’s an example of a dashboard with two apps under a canary deployment. There appears to be a drop in performance of the first app which then is replaced by an app with improved performance.

When things go wrong, it’s possible to create slices of the data to do deeper analysis and understand what might have gone wrong:

Once a time range is pulled out for deeper evaluation, users can view all traces in that time period with deep dives on feedback evaluations and full trace details as we show below.

We just walked through how users would diagnose their apps in production. Taking a step back, let’s take a look at how the TruEra platform helps users understand the performance of their apps throughout the lifecycle of an app.

Quality throughout the AI app lifecycle

In many ways, it’s too late to think about observability when you’re putting an app in production. It’s important to start thinking about quality right at the beginning, while experimenting with your app, and then systematically test your app before launching it in production. The diagram below walks through the journey of building a model and how TruEra helps developers quickly achieve higher performance.

Experimentation

During experimentation, developers have the most freedom to explore different architectures, parameters and implementations of their idea. As they do this, it’s important to systematically evaluate and track results in order to pick the best choice.

Testing

Before deployment, developers need to build confidence in the quality of their applications in order to put these applications into production. At this critical phase, it’s important to test for important edge cases and validate the performance of their applications with users.

Monitoring

Finally, the real world can be an unpredictable place. It’s important to monitor the quality of models on an ongoing basis. This ensures that app issues are caught and then root caused to identify fixes.

What’s new for LLM Observability with this launch

This launch brings together a number of different key developments to provide a rich observability experience to users.

Enterprise-class scalability. In this latest version of TruEra AI Observability, major updates have been made to fit the largest classes of LLM models that customers have in use. TruEra, for example, can now monitor up to hundreds of thousands of events per second. This scalability ensures that TruEra can easily manage the requirements of all of its customers, from the individual developer to the largest enterprises.
Seamless transition from individual developer to team or enterprise use. Existing TruLens users only have to add in one line of code to connect their LLM runtime environment to TruEra. When LLM apps mature from being developed and tested by a single developer using TruLens, it can easily transition to TruEra, where it can be managed by the teams that monitor LLM apps to ensure ongoing quality in production. Developers can also start with testing and evaluation in TruEra.
Easy-to-use collaboration tools and upgraded UX help apps reach production quality faster. Developing and pushing an LLM app to production requires the coordinated, informed efforts of multiple people across an organization. Now, AI developers can easily share the results of their work using TruEra’s Gen AI dashboards and collaborate using native workflows in a new, easy-to-use UI.
Robust monitoring for LLM apps. Previously, users of TruLens had elementary views into app performance in production. This new version of TruEra AI Observability has robust,

Updates to the TruEra AI Observability Platform

The launch builds on a number of significant improvements to the TruEra infrastructure

Streaming ingestion and the Kappa architecture

With this launch, we have significantly overhauled how data gets streamed into TruEra to reduce latency and improve the throughput of traces being ingested into TruEra. This update leverages the Kappa streaming architecture that allows stateful stream processing. For users, this means that batch and stream processing both now go through the same endpoints leading to more consistent behavior and fresh data. All ingestions now have close to sub-second latency. Further, this allows us to make all the services stateless.

Instrumentation and Lenses

LLM apps come in all shapes and sizes and with a variety of different control flows. As a result it’s a challenge to consistently evaluate parts of an LLM application trace. Therefore, we’ve adapted the use of lenses to refer to parts of an LLM stack trace and use those when defining evaluations. For example, the following lens refers to the input to the retrieve step of the app called query.

Select.RecordCalls.retrieve.args.query

Such lenses can then be used to define evaluations as so:

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name = "Context Relevance")
    .on(Select.RecordCalls.retrieve.args.query)
    .on(Select.RecordCalls.retrieve.rets.collect())
    .aggregate(np.mean)
)

Fitting into the broader LLM tooling ecosystem

The TruEra platform is designed to fit in with the emerging LLM tech stack through native integrations. The diagram below captures the standard LLM tech stack app developers are adopting. The TruEra platform leverages the wide range of TruLens integrations to fit seamlessly on top of the stack and provide observability throughout the stack.

This figure shows some of the most common integrations our users typically use.

Enterprise Readiness

All of the powerful functionality we described above is a part of the SOC 2 Type II-compliant TruEra cloud platform. This approach provides the widest range of options for enterprise customers to deploy TruEra. Customers can deploy within their VPC, on TruEra cloud or on an isolated cloud instance. TruEra offers flexible authorization that enables collaboration and sharing securely with a wide range of stakeholders, while offering encryption-at-rest and in-motion.

We think that this is the most powerful and comprehensive AI Observability solution available, and you can get started by downloading TruLens today for free, or contacting our TruEra team to get a demo of the full cloud solution.

The post LLM Evaluation and LLM Observability – Now at Enterprise Scale appeared first on TruEra.

Best Practices in Developing and Deploying LLM Applications

Will Uppington — Wed, 17 Jan 2024 17:55:00 +0000

For Consumer-Oriented Retail, Brand, Manufacturing and Tech Businesses

TruEra recently conducted an executive private seminar with 40+ senior AI leaders to share their lessons learned regarding best practices in developing and deploying LLM applications. The event featured an expert panel including:

Ping Wu, CEO at Cresta
Erwan Menard, Director of Cloud AI Product Management, Google Cloud
Xun Wang, CTO at BloomReach
Anupam Datta, President and Chief Scientist at TruEra

Key Takeaways from the Expert AI Panel

The panel discussion had the following key take-aways:

Companies are moving up the maturity curve. Companies are moving their LLM app focus from demos and experimentation to achieving business success in production
Quality and security are top pain points. Generating relevant, high quality output and meeting data security and governance requirements are the top two pain points in achieving LLM success
There are best practices for better LLM app quality that can be put in place today. Achieving relevant, high quality output requires implementing best practices for defining quality, new ways to cost efficiently evaluate and track LLM app quality throughout the life cycle, and good UX design
Thoughtful design and guardrails are key to governance. Achieving data security and governance requires thoughtful model selection and app architecture design and implementing application appropriate guardrails.
We are just at the beginning of increasing innovation and effectiveness. We are likely to see multiple models and multiple architectural approaches being successful and continued improvements in cost, latency, and quality

Moving from Demos to Real Business Success

The panel agreed that 2023 was the year for building prototypes of LLM apps – but that relatively few actually made it into production. TruEra’s past surveys have shown that only approximately 10% of companies have moved more than 25% of their LLM applications into production. More than 75% of companies have not moved any LLM apps into production at all.

Figure 1: Share of LLM Applications that have moved to production

Now, however, as Erwan Menard, Director, Cloud AI at Google, pointed out in the panel, companies are under pressure to move more LLM applications into production and most importantly to achieve business success with them. Whereas 2023 was the year for investigation and prototyping, 2024 will be the year for achieving business impact with LLM apps. Some of these applications will achieve business success, but it will also be true that some may not succeed, especially if companies do not invest in best practices around the development and deployment of LLM apps. As with any hype cycle, the panel recognized that there may be some disillusionment amongst the success.

Pain Points in Achieving LLM Application Business Success

During the panel, we conducted a poll to understand the audience’s top 3 pain points in building and deploying successful LLM apps. The clear, top two pain points in achieving LLM success were:

Generating high quality output
Meeting data security and governance requirements

Respondents were also concerned about costs; LLM model selection/optimization (e.g., fine tuning); user adoption; and application optimization (e.g., RAG/vectorDB optimization, prompt engineering, etc.).

Best Practices for Addressing LLM App Quality, Relevance, and Hallucinations

Panelists and participants in our breakout discussions agreed that the top challenge was ensuring that LLM applications generated high quality, relevant output. For some applications, such as the generation of artistic content, LLM hallucinations are a beneficial feature. But for many business applications, such as summarization and Q&A applications, hallucinations are bugs.

The best practices for addressing these challenges discussed by the panel and within breakout sessions include:

Define quality or relevance for your LLM application. Xun Wang, CTO of BloomReach, identified this as a key challenge that can become a significant unknown in development if not defined. The definition of quality will vary for different applications and for different use cases and verticals. For financial or healthcare applications, for example, high quality, relevant results are a must-have. Thus creating vertical specific benchmarks are useful for both provider and customer to align on what quality means for a use case.. In some applications, for example, there may be a value to leveraging some of the general knowledge of the LLMs versus only using a company identified knowledge store, while finding other ways to mitigate hallucinations using system 2 methods like Chain of Thought.
Develop new ways of evaluating LLM apps. Anupam Datta, Co-Founder & Chief Scientist of TruEra, pointed out that generative apps will require new ways of evaluating quality or accuracy vs prior apps, such as traditional classification/regression ML apps that use accuracy measures and prior language apps, such as search. Many measures of quality from language applications such as precision, recall, ranking do apply, but new measures such as groundedness (i.e. ensuring that the final response is grounded in the source documents) are necessary given the new dynamics that arise from hallucinations. TruEra has developed best practice methodologies for LLM evaluation such as the RAG triad (see “How to Prevent LLMs from Hallucinating?“)
Track experiments during development. Anupam discussed how evaluations during development can be critical for running application design experiments and understanding the performance of different designs. For example, evaluation experiments that measure quality measures such as relevance and groundedness, cost and latency can be used to select the best LLM model for an application and to optimize prompts and vector database configurations.
Monitor quality measures during production. Xun mentioned that it is also critical to continue to track and evaluate quality in production. There is no way to identify all of the potential quality issues during development. In production, app developers will see new prompts that they have not anticipated – just like Google search sees >15% new queries each day. The content in an application’s knowledge store will change. The behavior of the LLMs will also change over time as they are updated. Monitoring quality will be critical to maintain application quality and consistently achieve business success. Anpuam pointed out that developers will need to consider the cost and latency of app evaluations themselves. For example, using other LLMs to evaluate an LLM app might work during development but these will not cost-effectively scale since the cost of monitoring could then become 10X the cost of running the app.
Use UX to create a quality application experience. Ping Wu pointed out that Cresta improves the quality of applications through a number of UX features, including providing references for a response, soliciting and responding to human feedback, incorporating human feedback, and integrating generative content with heuristics and playbooks.

Best Practices for Addressing LLM App Data Security and Governance

The other major set of problems discussed included data security and governance. These problems have increased in priority as companies start to define the data security, safety, legal, and governance requirements they need to meet to move applications into production.

The best practices for addressing these challenges discussed by panelist and within breakout sessions include

Utilizing different types of architectures. Erwan pointed out that there are many flavors of RAG architectures depending on the strength of the governance requirements. For instance, some RAG architectures may use keyword search for retrieval while others might use highly optimized vector databases
Utilizing different models. Ping discussed how Cresta gains greater control, efficiency, and explainability by using medium size language models for certain use cases in addition to LLMs.
Implementing guardrails. Governance can be addressed by implementing guardrails that detect and/or prevent governance issues such as truthfulness. Anupam pointed out that these guardrails are similar to evaluations but that, to be used as guardrails, they need to be low latency to be used in real-time. This requires specialized models like those that TruEra has developed as LLM based evals are both higher latency and higher cost.
Use governance services. Erwan pointed out that new governance tools and services are being developed. For example, Google offers copyright infringement protection.

An exciting multiple model future for GenAI and LLM Apps

The panel expressed universal excitement for the future of generative AI. While there might be some examples of failures, they believed that LLM applications will achieve significant business impact. In addition, the costs and performance of LLMs are bound to improve increasing ROI and growing the use of LLM applications. In addition, the group envisioned that multiple models will be used for different applications. Xun believed that open source LLMs will see increasing adoption. Evaluations during development and production will significantly increase LLM adoption. New technology solutions for evaluation, such as those being built by TruEra, will be needed to enable broad adoption of LLM applications. The adoption of generative AI is only just beginning.

Interested in seeing the panelists in action? Check out a recording of “B est Practices in Developing and Deploying LLM Applications,” which is now publicly available.

For more information on LLM evaluation best practices, go to:

New, hands-on course on Building and Evaluating Advanced RAG Apps featuring Anupam Datta, Jerry Liu, and Andrew Ng.
Prof. Andrew Ng post on the free RAG course.
Blog – “Build and Evaluate LLM Apps with LlamaIndex and TruLens”
Blog – “Evaluating Pinecone Configuration Choices for Retrieval Augmented Generation (RAG) with TruLens“

The post Best Practices in Developing and Deploying LLM Applications appeared first on TruEra.

Hackathon Wrap-Up!

Team TruEra — Tue, 16 Jan 2024 15:58:18 +0000

From Dec 1 to 11, Truera hosted its first-ever hackathon in partnership with Google, Zilliz, LlamaIndex and LabLabAI. We had 3361 participants divided into 300 teams. Over ten days, talented hackers used Google Vertex, Trulens, Milvus and LlamaIndex to build and iterate on 40 LLM-based applications. Working with these teams to build their applications, a few themes emerged:

Theme 1: Quick feedback and logging accelerated development. Reliable, comprehensive evaluations connected to a system for tracking experiments allowed teams to understand the performance of their app with every change they made. This allowed them to quickly find new failure modes and iterate to find the best possible solution.
Theme 2: After evaluating the RAG triad (answer relevance, context relevance and groundedness), teams expanded the coverage of evaluations with evals for harmless and helpfulness. This was particularly critical as the winning apps were customer-facing.
Theme 3: After establishing their application meets out-of-the-box criteria for honesty, harmlessness and helpfulness, the winning teams tuned their application with custom evals. Because they were the subject matter experts in the domain of their app, the developers used that knowledge to write business requirements and turn those into evaluations of their app.
Theme 4: AI was used for good! The winning teams made it easier to learn, assisted those of us with diverse cognitive abilities, and assisted in solving complex immigration challenges.

We received fantastic submissions but three really stood out.

Winner #3: Study Buddy App

Team: Yesid Leonardo López Sierra, Juan Carlos Ortize Drada, Sara Ortiz Drada and Steven Quintana.

Study Buddy is an innovative iOS application designed to enhance the studying experience for students. It offers unique functionalities, such as a Q&A chatbot that enables students to query any image or PDF in their course materials, and automated flashcard creation that instantly converts sections of their course materials into custom flashcards. This app is powered by Google Cloud Platform (GCP), using a combination of advanced models (e.g. Cloud Vision for OCR, Palm for natural language processing tasks, and Google Cloud Text to Speech) and uses Zilliz as its vector database. TruLens allowed the Study Buddy team to identify a need for post-processing the OCR step of the application – dramatically improving answer relevance and lowering app latency.

Winner #2: SimplifAI

Team: Sara Diaz del Ser, Joel Weiss, Ismael Delgado, Amalia Cid, Paulina Aguiló and Alberto Garcia Garcia.

SimplifAI is an user-friendly web application that effortlessly converts any given English text into ‘Plain English,’ a simplified form of writing designed to enhance understanding for people with diverse cognitive abilities. It can also help students whose English is not their mother tongue. As a result, its user base is potentially wide, ranging from individuals with learning or mental disabilities to caregivers, educators, and any professional aiming to communicate with a diverse audience. SimplifAI uses a similar tech stack as Study buddy.

The Simplifai team showcased a full range of evaluation metrics including BLEU and language match along with custom evaluations that confirmed the simplicity of the resulting text, a key business requirement. Last, to explain which tokens in the LLM response were most influential to its complexity, the team used an integrated gradients technique available from TruLens-explain.

Additional links:
- GitHub
- App

Winner #1: Ask Priya

Team: Bassim Eldath, Fadil Faizal and Phuc Nguyen,

Ask Priya is a pioneering AI chatbot designed to streamline the process of acquiring US immigration information, leveraging data from the United States Citizenship and Immigration Services (USCIS) website. Ask Priya is a classic retrieval augmented generation app whose answers are grounded on USCIS-indexed data and evaluated using Trulens’ feedback functions. Considering the continuous and massive flows of immigrants seeking to move to the U.S. and the complexity of its immigration policy, Ask Priya addresses a pressing issue.

During development, the Ask Priya team added a “development mode”, which accelerated iteration by allowing the team to test the application while receiving feedback from TruLens on the quality of the LLM responses. On top of the RAG triad to assess hallucinations, the team also built in language match feedback functions as their app seeks to serve a diverse set of users.

Additional links:
- GitHub
- App

Other remarkable submissions

Two other applications came very close to the winning teams and thus are worth presenting. First, we have Huddle, a professional networking app that enables users to connect with interesting professionals selected based on professional backgrounds, work experiences, and objectives. If there’s a match, the app schedules a video call for both parties, thereby streamlining the networking process. Customer Chatbot API also made a great impression. This app improves customer engagement and optimizes online shopping experience through more personalized product recommendations and conversational search. Both apps used the RAG triad from TruLens to validate the lack of hallucinations in their final product.

This blog post was co-authored by Lofred Madzou and Josh Reini.

The post Hackathon Wrap-Up! appeared first on TruEra.

The EU AI Act is here. What does it mean for non-EU firms?

Lofred Madzou — Tue, 12 Dec 2023 18:06:17 +0000

The EU Artificial Intelligence Act (AIA) is, after much debate and anticipation, finalized. The long journey since April 2021, with its tortuous negotiations and messy compromises, can be confusing even for those of us in Europe, who have been following it closely. I suspect it is even more bewildering if you are sitting in California, Dubai, or Singapore, and lack a large European presence to help make sense of it.

What does the AIA actually say?

The final text has not been shared yet, but the broad outline is clear. The AIA aims to ensure that AI systems operating within the European market respect fundamental rights and EU values. In practice, it places specific obligations on public and private organizations that build or use AI systems, based on the perceived risks associated with specific use cases. For instance, emotion recognition and biometric categorisation to infer sensitive data are completely banned. In contrast, CV-sorting software for recruitment procedures and credit applications are authorized but labeled as high-risk and thus subject to strict obligations before they can be put on the market.

One area that has attracted a lot of attention this year is the AIA’s approach to General Purpose AI (GPAI) and Foundation Models. The latter must comply with specific transparency obligations before they are placed in the market, such as drawing up technical documentation, complying with EU copyright law and providing information about the content used for training. Plus, high-impact GPAI models which pose systemic risk will be subject to additional binding obligations related to performance evaluation, risk management, adversarial testing, and monitoring of incidents. These new obligations will be operationalised through codes of practice developed by a multistakeholder community that includes industry actors, academics, civil society, and EU commission officials.

So is this “GDPR for AI”?

Many commentators have predicted that the EU AIA will have a global impact, similar to the General Data Protection Regulation (GDPR) which has set the standard on data privacy.

However, there are some important differences from GDPR: For one thing, the breadth and complexity of the regulatory landscape will be much higher because the AIA complements sectorial regulations, it does not replace them. That’s particularly true for highly regulated industries such as financial services and healthcare. Navigating the interplay between existing regulations and the AIA will be critical to get implementation right.

Also, given the strategic importance of AI, other countries will probably develop their own AI governance standards based on their national economic and geopolitical goals. In the long term, these regulations are likely to converge to facilitate cross-border trade. But in the short term, business executives face the possibility of having to deal with different regulatory expectations depending on the jurisdictions in which they operate.

Further, some companies may be tempted to provide a different version of their AI products and services to the EU for the reason mentioned above. Here, I’m particularly thinking about foundation models. Some market leaders have already indicated that they are reluctant to disclose details of its training methods and data sources.

Can’t you just wait for a few years before acting?

The easy answer would be to wait and watch. After all, there is work to be done to establish the underlying standards, and a two-year grace period is usual with such large scale regulations, anyway. But that would be the wrong response, for three very good reasons:

There is solid business rationale to build most of the underlying principles into your product and strategy from the outset. Whether you are a provider or user of AI systems, you will want to ensure that they are trustworthy. You will want to prevent harm to your customers or your own reputation. You are much more likely to achieve greater adoption and value at scale if you address these issues from the start, even if you are not immediately impacted by the regulations.
Social and government pressure on AI safety and quality is not limited to the EU. Many of the same concerns are prominent elsewhere too, as seen in the recent US Executive Order. We expect that other countries and regions will hasten to implement their own guidelines or regulations in the near future, using the EU AIA as a guide, perhaps even setting them into action prior to the time when the EU AIA enters its enforcement period. So building up your “AI quality muscle” using the AIA as a catalyst will help you, no matter where you operate.
Making your AI systems safe and trustworthy is not like flipping a switch. It requires sustained effort over time to be effective, across people, processes, technology and governance layers. Starting now is a necessity, not a luxury.

So what can you do today, sitting outside of the EU?

In order to minimize your general AI risk or specifically utilize AI in the EU, there are three places to start:

First, rapidly build a working hypothesis of your regulatory exposure. Are you using any AI system that may fall under the unacceptable risk category? What other AI use cases in your portfolio are likely to get tagged as high risk use cases? To what extent will existing regulations – and your mechanisms to adhere to them – cover you already?

Second, establish AI quality and risk management as core operational capabilities. As AI-based applications are increasingly important drivers of business and increasingly exposed to the end customer, AI quality and risk management need to move to the forefront. They are not just a corporate affairs or government relations function, but something that is understood and owned by those sponsoring, building and using AI-enabled systems in your organization.

These capabilities will help you track the evolution of detailed standards in response to the AIA – such as the work at the European Committee for Standardisation (CEN), and the European Committee for Electrotechnical Standardization (CENELEC). It will allow you to update your baseline hypothesis with the inevitable changes and refinements in requirements over time. Most importantly, it will help you ensure that your processes, people and technology are ready for compliance in the coming years.

Third, get serious about the missing “AI quality” layer in your tech stack. Ensuring compliance with something as significant as the AIA (and its equivalents elsewhere in the world) will be highly challenging if you attempt to do this through manual effort alone. A systematic approach is needed – one that designs AI quality into the lifecycle of AI systems. Technology solutions to enable data scientists and ML engineers address fairness, safety and transparency concerns have existed for some time. By embedding these as an integral part of your MLOps landscape, you can start testing, debugging, explaining, and monitoring your AI systems today.

The post The EU AI Act is here. What does it mean for non-EU firms? appeared first on TruEra.

TruLens Feedback Functions with Anthropic Claude

Joshua Reini — Fri, 03 Nov 2023 12:05:00 +0000

We’re excited to announce the integration of TruLens with Anthropic to enable anyone to evaluate LLM applications with Claude on the back end. Evaluate your LLM apps to ensure they are honest, helpful and harmless with TruLens feedback functions run with Claude.

Whether you are building your first prototype or deployed in production, the TruLens observability suite gives you all the evaluations you need to understand its performance.

Feedback functions available for evaluating LLM apps with TruLens.

How do you get started with TruLens and Anthropic?

Using Claude by Anthropic for automated LLM app evaluations is easy. After selecting your model engine as Claude-2 or Claude-Instant, any LLM-enabled feedback function can be integrated to your application.

After we’ve set up feedback function(s) with TruLens and Claude, we can create our recorder by wrapping our LLM application, in this case named chain, and passing in our feedback function(s).

Once we’ve done so, you can use tru_recorder as a context manager for our application. Critically, by using the recorder, every call to your application will now be evaluated by Claude.

How does Anthropic Claude Perform on Evaluation Tasks?

To understand the performance of feedback functions, we ran a set of experiments comparing against a curated set of human evaluations. We ran these experiments on each of the feedback functions commonly used for detecting hallucination: Context Relevance, Groundedness, and Answer Relevance. Last, we also ran these experiments using different backend LLM implementations such as GPT and Claude.

Answer Relevance Performance Results

For the task of evaluating the relevance of LLM responses to questions, Claude Instant and Claude 2 performed nearly as well as GPT-3.5-Turbo and better than GPT-4 compared to human evaluations in the test set. The test cases used for answer relevance were generated by the TruLens team to capture a variety of different answer/response pairs. You can find more information about these experiments in our open-source repository.

MAE compared to human evals for Claude, GPT models on answer relevance task.

Context Relevance Performance Results

For the task of evaluating the relevance of LLM responses to questions, Claude Instant performed the closest to human evaluations in our test set. The test cases used for context relevance were generated by the TruLens team to capture a variety of different query/context pairs. You can find more information about these experiments in our open-source repository.

MAE compared to human evals for Claude, GPT models on context relevance task.

Groundedness Performance Results

For the task of evaluating the groundedness of LLM responses based on provided Claude Instant and Claude 2 performed nearly as well as GPT-3.5-Turbo and better than GPT-4. The test cases used for groundedness were generated from SummEval.

SummEval is one of the datasets dedicated to automated evaluations on summarization tasks, which are closely related to the groundedness evaluation in RAG with the retrieved context (i.e. the source) and response (i.e. the summary). It contains human annotation of numerical score (1 to 5) comprised of scoring from 3 human expert annotators and 5 crowd-sourced annotators. You can find more information about this evaluation in our open-source repository.

MAE compared to human evals for Claude, GPT models on groundedness task.

For all three tasks, we find comparable results across the models in our experiments.

Ready to evaluate LLM apps to ensure they are honest, helpful and harmless with TruLens and Claude? Try the Anthropic integration in the notebook below

Colab Notebook

The post TruLens Feedback Functions with Anthropic Claude appeared first on TruEra.

The Executive Order on AI: Key Takeaways and Recommendations

Will Uppington — Thu, 02 Nov 2023 18:55:00 +0000

“Develop standards, tools, and tests to help ensure that AI systems are safe, secure, and trustworthy.”

This is a sentence that sounds like it could have come directly from our mission statement here at TruEra. However, it appeared just this week sitting at the top of President Biden’s Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.

For the past four years, my two co-founders, Anupam Datta and Shayak Sen, and I have been working with our dedicated team of AI experts on a very similar problem statement. The main addition is that it is also critical that AI systems are high quality and meet their performance targets while being safe, secure, and trustworthy. Together with a few other firms, we have been on a mission to prove – right here, from the heart of Silicon Valley – that technology can play a constructive role in translating good intentions around Trustworthy AI into practical, market-expanding outcomes. It is a positive vision for AI that our customers – including global banks, retailers, publishers, tech companies, and more – have both subscribed to and benefited from.

I believe that this Executive Order is a good start to achieving wise AI regulation for three reasons.

It embraces messy, real-world complexities: The Executive Order acknowledges that AI safety and trust is too complex to be boxed into a single “AI law.” Instead, we must rely upon an established fabric of existing laws and regulatory frameworks in related areas, such as national security, fairness, workers’ rights, intellectual property, privacy, and industry-specific considerations such as those in finance or healthcare.

It uses both the stick and the carrot: The Executive Order seems to seek to have real teeth, whether through the significant heft of US government procurement (which is likely to touch every serious AI player) or direct national security obligations for large language models. At the same time, it demonstrates – for example, by attracting and developing AI talent, and catalyzing AI research – that the US has no intention of giving up its position as the powerhouse of AI innovation.

It focuses on pragmatic challenges vs. existential fears: In recognizing the risk that AI poses now – e.g., misinformation, fraud, security and privacy breaches – the Executive Order has implicitly left the existential fears around Artificial General Intelligence (AGI) to another day and time.

However, to ultimately achieve the goal of addressing societal concerns around AI while not inhibiting innovation I would make the following recommendations.

Move faster. Those not in favor of regulation primarily oppose it on the grounds that it could slow down innovation. While heavy-handed regulation can certainly do that, I think that in many industries, the uncertainty regarding regulation actually slows down innovation more than the actual regulation would. We have observed this dynamic with financial institutions and with companies in sensitive areas such as HR. The reality is that many companies are limiting AI use today because of the lack of clear rules around, for example, what level of disparate impact might be acceptable. As a result, they don’t know how risky using AI might be. If organizations use AI based on their own internal standards around bias, and then regulators say later that it is not good enough, then they might find themselves in trouble even if they had the best of intentions. If regulators were to specify operating standards regarding safety, bias and trustworthiness, then companies could better assess the risk of using the technology and have greater confidence in moving forward with AI initiatives.

Don’t worry about whether the technology exists to address bias, safety and transparency concerns around AI. Some worry that companies may not have cost-effective ways to meet potential regulatory standards. They should not be worried. The bottom line is that the technology to do this exists. It certainly exists for predictive AI. TruEra, for example, can enable data science teams to measure, test, and explain ML models for both performance and for societal challenges, such as bias. As I wrote in this article,“Resetting the Conventional Wisdom: Using AI to Reduce Bias,” if you actually care about reducing bias in the world, then you should be doing everything possible to speed up the adoption of AI and AI quality and observability technologies, because that combination can produce far fairer decision-making than what exists in the world today.

The situation is slightly different with generative AI. The technology to address these concerns is still being developed – but the bottom line is that reasonable (though not perfect) solutions to these challenges can be achieved. Progress is happening swiftly; regulatory adoption should also swiftly proceed.

Don’t be overly prescriptive. When defining the rules for how to address bias, safety and other AI risks, regulators should avoid strict specifications for how to address these issues. Instead, they should set standards, such as the level of acceptable disparate impact, and then let companies determine how to meet these standards. Regulators should require some level of documentation and justification for a company’s chosen methods for achieving these standards, and then retain the power to review a company’s documented approaches. Regulators should enforce these standards by selectively reviewing the documented approaches and then publicly and iteratively assessing whether they are acceptable or not. Companies should not be penalized for good faith efforts to adhere to the regulation but for a second or repeated violations after remediation steps have been suggested. Taking a non-prescriptive approach will allow for innovation in how to address AI challenges. Collective learning, over time, can lead to regulations that address concerns in the most effective way.

For those of us at TruEra who have long been grappling with these challenges in their full complexity, the Executive Order renews our resolve for our mission and purpose. We are energized to see that the problems that we have been trying to solve are well-and-truly entering wider societal awareness.

I have long been an advocate of well designed AI regulation that could speed up innovation, seeing it as the key to greater AI adoption and positive AI outcomes (see “Driving AI innovation in tandem with regulation”). As the CEO of a US-headquartered company with clients and colleagues across the globe, I have appreciated the pioneering efforts of governing bodies such as Singapore, the United Kingdom, and the European Commission, which have developed proactive initiatives to encourage the safe and effective use of AI. This Executive Order is a very welcome addition to this chorus of voices for more effective AI governance, as it will bring the weight and breadth of the US government market to bear on these challenges, providing much-needed clarity for driving the safe, fair, and effective use of AI.

The post The Executive Order on AI: Key Takeaways and Recommendations appeared first on TruEra.

5 Key Takeaways on the State of AI

Anupam Datta — Thu, 19 Oct 2023 15:38:45 +0000

Reflections on GenAI, operationalizing AI, and the role of education from a week in Europe

I spent the last week speaking at and hanging out with a few thousand startups, SMBs, and enterprises at World Summit AI in Amsterdam, having in-depth technical conversations with leaders of some of the biggest enterprises in London and Madrid that are leaning into AI. I have also been taking some time to reflect, read, and think deeply by the Thames in Richmond.

Here are 5 key takeaways:

1. GenAI conversations are maturing, moving from early experimentation to considerations for moving and maintaining GenAI apps in production.

In this context, teams are carefully thinking through the right tech stack for their purposes: Which LLMs should we use? Which vector databases? Which kind of advanced RAGs and agents make sense for our purpose? What tooling should we set up as we make the journey from experimenting in development to production? How should we evaluate, debug, and monitor our GenAI apps across their lifecycle? What does an MLOps stack look like?

2. Teams are thinking carefully about the business value from GenAI apps in production.

User experience and cost are key considerations that repeatedly came up in that context.

3. Trustworthiness is back in the spotlight.

Are GenAI apps for question-answering, summarization, agents, and more producing reliable answers? How do we guard against hallucinations? How do we ensure the appropriateness of responses, taking into account considerations of safety, toxicity, fairness, and more? This is being driven primarily by business considerations, although regulatory developments, especially with the EU AI Act, is another driver. This area has seen significant research progress and transitions from research to practical tooling.

4. Operationalizing and industrializing the adoption of AI models with end-to-end AI tech stacks continues unabated.

This includes the previous generation of AI models (XGBoost, BERT and more) in addition to the latest GenAI models and apps. Scalability, ease of integration, and functional capabilities for training, inference, and observability are key in this context.

5. AI education and training are important areas of focus.

The rise of GenAI has led to significant changes in the pipelines for building, deploying, and observability for models and apps, thus creating a strong need for education and training.

It’s an exciting time for our field as we see significant research advancements, rapid transitions from research to open source and scalable products, coupled with open education initiatives that are reaching millions world wide and creating a community that can bring this technology to have a positive impact on the world. For me, it’s exciting to be part of all these facets of the GenAI movement via TruEra and various education initiatives at Stanford University, DeepLearning.AI, and TruEra’s AI Quailty Workshop course on Udemy.

The post 5 Key Takeaways on the State of AI appeared first on TruEra.