Salesforce Engineering Blog

Beyond CRM: How Salesforce Engineered an Enterprise Agent Platform for Any Workload

Scott Nyberg — Wed, 11 Mar 2026 13:17:55 +0000

By Muralidhar Krishnaprasad.

Enterprises move quickly to adopt agent-based systems, yet many still assume they need to assemble bespoke stacks on hyperscalers to support serious, non-CRM workloads. Inside Salesforce Engineering, the challenge looked different. Our goal: design Agentforce, Data 360, and the broader platform as the enterprise-standard agent foundation. This foundation supports mission-critical systems, rich data context, and long-lived agent lifecycles without being tied to any single product surface.

Join us as we explore how Salesforce Engineering solved that problem at the platform level. We will examine how established perspectives shaped architectural choices, how the team integrated trust and governance from the start, and how we prioritized data, metadata, and transparency to build an agent platform that scales across enterprises and ecosystems.

Extending Salesforce Beyond CRM to Power Enterprise Agent Workloads

Salesforce not only powers sales and service workflows for enterprises around the world, but it’s foundation also now reaches far beyond traditional CRM tasks. Agentforce and Data 360 support enterprise-grade agent systems across different industries and mission-critical environments.

The platform allows you to design agents that manage policy engines, custom backends, and specific industry logic within one architecture. Instead of treating agents as simple CRM additions, Salesforce provides the tools and governance needed to work across various systems. This ensures your workloads operate reliably at an enterprise scale.

Internally, our engineering team built the platform with a different intent. Design choices ensure Agentforce remains open, extensible, and customizable. Primitives like AgentScript and AgentGraph introduce deterministic structure into non-deterministic systems.

These primitives do not rely on CRM objects or workflows. Instead, they provide a generic mechanism for orchestrating tools, actions, and reasoning flows across enterprise systems. Data 360 complements this approach acting as the system of context, harmonizing and unifying disparate data from both inside and outside of CRM, which enables agents to reason over structured data, unstructured data, and metadata.

Engineering Enterprise Trust, Security, and Governance for Agents

Enterprise agents operate close to sensitive data, business processes, and user identity. This proximity makes trust a non-negotiable requirement. Because even small failures in isolation or access control cause outsized consequences, the architecture treats trust as more than an application-level concern.

Agentforce builds on foundational Salesforce platform capabilities like identity, credential context, and policy enforcement. It also adds specific protections for agentic behavior. A dedicated trust layer addresses threats such as prompt injection and impersonation. This layer ensures that critical variables come from trusted actions and governed data inputs rather than raw user prompts. Furthermore, the system treats agent identity as a first-class concept to enable secure interactions within Salesforce and across external systems.

Data governance remains a priority throughout the Agentforce and Data 360 integration pipeline. The system enforces rigorous guardrails and validates data before it undergoes chunking, indexing, or exposure for reasoning. These steps ensure that only policy-compliant information gets RAG’d into an agent’s context. Together, these controls allow agents to operate across systems and vendors while they preserve enterprise expectations around security, auditability, and data protection.

Context beyond Data — Metadata, Personalization, Memory and insights for Reliable Agent Reasoning

Reliable agent reasoning requires more than data and tools as context. Data and tools need accompanying metadata depth and semantic grounding to provide the full context over data as enterprise sources evolve. Simple metadata decoration fails in complex environments, so the core platform and Data 360 utilizes deeper metadata enrichment and derives relationships, extracts implicit structures, and uses business terms and glossaries to create rich semantic representations. Agentforce agents get to reason with deep metadata context that reflects actual meaning instead of relying on static declarations or human descriptions alone.

Further, Data 360 not only stores agentic conversation history and other application engagement signals but also curates them into short-term, long-term, and episodic conversation memory context and derives further affinities and insights to be maintained as users personalization profile. Preferences, historical interactions, and behavioral signals unify into an intelligent context layer. This enables agents to critically enhance context with key conversational memory and enriched user profiles, enabling agents to reason and process with deep user-specific context and respond in a personalized way. When enriched metadata, personalization and memory context meets core data and tools context, it creates a powerful foundation for reliable and trust-worthy enterprise-grade reasoning.

Keeping the Agent Platform Open Across Models, Tools, and Execution Surfaces

Enterprise customers demand flexibility. They require the freedom to choose models, integrate existing tools, and deploy agents across different surfaces. Locking into a single model provider or workflow remains unviable for modern business.

Agentforce supports multiple reasoning and prompt-build models, including those users provide. It leverages open standards like MCP to enable structured sharing of data and context and consistent tool invocation among AI agents and external systems. It also uses open standards like A2A to support orchestration of agents running both within and outside the Agentforce ecosystem. With MCP, users can expose tools through MCP servers that they host internally, via MuleSoft Agent Registry as part of MuleSoft Agent Fabric, or elsewhere — making them immediately available to agents. This approach integrates existing systems without duplicating tooling or rewriting logic.

Agents operate across various surfaces. Users can access Agentforce agents from Salesforce applications or external interfaces to meet users where they work. This flexibility supports incremental adoption so teams start with focused use cases and expand as confidence grows.

All this sits on Data 360’s open approach to common data foundation via connectors and also zero-copy operations with major ecosystem vendors along with open format file based data sharing.

Avoiding Fragmentation in Multi-Vendor Agent Systems

Architectural fragmentation creates concerns as teams adopt agents across various vendors. Separate stacks for reasoning, orchestration, and governance increase coordination overhead. MuleSoft Agent Fabric addresses this complexity by providing a unified layer for agent discovery, cross-platform orchestration, identity propagation, governance and observability.

MuleSoft Agent Fabric allows you to register and orchestrate agents regardless of the vendor. This ensures heterogeneous agent ecosystems operate without duplicating infrastructure while maintaining strict isolation.

Policy-controlled context sharing remains a central feature. Users define exactly what data moves between agents when they interact across domains. These policies apply at the data and interaction layers to prevent unintended leakage and enable controlled collaboration across systems.

Agent Monitoring and Observability — Operating a Fleet of Agents in Production

Enterprise agent programs don’t fail in the build phase — they fail after the first successful deployment. Once agents start handling real customer and employee work, the system can turn into a “black box”: users see outcomes, but not the reasoning path, tool calls, or configuration gaps that caused them. At that point, monitoring is no longer a nice-to-have SRE add-on; it becomes a core platform capability for trust, reliability, and iteration speed.

Agentforce approaches this problem by treating observability as a single mission control for both IT and business teams — not just dashboards, but a feedback system that connects production behavior back to configuration changes. Agentforce observability is positioned explicitly around this loop: monitor, analyze, and optimize performance in near real time, combining deep inspection with adoption and consumption visibility so teams can tie agent behavior to outcomes and cost.

Looking Ahead

Agentforce and Data 360 engineering decisions reflect a core platform philosophy. We build foundational capabilities first so higher-level agent behaviors emerge safely. By prioritizing trust, context, and interoperability, the platform supports both single-agent use cases and complex multi-agent systems and also extending manageability of multi-vendor agents with MuleSoft Agent Fabric.

Responsible agent adoption requires solving platform problems rather than focusing solely on certain generative AI aspects like model selection or prompt tuning and/or just narrowly focusing on few application use cases. Addressing these foundational issues upfront allows agents to operate reliably and securely at scale and across a wide variety of workloads — CRM or not.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

The post Beyond CRM: How Salesforce Engineered an Enterprise Agent Platform for Any Workload appeared first on Salesforce Engineering Blog.

Engineering Platform Trust: Cutting Customer Case Volume 20x with Petabyte-Scale Health Signals

Scott Nyberg — Mon, 09 Mar 2026 17:24:03 +0000

By Sanjeevani Bhardwaj, Ganesh Prasad, Sukumar Surya, and Thomas Bohn.

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today, we spotlight Sanjeevani Bhardwaj, CSG Product Director, who leads the Technical Health Score to make platform trust measurable by scoring Salesforce implementations through analytics pipelines that process petabytes of telemetry and historical context.

Explore how the team engineered a system that converts platform trust into actionable signals by defining technical health consistently across multi-tenant environments and building scalable machine learning pipelines that deliver proactive health insights.

What is your team’s mission in building the Technical Health Score within Customer Success Core?

The team builds a transparency layer for the Salesforce platform to turn trust from a subjective sentiment into a measurable engineering signal. Understanding implementation health becomes difficult as you adopt more products and deepen your customizations. Technical Health provides an objective view of that status and offers a clear path toward improvement.

Trust erodes when health indicators stay fragmented across tools or hidden in logs until incidents occur. To solve this, the team designed a continuous feedback loop that aggregates signals across efficiency, security, operational excellence, customization, and observability. This structure allows you to identify risks and optimize your implementation before issues surface as escalations.

The ultimate goal centers on your independence. Maintaining a healthy Salesforce implementation requires continuous effort as your organization evolves, and this score guides that effort over time. By standardizing technical health through a consistent interface, the team helps you balance innovation with stability throughout the lifecycle of your Salesforce footprint.

Mission framework showing how Technical Health builds a transparency layer, transforming trust from subjective sentiment to measurable engineering signal, enabling customer independence through continuous feedback.

What definition and standardization constraints shaped how the team defined “technical health” for Salesforce customers?

Inconsistency creates a major hurdle for Salesforce users. Customers span various industries and architectural patterns, yet everyone needs a shared definition of health. Without a standard framework, technical status remains subjective and impossible to compare across different organizations.

The team introduced a five-pillar taxonomy to serve as a universal interface for technical health:

Security
Efficiency
Operational Excellence
Customization
Observability

Every signal maps into one of these pillars. This structure allows the system to evaluate health consistently regardless of which clouds or features you use. This abstraction helps the score scale across an evolving platform while maintaining its core meaning.

Standardization also requires a common health currency. The team normalized diverse metrics into a unified 1–100 scale, which allows you to view health holistically instead of interpreting disconnected indicators. Distribution-based normalization ensures the system evaluates you against peers with similar scale and complexity. This approach creates a definition of technical health that stays both precise and fair.

What data-scale constraints shaped how the team curated technical health signals from petabytes of Salesforce telemetry?

Extracting meaningful health signals from a massive telemetry surface presents a significant data challenge. These signals originate from UI interactions, API traffic, and security configurations spread across various databases and logs. Many of these sources only retain raw data for short periods.

Engineering architecture addressing petabytes of telemetry through strategic signal curation and off-core analytics platform, ensuring system remains invisible to customer workloads.

The team designed the system around strategic curation instead of ingesting every data point. They identified signals that predict unhealthy behavior by focusing on common pain points like limits, errors, and security vulnerabilities. This method improves the signal-to-noise ratio and keeps the system manageable at scale.

The architecture runs all analytics on an off-core data platform. This isolation from live transactional systems prevents any impact on your daily operations. Aggregation occurs near the source to reduce data volume before ingestion. This approach allows the platform to process massive amounts of telemetry with historical context while remaining invisible to your workloads.

What correctness and explainability constraints shaped how the Technical Health Score distinguishes customer misconfiguration from platform issues?

Maintaining trust requires a clear distinction between platform behavior and user configuration. Performance issues often stem from both sources, but conflating them undermines the credibility of any health metric.

The team engineered a signal-qualification framework based on shared responsibility. Every signal must pass an actionability gate. If you cannot fix the issue through code or configuration changes, the system excludes that signal from your score. This ensures your Technical Health Score reflects your specific implementation choices rather than platform incidents.

Unified framework showing signal qualification mechanism and explainable ML pipeline — ensuring scores reflects only customer-actionable issues with complete audit trail from score to root cause.

Transparency drives the modeling process. While complex neural networks offer theoretical accuracy, they often fail to explain why a score changed. The team built a multi-stage machine learning pipeline to prioritize explainability:

Signals normalize onto a common 0–100 scale using statistical distributions.
Partial Least Squares regression weights these signals against historical outcomes.
Simple weighted averages aggregate the final data.

This design provides a complete audit trail. You can drill down from a top-level score to individual root causes without any ambiguity.

What outcome-validation constraints shaped how the team proved the Technical Health Score drives measurable results?

Validating impact requires operationalizing the score within existing workflows. The team embedded Technical Health into customer success processes to trigger proactive engagement. This shift moves the focus from reactive support to preventive action.

Back-testing confirms the value of this metric. Data shows that users with low scores experience more high-severity incidents and higher costs. Users who improve their score from Fair to Excellent see case volumes drop by nearly 20 times. Support costs for these users also decrease by approximately 35 times.

This system provides significant benefits for both internal teams and users:

Internal teams reduce data gathering cycles from weeks to hours.
Users access 12 months of curated health history.
Proactive refactoring before peak seasons flattens support demand.

These outcomes prove that Technical Health serves as a lever for reliability. It provides a clear path toward sustained success on the platform.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

The post Engineering Platform Trust: Cutting Customer Case Volume 20x with Petabyte-Scale Health Signals appeared first on Salesforce Engineering Blog.

How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings

Scott Nyberg — Thu, 05 Mar 2026 16:37:58 +0000

By Padma Aradhyula, Dongwei Feng, Siddharth Sharma, and Anuja Gore.

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today, we spotlight Padma Aradhyula, Senior Director of Software Engineering on the Data 360 Compute Fabric team, who manages a large-scale platform orchestrating four million Spark applications daily, with nearly 2 million of them on Kubernetes.

Explore how Padma’s team optimized infrastructure cost at global scale by evolving Kubernetes scheduler behavior to eliminate node fragmentation under bursty Spark workloads, redesigning placement logic to proactively consolidate executor pods onto fewer nodes and embedding efficiency directly into the scheduling layer to resolve the reliability tension created by reactive autoscaler-driven node churn.

What is your team’s mission in building and operating the Data 360 Compute Fabric platform at Salesforce scale?

Our mission is to provide a resilient, hyper-scale compute foundation that powers the entire Data 360 lifecycle — from ingestion and modeling to activation. By abstracting the complexities of massive-scale distributed processing, we enable a unified ELT-first approach that eliminates fragmented point solutions and provides high data availability across batch and streaming workloads.

To meet Salesforce’s rigorous data freshness guarantees, our team orchestrates millions of Spark jobs daily, processing petabytes of data across global Kubernetes fleets. At this magnitude, we view operational reliability and Cost-to-Serve (CTS) as a single, inseparable objective.

Scaling successfully means ensuring that efficiency never comes at the cost of stability. As part of a broader suite of infrastructure initiatives, we’ve recently prioritized intelligent resource placement and high-density bin-packing. This ensures we maximize utilization while maintaining the “five-nines” reliability required for our customers’ most critical data workloads.

What architectural bottlenecks in the default Kubernetes scheduler placement logic led to node fragmentation and sub-optimal bin-packing for bursty Spark applications?

The primary architectural bottleneck stems from the default kube-scheduler scoring strategy, specifically LeastAllocated. While ideal for persistent microservices where high availability is prioritized through “spreading” (to minimize blast radius), this logic fails in a high-scale Spark environment for three core reasons:

1. Anti-Pattern: The “Scatter” Effect
By default, the scheduler seeks out nodes with the most free resources. In a bursty environment, when a large Spark job requests 100+ executors, the scheduler spreads them across the widest possible footprint. When these executors terminate — often non-deterministically due to Spark’s Dynamic Resource Allocation (DRA) — they leave behind nodes with 90% idle capacity but 10% active pods.

2. The Reactive Autoscaler Conflict (Karpenter)
To solve the idle capacity issue, we enabled Karpenter to consolidate the nodes. While Karpenter’s consolidation logic eventually attempts to “defrag” the cluster by moving pods, this is a reactive process. For Spark, this is often fatal; moving an executor means killing a running task, leading to job retries, extended runtimes, and stage failures. Hence, we had to tune down the consolidation thresholds to minimize pod disruption.

3. Lack of Workload Awareness
The default scheduler treats every pod as an independent unit. For example, it lacks the application awareness to recognize that 500 Spark pods belong to the same job and should ideally be co-located on the fewest possible nodes to facilitate efficient node reclamation once the job completes.

Padma highlights her team’s research projects.

What trade-offs emerged when autoscaler-driven consolidation was used to reclaim underutilized Kubernetes nodes for Spark workloads?

Using autoscaler-driven consolidation (like Karpenter’s) to reclaim fragmented capacity creates a direct conflict between CTS and job-level SLA stability. While consolidation identifies underutilized nodes, it relies on reactive eviction — terminating nodes and forcing active Spark executors to move.

These disruptions are particularly “expensive” for Spark. Evicting an executor mid-stage triggers task retries and the loss of local shuffle data, which can lead to cascading delays and extended job runtimes. We found that the compute cost of re-running failed stages often offset the 10%–15% gains in raw utilization.

How was the Kubernetes custom scheduler redesigned to implement proactive, high-density bin-packing as a first-order placement primitive for Spark workloads?

Our team solved the above mentioned challenges with a default Kubernetes scheduler and Karpenter node consolidation, moving to a density-focused placement strategy. We introduced a custom scheduler that uses a MostAllocated approach to pack executors onto utilized nodes. This change eliminates fragmentation at the start and ensures the cluster behaves efficiently during workload spikes.

The transition to proactive bin-packing required a fundamental shift in the scheduler’s filtering and scoring phases. Our design prioritized “filling” existing nodes to their resource limits before allowing the provisioner to spin up new capacity.

To achieve this, the compute fabric team adopted the MostAllocated scoring strategy through the NodeResourcesFit plugin. This logic assigns the highest score to nodes that are already running workloads but still have available headroom. By “stacking” new Spark executors onto these nodes, we maximize the utilization of already-paid-for compute.

What validation challenges did the team address to ensure Kubernetes bin-packing improved efficiency without increasing workload disruption at scale?

The team monitored millions of daily Spark jobs to ensure higher utilization did not compromise stability. Results from the production rollout confirmed that the bin-packing scheduler improved resource efficiency while maintaining performance.

CPU and memory utilization rose by roughly 15% as workloads packed more densely onto active nodes. This shift led to a 13% reduction in compute infrastructure costs. These savings represent a significant impact on the annual budget for the Data 360 platform.

Reliability also improved during this transition. Autoscaling now terminates empty nodes instead of evicting active ones, which cut EC2 node disruption rates by 50%. Spark applications now benefit from fewer executor losses and more predictable runtimes. Proactive scheduling secures both cost efficiency and operational stability.

Padma shares what keeps her at Salesforce.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

The post How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings appeared first on Salesforce Engineering Blog.

Delivering Accurate, Low-Latency Voice-to-Form AI in Real-World Field Conditions

Scott Nyberg — Mon, 02 Mar 2026 18:26:48 +0000

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today, we feature Rajashree Pimpalkhare, SVP of Software Engineering, Field Service, and the team responsible for voice-to-form data capture in the Field Service Mobile application, which delivers AI-powered mobile experiences to a field workforce supporting hundreds of thousands of active technicians each month.

Discover how her team developed a hybrid on-device and cloud architecture to accurately translate unstructured voice input into structured form data at an enterprise scale, ensured reliable performance across various accents and noisy field conditions through real-world voice testing, and managed latency, cost, and privacy by keeping speech-to-text on the device while leveraging cloud LLMs for intelligent field mapping.

AI-driven data flow process diagram.

What is your team’s mission as it relates to building voice-to-form data capture for the Field Service Mobile application?

Our mission focuses on streamlining field work. We empower technicians to capture data quickly, safely, and accurately using natural voice interactions. Field technicians often work in environments where traditional data entry is difficult, such as when wearing gloves, handling equipment, or in dangerous locations. This makes voice a more effective way to input information.

From an engineering standpoint, our mission goes beyond simple speech recognition; it involves intelligent data capture. Technicians provide a natural summary of their work, and the system directly maps that input to structured form fields. Form structures, field semantics, and technician language differ significantly across customers and industries. Therefore, this mapping requires semantic understanding, not just deterministic parsing. Without AI-based semantic reasoning, this method would depend on rigid, form-specific rules, which would not scale across various industries or schemas.

Voice-to-form is a core feature within Field Service Mobile. It integrates directly into existing record editing and form workflows. This approach allows for gradual adoption without introducing new interaction models or requiring user retraining. The outcome is a production-grade experience that enhances efficiency while meeting enterprise demands for accuracy, reliability, and trust.

What accuracy constraints did you encounter when mapping unstructured voice input into structured form fields at enterprise scale?

The central accuracy challenge involved converting free-form speech into correctly populated, structured fields. This task spanned diverse industries, form designs, and technician speaking styles. Technicians commonly use domain-specific terminology, abbreviations, and relative date references. The system must interpret these accurately within each field’s data type and format.

As the number of form schemas increases, deterministic approaches would demand per-form logic to manage overlapping field names, varying data types, and context-dependent references. This quickly leads to a combinatorial maintenance issue. To resolve this, the team developed a hybrid architecture. This combines on-device speech-to-text with cloud-based large language models for semantic field mapping. Each request incorporates schema-driven metadata — field types, constraints, examples, and formatting expectations — encoded directly into the prompt alongside the user’s utterance. This avoids relying on post-processing heuristics.

AI proved to be the only practical method to generalize intent resolution across hundreds of form variations without hardcoding logic. The team validated accuracy through iterative testing. This involved various device classes, form sizes, and real-world noise conditions, utilizing a growing collection of authentic technician utterances. Evaluation focused on correct field assignment and valid value population, achieving 85% field-level accuracy, which serves as a robust production baseline.

What reliability constraints emerged when supporting diverse voices, accents, and noisy field environments across real technician workflows?

Reliability challenges arose from the varied conditions in real-world field environments. These included differences in accents, speech cadence, vocabulary, and background noise from traffic or machinery. Such conditions can create inconsistency if not specifically addressed in both architecture and testing.

The team established reliability engineering in real-world conditions by creating a Voice Utterance Library. This library contained authentic technician voice clips captured during field ride-alongs. They systematically combined these utterances with various noise profiles and replayed them through the entire pipeline. Failures were categorized based on whether errors originated in transcription, semantic interpretation, or field assignment. This allowed for targeted refinement and made AI behavior observable rather than opaque.

On-device transcription, utilizing native iOS and Android speech frameworks, provided consistent performance in mobile environments. When transcription quality fluctuates, technicians can review and edit the text before processing. This prevents low-confidence inputs from spreading into structured records. This layered strategy ensures reliable performance across diverse field conditions.

What latency constraints shaped how you balanced on-device speech-to-text with server-side text-to-form processing for voice workflows?

Latency directly affects usability in the field. Technicians expect quick feedback, even when network conditions vary. The team needed to minimize perceived delay while still using cloud intelligence for semantic understanding.

The architecture separates transcription from semantic processing. Speech-to-text operates entirely on the device. This removes network dependency and provides predictable performance. Only the resulting text and metadata transmit to the server for field mapping. This reduces payload size and avoids audio transmission. This separation ensures AI inference applies only where semantic reasoning is necessary.

The system completes a single server round-trip for text-to-form processing. This avoids compounding delays. A review step allows technicians to edit transcriptions before submission. This adds a quality gate without stopping progress. Together, these choices enable end-to-end completion in under 15 seconds. This preserves responsiveness in real-world conditions.

What user-experience constraints guided the design of a voice workflow for non-technical field service technicians?

The main UX constraint was simplicity. Field technicians complete jobs under time pressure. They do not experiment with AI. The voice workflow needed to be discoverable, intuitive, and require minimal explanation. It also needed to avoid introducing chat-style interfaces.

Voice input embeds directly into existing form experiences. Technicians start voice capture with a single control. They speak naturally without referencing field names. After processing, updated fields visually highlight. Inline undo and text editing controls keep users in full control. This transparency is critical when AI modifies structured records.

Privacy considerations also influenced UX decisions. No voice recordings store. Audio discards immediately after transcription. Extensive beta testing with enterprise customers confirmed that technicians preferred transparency and correction over silent automation. This resulted in a voice experience that feels native to the workflow.

Voice-powered data entry, natively integrated into existing workflows.

What cost-to-serve and privacy constraints influenced the decision to perform speech-to-text on the device?

Cost and privacy were core architectural limits. Cloud-based transcription would create recurring costs. It would also increase exposure of sensitive audio data.

By performing speech-to-text on the device, using native OS frameworks, the team removed transcription costs entirely. They ensured audio never leaves the device. Once transcription finishes, the audio immediately discards. Only text processes further. This simplifies compliance by avoiding storage, retention, and audit requirements for raw audio.

Text-to-form processing uses existing cloud LLM infrastructure. This minimizes incremental platform cost. It also retains flexibility. Processed data retains only as needed to populate the form. This ensures AI applies where it adds semantic value. The rest of the pipeline remains deterministic, cost-efficient, and privacy-safe.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

The post Delivering Accurate, Low-Latency Voice-to-Form AI in Real-World Field Conditions appeared first on Salesforce Engineering Blog.

Hyperforce Migration at Scale: How Deterministic Automation Replaced Manual Spreadsheets Across 95,000 Organizations

Scott Nyberg — Thu, 26 Feb 2026 15:30:46 +0000

By Raksha Subramanyam and Vijay Singh.

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today, we feature an internal decision and automation platform, the Migration Intake and Processing Service (MIPS). Engineers on the Cloud Economics and Capacity Management and Org Migration teams came together to build this platform, which enables Salesforce’s large-scale customer organization (org) migrations to Hyperforce, processing over 95,000 organization migrations through more than 15,000 requests over a span of ~1 year.

Explore how the team replaced manual migration intake with an automated decision engine, architected reliable integration across multiple sources of truth while preserving auditability and customer trust, and balanced auto-approval with human intervention to maintain predictable throughput for 90%+ of the org migration requests managed through the automated system.

What is your team’s mission building MIPS to support Salesforce’s Hyperforce migration program?

Our team enables Salesforce to migrate customer organizations to Hyperforce at an enterprise scale. We achieve this while meticulously honoring customer constraints and preferences, all without introducing operational risk or human throughput bottlenecks.

As Salesforce transitioned from first-party data centers to public cloud infrastructure, the nature of migration planning evolved significantly. What was once a small set of coordinated moves transformed into a high-volume program. Each request carries real, critical customer preferences including destination region, scheduling windows and respects customer constraints.

To address this new scale, the team developed the Migration Intake and Processing Service, or MIPS. This platform acts as a centralized decision and automation layer. It transforms migration requests into deterministic outcomes, empowering downstream execution teams to act with confidence. Rather than relying on spreadsheets managed by humans and manual coordination across various systems and teams, MIPS establishes a single, auditable intake path with clear routing instructions. It automates standard cases and escalates exceptions for review, allowing the program to scale efficiently without compromising accuracy, speed or customer trust.

What scalability constraints made manual migration intake unsustainable as Salesforce accelerated Hyperforce adoption?

In the initial phases of Hyperforce migrations, manual coordination proved manageable. This was largely due to the low migration volume, which allowed for communication through email, Slack, and spreadsheets. Salesforce manually validated each request, meticulously checking eligibility, schedulability, regional capacity, and legal requirements across various dashboards and data sources.

However, as migration volumes rapidly escalated into thousands of organizations per month, this manual approach quickly became a significant bottleneck. Each request demanded multiple validations, and performing these checks by hand led to delays, inconsistencies, and growing backlogs. Even minor data inaccuracies could result in incorrect migration decisions.

The primary limiting factor wasn’t data movement itself, but rather the speed of request processing. Without automation, the intake layer severely constrained the entire migration program. MIPS directly addressed this challenge by embedding validation logic into the system. This eliminated the need for human intervention in repetitive checks, allowing intake throughput to scale independently of staffing levels.

What architectural challenges shaped the design of a decision engine evaluating eligibility, schedulability, and capacity at Hyperforce scale?

The core architectural challenge centered on consolidating numerous distributed sources of truth into one reliable decision engine. Migration eligibility, scheduling windows, capacity limitations, and policy rules resided with various teams and within disparate systems. MIPS had to read from and write to these systems, ensuring every decision relied on accurate, current data.

Misinterpretations could lead to migrating an organization to an incorrect region or scheduling a move at an invalid time, creating substantial customer and operational risks. To counter this, the team heavily invested in upfront architecture and functional design. This involved meticulously documenting dependencies, update paths, and ownership boundaries across systems.

To support deterministic decision-making, MIPS evaluates requests against a fixed set of validated inputs. These include:

Eligibility signals, such as organization topology and deployment model.
Scheduling availability windows, aligned with customer constraints, preferences and platform readiness.
Regional and infrastructure capacity data, obtained from upstream systems.
Explicit policy rules, agreed upon by migration, infrastructure, and legal teams.

The resulting architecture utilizes well-defined APIs, explicit data contracts, and continuous data quality checks. MIPS validates inputs across multiple dimensions before approving a request. It defaults to manual review when required data is missing or inconsistent, enabling autonomous operation at scale without compromising precision.

What reliability risks shaped how MIPS integrates with multiple sources of truth while preserving auditability and customer trust?

Reliability and auditability formed the bedrock of the MIPS system, given its direct impact on customer data residency and availability. Every approval or rejection is traceable, allowing engineers to manually pinpoint the exact decision-making process if downstream issues emerged. Traceability of our requirements is critical to ensuring our number one value, trust, is met at all times.

The team tackled this by constructing auditable decision pipelines that capture:

All evaluated inputs used in a decision.
The specific rules applied during eligibility and schedulability checks.
The final decision outcome, including approval or routing for review.

This allows engineers to backtrack through decisions when validating behavior or investigating anomalies.

Another significant risk was partial data availability. Instead of inferring or guessing, MIPS was engineered for safe failure. If any required source of truth becomes unavailable or yields unexpected results, the system automatically routes the request for human review. This conservative strategy prevents incorrect automated decisions and safeguards customer trust, ensuring automation never compromises correctness as migration volume grows.

What design tradeoffs guided how MIPS determines which migration requests are auto-approved versus routed for human review?

Timeline pressure to complete most Hyperforce migrations forced the team to carefully select what they could safely automate early on. The goal was not complete automation, but rather high-impact automation without escalating risk.

The team analyzed manual validation steps performed by humans. They separated deterministic, policy-driven checks from those needing human judgment. Eligibility, schedulability, and capacity validations proved reliably automatable, unlike legal verification and certain exception scenarios.

MIPS directly incorporates this distinction into its rules engine. Requests meeting all deterministic criteria receive auto-approval and propagate downstream. Other requests are flagged with explicit reasons and routed for manual review. This method automated approximately 80% of requests, substantially reducing the backlog while retaining human involvement where judgment is necessary. The remaining 20% guides future automation opportunities.

What scalability and throughput challenges emerged as MIPS processed over 15,000 migration requests and routed more than 95,000 org migrations?

Throughput at scale depended as much on data freshness as on request volume. Each migration request initiates multiple validations across eligibility, location, and capacity data. Synchronous lookups for every request would introduce unacceptable latency.

To address this, the team separated data freshness from request processing. Background jobs continuously refresh commonly used datasets on a fixed cadence. This ensures fast access to validated data when requests arrive. This approach reduced reliance on real-time calls and enabled consistent decision latency as volume increased.

Some constraints remained due to upstream system limitations. In those instances, the team defined clear service-level expectations and communicated them explicitly, rather than obscuring the limitations. By combining precomputed data, conservative defaults, and clear service level objectives, MIPS maintains predictable throughput while safely supporting 90%+ of the org migration requests managed through the automated system.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

The post Hyperforce Migration at Scale: How Deterministic Automation Replaced Manual Spreadsheets Across 95,000 Organizations appeared first on Salesforce Engineering Blog.

Building an AI-Accelerated Compliance Automation Platform for 24x Faster Audits

Scott Nyberg — Mon, 23 Feb 2026 14:49:39 +0000

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce.
Today we spotlight Aastha Goyal, a Senior Manager of Software Engineering, whose team built FastTrack, a production-grade compliance automation platform for Salesforce’s mobile Apple App store and Google Play Store environments that delivers a 24× reduction in audit execution time.

Explore how the team replaced fragile, multi-hour screenshot-driven compliance audits with deterministic, API-based automation and used AI-assisted development to compress the path from system design to production, delivering compliance-grade evidence through APIs never designed for audit workflows while maintaining the Salesforce engineering quality bar.

What is your team’s mission as it relates to scaling compliance automation and developer productivity across Salesforce’s mobile environments?

Our team ensures compliance audits across Salesforce mobile environments remain accurate, repeatable, and scalable while minimizing operational risk. As requirements expanded, the team prioritized eliminating fragile manual workflows that consumed engineering time without improving reliability.

The mission evolved to accelerate production system delivery without lowering engineering standards. This shift focused specifically on compliance-critical automation.

AI-assisted development became a core enabler of this strategy. This technology allows the team to focus on system architecture, validation logic, and compliance intent while reducing implementation overhead. These tools ensure platforms like FastTrack reach production quickly and safely, even when engineering resources face constraints.

What workflow constraints shaped the move from manual compliance audits to automated evidence collection?

Manual compliance audits became inherently fragile as mobile environments scaled. Each cycle required engineers to navigate administrative consoles, capture timestamped screenshots, and manually verify permission data. Even small omissions risked invalidating an audit, which created direct business risks for regulated customers.

As the scope of evidence expanded, the process consumed hours per cycle. This methodology depended heavily on individual precision and contextual knowledge. Consequently, operational bottlenecks formed whenever key team members became unavailable.

AI-assisted development changed how quickly the team translates audit requirements into deterministic system workflows. Rather than treating automation as a long-term engineering effort, the team implements architectural decisions around API integration and evidence normalization immediately.

This shift transformed an unsustainable manual process into a scalable automation system. The new approach eliminates fragile touchpoints, reduces risk exposure, and reclaims vital engineering capacity.

What upstream integration constraints shaped how the system handled limitations in the Google Play Console API?

The Apple App Store Connect integration provides the permission granularity required for audit evidence collection. In contrast, the Google Play Console API restricts precise scoping at the application level because it does not expose app-specific user permission data.

The team refused to let this constraint block production automation. Instead, they collaborated with compliance stakeholders to redefine acceptable evidence boundaries. The current solution collects the complete authorized user set within the Google Play Console environment.

AI-assisted implementation accelerates how quickly the team prototypes and hardens alternative evidence models. When constraints surface, the team translates revised compliance requirements directly into functioning system behavior inside FastTrack. This methodology eliminates the need for extended engineering cycles.

This approach preserves audit integrity. It also simplifies both system design and compliance reviews.

What trust and validation requirements shaped how audit outputs were engineered for compliance-grade reliability?

Compliance automation succeeds when governance teams and external auditors accept the evidence. The system functions as a deterministic evidence engine that traces every output directly to authoritative source data.

The design embeds validation logic into runtime execution to enforce fields, timestamps, and permission states. Each audit execution logs the exact API queries to create a transparent, verifiable trail.

AI tooling accelerates the implementation of validation paths and traceability mechanisms within production workflows. This ensures the system enforces compliance correctness programmatically. By making validation a core architectural component, FastTrack delivers reliability at scale.

What system-design challenges shaped the transition from UI-driven automation to API-first production systems?

Early automation efforts often rely on browser-driven workflows that simulate human interactions. These approaches break easily and create maintenance challenges when interfaces change.

Long-term system reliability and scalability remain core design constraints for compliance workflows. The team adopts an API-first architecture to access authoritative data.

AI-assisted development accelerates how quickly the team explores and implements these architectural decisions. API-based automation pipelines allow for rapid iteration and validation. This shift eliminates failures caused by user interface changes and creates a scalable system for compliance automation.

Scaling Compliance: From Fragile Manual Audits to AI-Powered Automation.

What developer-productivity constraints shaped how AI tooling enabled a production system without traditional coding experience?

The initiative originated outside a formal engineering roadmap and succeeded without prior professional coding experience. The system still meets every production quality, security, and compliance standard.

AI tooling shifts development toward architectural intent. It defines integrations, authentication flows, and validation logic while translating those designs into functioning code. Every component undergoes review against engineering expectations to create a rapid feedback loop for refinements.

AI does not replace human judgment. It compresses implementation overhead and narrows the gap between system architecture and execution. This allows the system to reach compliance-grade maturity faster than traditional development approaches.

What operational-risk and scalability constraints shaped how automation replaced manual audits?

Compliance failures impact regulated users and organizational trust. Manual workflows increase risk through human error and inconsistent execution. As mobile environments scale, audit complexity grows while operational capacity stays the same.

AI-accelerated delivery replaces fragile processes before risks grow. Automated evidence collection standardizes execution across teams and removes manual steps. Tasks that once required hours now finish in minutes. This reduces operational exposure and allows compliance automation to scale with the mobile footprint.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

The post Building an AI-Accelerated Compliance Automation Platform for 24x Faster Audits appeared first on Salesforce Engineering Blog.

From Audio to Action: How Speech Invocable Action Powers Native AI Automation Across Salesforce

Scott Nyberg — Sat, 21 Feb 2026 00:53:04 +0000

By Yaheli Salina, Karthik Prabhu, and Omri Alon.

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today, we spotlight Software Engineer Yaheli Salina. She and her Agentforce Speech Foundations team developed Speech Invocable Action — a new AI tool that standardizes repeatable actions, delivering sophisticated AI power throughout the Salesforce ecosystem, including secure, native speech automation housed within the platform trust boundary.

Explore how the team integrated native speech automation by developing speech-to-text as a primary action under rigid multi-tenant constraints, engineering protective barriers to stop automation errors from impacting Flows and Agentforce actions, and leveraging AI tools to speed up architectural exploration while maintaining high security standards.

What is your team’s mission as it relates to building native speech automation on the Salesforce platform?

The team simplifies speech capabilities by creating native building blocks within the Salesforce platform. Previously, speech-to-text required routing audio to third-party services, which forced users to manage credentials and accept security tradeoffs. This old model created friction for enterprise environments that prioritize data residency and trust boundaries. Conversely, this current approach ensures audio data stays within the Salesforce trust boundary. Processing occurs through platform services to preserve privacy while enabling hands-free automation.

By integrating speech capabilities as a suite of standard actions, the team democratizes voice access for all builders. Speech-to-text, text-to-speech, and translation are now standard composable actions. Admins and Developers can trigger voice-driven logic in Flows or Agentforce without writing boilerplate code for audio streaming or WebSocket management. This shift transforms speech into a reusable tool rather than a specialized integration.

The team aims to make voice a natural extension of workflows so users build speech-driven experiences with total confidence.

Inside the Speech Invocable Action architecture: Bridging Salesforce platform consumers with Agentforce Speech Foundations through standardized core actions.

What architectural constraints shaped how native speech automation was built inside the Salesforce platform?

Building inside the Salesforce platform presents architectural realities that differ from deploying external services. The platform operates as a large, multi-tenant system where thousands of features share memory, compute, and execution paths. Every new capability must coexist safely with all other platform processes.

Speech-to-text processing demands significant resources, especially regarding memory usage during audio handling. Since these resources are shared, the team evaluates how speech actions behave when multiple Flows or Agentforce actions run concurrently. Each automation step assumes other platform workloads compete for the same underlying resources.

To manage these demands, the team prioritizes disciplined resource management and rigorous performance testing. They validate usage patterns against Speech Foundations API limits and tune execution paths for maximum efficiency. These efforts maintain platform stability and ensure speech automation performs predictably under heavy loads.

How did reliability requirements influence the design of speech automation for Flows and Agentforce actions?

Speech automation often operates in synchronous contexts like Flows and Agentforce actions, where execution pauses until a task completes. A single failure in these scenarios can stall an entire automation or disrupt an agent interaction. This makes failure behavior as critical as success behavior.

The team uses a defensive design strategy to ensure predictable outcomes. The speech action returns structured error categories instead of generic system errors. This allows builders to handle issues explicitly. Downstream automation can then respond with intentional actions like retrying, branching to a fallback path, or logging the event.

Extensive testing validates this approach through unit, integration, and end-to-end scenarios. These tests ensure the speech action behaves consistently when combined with other platform tools. Controlled failure modes ensure speech automation strengthens workflows and maintains reliability.

What delivery pressures shaped how the team executed this work with a small team?

The delivery of speech automation happened under fixed timelines and high operational expectations. Because this action operates deep within the platform, the team treated correctness and guardrails as non-negotiable requirements.

The Speech Foundations and Standard Actions teams adhered to a design for bulk processing from the beginning — crucial for scalability and efficient governor limit consumption in Salesforce’s multi-tenant environment. To implement speech tasks (such as transcription) within the complex codebase, the team used AI tools like Claude Code. This enabled a small team to autonomously deliver production-ready code that met these strict constraints with unprecedented speed.

Testing focused on how builders use speech automation inside Flows and Agentforce actions. By validating real execution paths end-to-end, the team ensured the feature could ship confidently despite tight timelines.

How did AI tools change developer productivity while working inside an unfamiliar platform codebase?

Working within the Salesforce platform required navigating a massive codebase and complex internal APIs. Usually, onboarding to such an environment requires weeks of documentation review and trial-and-error exploration.

AI development tools transformed that experience. Tools like Claude and Cursor served as architectural guides and helped the team understand system components and existing patterns. This AI-assisted approach allowed the team to query the codebase, find relevant examples, and generate tests that met internal standards.

The team estimates AI shortened development and discovery time by seven to eight weeks. Beyond speed, AI shaped how engineers learned, reasoned about, and extended a complex system at scale and reduced cognitive overhead. This allowed the team to focus on speech automation logic rather than platform mechanics.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

The post From Audio to Action: How Speech Invocable Action Powers Native AI Automation Across Salesforce appeared first on Salesforce Engineering Blog.

How Agentforce Achieved Accurate Flow Generation Across 461 Billion Monthly Executions Using a Constrained DSL

Scott Nyberg — Mon, 16 Feb 2026 17:35:29 +0000

By Shipra Shreyasi, Aniket Kumar, Manas Agarwal, and Pragya Kumari

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today we spotlight Shipra Shreyasi, a software engineering architect who directs the team enhancing natural-language-to-Flow creation within Agentforce. This empowers users to build production-ready Flow metadata from simple speech while managing automation at a scale surpassing 461 billion monthly executions.

Explore how Shipra’s team boosted natural-language-to-Flow precision by swapping fine-tuned models for a restricted, multi-level DSL framework, and how they maintained reliability across 63+ Flow varieties — including Screen Flows, UI elements, and unique actions — through specialized constraints and staged verification.

What is your team’s mission as it relates to building accurate natural-language-to-Flow generation in Agentforce?

The team simplifies how you create, modify, and understand automation by using large language models to transform plain-language instructions into Flow metadata. This process allows you to deploy business logic directly into Flow Builder with the expectation that every automated task behaves exactly as intended.

Accuracy remains the central focus because Flows serve as vital operational assets. Since a Flow that fails to reflect your intent can introduce hidden errors, the team prioritizes several core requirements:

Correctness
Debuggability
Reliability

This perspective shifts Flow generation from a simple text task into a structured engineering solution. By applying explicit constraints and system-aware reasoning, the team helps you build sophisticated automations with minimal manual effort and high confidence in the final result.

Shipra shares what keeps her at Salesforce.

What accuracy and intent-alignment constraints did fine-tuned models introduce for natural-language-to-Flow generation?

Fine-tuned models created accuracy hurdles that grew more obvious as Flow complexity increased. While these models produced valid metadata, they often missed the actual meaning behind a request. This meant a Flow could go live while failing to perform the specific tasks you originally described.

Adaptability also stayed out of reach. These models struggled to handle your unique customizations, such as custom Apex actions or specific HTTP callouts. This approach created several persistent issues:

Retraining cycles increased the risk of system regressions.
Diagnosing failures became nearly impossible.
Errors remained hidden within a single, complex process.

Ultimately, these limitations made it difficult to enforce accuracy. Because the model operated as a monolith, the team could not determine if a failure happened during planning or the final generation. This lack of transparency prevented the system from delivering reliable, intent-aligned automation at a larger scale.

What architectural constraints drove the shift from fine-tuned models to a constrained, multi-stage DSL for Flow generation?

The architectural shift prioritizes deterministic results and eliminates hallucinations. While standard models often struggle with semantic drift and invalid data combinations, this new structure enforces strict rules for metadata and Flow types.

The team replaced the old monolithic approach with a modular, multi-stage pipeline. This system breaks the generation process into specialized phases with clear validation gates. A new Domain-Specific Language (DSL) defines exactly what the system can build, which stops invalid constructs before they ever exist.

The new model separates design from implementation through these methods:

The Architect phase resolves planning and structure first.
The Developer phase handles the low-level metadata production.
Validation occurs at every stage to prevent errors.

This phased approach ensures accuracy through enforced constraints rather than trying to fix mistakes after the fact.

Natural Language Prompt to Agentforce for Flow Generation using Multi Stage DSL Generation Pipeline

What innovation-velocity constraints emerged from fine-tuned model training and release pipelines?

Fine-tuned models created operational overhead that slowed innovation velocity. Supporting a new Flow type or fixing correctness issues required assembling datasets, retraining models, and moving through sequential testing environments. These steps meant even small changes often took months to reach users.

This slow cadence made it difficult to respond to evolving platform requirements. Accuracy improvements depended on model release timelines rather than engineering intent, while changing Flow schemas required repeated retraining cycles.

The team eliminated the need for retraining by moving to a DSL-based architecture with open-source large language models. This shift allows the team to address correctness fixes and schema changes through deterministic rule updates. Now, accuracy improves continuously instead of waiting for infrequent, high-risk releases.

What metadata-evolution constraints emerged as Flow schemas and Flow types expanded across Salesforce releases?

Flow operates at a massive scale. It supports over 63 distinct Flow types and features schemas that evolve with every release. Each type carries its own execution semantics and start configurations, which previously made manual generation approaches far too brittle to maintain.

The team solved this by automating DSL generation directly from Flow metadata definitions. These constructs now derive programmatically from the Flow Metadata WSDL. This method ensures that generation rules reflect the platform schema at all times. As the platform introduces new features, the DSL evolves automatically.

Because the DSL pulls from authoritative metadata, the system stays aligned with actual runtime behavior. This change removes the risk of schema-drift errors. It also allows Flow generation accuracy to scale naturally alongside the platform.

Shipra spotlights her team’s favorite AI tools.

What correctness constraints emerged when supporting complex Flow types like Screen Flows, UI components, and custom actions?

Complex Flow types present correctness challenges that go beyond static metadata. Screen Flows act as user interfaces, which demand accurate component selection and reactive behavior. Custom actions add another layer of difficulty by introducing specific semantics that models cannot reliably predict.

Start elements also function as polymorphic components. They contain fields that change depending on the Flow type. A single generation approach often fails to handle these variations, leading to incorrect or invalid configurations.

The constrained DSL architecture fixes this by enforcing specific rules at every stage. The pipeline selects valid elements and validates metadata in real time. It also calls dynamic APIs to resolve specific organization details. These steps ensure accuracy even in complex, UI-driven scenarios.

Shipra explains why engineers should join Salesforce.

What measurement and evaluation challenges did you face proving that the constrained DSL architecture improved Flow accuracy?

Measuring accuracy requires more than simple observation. Manual reviews fail to scale, and basic indicators like successful saves do not prove that a Flow honors a user’s intent.

The team solved this by building an automated evaluation framework. This system uses hundreds of prompts and a Flow-as-a-Judge model to test results. The framework evaluates every generated Flow on three specific dimensions:

Successful saving
Activation readiness
Alignment with user intent

By running identical prompts through different methods, the team compared outcomes directly. The constrained DSL approach shows superior fidelity for complex types like Screen Flows. This framework provides the quantitative evidence needed to prove the architectural shift improves accuracy.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

The post How Agentforce Achieved Accurate Flow Generation Across 461 Billion Monthly Executions Using a Constrained DSL appeared first on Salesforce Engineering Blog.

Against the Clock: How Data 360 Launched the Informatica Help Agent in 24 Days

Scott Nyberg — Wed, 11 Feb 2026 20:17:03 +0000

By Irina Malkova and Alexander Smith.

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today, we spotlight Irina Malkova, Vice President of Product and Success Data, who helped deliver the data foundation behind the Informatica Help Agent in just 24 days.

Explore how the team met an ambitious deadline by refining project focus, converting 100,000 unstructured documents into searchable intelligence via Data 360, and applying established architectural frameworks to enable reliable retrieval for live agents.

What is your team’s mission as it relates to building the Data 360 foundation for the Informatica Help Agent?

The team builds trusted AI-ready context. In this case, a knowledge base that empowers Informatica agent to reliably answer customer questions and reduce support cases. We support all agents that augment the Customer Success business motion, including those on help.salesforce.com and slack.com/help. Our strategy balances enabling helpful, tailored answers for each agent with building a durable data foundation that can power future agents, too — reducing time to launch and ensuring consistent trusted results across all experiences.

Data 360 is how the team unifies, standardizes, indexes, and activates unstructured knowledge. Data preparation is a notoriously difficult step in building AI — but Data 360 eliminates the need for custom pipelines, accelerates time to launch, and enables reuse — making tight deadlines possible.

Retrieval precision and accuracy defined the success of the Informatica Help Agent. By focusing on AI data readiness as a core engineering task, the team delivers correct answers and scales the system without losing trust.

How Data 360 transforms data into retrievable context for AI Agents.

What delivery constraints shaped the 24-day launch of the Informatica Help Agent after acquisition?

We were challenged to enable Informatica Agent in 30 days after the acquisition completed on November 18, 2025. The ambitious post-acquisition timeline required strict discipline and architectural creativity. The team focused on delivering a production-grade high-quality foundation instead of addressing every complex detail in the initial release.

To avoid friction that threatened the deadline, the team leveraged clever architectural approaches. For instance, Informatica’s knowledge base had complex versioning, with many near-duplicate articles differing only slightly across product versions. The team found a way to manage the product versioning through prompting and configuration rather than changing the system logic. This choice kept the primary effort on ingestion and retrieval fundamentals.

Execution relied on reusing established Data 360 patterns while protecting the engineering team from distractions. By following a precise plan and sequencing tasks carefully, the team completed the entire system in 24 days — ahead of the 30 day deadline.

What data quality challenges emerged when preparing Informatica’s unstructured knowledge for AI consumption?

Informatica documentation was written for human readers rather than artificial intelligence. Raw HTML files contained headers, footers, and navigation menus that interfere with retrieval quality. To become AI-ready, the knowledge needed a cleanup — but manual cleaning was impossible at this scale.

Instead, the team used Data 360 patterns to normalize content and remove noise while keeping the original meaning. This process transformed HTML into consistent chunks for better embedding and retrieval.

Preparing this volume of content would have taken weeks without Data 360. By using native ingestion and search features, the team finished data preparation in days and moved quickly to optimizing the performance. Thanks to the data cleanup, they had a solid performance baseline to start with — because context determines the quality of an agent’s response.

What ingestion and storage challenges shaped aggregating 100,000 Informatica documents into Data 360?

The Informatica knowledge base came from different systems with unique structures and metadata. The ingestion process had to handle these differences while remaining reliable at a large scale.

A lot of Informatica’s knowledge we sought to use was available through a content management system and hosted on their website. To ingest it, the team used the new Data 360 feature “sitemaps” that crawls the website and creates conforming Data 360 knowledge.

For more unique content, Python workflows managed the extraction, while Data 360 handled the ingestion and storage. The first ingestion of developer documentation finished in about three hours. Future updates ran faster as the pipelines stabilized.

The team managed limitations in filtering and refresh timing through preprocessing and configuration. Despite these constraints, Data 360 pipelines supported hundreds of thousands of documents. This approach created a production-ready knowledge base within the necessary timeline.

What retrieval accuracy and performance considerations guided your chunking and indexing strategy?

Accuracy remains vital because documentation varies by product version and user type. Mismatched content risks eroding trust even when responses appear relevant. To solve this, the team reused proven chunking strategies that worked for Customer Success and added filters and metadata tags during ingestion.

These tags enable more precise retrieval and simplify evaluation by narrowing results to the most relevant context. Real-world usage validated this approach following the launch. The Informatica Help Agent achieved an 80% resolution rate with only 5% human escalation. This success demonstrates that retrieval accuracy and performance hold under live traffic without sacrificing quality.

What architectural decisions enabled reuse instead of rebuilding prior help-agent data work?

Confidence in existing Data 360 patterns drove the decision to reuse systems and move quickly without adding unnecessary complexity. Rather than rebuilding from scratch, the team extended established configurations for ingestion, chunking, indexing, and retrieval to Informatica content.

Although Informatica data behaves differently than Salesforce-authored content, necessary adjustments remained localized. Because pipelines and infrastructure follow a standard design, tuning did not require systemic changes or a ground-up redesign.

This strategy avoided a rebuild that would have required a much larger team and months of extra work. In practice, reusing proven patterns in Data 360 delivered equivalent outcomes in a fraction of the usual time. The process maintained enterprise quality while establishing a scalable foundation for future agent expansions.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

The post Against the Clock: How Data 360 Launched the Informatica Help Agent in 24 Days appeared first on Salesforce Engineering Blog.

How Agentic Memory Enables Durable, Reliable AI Agents Across Millions of Enterprise Users

Scott Nyberg — Mon, 09 Feb 2026 18:46:29 +0000

By Makarand Bhonsle, Christina Abraham, and Jayesh Govindarajan.

In our Engineering Energizers Q&A series, we highlight the engineering minds driving innovation across Salesforce. Today’s discussion features Makarand Bhonsle, a software engineering architect at Salesforce, whose team is developing Agentic Memory within Agentforce to provide durable, governable memory for enterprise agents at massive scale.

Explore how the team addressed the inherent limits of stateless agents with small context windows by introducing Agentic Memory as a durable, structured data layer, and how they tackled the formidable challenge of ensuring its accuracy, governability, and reliability at enterprise scale through confidence scoring, write and read gates, and hybrid semantic validation.

What is your team’s mission in addressing the limitations of stateless AI agents within enterprise workflows?

The fundamental objective is to elevate agents beyond fleeting, stateless exchanges, transforming them into dependable collaborators over extended periods. Across the industry, most AI agent architectures operate within a restricted working space, treating each interaction in isolation. This design severely curtails their capacity to retain user context, past decisions, and crucial enterprise constraints across various business workflows. Consequently, applying these architectures reliably becomes increasingly difficult beyond basic, single-turn interactions.

To overcome this limitation, the team prioritizes equipping agents with a robust, durable memory foundation. This memory persists across interactions, yet remains governable and transparent. Agentic Memory is a core platform capability and allows agents to use relevant information in the chat without referring back to chat history and other large consumer datasets.

While short-term context remains tethered to the active session, enabling agents to reason effectively in the immediate moment, long-term memory is linked to a persistent profile graph. This graph endures across sessions and distinct communication channels. This strategic approach ensures continuity without compromising trust, auditability, or enterprise control. The profile graph refers to an individual profile within Salesforce.

The Agent Memory Platform powered by Data 360.

What constraints of small context windows and stateless execution prevent today’s agents from operating reliably over time?

In the realm of stateless agent designs, agents operate with a severely restricted view of information. Older chats, emails, and CRM records simply vanish from their scope as conversations evolve. Furthermore, this execution model consistently resets an agent’s working context with each interaction, even when the user and the task at hand remain constant. During extended interactions, these limitations often lead to repetitive questioning, inconsistent behavior, or noticeable gaps in the retained context.

Industry efforts to mitigate this by injecting vast quantities of raw historical data into prompts only introduce further latency, increase costs, and generate unnecessary noise. Obsolete or irrelevant information actively distracts the model, diminishing its reasoning capabilities. Without explicit memory records, updating or removing facts as individuals change roles, preferences, or circumstances becomes a formidable challenge, allowing outdated information to resurface persistently.

Agentic Memory directly confronts these constraints by externalizing memory into structured records. These records come with explicit lifecycle control. This design enables agents to retain stable facts, adapt their understanding as conditions shift, and effectively discard information that no longer holds relevance.

Why does treating memory as prompt text break down at enterprise scale, and what architectural shift was required to overcome that?

Prompt-based memory approaches frequently fail at an enterprise scale due to their inherent lack of structure, governance, and explainability. When memory exists solely as transient prompt text, auditing an agent’s knowledge, enforcing access controls, or explaining decision paths becomes increasingly difficult. At the enterprise level, these deficiencies severely limit the capacity to meet trust, governance, and compliance expectations.

To rectify this, the team re-conceptualized memory as a core platform capability, rather than a mere prompt-side technique. Memory now resides in a real-time data layer, distinctly separate from prompts, and possesses explicit structure and lifecycle controls. Short-term session context is isolated from long-term memory, which anchors to a profile graph. Raw signals pass through a pipeline that determines whether to add, update, delete, or disregard each memory candidate.

This fundamental shift makes memory inspectable, governable, and explainable. It also allows for seamless integration with retrieval, planning, and tool execution across various agents.

How does adaptive context and session-level tracing help make agent behavior governable and explainable at enterprise scale?

Salesforce serves as the authoritative record for enterprise data. However, effective agent automation needs context that evolves with interactions. Adaptive Context allows agents to dynamically refine, prioritize, and prune information in real time, moving beyond static inputs. This helps agents highlight the most relevant signals from conversations, documents, and enterprise systems as their tasks progress.

During execution, agents create structured reasoning and decision traces. These traces capture how choices are evaluated and which tools or actions are selected. They provide an evidence-backed history, explaining an agent’s actions, which supports accurate auditing and governance. To enhance this visibility, a standardized session trace model organizes session activity, capturing the agent’s complete journey.

Over time, these traces build a relational history connecting decisions to enterprise outcomes. By referring to successful past sessions in similar situations, agents can base future behavior on proven patterns. This remains fully inspectable, auditable, and consistent with organizational policies.

What makes enterprise-grade agentic memory especially difficult to build correctly at scale?

The most challenging aspect involves determining what information merits retention and ensuring its accuracy over time. Storing an excessive amount of data quickly generates noise, while saving too little limits practical utility. Episodic memory introduces additional complexity because order and timing are crucial. Agents must preserve the precise sequence of events to reason accurately.

Conflicts between various sources present an additional risk. Enterprise systems may contradict conversational signals, and memory must represent uncertainty rather than false certainty. Mixing short-term context with long-term memory can also lead to private or one-time information persisting incorrectly across multiple sessions.

The team tackles these challenges through strict write and read gates, confidence scoring, memory compaction processes and comprehensive source tracking. Hybrid matching combines similarity search with semantic checks to prevent duplication and drift as memory evolves.

What performance and cost constraints shaped how Agentic Memory retrieval was designed?

Every agent turn operates under strict latency and cost constraints. If memory retrieval is slow, the agent appears unresponsive. Frequent invocation of large models quickly escalates costs, becoming unsustainable at scale.

The solution employs compact, structured memory records with precomputed embeddings to facilitate rapid similarity search. Only a small, task-relevant subset of memory is retrieved per interaction, with caching applied to active sessions. Smaller models manage inexpensive steps such as candidate extraction and validation, while larger models are reserved for more complex reasoning when necessary.

This design enables agents to leverage long-term memory without sacrificing responsiveness or efficiency in real-time enterprise environments.

What scientific and engineering challenges arise when modeling episodic, long-term, and short-term memory in agents?

Episodic memory is inherently temporal. Agents must preserve event order and understand how outcomes relate across time, otherwise they risk applying incorrect lessons. Short-term context must also remain distinct from long-term memory to prevent the persistence of transient or sensitive information.

Uncertainty further complicates the modeling process. Different sources may disagree, and the agent must represent confidence rather than assuming correctness. Measuring memory quality also presents a challenge, requiring evaluation across correctness, freshness, helpfulness, and safety, rather than relying on a single metric.

To address these constraints, the team designed a memory model that supports long-term reasoning while respecting temporal boundaries and uncertainty. Time-bounded episodic chunks, a confidence-first design, and replay-based evaluation help ensure memory remains useful without becoming rigid or unsafe.

Further, we consider data sources beyond just agentic conversations. For example, human agent chats in Service Cloud (from the livechattranscript Salesforce object) and Einstein Bot conversations (from the Bot Conversations Data model) can feed into the memory derivation pipeline. We also include data brought in via the zero-copy connector.

These memory derivation candidates are pre-defined as metadata and fed into the derivation pipeline. This aligns with industry memory solutions, where the system extracts memories or insights from an actor-text-blob tuple or actor-text-actor triplet. The text-blob can come from various sources like agentic, bot, or human conversations, or documents such as Excel or PDF files.

This approach means data from all Data 360 connectors are potential memory candidates. Decoupling data sources makes the derivation pipeline a flexible, extensible enterprise memory platform. Memories are derived from diverse sources, not just conversations, and stored in a standard memory object.

What early R&D approaches show the most promise for keeping agentic memory reliable and governable over time?

Memory is currently being approached as clean, structured data, complete with explicit fields for type, time, source, confidence, and lifecycle controls. Write gates are implemented to ensure that only high-quality candidates become part of this memory. Concurrently, read gates are utilized to limit retrieval exclusively to records relevant to the task at hand.

Hybrid validation is a technique that combines vector similarity with meaning checks. This method aims to prevent both duplication and drift within the memory system. Episodic memory undergoes summarization over time, a process designed to preserve signal while simultaneously reducing noise. Furthermore, trusted enterprise records are prioritized over what might be considered more casual conversational signals.

To assess the system’s performance, replay testing is employed to evaluate correctness, freshness, and safety. These techniques, when combined with a cost-aware model selection process, collectively support the development of long-running, enterprise-grade agentic memory, all without sacrificing trust or reliability.

Learn more

Stay connected — join our Talent Community!
Check out our Technology and Product teams to learn how you can get involved.

The post How Agentic Memory Enables Durable, Reliable AI Agents Across Millions of Enterprise Users appeared first on Salesforce Engineering Blog.