Understanding Retrieval-Augmented Generation (RAG) in AI

LLMs cannot access private company data like internal documentation, customer records, product specifications, and HR processes. When employees ask questions about company HR policies, a standard LLM can only provide generic responses. Likewise, when customers inquire about specific products, it can only rely on common patterns. These patterns have been learned from public internet data.

Retrieval-Augmented Generation (RAG) solve these problems by giving AI system access to specific documents and data.

RAG does not depend solely on what the model learned during training. It allows the system to look up relevant information from a particular document collection. This happens before generating a response.

What is Retrieval-Augmented Generation (RAG)?

Retrieval augmented generation (RAG) is a generative AI method. It enhances LLM performance by combining world knowledge with custom or private knowledge. This combining of knowledge sets in RAG is helpful for several reasons:

  • Providing LLMs with up-to-date information: LLM training data is sometimes incomplete. It may also become outdated over time. RAG allows adding new or updated knowledge without retraining the LLM.
  • Preventing AI hallucinations: The more accurate and relevant the in-context information LLMs have, the less likely they’ll invent facts. They are also less likely to respond out of context.
  • Maintaining a dynamic knowledge base: Custom documents can be updated, added, removed, or modified anytime. This keeps RAG systems up-to-date without retraining.

How RAG Works?

RAG works in two phases, which are:

  1. Document preparation phase, which occurs once we set up the system. This can also happen later on when new documents or new sources of information are added to the system.
  2. Query processing phase which happens in real-time whenever a user asks a question.

This two-phase approach is powerful. It provides a separation of concerns between the computationally intensive document-preparation phase and the latency-sensitive query phase.

1. The document preparation phase

In this phase, first, the documents need to be collected and processed. Each document needs to be converted into plain text. This includes PDFs, Word documents, web pages, or database records.

Once the text is extracted, it breaks it into smaller chunks. This chunking is necessary because documents are usually too long to process as single units. A 100-page technical manual might be split into hundreds of smaller passages, each containing a few paragraphs.

The next step transforms these text chunks into numerical representations known as embedding. These numbers encode the meaning of the text in a way that allows mathematical comparison. Similar concepts produce similar number patterns, which enables the machine to find related content even when different words are used.

These embedding, along with the original text chunks and their metadata, are then stored in a specialized vector database. This database is optimized for finding similar vectors. It indexes the embedding in a way that allows rapid similarity searches across millions of chunks.

2. Query Processing Phase

The Query processing journey starts when the user enter the question in the system. That question first goes through the same embedding process as the document chunks.

For example, the question “What is our refund policy for electronics?” gets converted into its own numerical vector using the same embedding model that processed the documents.

With the query now in vector form, the system searches the vector database for the most similar document chunks. This similarity search is fast because it uses mathematical operations rather than text comparison. The database might contain millions of chunks, but specialized algorithms can find the most relevant ones in milliseconds. Typically, the system retrieves the top 3 to 10 most relevant chunks based on their similarity scores.

These retrieved chunks then need to be prepared for the language model. The system assembles them into a context. It often ranks them by relevance. Sometimes, it filters based on metadata or business rules. For example, more recent documents might receive priority over older ones. Some sources might be viewed as more authoritative than others.

The language model now receives both the user’s original question and the relevant context.

The prompt contain the following details:

  • Context documents provided
  • User’s specific question
  • Instructions to answer based on the provided context
  • Guidelines for handling information not found in the context

The language model processes this augmented prompt and generates a response. Since it has specific, relevant information in its context, the response can be accurate and detailed rather than generic.

Finally, the response often goes through post-processing before reaching the user. This may involve adding citations that link back to source documents. It might also include formatting the response for better readability. Additionally, checking that the answer properly addresses the question is important.

Embedding – The numerical language of AI

Someone might ask about a computer problem in various ways. They could say, “laptop won’t start,” “computer fails to boot,” “system not powering on,” or “PC is dead.” These phrases share almost no common words, yet they all describe the same issue.

These phrases share almost no common words, yet they all describe the same issue. A keyword-based system would treat these as completely different queries and miss troubleshooting guides that use different terminology.

Embedding solve this by capturing semantic meaning rather than surface-level word matches.

Embedding models Vs Large Language models (LLM) in RAG

An embedding model convert non-numeric data, such as words, sentences, or images, into dense numerical vectors that capture the semantic meaning of the input.

In contrast, An LLM is the large, sophisticated deep-learning model that uses the numerical representations provided by embeddings to generate human-like text and perform complex language tasks.

This specialization is why RAG systems use two separate models. The embedding model efficiently converts all the documents and queries into vectors, enabling fast similarity search. The LLM then takes the retrieved relevant documents and generates intelligent, contextual responses.

Final Point

Retrieval-Augmented Generation represents a practical solution to the very real limitations of LLMs in business applications. RAG combines the power of semantic search through embeddings with the generation capabilities of LLMs. This combination enables AI systems to provide accurate and specific answers based on the organization’s own documents and data.

Idempotency in API Design

What is Idempotency?

An idempotent API guarantees that repeating the same request has the same effect as requesting it once.

Without idempotency, retries can create duplicate rows or records and it create side effects/issues.

Common Use Cases for Idempotency

Order Creation: If an online store allows customers to place orders, retrying an order request could result in duplicate orders. Using an idempotency key prevents this.

Payment Processing: Retrying a payment request without idempotency could lead to multiple charges for the same transaction.

Resource Creation: Creating resources like products, users, or accounts should be idempotent to avoid duplicates.

Booking Systems: Hotel or airline bookings need idempotency to prevent users from booking the same reservation multiple times due to retries.

Strategies for Implementing Idempotency

  1. Unique Request Identifier

A simple way to achieve idempotency is by attaching a unique identifier (idempotency key) to each request. If the server receives a request with the same ID again, it recognizes it as a duplicate and ignores it.

Example: A payment service can require a unique ID for every transaction. If the client retries with the same ID, the server skips the charge to avoid duplicates.

  1. Database Adjustments (Upsert Operation)

To prevent duplicates in the database, use operations like “upsert” (update or insert). This ensures the database remains consistent without creating duplicates.

Example: Using SQL INSERT … ON CONFLICT can either update or insert a record, avoiding duplicates.

  1. Idempotency in Messaging Systems

In messaging systems, store processed message IDs and check each incoming message against this list. If the message ID already exists, it’s ignored.
Example: A unique messageId is checked before processing, ensuring no duplicates.

  1. Idempotency in HTTP Methods

HTTP methods can be either idempotent or non-idempotent, affecting how retries are handled.

The Benefits

Safe retry: Clients can repeat requests after timeouts or network drops without fear of duplicate actions.

Data integrity: Prevents double charges, duplicate records, or repeated side effects like emails.

Easier recovery: Systems can recover from partial failures by re-sending requests with confidence.

Simpler error handling: Eliminates the worry of not knowing whether an action actually completed after a failed or slow response.

Better user experience: Users don’t get penalized for clicking twice or refreshing at the wrong time.

Operational flexibility: Load balancers and retries at scale become safer because duplicates don’t create extra work.

The Tradeoffs

Extra complexity: You need logic to detect duplicates, store outcomes, and decide what to return.

Storage overhead: Idempotency keys or cached results must be saved, managed, and eventually expired.

Performance impact: Every request adds a lookup step, which can increase latency.

Distributed challenges: Multiple servers must share the same truth to avoid races or inconsistencies.

Implementation using .Net API

Implementing Idempotent REST APIs in ASP.NET Core: https://www.milanjovanovic.tech/blog/implementing-idempotent-rest-apis-in-aspnetcore

Blue-Green Deployment with Azure DevOps and App Service

Blue/Green deployment is a deployment model in which we keep two production-like environments on active-active standby. In this case, one of the environments is always serving production traffic while the other one can be idle or be used for testing features. So, what happens, in this case, is that one environment always contains the latest code which needs to be in production while the other environment contains the older production code.

Getting the latest changes in production is as simple as swapping the DNS to point to the environment containing the latest code. Rolling back a deployment which doesn’t meet the expectations is as simple as rolling back to the previous environment containing the older production code.

Let’s discuss how we can use Azure Web App Deployment Slots and Azure DevOps Tools like Repos, Pipelines (Build/Release) to automate this process.

Azure Boards provides backlogs and work item tracking to help development teams collaborate and coordinate their work.

Azure Repos fires a trigger to launch a Build Pipeline. The Build Pipeline includes jobs and tasks that clone the repo, install tools, build the solution, and then package and publish artifacts to Azure Artifacts.

Release Pipeline is responsible for deploying the application artifacts to development, QA, and production environments. The Release Pipeline is organized into stages which, although executed sequentially, act independently of each other. In this scenario, the Dev stage deploys the application to a Dev environment. This environment is typically hosted in a non-production Subscription and may share an App Service Plan with other non-production environments such as QA.

Between stages, you use approvals and gates to control when the next stage is executed. This allows your team to perform testing and validation in each stage before moving onto the next.

Blue-Green Deployment, the staging slot represents your “green” deployment. The production slot represents your “blue” deployment. Once you validate that everything has been successfully deployed to the staging slot (i.e. green), the Prod stage performs a swap of green and blue. This makes the green deployment live for end-users and moves the blue deployment to your staging slot where it remains until you remove it. If problems arise with the new green deployment then you can swap again to move blue back to production.

Read more

Learn by doing: dapr hands on lab

Dapr is a portable, event-driven runtime that makes it easy for developers to build resilient, microservice stateless and stateful applications that run on the cloud and edge and embraces the diversity of languages and developer frameworks.

Edwin has created a repository that contains several hands-on assignments that will introduce you to Dapr. You will start with a simple ASP.NET Core Application that contains a number of services. In each assignment, you will change a part of the application so it works with Dapr (or “rub some Dapr on it” as Donovan Brown would say). The Dapr features you will be working with are:

  • Service invocation
  • State-management
  • Publish / Subscribe
  • Secrets

For the assignments, you will be using Dapr in stand-alone mode. As a stretch goal, we added the last assignment that will ask you to run the Dapr application on Kubernetes.

Practice link: https://github.com/EdwinVW/dapr-hands-on

I hope this will help !!!