Learn Retrieval-Augmented Generation using vectors, cosine similarity, and local LLMs (no cloud, no paid APIs)
- Simple Explanation of RAG
- Models used
- Simple Vector
- Create Vector With LLM
- How do we search Vectors?
- Why Same embedding model for searching?
- Search using cosine search or similarity search
- Real world RAG patterns
- RAG with Vector DB
"Code is cheap but software is expensive"
This statment tells that understanding something in and out is more important than just coding, you have lot of coding tools available which can provide you the code the way you ask/question it to code, but you won't be able to ask right questions matching your use case if you don't have proper understanding. You will also need to make the generated code better and also resolve production issues, coding won't help much but undersatnding will help you explain the issue and resolve it as well.
so let us come back to our topic.
LLM/models doesn't know your businss, your data and your processes so how can we make use these models ? We use model by providing our knowledge at run time so that model can use that knowledge and respond to questions or take decisions. But how do we provide that knowledge as we have too many documents/processes etc.? To do that we store our knowledge base in form of decimal numbers, and these numbers can take values from minus infinity to plus infinity. These decimals values are called vectors and stored in vector database.
But how can we search that big vector database to answer questions at run time or dynamically? We don't do search like we search in SQL database, we search by meaning of what is asked and that is called semantic search.
Number of decimal values for a give word/sentences is called number of dimensions and creating these vectors from simple text is called embedding.
So how do we search this vector database ? These vectors are nothing but points in multi dimensional space and we search by direction i.e. meaning or semantics. Model will first convert question into vectors and then will find those vectors having the same direction in multi dimensional workspace, this makes search very quick compare to text search. This semantic search is also called cosine search.
You can see in below image, how similar items are placed into same direction in the multi dimensional space.
so to setup RAG we need to do following steps
- Take your knowledge and use a model to convert it into vectors, these models who create text into vectors are called embeding models
- Store vectors into a vector databases
- Now setup your main model who is going to take up the task of responding to questions
- The questions will again be converted into vectors by same embedding model who created knowledge into vectors
- questions vectors will be searched into vector databases to find those vectors having same semantic meaning
- Once vectors found those are again converted into text and given to main model as context to respond the question
- Main model uses this context to answer the question
the above steps are also called RAG pipelines.
So RAG has the following 3 main steps:
Retrieve the vectors from the database matching what is asked (semantic match)
Provide the retrieved details to the model as context
The model uses the context and generates the response
- Embedding models β convert the knowledge base into vectors
- Answering models β take the question and use the provided context (from RAG) to generate an answer
let us install scikit-learn which can create vectors for without using any model for learning purposes ( we will use models later on )
uv pip install scikit-learn
now let us run the following program
1-understand-vector.py
here you will see vector only containing 1 and 0 and no decimals, this is also a vector but it doesn't have any semantic meaning. We don't have to learn aything further on this in this file,this was just to show that vector can get values 1 and 0 also (as printed by this file) , but it will have no semantic meaning or direction and it will not help in searching if we want to search something by what it means or by its semantic vlaue.
- Install ollama
curl -fsSL https://ollama.com/install.sh | sh- Install model for embedding
ollama pull nomic-embed-text
now run following file
2-create-vector-llm.py
Now you can see for each line it has 768 dimensions and values are positive and negative both and values are decimals values and not 1 and 0. Previous example (where we didn't use LLMs) has fewer values but here we have 768 values for each line, it has now so many values because now it holds semantic meaning of the text. These values are not 'mechanical', these values stores and reflects certain semantinc meaning and that meaning is given by the model that we have used. This process is called embedding.
Now we have seen in previous file lot of decimal numbers which are called vectors and they hold some semantic meaning, but question is how do search a semantic meaning in set of vectors or vector database?
Cosine similarity measures how alike two non-zero vectors are by calculating the cosine of the angle between them, indicating direction rather than magnitude. Here the main point is direction and not the magnitude. If direction of what is searched matches with anything stored in vector database with the same direction, those values are picked in the search.
Now let us understand this by following example, let us run this file
3-cosine-similarity.py
| doc1 | doc2 | doc3 | |
|---|---|---|---|
| doc1 | 1.0000 | 0.5758 | 0.4282 |
| doc2 | 0.5758 | 1.0000 | 0.4062 |
| doc3 | 0.4282 | 0.4062 | 1.0000 |
now if you look at these numbers, it shows where one doc matches with other doc number is higher and where they seem not related number is lower. I am here again pasting those 3 docs i.e. doc1, doc2, and doc3
- "doc1": "Cricket is game played by bat and hard leather ball",
- "doc2": "cricket has 11 palyers each side",
- "doc3": "an apple a day is great for health",
doc1 and doc2 are somewhat related but doc3 is complete different, doc3 is talking about fruits but doc1 and doc2 talking about cricket sport. Now look at this diagram where doc1, doc2 and doc3 are displayed, you can find here that doc3 is almost completely in other direction.
This is just 3 dimensional diagram created to explain the vector distribution but in reality number of dimensions could be in 1000s, so you can imagine if direction is matching then it must be having same semantic meaning.
Cosine similarity measures the angle between two directions from the origin and that is why doc1 and doc2 having less than 90 degrees but doc3 is with more than 90 degrees angle, and you can also see in diagram same document with same document cosine similarity value is 1 because cos(0)=1
- Angle between doc1 and doc2 -> 42.1 degrees
- Angle between doc1 and doc3 -> 77.7 degrees
- Angle between doc2 and doc2 -> 78.6 degrees
Now you have full clarity about internal mechanism of vector search, it looks complex initially but when you talk about cosine values and angles/degrees things become easier to understand. Mind you all this is not required to develope RAG applications but knowing internal mechanism gives lot of confident and satisfaction.
Lot of AI developers asks this question, why cannot I use different model when it comes to searching vector database. The reasons are following
- Each model has its own set of dimensions, model may not match when it comes to number of dimensions that it will create to store that semantic meaning.
- Even the meaning of AXES could be different
I hope it is clear now when same model needs to be used while searching as well.
Before we start working on cosine search let us understand few things
- Cosine search is similarity search so you will get lot of results and you have to use only top few search results ( here we will use top 2)
- It is not a SQL query where you search exact match and give back as is to the user
- Top results need to be first given to model (non embedding one) who will use it as an context and provide answer based on that context
now let us look at this file
4-simple-rag.py
one question when you look at this file you will get is, what is this different endpoint - "http://localhost:11434/api/embed" and not use which we already using "http://localhost:11434/v1" ?
- http://localhost:11434/v1 --> It is for reasoning (real use of any LLM model and results are used by users directly)
- http://localhost:11434/api/embed --> this is for mechanics i.e. creating vectors (results are not used by users directly)
in this program you can see that retrieve function first creating vector of the question being asked by calling embed_one which is using same embedding model.
Here if we don't use same model, you will get incorrect results.
you can run this program and see results yourself that is able to answer questions correctly.
You also need to look at the function build_rag_prompt which brings everything together and creates the prompt which has question which was asked and also context which was provided by searching vectors (top 2 after similarity or cosine search).
Now you have very good basic understanding of what a vector is and what is cosine search and why do we use more than one model for RAG application.
Now the question that you may have is - " do we use only cosine search while searching vectors or also use some other classif filters?" The answer will come in the next section.
A company might have 10 million customers and might be storing call transcripts or chat transcripts or behaviour patterns in vector database, but if we are going to do cosine search in real time, it is going to take too much time so how does it work ?
In real RAG applications we don't only rely on cosine search alone we also supply additional filters like customer_id, region, product, may be dates etc. This limits the overall search space. So next question is - " how do we store customer_id in the vector database ?"
vectors are stored in a table which is also called a vector index, and all of these vectors go in a column of that index but we still store other values in that index/table like customer_id, region etc. So when it comes to searching first we apply those filters on non-vector columns so that we are only looking at those vectors which belongs to that customer and not to every single customer.
Now let us look at following program
5-rag-with-addtional-filters.py
now you can see that apply_filters function applies few filters before doing cosine or similarity search limiting the overall records.
This is the industry standard, never store vectors without any other business columns as it will not only very slow but it will also cost a lot.
So far our vectors are in-memory, but in real world vectors are not stored in-memory but in vector DB so that all RAG programs can access it. Let us install vector db locally
uv pip install chromadb
This vector database is good vector db for local setup and test, but in production envrionments we use other vector dbs like pinecode, s3 vector etc. There are many vector db present in markets today.
You don't have to look at vectors something different then other data that you store, only thing changes with vectors is how it is searched, vectors are searched with semantic meaning and not by their literal values.
Now let us run following program
6-rag-with-vector-db.py
So far we are providing knwoledge in form of a JSON and its variable name is DOCS, but in real world knowledge will come from your databases, pdfs, file store etc.
I will create a seperate repo for vector creation and ingestion as that is a big topic in itself and requires carefull explaination as method of storing vectors is important for
proper semantic search.
Vector creation process is a seperate process and it will have its own pipelines, the code examples we are trying we are keeping the things simple for understanding purposes. But please keep this thing in mind, you will never have vector creation process and retrieval process part of the same program or setup.π§
Now you have full understanding of what RAG is , the mathematcis behind RAG search/cosine/semantic search and also how do we apply basci SQL filters before we do cosine search/filters. You also know there are 2 kind of models we use, one to create vectors and one to interpret the context returned by cosine search.
Always rememeber we call semantic search or cosine search, we don't use word "filter", as cosine search result everything with cosine values until you limit it to top few which you would like to use for creating final prompt.


