Simple and easy local RAG (Retrieval-Augmented Generation) guide you can set up to run your local LLMs for free.
- LocalRAG
- Guide
- System Requirements
- Guide 1 Run Local RAG Using Anythingllm and OpenRouter
- 1 Install Anythingllm
- 2 Generate API Keys for OpenRouter
- 3 Launch AnythingLLM and configure to make use of OpenRouter
- 4 Create Workspace
- 5 Set Up Vector Database and embed your documents into your RAG
- 6 Upload your documents and build your knowledge base
- 7 (Optional) Configure as query mode
- 8 Test the RAG System
- Guide 2 Run Local RAG Using Anythingllm and Self hosted Chat LLM Using LMStudio
- Links
- Models that work for NVIDIA RTX4080 GPU
- Guide
This guide was tested with the following system specs
- Windows 11
- NVIDIA RTX 4080
- 16GB RAM
- Processor Intel(R) Core(TM) i7-14650HX, 2200 Mhz, 16 Core(s), 24 Logical Processor(s)
- NVIDIA GeForce RTX 4080 Laptop GPU
Guide based on AnythingLLM v1.8.2
- (recommended) Download the latest AnythinLLM version from here https://anythingllm.com/
Open router provides use access to a wide range of free LLMs developed from the majority of the big players (OpenAI, META, Google, Grok and more)
- Head to https://openrouter.ai and sign up and confirm email
- Head to https://openrouter.ai/settings/keys and generate key
- Head to settings -> AI Providers -> LLMs
- Select OpenRouter and set API Key
- Select a model marked as free
- In this example I have user meta-llama/llama-4-maverick:free but other excellent choices are also
- mistralai/mistral-small-3.1-24b-instruct:free
- deepseek/deepseek-r1-distill-llama-70b:free
- apidog.com+14gaikarkshitij.medium.com+14reddit.com+14
- reddit.com+9ollama.hf-mirror.com+9
- In this example I have user meta-llama/llama-4-maverick:free but other excellent choices are also
Now that you have added the LLM you should be able to start chatting wright away.
- Create a work space
This is the most important step of the RAG system, embedding your documents into a vector database. By embedding your documents you are generating essentially an array of numbers (relations) from each document you upload. Similar document generate similar arrays of numbers and are hence grouped/linked/related more than others. The vector database serves as the knowledge base for your LLM (any LLM you plan to use will make use of this database, so you can switch between LLMs).
We have 2 machine learning models that we require, 1 for producing the vector database/knowledge base and one for chatting and providing information from this vector database/knowledge base.
- Set up vector database (for local set up default option works fine)
- Select an Embedding LLM (for local set up default option works fine).
Also some other options
| Model/API | Dim | Free? | API Option | Notes |
|---|---|---|---|---|
| Cohere/embed-english-light-v3.0 | 1024 | ✅ 1M tokens/month | Cohere | Fast, accurate, multilingual |
| OpenAI/text-embedding-3-small | 1536 | ❌ after limit | OpenAI | Great but paid |
| Mistral Embed | 1024 | ✅ via Together.ai/OpenRouter | OpenRouter | Lightweight, decent for chatbots |
| BAAI/bge-base-en-v1.5 | 768 | ✅ HF / Local | HuggingFace, Ollama | Strong multilingual support |
| DeepSeek Embed | 1024 | ✅ via OpenRouter | OpenRouter | Matches DeepSeek-R AG pipelines |
| InstructorXL | 768 | ✅ HuggingFace | HuggingFace Spaces | Instruction-based embedding model |
You may upload images, pdfs, text files and etc to build your knowledge base
- Head to the home screen and click on embed a document
- Upload your documents and assign them to your workspace
- Click on your works space settings and change to query mode
- Coming Soon
- lmstudio
- attempt https://lmstudio.ai/docs/typescript in a client app project https://github.com/lmstudio-ai/lmstudio-js
- https://hub.docker.com/r/noneabove1182/lmstudio-cuda
- anythingllm
- https://github.com/mudler/LocalAI
| Model | Size | Notes |
|---|---|---|
| TinyLLaMA (1.1B) | 1.1B | Great for testing, good trade-off |
| LiteLLaMA (460M) | 460M | Ultra-light for toy use |
| MobileLLaMA (1.4B/2.7B) | 1.4–2.7B | Fast for real-time apps |
| Phi‑3‑mini (3.8B) | 3.8B | Benchmarks rival larger models |
| Llama 3.1‑Minitron (4B) | 4B | Distilled, high efficiency |







