Quickstart

Get started with Llama Stack in minutes!

Llama Stack is a stateful service with REST APIs to support the seamless transition of AI applications across different environments. You can build and test using a local server first and deploy to a hosted endpoint for production.

In this guide, we'll walk through how to build a RAG application locally using Llama Stack with Ollama as the inference provider for a Llama Model.

💡 Notebook Version: You can also follow this quickstart guide in a Jupyter notebook format: quick_start.ipynb

Step 1: Installation and Setup

Install Ollama by following the instructions on the Ollama website, then download Llama 3.2 3B model, and then start the Ollama service.

ollama pull llama3.2:3b
ollama run llama3.2:3b --keepalive 60m

Install uv to setup your virtual environment

macOS and Linux
Windows

Use curl to download the script and execute it with sh:

curl -LsSf https://astral.sh/uv/install.sh | sh

Use irm to download the script and execute it with iex:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Setup your virtual environment.

uv sync --python 3.12
source .venv/bin/activate

Step 2: Run the Llama Stack server

We will use uv to install dependencies and run the Llama Stack server.

# Install dependencies for the starter distribution
uv run --with llama-stack llama stack list-deps starter | xargs -L1 uv pip install

# Run the server
OLLAMA_URL=http://localhost:11434/v1 uv run --with llama-stack llama stack run starter

Step 3: Run the demo

Now open up a new terminal and copy the following script into a file named demo_script.py.

demo_script.py
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.

from llama_stack_client import Agent, AgentEventLogger, LlamaStackClient

vector_store_id = "my_demo_vector_db"
client = LlamaStackClient(base_url="http://localhost:8321")

models = client.models.list()

# Select the first LLM and first embedding models
# Prefer Ollama models since they don't require API keys
models_list = list(models)
llm_models = [m for m in models_list if m.id and not m.id.startswith("sentence-transformers")]
ollama_models = [m for m in llm_models if "ollama" in m.id.lower()]
model_id = (ollama_models[0] if ollama_models else llm_models[0]).id

# Get embedding model
embedding_models = [m for m in models_list if m.id and m.id.startswith("sentence-transformers")]
em = embedding_models[0] if embedding_models else None
if not em:
    raise ValueError("No embedding model found")
embedding_model_id = em.id
# Default embedding dimension for nomic-embed-text-v1.5 is 768
embedding_dimension = 768

print(f"Using model: {model_id}")

# Download the document content
import requests
source_url = "https://www.paulgraham.com/greatwork.html"
print(f"Downloading document: {source_url}")
response = requests.get(source_url)
content = response.text

# Upload the file
print("Uploading file to server...")
file_obj = client.files.create(
    file=("greatwork.html", content.encode('utf-8'), "text/html"),
    purpose="assistants",
)
file_id = file_obj.id
print(f"File uploaded: {file_id}")

# Create or retrieve vector store
print(f"Creating vector store: {vector_store_id}")
try:
    # Try to retrieve existing vector store
    vector_store = client.vector_stores.retrieve(vector_store_id)
    print(f"Using existing vector store: {vector_store_id}")
    vector_store_id = vector_store.id

    # Add file to existing vector store
    client.vector_stores.files.create(
        vector_store_id=vector_store_id,
        file_id=file_id,
    )
    print(f"Added file to vector store")
except Exception as e:
    # Create new vector store with the file
    print(f"Creating new vector store (error: {e})")
    vector_store = client.vector_stores.create(
        name=vector_store_id,
        file_ids=[file_id],
    )
    vector_store_id = vector_store.id
    print(f"Created new vector store: {vector_store_id}")
agent = Agent(
    client,
    model=model_id,
    instructions="You are a helpful assistant. Use the knowledge_search tool to find relevant information in the ingested documents.",
    tools=[
        {
            "type": "file_search",
            "vector_store_ids": [vector_store_id],
        }
    ],
)

prompt = "How do you do great work?"
print("prompt>", prompt)

use_stream = True
response = agent.create_turn(
    messages=[{"role": "user", "content": prompt}],
    session_id=agent.create_session("rag_session"),
    stream=use_stream,
)

# Only call `AgentEventLogger().log(response)` for streaming responses.
if use_stream:
    for log in AgentEventLogger().log(response):
        if hasattr(log, 'print'):
            log.print()
        else:
            # Print text chunks inline without newlines
            print(log, end='', flush=True)
    print()  # Final newline at the end
else:
    print(response)

We will use uv to run the script

uv run --with llama-stack-client,fire,requests demo_script.py

And you should see output like below.

>print(resp.output[1].content[0].text)
To do great work, consider the following principles:

1. **Follow Your Interests**: Engage in work that genuinely excites you. If you find an area intriguing, pursue it without being overly concerned about external pressures or norms. You should create things that you would want for yourself, as this often aligns with what others in your circle might want too.

2. **Work Hard on Ambitious Projects**: Ambition is vital, but it should be tempered by genuine interest. Instead of detailed planning for the future, focus on exciting projects that keep your options open. This approach, known as "staying upwind," allows for adaptability and can lead to unforeseen achievements.

3. **Choose Quality Colleagues**: Collaborating with talented colleagues can significantly affect your own work. Seek out individuals who offer surprising insights and whom you admire. The presence of good colleagues can elevate the quality of your work and inspire you.

4. **Maintain High Morale**: Your attitude towards work and life affects your performance. Cultivating optimism and viewing yourself as lucky rather than victimized can boost your productivity. It’s essential to care for your physical health as well since it directly impacts your mental faculties and morale.

5. **Be Consistent**: Great work often comes from cumulative effort. Daily progress, even in small amounts, can result in substantial achievements over time. Emphasize consistency and make the work engaging, as this reduces the perceived burden of hard labor.

6. **Embrace Curiosity**: Curiosity is a driving force that can guide you in selecting fields of interest, pushing you to explore uncharted territories. Allow it to shape your work and continually seek knowledge and insights.

By focusing on these aspects, you can create an environment conducive to great work and personal fulfillment.

Congratulations! You've successfully built your first RAG application using Llama Stack! 🎉🥳

HuggingFace access

If you are getting a 401 Client Error from HuggingFace for the all-MiniLM-L6-v2 model, try setting HF_TOKEN to a valid HuggingFace token in your environment

Next Steps

Now you're ready to dive deeper into Llama Stack!

Explore the Detailed Tutorial.
Try the Getting Started Notebook.
Browse more Notebooks on GitHub.
Learn about Llama Stack Concepts.
Discover how to Build Llama Stacks.
Refer to our References for details on the Llama CLI and Python SDK.
Check out the llama-stack-apps repository for example applications and tutorials.

Get started with Llama Stack in minutes!​

Step 1: Installation and Setup​

Step 2: Run the Llama Stack server​

Step 3: Run the demo​

Next Steps​

Get started with Llama Stack in minutes!

Step 1: Installation and Setup

Step 2: Run the Llama Stack server

Step 3: Run the demo

Next Steps