The AI Merge

Building a Local AI Task Manager with PydanticAI and Ollama

Alex Razvant — Sun, 22 Mar 2026 14:03:18 GMT

Welcome to The AI Merge. Where you learn practical, production-ready AI/ML Engineering. Join over 9000+ engineers and build real-world AI Systems.

Subscribe now

In MAVS (Multi-Agent Vision System), my upcoming course, a part of the architecture is an A2A agent network. The agents coordinate on tasks, decide what needs to happen next, and split execution across the system.

While building that layer, I looked at a few frameworks: CrewAI, Google ADK, PydanticAI, and LangGraph. LangGraph was the first candidate and a strong fit for complex agent systems, giving you a lot of control over how agents move through a workflow. But I felt there were already a ton of examples on building LG Agents, and I wanted to try something new.

I ended up choosing PydanticAI.

It is not as flexible or as mature as LangGraph, but it felt much closer to the way I already build Python applications. If you know Pydantic or FastAPI, the dev experience is similar: typed models, explicit structure, clear tool definitions, and code that is easy to follow.

PydanticAI does have real limits. In my experience, wiring native agent tools can feel rigid, and the flow control is more manual than it should be, but it’s still a good fit.

As an exercise to get familiar with the course’s tech stack, I decided to write a series of articles with quick end-to-end examples of actual tools and frameworks I’ve used in the course.

In this article, we will start with a small agent-powered app, a Task Manager that runs local models through Ollama and uses PydanticAI for the agent layer. The AI handles intent such as adding a task, marking work complete, or listing overdue items.

I will break the app down piece by piece and show how the PydanticAI concepts map to a real application architecture.

The LLM Engine
Project Scaffolding
Defining Agent Schemas
Agents Runtime Context
Defining the Task & Report Agents
Adding Tools
The Agent Execution Flow
Demo

Please use the navigation bar on the left to go through the article sections.

Step 1 - The LLM Engine

We’ll use Ollama to run the local LLMs powering our Pydantic AI Agents. Ollama is one of the most popular solutions for running LLMs locally. It’s straightforward and super easy to set up.

tl;dr Ollama is built in Go, as an optimized serving layer on top of actual LLM models running in llama.cpp. You can consider Ollama as a wrapper over llama.cpp, which abstracts the setup complexity of a llama.cpp server.
For a complete guide on how Ollama works, see this article.

For the context of this article, however, we can simply install and use Ollama following these 3 steps:

Step 1.1 - Install Ollama on your Machine

curl -fsSL https://ollama.com/install.sh | sh

Step 1.2 - Start the Ollama Server

ollama serve

Step 1.3 - Pull Qwen3:4B from Ollama Registry

ollama pull qwen3:4b

Step 2 - Project Scaffolding

At this step, we need to set up a new Python project, install the required dependencies upfront, and prepare the project structure before starting to implement the components one by one.

For that, we’ll use astral/uv, a fast Python Project & Dependency manager built in Rust, that’s a drop-in replacement for other package managers (e.g., Conda, Poetry, requirements.txt).

If you’re working with Python and still use requirements.txt/poetry/conda to manage your dependencies, I strongly recommend porting to UV.

Step 2.0 - Install Uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Step 2.1 - Creating a new Project

uv init --package pyai-starter

Step 2.2 - Installing the Dependencies

uv add pydantic httpx pydantic-ai

Step 2.3 - Activating the Environment

cd pyai-starter && source .venv/bin/activate

Step 3 - Defining Agent Schemas

Before reaching the LLMs & Agents part, we start with the contract schema that will keep the AI layer grounded.

In our application, the schema does more than validate the data as it defines the domain. Once we have a clear Task model, every tool and agent can operate against the same data contract.

In doing that, we make our Agent work towards typed data structures that our application already understands.

Step 3.1 - Data Contracts for a Task

The following models will help us define the state of a Task, and will help the Agent understand how to map and interpret what a Task is.

Task - the core Pydantic Model with typed fields (id, title, status, priority, etc.)
TaskStatus - a string enum with Pending | InProgress | Cancelled | Completed.
TaskPriority - a string enum with Low | Medium | High | Urgent

class TaskStatus(str, Enum):
    """Task completion status."""
    PENDING = "pending"
    IN_PROGRESS = "in_progress"
    COMPLETED = "completed"
    CANCELLED = "cancelled"


class TaskPriority(str, Enum):
    """Task priority levels."""
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    URGENT = "urgent"


class Task(BaseModel):
    """
    Domain model for a task.
    """
    id: int = Field(description="Unique task identifier")
    title: str = Field(description="Short description of what needs to be done")
    status: TaskStatus = Field(default=TaskStatus.PENDING, description="Current task status")
    priority: TaskPriority = Field(default=TaskPriority.MEDIUM, description="Task priority level")
    due_date: date | None = Field(default=None, description="Optional deadline (YYYY-MM-DD)")
    notes: str | None = Field(default=None, description="Additional details about the task")
    tags: list[str] = Field(default_factory=list, description="A label assigned to the Task")
    created_at: datetime = Field(default_factory=datetime.now, description="When the task was created")

When defining BaseModels for the Agent to use when serializing structured output, one useful tip is to always populate description fields, as it introduces additional context for the agent, making it understand how to manage/populate a field on output.

Although in low measure, keep in mind that these still have an impact on model’s context window. It may not be obvious on short turns, but on long enough conversations, every token counts.

Step 3.2 - The Results Object

Here, we’ll define two additional Pydantic Models to describe a TaskReport for getting overview details on all tasks, and a ReportNarrative for a summary across multiple tasks. Both of these Pydantic models will help the LLM format its responses in structured output.

TaskReport - the structured result returned by the Task Agent.
ReportNarrative - the structured status report across multiple tasks.

class TaskReport(BaseModel):
    """Structured summary report of task status."""
    total_tasks: int = Field(description="Total number of tasks")
    pending_count: int = Field(description="Tasks not yet started")
    in_progress_count: int = Field(description="Tasks currently being worked on")
    completed_count: int = Field(description="Tasks that are done")
    urgent_pending: list[str] = Field(description="Titles of urgent tasks not yet completed")
    summary: str = Field(description="Brief 1-2 sentence status summary")
    recommendation: str = Field(description="One clear next action for the user")


class ReportNarrative(BaseModel):
    """LLM-authored narrative grounded in provided task stats."""
    summary: str = Field(description="Brief 1-2 sentence status summary based only on provided stats")
    recommendation: str = Field(description="One clear next action based only on provided stats")

Step 4 - Agents Runtime Context

Agents will need to have access to the runtime state created or modified during a session. This runtime state shouldn’t live in a prompt directly, but be invoked on demand.

The runtime context is the extra data we pass into an agent at run time that is not part of the prompt or output schema. That is still accessible to tools, validators, and logic while the agent runs.

One example that helps with this idea is how Coding Agents work. During a session, the internal state is modified multiple times (list_directory, create_file, delete_file).

The Agent will be aware of these actions, and will mention them in its history - but won’t keep the log loaded in its prompt context, as it might get it confused.

All these dynamic actions are passed between Agent steps as a runtime state, a log that the Agent could interrogate to see what has been modified during its execution.

In PydanticAI, we can define a custom object that contains sub-objects we want to track in state, and then use dependency injection to make it available to the Agent’s runtime context.

Step 4.1 - Defining the TaskRepository

This is our in-memory task storage, with CRUD operations to enable actions that’ll modify the state.

We’ll have actions such as:

Create - adding a new task to the repository.
Get - fetching a task by ID.
ListAll - lists all tasks in the repository.
ListByStatus - filter tasks given a state.
ListPending - filter tasks in the Pending state.
Search - to search by title.
GetStats - to aggregate and display a report.

class TaskRepository:
    def __init__(self):
        self._tasks: dict[int, Task] = {}
        self._next_id: int = 1
    
    def create(self, title: str, priority: TaskPriority = TaskPriority.MEDIUM,
               due_date: date | None = None, notes: str | None = None,
               tags: list[str] | None = None) -> Task:
        task = Task(
            id=self._next_id,
            title=title,
            priority=priority,
            due_date=due_date,
            notes=notes,
            tags=tags or [],
        )
        self._tasks[task.id] = task
        self._next_id += 1
        return task
    
    def get(self, task_id: int) -> Task | None:
        return self._tasks.get(task_id)
    
    def list_all(self) -> list[Task]:
        priority_order = {TaskPriority.URGENT: 0, TaskPriority.HIGH: 1, 
                         TaskPriority.MEDIUM: 2, TaskPriority.LOW: 3}
        return sorted(
            self._tasks.values(),
            key=lambda t: (priority_order[t.priority], t.created_at)
        )
    
    def list_by_status(self, status: TaskStatus) -> list[Task]:
        return [t for t in self.list_all() if t.status == status]
    
    def update_status(self, task_id: int, status: TaskStatus) -> Task | None:
        task = self._tasks.get(task_id)
        if task:
            task.status = status
        return task
    
    def delete(self, task_id: int) -> bool:
        if task_id in self._tasks:
            del self._tasks[task_id]
            return True
        return False
    
    def search(self, query: str) -> list[Task]:
        query_lower = query.lower()
        return [t for t in self.list_all() if query_lower in t.title.lower()]
    
    def get_stats(self) -> dict:
        all_tasks = self.list_all()
        return {
            "total": len(all_tasks),
            "pending": sum(1 for t in all_tasks if t.status == TaskStatus.PENDING),
            "in_progress": sum(1 for t in all_tasks if t.status == TaskStatus.IN_PROGRESS),
            "completed": sum(1 for t in all_tasks if t.status == TaskStatus.COMPLETED),
            "urgent_pending": [t.title for t in all_tasks 
                              if t.priority == TaskPriority.URGENT 
                              and t.status not in (TaskStatus.COMPLETED, TaskStatus.CANCELLED)],
        }

Step 4.2 - Defining the TaskDeps

With TaskDeps, we’ll keep track of the runtime dependencies we want to pass to our Agent via dependency injection. This can be a simple dataclass, where we instantiate:

TaskRepository - such that the Agent would have access to add/modify/delete entire tasks or their states, and have the baseline to generate a report.

@dataclass
class TaskDeps:
    """
    Runtime dependencies for the task agent. Injected into Agent's runtime context.
    """
    task_repo: TaskRepository

We’re using dataclass and not BaseModel, since dependencies are for runtime injection, not really used by the Agent for output schema generation.

Step 5 - Defining the Task & Report Agents

At this step, since we’ve already defined the Runtime State for our Agents and the Pydantic Base models for structured Inputs/Outputs, we need to create the actual Agents.

For that, we’ll have to do two things:

Create an OpenAI API Compatible Connection
Compose the Agents

Step 5.1 - The BuildModel Function

Since we’re using Ollama to serve LLMs, and it being OpenAI API compatible, the setup becomes as simple as specifying the Model Provider, and then during the Pydantic AI Agent instantiation, we’ll wire up the model tag and provider URL.

OLLAMA_URL = "http://localhost:11434/v1"
MODEL_NAME = "qwen3:4b"

def build_model() -> OpenAIChatModel:
    return OpenAIChatModel(
        model_name=MODEL_NAME,
        provider=OpenAIProvider(
            base_url=OLLAMA_URL
        ),
    )

Step 5.1 - The Task Agent

The anatomy of any PydanticAI Agent is composed of:

Model - here we pass the OpenAIModel created above.
DepsType - the runtime state model TaskDeps we’ve defined above.
Retries - a safeguard to tell the model how many times it can retry a failed run, such as failing to serialize the correct Pydantic model.
SystemPrompt - the base instructions for the Agent.

task_agent = Agent[TaskDeps, str](
    model=build_model(),
    deps_type=TaskDeps,
    retries=2,
    system_prompt="""You are a task management assistant. You help users manage their to-do list. You should be concise and to the point, answers should be fast.

Your capabilities:
- Create new tasks with title, priority, due date, and notes
- List tasks (all, pending, or by status)
- Mark tasks as in-progress, completed, or cancelled
- Search tasks by keyword
- Get the current time in any timezone

Guidelines:
- Use the appropriate priority level based on user language (urgent, high, medium, low)
- Format task lists clearly with status indicators

Available tools will let you perform these operations on the user's task list.""",
)

Step 5.1 - The Report Agent

Same as the above, only changing the base instructions in the SystemPrompt.

report_agent = Agent[TaskDeps, ReportNarrative](
    model=build_model(),
    deps_type=TaskDeps,
    output_type=ReportNarrative,
    retries=1,
    system_prompt="""You write short task status narratives from provided repository stats.

Rules:
- Use only the numbers and task titles provided in the prompt.
- Do not invent tasks, counts, percentages, or priorities.
- Keep the summary to 1-2 sentences.
- Give exactly one actionable recommendation.
- If there are no tasks, say so plainly.
""",
)

Notice that, when defining an Agent, we’ve used square brackets (Template Style) to specify the I/O schemas the Agent should consider when parsing/composing responses.

For the TaskAgent [TaskDeps, str]- we’re passing TaskDeps as input, and expect plain str as output. That’s because this is our main Agent entry point, and we’ll chat with it.
For the ReportAgent [TaskDeps, ReportNarrative]- we’ve used TaskDeps as input, such that the Agent can access our TaskRepository, but passed the ReportNarrative model as output. That means the ReportAgent will always try to serialize its output following the ReportNarrative Pydantic model we have.

Step 6 - Adding Tools

At this stage, our Agents will hallucinate. They understand the schemas and what they can/should do - but don’t yet have the means to accomplish that. That’s why we need to define a set of tools to enable our Agents to manipulate and work with Tasks in our TaskRepository.

In PydanticAI, we can define two types of tools using decorators:

Plain Tools (@agent.tool_plain) - for when the Agent doesn’t need to know about its runtime context. For example, telling the current time.
Tools (@agent.tool) - for when the Agent has a dependency on the runtime context. For example, moving a Task from the Pending state to the completed state implies access to the TaskRepository from the runtime context.

Step 6.1 - Adding tools to the Task Agent

We’ll add a set of 6 different tools:

GetCurrentTime - plain text tool to get the current timestamp, without accessing the full Context.
CreateTask - used to add a new task to our in-memory TaskRepository
ListTasks - used to list all tasks in the TaskRepository
UpdateTaskStatus - used to move any Task between states.
SearchTasks - used to search by title.
DeleteTask - used to remove a task from the Repository.

@task_agent.tool_plain
async def get_current_time(timezone: str = "UTC") -> str:
    """
    Get the current date and time in a specific timezone.
    
    Use this when the user asks about the time, or when you need
    to determine "today" or "tomorrow" for due dates.
    
    Args:
        timezone: IANA timezone name (e.g., "America/New_York", "Europe/London", "Asia/Tokyo")
    
    Returns:
        Current date and time as a formatted string
    """
    try:
        tz = ZoneInfo(timezone)
        now = datetime.now(tz)
        return f"Current time in {timezone}: {now.strftime('%Y-%m-%d %H:%M:%S %Z')} (weekday: {now.strftime('%A')})"
    except Exception:
        return f"Unknown timezone '{timezone}'. Use IANA names like 'America/New_York', 'Europe/London', 'Asia/Tokyo'."

@task_agent.tool
async def create_task(
    ctx: RunContext[TaskDeps],
    title: str,
    priority: Literal["low", "medium", "high", "urgent"] = "medium",
    notes: str | None = None,
) -> str:
    """
    Create a new task. Today's date is provided in the response for reference.
    
    Args:
        title: What needs to be done
        priority: low, medium, high, or urgent
        notes: Optional details
    """
    today = date.today()
    task = ctx.deps.task_repo.create(
        title=title,
        priority=TaskPriority(priority),
        due_date=today,
        notes=notes,
        tags=[],
    )
    
    due_str = f", due {task.due_date}" if task.due_date else ""
    return f"Created task #{task.id}: '{task.title}' [{task.priority.value}]{due_str}. (Today is {today})"


@task_agent.tool
async def list_tasks(
    ctx: RunContext[TaskDeps],
    filter_status: Literal["all", "pending", "in_progress", "completed"] = "pending",
) -> str:
    """
    List tasks from the to-do list.
    
    Use this when the user asks to see their tasks, to-do list, or what needs to be done.
    
    Args:
        filter_status: Which tasks to show
            - "pending": tasks not started (default, most useful)
            - "in_progress": tasks being worked on
            - "completed": finished tasks
            - "all": everything
    
    Returns:
        Formatted list of tasks with status, priority, and due dates
    """
    if filter_status == "all":
        tasks = ctx.deps.task_repo.list_all()
    elif filter_status == "pending":
        tasks = ctx.deps.task_repo.list_by_status(TaskStatus.PENDING)
    elif filter_status == "in_progress":
        tasks = ctx.deps.task_repo.list_by_status(TaskStatus.IN_PROGRESS)
    else:
        tasks = ctx.deps.task_repo.list_by_status(TaskStatus.COMPLETED)
    
    if not tasks:
        if filter_status == "pending":
            return "No pending tasks. You're all caught up!"
        return f"No {filter_status} tasks found."
    
    status_icons = {
        TaskStatus.PENDING: "○",
        TaskStatus.IN_PROGRESS: "◐", 
        TaskStatus.COMPLETED: "✓",
        TaskStatus.CANCELLED: "✗",
    }
    
    lines = []
    for t in tasks:
        icon = status_icons.get(t.status, "?")
        due = f" (due: {t.due_date})" if t.due_date else ""
        lines.append(f"{icon} #{t.id}: {t.title} [{t.priority.value}]{due}")
    
    return "\n".join(lines)


@task_agent.tool
async def update_task_status(
    ctx: RunContext[TaskDeps],
    task_id: int,
    new_status: Literal["pending", "in_progress", "completed", "cancelled"],
) -> str:
    """
    Update a task's status.
    
    Use this when the user wants to:
    - Start working on a task → "in_progress"
    - Mark a task as done/finished → "completed"
    - Cancel or remove a task → "cancelled"
    - Reset a task → "pending"
    
    Args:
        task_id: The task number (e.g., 1, 2, 3)
        new_status: The new status to set
    
    Returns:
        Confirmation message or error if task not found
    """
    task = ctx.deps.task_repo.update_status(task_id, TaskStatus(new_status))
    if task:
        return f"Task #{task.id} '{task.title}' is now {task.status.value}."
    return f"Task #{task_id} not found. Use list_tasks to see available tasks."


@task_agent.tool
async def search_tasks(ctx: RunContext[TaskDeps], query: str) -> str:
    """
    Search tasks by keyword in the title.
    
    Use this when the user wants to find a specific task or tasks
    related to a topic.
    
    Args:
        query: Search term to look for in task titles
    
    Returns:
        List of matching tasks or message if none found
    """
    tasks = ctx.deps.task_repo.search(query)
    if not tasks:
        return f"No tasks found matching '{query}'."
    
    lines = [f"Found {len(tasks)} task(s) matching '{query}':"]
    status_icons = {
        TaskStatus.PENDING: "○",
        TaskStatus.IN_PROGRESS: "◐",
        TaskStatus.COMPLETED: "✓",
        TaskStatus.CANCELLED: "✗",
    }
    for t in tasks:
        status = status_icons.get(t.status, "?")
        lines.append(f"  {status} #{t.id}: {t.title} [{t.priority.value}]")
    return "\n".join(lines)


@task_agent.tool
async def delete_task(ctx: RunContext[TaskDeps], task_id: int) -> str:
    """
    Permanently delete a task.
    Use this only when the user explicitly wants to remove a task entirely.
    """
    if ctx.deps.task_repo.delete(task_id):
        return f"Task #{task_id} has been deleted."
    return f"Task #{task_id} not found."

After adding tools, we can make an important and interesting distinction.

This is strikingly similar to how MCPs standardize tool access for LLMs. In any Agent framework, using an “agent.tool” decorator will register the Python method for the Agent’s execution context, as a callable function.

Similarly, when using MCP, the state is kept inside the MCP server, instead of the Agent’s runtime context. Both approaches provide schema-defined functions that an LLM can choose to invoke, making their interfaces conceptually similar, only that MCP is gated by a protocol boundary, with agents sending function names and attributes via JSON-RPC.

Find more details on how MCPs work, in this MCP is Just A Fancy API article.

Step 6.2 - Adding helpers for the Report Agent

The scope for this sub-agent is narrow, that is, taking the current TaskRepository state and generating a summary report. Usually, here we’d use an intent classifier (small classifier LLM) to figure out from the user’s message if it should do a task report or not.

We’ll be doing that manually, using hardcoded key terms.

The Report Agent will use these two helpers:

IsReportRequest - figure out if the Report agent should be invoked.


def is_report_request(user_input: str) -> bool:
    normalized = user_input.strip().lower()
    report_terms = (
        "report",
        "summary",
        "summarize",
        "status",
        "overview",
        "stats",
        "statistics",
    )
    return any(term in normalized for term in report_terms)

DumpTasksForReport - a simple serializer method that takes the TaskRepository state and extracts it as plain text.

def dump_tasks_for_report(task_repo: TaskRepository) -> str:
    tasks = task_repo.list_all()
    if not tasks:
        return "Current task repository: no tasks."

    lines = ["Current task repository:"]
    for task in tasks:
        due = task.due_date.isoformat() if task.due_date else "none"
        notes = task.notes or "none"
        tags = ", ".join(task.tags) if task.tags else "none"
        lines.append(
            f"- id={task.id}; title={task.title}; status={task.status.value}; "
            f"priority={task.priority.value}; due_date={due}; tags={tags}; notes={notes}"
        )
    return "\n".join(lines)

We could have added this serialization to the TaskRepository BaseModel directly, or use the built-in .model_dump_json(). To separate boundaries between Agents, we’ll serialize outside the Pydantic model.

Step 7 - The Execution Flow

At this point, we’ve already defined the core building blocks of the app: the task schema, runtime dependency object, model setup, agent definitions, and the tools the main agent can call.

The last step is wiring everything into an execution flow through a simple Gradio chat interface, where:

The main task_agent handles task operations with streaming
The report_agent handles report-style requests
The UI decides which path to invoke

Gradio is an open-source Python library used to rapidly create customizable, shareable web-based user interfaces (UIs) for AI/ML usecases.

For that, customized to our project example, we’ll need 2 methods:

GenerateReport - executed whenever the ReportAgent is invoked.
CreateUIApp - this is where we wire together the Agent loop, UI components, and workflows.

Step 7.1 - Add the GenerateReport Function

Instead of computing a separate stats object and converting it into a report model, we now pass the current repository state directly to the report agent.

  def dump_tasks_for_report(task_repo: TaskRepository) -> str:
      tasks = task_repo.list_all()
      if not tasks:
          return "Current task repository: no tasks."

      lines = ["Current task repository:"]
      for task in tasks:
          due = task.due_date.isoformat() if task.due_date else "none"
          notes = task.notes or "none"
          tags = ", ".join(task.tags) if task.tags else "none"
          lines.append(
              f"- id={task.id}; title={task.title}; status={task.status.value}; "
              f"priority={task.priority.value}; due_date={due}; tags={tags}; notes={notes}"
          )
      return "\n".join(lines)

What this does:

reads the current in-memory tasks from TaskRepository
converts them into a compact plain-text dump
gives the report_agent the exact repository contents as prompt context
avoids extra report-only helpers and schemas

Step 7.2 - Step by Step Gradio UI (With Token Streaming)

The main UI is still built inside create_ui_app(). This is where we store app state, define the async chat handler, and connect it to Gradio components.

RunStream (run_stream) - this is where Pydantic AI executes the agent loop.
```
async with task_agent.run_stream(
    user_input,
    deps=state["deps"],
    message_history=state["history"],
    usage_limits=UsageLimits(request_limit=10),
) as stream:
```
What happens inside this context manager:
- The agent sends the user input + message history to the LLM
- The LLM decides whether to respond directly or call a tool
- If a tool is called, the agent executes it and loops back to the LLM
- This continues until the LLM produces a final text response

The run_stream method gives us access to this process as it happens, rather than waiting for the entire loop to complete.

Streaming Text Tokens to the UI - we iterate over text chunks as they arrive from the LLM.
```
async for message in stream.stream_text(delta=True):
    streamed_text += message
    
    display = ""
    if tool_calls_shown:
        display = "\n".join(tool_calls_shown) + "\n\n"
    display += streamed_text
    
    updated_history = history + [{"role": "assistant", "content": display}]
    yield "", updated_history
```
What’s happening:
- stream.stream_text(delta=True) yields incremental text tokens (e.g., “I”, “’ve”, “ added”, “ the”, “ task”)
- We accumulate these into streamed_text
- Each iteration, we yield the updated history to Gradio
- Gradio receives the yield and immediately updates the chatbot UI
- The user sees tokens appear one by one, creating a “typing” effect

Why delta=True? Without it, each iteration would return the full text so far. With delta=True, we get only the new characters, which we accumulate ourselves.

Extracting Tool Calls After Completion - we get the tool_calls from the message history, to show what the Agent did.

state["history"] = stream.all_messages()

for msg in state["history"]:
    if hasattr(msg, 'parts'):
        for part in msg.parts:
            if isinstance(part, ToolCallPart):
                tool_str = f"🔧 `{part.tool_name}({part.args})`"
                if tool_str not in tool_calls_shown:
                    tool_calls_shown.append(tool_str)

final_display = ""
if tool_calls_shown:
    final_display = "\n".join(tool_calls_shown) + "\n\n"
final_display += stream.response.text or ""

What’s happening:

stream.all_messages() returns the complete conversation history, including tool calls and responses
We iterate through messages looking for ToolCallPart objects
We prepend these to the final display so users can see what tools were invoked.

Report Requests - figuring out if the agent should generate a task report.

if is_report_request(user_input):
    try:
        result = await report_agent.run(
            f"{user_input}\n\n{dump_tasks_for_report(state['deps'].task_repo)}",
            deps=state["deps"],
            usage_limits=UsageLimits(request_limit=2),
        )
        response = result.output if isinstance(result.output, str) else str(result.output)
    except Exception as e:
        response = f"Error generating report: {e}"
    history = history + [{"role": "assistant", "content": response}]
    yield "", history
    return

What’s happening:

is_report_request(...) detects report-like prompts such as “report” or “summary”
We dump the current repository into text
We append that dump to the report prompt
report_agent.run(...) returns a plain text report
The response is added to chat history and shown in the UI

The UI Wiring (Gradio) - where we define the UI Gradio blocks.
```
with gr.Blocks(title="Task List Agent") as app:
    chatbot = gr.Chatbot(height=450, type="messages")
    msg = gr.Textbox(placeholder="Add a task...", label="Message")
    clear = gr.Button("Clear All")
    
    msg.submit(chat_stream, [msg, chatbot], [msg, chatbot])
    clear.click(clear_all, outputs=[chatbot, msg])
```
What’s happening:
- gr.Chatbot(type=”messages”) expects the new dict format: {”role”: “user/assistant”, “content”: “...”}
- msg.submit() connects the textbox to our chat_stream generator
- Because chat_stream uses yield, Gradio automatically streams updates to the chatbot

Step 7.3 - Full UI Code

def create_ui_app():
    import gradio as gr
    from pydantic_ai.messages import ToolCallPart
    
    # Shared state
    state = {"deps": TaskDeps(task_repo=TaskRepository()), "history": []}
    
    async def chat_stream(user_input: str, history: list):
        if not user_input.strip():
            yield "", history
            return
        
        history = history + [{"role": "user", "content": user_input}]
        yield "", history
        
        if is_report_request(user_input):
            try:
                result = await report_agent.run(
                    f"{user_input}\n\n{dump_tasks_for_report(state['deps'].task_repo)}",
                    deps=state["deps"],
                    usage_limits=UsageLimits(request_limit=2),
                )
                response = result.output if isinstance(result.output, str) else str(result.output)
            except Exception as e:
                response = f"Error generating report: {e}"
            history = history + [{"role": "assistant", "content": response}]
            yield "", history
            return
        
        tool_calls_shown = []
        streamed_text = ""
        
        try:
            async with task_agent.run_stream(
                user_input,
                deps=state["deps"],
                message_history=state["history"],
                usage_limits=UsageLimits(request_limit=3),  # Limit LLM calls for speed
            ) as stream:
                async for message in stream.stream_text(delta=True):
                    streamed_text += message
                    
                    display = ""
                    if tool_calls_shown:
                        display = "\n".join(tool_calls_shown) + "\n\n"
                    display += streamed_text
                    
                    updated_history = history + [{"role": "assistant", "content": display}]
                    yield "", updated_history
                
                state["history"] = stream.all_messages()
                
                for msg in state["history"]:
                    if hasattr(msg, 'parts'):
                        for part in msg.parts:
                            if isinstance(part, ToolCallPart):
                                tool_str = f"🔧 `{part.tool_name}({part.args})`"
                                if tool_str not in tool_calls_shown:
                                    tool_calls_shown.append(tool_str)
                
                final_display = ""
                if tool_calls_shown:
                    final_display = "\n".join(tool_calls_shown) + "\n\n"
                final_display += stream.response.text or ""
                
                history = history + [{"role": "assistant", "content": final_display}]
                yield "", history
                
        except Exception as e:
            history = history + [{"role": "assistant", "content": f"Error: {e}"}]
            yield "", history
    
    def clear_all():
        state["deps"] = TaskDeps(task_repo=TaskRepository())
        state["history"] = []
        return [], ""
    
    with gr.Blocks(title="Task List Agent") as app:
        gr.Markdown("# PydanticAI Task List Agent\nManage tasks with AI. Type `report` for a summary.")
        
        chatbot = gr.Chatbot(height=450)
        msg = gr.Textbox(placeholder="Add a task, list tasks, mark done...", label="Message")
        
        with gr.Row():
            clear = gr.Button("🗑️ Clear All")
            gr.Examples(
                examples=["Add a high priority task: Review PR", "List my tasks", "Mark task 1 as done", "report"],
                inputs=msg,
            )
        
        msg.submit(chat_stream, [msg, chatbot], [msg, chatbot])
        clear.click(clear_all, outputs=[chatbot, msg])
    
    return app

Step 7.4 - Gradio Entrypoint

Here we’ll wire the main entry point, set the server name, and the port on which the Gradio app will start.

if __name__ == "__main__":
    app = create_ui_app()
    app.launch(server_name="0.0.0.0", server_port=7860, share=False)

Then, since we’ve activated the Python virtual environment at Step 2.3, we can run our app using python main.py.

Step 8 - Project Demo

Conclusion

In this article, we’ve built a fully working local AI application: a task manager powered by PydanticAI, running on a local Ollama model, and surfaced through a simple Gradio interface.

In practice, most AI engineers won’t stick to a single framework forever. Some will use LangGraph, others CrewAI, Mastra, ADK, and others will build their own loops in plain Python.

If you’re coming from a Python/FastAPI background, using PydanticAI should feel natural. It’s the same BaseModels, Pydantic Validations, typed data, and explicit structures.

In the context of the upcoming MAVS course, this is a small step into understanding the tech stack, frameworks, and tools used to build it, and acts as a short intro on the key concepts before diving into the full end-to-end system building.

The framework doesn’t matter as much as the architecture.

Hope you enjoyed this article!

Images and Media were created by the author, if not otherwise stated.

References

[1] Pydantic AI. (2024). Pydantic.dev. https://ai.pydantic.dev/

[2] Team, G. (2020). Gradio. Gradio.app. https://www.gradio.app/

[3] Ollama. (2026). Ollama. https://ollama.com/

[4] The AI Merge. (2025, October 25). The Complete Guide to Ollama: Local LLM Inference Made Simple (VIDEO). Theaimerge.com; The AI Merge. https://read.theaimerge.com/p/the-complete-guide-to-ollama-local

Win an NVIDIA DGX Spark by joining me for Virtual NVIDIA GTC 2026

Alex Razvant — Sat, 14 Mar 2026 09:30:39 GMT

In this article, you’ll learn about:

How the giveaway works (and what counts as an entry)
A few Virtual GTC 2026 sessions I find interesting
What the DGX Spark is good for when you’re building AI systems locally
What I’ve been working on with it (and why local iteration still matters)
Where to find my DGX Spark unboxing + capabilities breakdown

If you’ve deployed AI systems, you already know the fastest way to make progress is to shorten the “idea → run → inspect → iterate” loop. AI requires a lot of GPU compute, even for basic experiments.

Cloud is great and gives you the compute, but it costs you every time you want to debug. Local compute on the other hand gives you a different kind of control, especially when you’re iterating on data pipelines, model behavior, and end-to-end latency.

That’s the main reason I’ve enjoyed building on the NVIDIA DGX Spark for the past 6 months. It’s been a very practical machine for AI, learning, prototyping, fine-tuning, and building small-to-mid AI systems without having to rent cloud compute to test small bits of my code.

The surprise: I’m giving away 1× NVIDIA DGX Spark (Europe Only)

For NVIDIA GTC 2026 (March 16–19), I’m doing a giveaway for my European audience. If you join me for Virtual GTC, you’ll have a chance to win a DGX Spark.

🎁 Courtesy of NVIDIA.

The Virtual GTC is free. And the entry rules are simple (details below).

How to enter the DGX Spark giveaway

Register for GTC 2026 using my link. https://nvda.ws/4qTY2Bn
Attend at least 1 virtual session (Jensen’s Keynote does not count)
Be a free subscriber to this newsletter. Subscribe
Fill out my giveaway form after attending Google Form
- your name
- your country (Europe only)
- your email
- which session you attended
- a screenshot showing you attended that session
- your favorite takeaway from that session

Please read the form now (1min), to learn the extra steps required.

Quick FAQ:

Is Virtual GTC free?
Yes.
Does the keynote count?
No. To qualify, you need to attend at least one virtual session (not the keynote).
How do I fill in the form?
You’ll have to register with my link, attend a virtual session, take a screenshot of your attendance in the session, and fill-in the Google Form.
I subscribed after attending a session, is that fine?
Yes. As long as you’re subscribed by the time you submit the giveaway form.
Do I have to be a paid subscriber?
Nope, free subscription is enough.
Is this Europe Only?
Yes, this giveaway is for my European audience.
Do I need to attend a specific session?
No, any eligible virtual session counts (as long as it’s not the keynote).
How do you verify subscription?
The giveaway form will ask for the email you used to subscribe, so I can confirm eligibility.

A few interesting Virtual GTC sessions

Here are a few sessions that caught my eye, maybe you’ll find them interesting as well.

AI Factories in Europe: Building the Foundations for Scalable Intelligence S81899
Accelerate AI Through Open-Source Inference S81902
Teach AI to Code in Every Language With NVIDIA NeMo S82306
From Data to Meaning: Vision-Language Models Shaping the Cities of Tomorrow S81867
Your Learning Pathway: Get Certified for Career Success C81544

Behind The Scenes and what I’ve been building on the Spark

To describe it shortly, I’ve been using the Spark as my main AI development machine for a good few months. Before it, I had a PC with an RTX4080 (16GB VRAM) and an M1 Max for building with AI locally.

The Spark, with 128GB of unified memory tops that, allowing me to run multiple models and heavy processing workloads without looking at nvtop or GPU load charts.

Below you can find the main threads I’ve been tested and worked on with the Spark:

Local Finetuning - going through the 30+ Spark Playbooks and Unsloth Tutorials.
Local Inference - with Ollama, LMStudio and llama.cpp
Multi Agent Systems - with SLMs (Qwen3.5, GPT-OSS-20B/120B, and NVIDIA Nemotron Nano 30B-A3B)
AI Systems - with Inference, Application, UI & Backend as docker stacks, closely mirroring setups you’d see in real deployments.
Edge AI - with Vision Models, Audio, LLMs and pretty much any multimodal pipelines and workflows.
Agentic AI - MCP, A2A, PydanticAI, LangGraph, LangSmith, Opik.
RAG & Multimodal RAG - a few small projects, mainly on VSS (Video Search and Summarization)
Computer Vision - my old passion, running any workload from Object Detection, Multi-Camera Tracking, Instance Segmentation etc.
Image/Video Generation - ComfyUI (Spark Playbooks)

Bottom line is, you can do a lot with the Spark, from tiny PoC projects up to solid AI Systems that you can build & validate, and then scale to real heavy workloads in cloud.

If you’re curious on how the Spark looks like, it’s hardware and architecture details and what it can do. 👇

Wrap-up

Register for Virtual GTC with my link
Attend at least one virtual session (not the keynote)
Submit the giveaway form attached in this article
Make sure you’re a free subscriber so I can reach you if you win

I’ll announce the winner shortly after GTC ends. In the meantime, if you want the practical breakdown of what the Spark can do (and what I’m building on it), the unboxing/capabilities article is linked above.

Good luck! 🫶
Alex

Local LLM Inference : llama.cpp, GGUF, Quantizations and GGML Explained

Alex Razvant — Tue, 03 Mar 2026 11:31:04 GMT

Welcome to TheAIMerge. I write about practical, production-ready AI/ML Engineering. Join over 8500 engineers and build real-world AI Systems.

Subscribe now

AI is moving closer to where data resides, at the edge.

Currently, in the industry, there seem to be two directions AI is taking, which both pull on different ends. One is chasing AGI, a mission for which the big AI Research labs are stacking on compute power, with examples from xAI’s Collosus [9] cluster, to Meta AI’s 600k order for Blackwell GPUs, to NScale AI [8] supercluster in Europe this year, and the list could go on.

However, the second one, which is more interesting, is Edge AI. And we’re seeing that with Meta Ray-Ban, Apple Intelligence, Figure AI Robots, Tesla Bots, or the famous Unitree G1 Robot. In the end, AI is going to run at the edge, or at least a big part of what we now need Cloud Compute for will sit on small, compute-efficient edge devices.

Figure 1: Analysis and projection of the Edge AI Market over the next 5 years.

That’s why in this Article, we’re going to dive into the most popular library that allows developers to run LLMs on small devices, llama.cpp. I genuinely think AI will end up sitting there next to the camera sensor, the drone controller, or the car’s embedded chip, making predictions in real-time.

Therefore, we’ll be unpacking llama.cpp, GGUF, and GGML, and see how everything connects into a framework for running LLMs efficiently on the Edge.

What is GGUF?
What Quantizations does GGUF Support?
What is GGML? (tl;dr)
The Llama.cpp Library Workflow
The High-Level Architecture
Conclusion

1. What is GGUF?

GGUF [5] is a model file format that optimizes LLM checkpoints for efficient storage and quick deployment.

The GGUF format works well with specific LLM Inference Engines, especially those based on `llama.cpp`, since it is native to the framework. Lately, other Inference Engines started adding support for it, with vLLM being one example [6], where, although it’s still an experimental feature, it marks the maturity and larger adoption of GGUF.

Figure 2: The GGUF binary format structure, representing how model weights are compressed and organized.

As a summary of the advantages of the GGUF format:

Smaller file sizes compared to other formats.
Faster loading times.
Improved cross-platform compatibility.
Rich built-in support for different quantization levels.

One of the most important features in GGUF is its quantization setups. These span across various settings and configurations, from high-level legacy quants that are applied equally to each weight, up to low-level customizable quants, where specific weights in a layer could be quantized differently or with mixed precision.

In the following section, I’ll go through the GGUF quantization types, providing a short description with bullet points for each type, as this is a low-level, advanced concept that an AI Engineer probably won’t work too much on.

If you want to dive into the nits & bits of quant types, check this Table.

2. What Quantizations does GGUF Support?

When working with any GGUF model, you might see the model name has a suffix attached, such as Q4, Q5_K, IQ2_K_S, etc. Each of these names stands for an identifier of what type and class of quantization the model was compressed and stored as.

Tip: The next section dives deep into Floating Point precision and maths, feel free to skip it.

Let’s take the `microsoft/Phi-3-mini-4k-instruct` model as an example. It comes in 2 variants, an FP16 non-quant, and a Q4 GGUF quant.

Figure 3: A screenshot from HuggingFace of the Files & Versions tab for Phi3-mini-4k-instruct-gguf model.

That means that for FP16, each `weight` is represented in 16 bits:

1 bit for Sign (+/-).
5 bits for the Exponent.
10 + 1 Mantissa bits.

Whereas for Q4, each `weight` is represented in 4 bits, but with a trick:

All 4 bits store an integer code, no separate sign/exponent/mantissa in those 4 bits, we’ll call this QuantizedWeight.
Instead, a shared scale is stored per block of weights, let’s call it SharedScale.
The real value of each weight is reconstructed during inference as

Weight = SharedScale * QuantizedWeight

But first, why does GGUF have so many quant variations?

The main reason is that some layers within the model’s architecture, such as those for EMB (embeddings), are very sensitive to precision loss, whereas other layers as some FFN (feed forward ones), can be heavily quantized with minimal effect on outputs.

Tip: The key goal here is to customize and squeeze the most quality out of the model, while keeping its memory and compute footprint as low as possible.

Let’s take the`Q4_K_S` variant as an example, and unpack what that means:

Q stands for Quantization.
The digit 4 stands for the number of bits the weights are quantized in.
K shows that the layers of the model were quantized in a block-wise structure. The layer weights are split into blocks, and each of the blocks is quantized using a different factor for that block.
S/M/L is a block size, and it helps determine the tradeoff between accuracy and compression. Larger blocks = more compression = lower accuracy.

Although there are many configurations, we can generally split the quants into 3 distinct groups: Legacy, K-Quants, and I-Quants, ordered from the oldest to the newest added.

Legacy Quants (Q4_0, Q4_1, Q8_0)

These are block quantization: the layer’s weights are divided into fixed-size blocks.
Uses one (_0) or two (_1) extra constants per block, to scale weight values back from their quantized value.
Fast and straightforward, but less precise than modern methods.

K-Quants (Q3_K_S, Q5_K_M)

A K quant is a block-wise quantization, with weights split into blocks and scaled with per-block factors.
These also support mixed quantization, where some layers could be compressed more in order to give more bits to critical layers.

I-Quants (IQ2_XXS, IQ3_S)

In simple terms, an I quant builds on top of a K quant, introduces an Importance Matrix with lookup tables to preserve critical weights.
An importance matrix is basically used to identify which weights carry more value for the model output quality, such that the weight is quantized less, whereas the “less important” weights are quantized more.
Best for memory-limited environments with sufficient compute power.

A large majority of GGUF models are quantized with K Quants or I Quants, especially for the larger models, which makes sense as they require more optimizations to reduce the memory footprint.

At this point, we already have an understanding of what GGUF is and how LLM checkpoints are quantized and stored in the GGUF format. To complete the core picture, we need to further understand how llama.cpp works as a framework, through loading, parsing, and interpreting the GGUF model checkpoint.

Let’s get into the interesting details!

3. What is GGML? (tl;dr)

tl;dr: ggml [1] is a machine learning (ML) library written in C and C++ with a focus on Transformer inference. ggml is similar to ML libraries such as PyTorch and TensorFlow.

GGML [1] is the actual backbone, the core library that provides and executes all the tensor operations and provides all the optimizations necessary for high-performance computation, as it can be seen in the following sequence:

Figure 4: The scope of GGML within llama.cpp is represented in yellow.

An equally important detail to note about GGML is its extensibility and portability.
As it’s written in C++, GGML (and llama.cpp) can be compiled for virtually any hardware platform, from x86_64 (AMD/Intel) to ARM64 (M1, M2, M3), and other architectures.

The primary hardware target is the CPU, as GGML supports SIMD (Single Instruction Multiple Data) for newer CPUs, which in short means parallel computation leveraging the CPU’s multi-core threading.

Note: SIMD will take the same instruction, and run it across multiple data slices at the same time.

On the GPU side, ggml supports accelerators such as ggml-cuda for NVIDIA GPUs and ggml-metal for Apple’s hardware. These accelerators enable llama.cpp to execute the GGML computation graph on GPU hardware, by basically translating the instructions to the GPU’s own “language” which for CUDA is PTX, and for Apple is Metal.

With those in mind, we can conclude the following about GGML:

Can be compiled to multiple architectures (arm64, x64, x86)
It’s an ML Library, similar to PyTorch or Tensorflow
Runs fast on CPUs, but also supports GPU accelerators.
It’s the core of llama.cpp, executing all the tensor operations.

Now we’ve learned that GGUF is the format of storing the model weights, and GGML is the actual Tensor Library that parses, loads, and executes the LLM model.

The third and last missing piece is llama.cpp and how everything fits together. We’ll cover that next.

4. The Llama.cpp Library Workflow

First off, I have to say - the llama.cpp codebase is quite large. For the writing and diagrams of this section, I’ve analyzed bits of the llama.cpp codebase, official discussion threads, closed/open PRs to get a feeling of what are the core components powering it.

Figure 5: The distribution of commits on llama.cpp codebase, over the 2023-2025 period. Each candle represents a Week.

We’re not aiming to fully grasp every detail, “We only care about how it works.”

And to get there, we’ll cover only 4 components at a high level that make the entire workflow just click.

Model Loading → parsing GGUF, loading config, architecture, etc.
Populating Weights → how the execution graph is compiled for inference.
KV Cache Handling → how KV cache is handled during inference.
Token Sampling → how a token gets selected from the distribution.

Let’s start with them in order.

1. GGML Model Loading

Figure 6: An annotated image of the llamacpp/src/llama.cpp implementation, showing the GGUF model loading entrypoint.

After defining the model architecture type, which is usually inferred from the GGUF metadata fields, the next step is to start defining the execution graph of the model, as after this step, we’ll be able to load weights for each tensor.

Figure 7: An annotated image of the llamacpp/src/llama-graph.h header file, where the utilities for defining the Model Graph are defined.

2. GGML Populating Weights

Figure 8: An annotated image of llamacpp/src/llama-model-loader.cpp implementation, going through the logic of loading tensor Weights for a given layer.

For each layer within the model graph, we’re loading the weights from the GGUF file using `mmap`, or Memory Mapping.

Def: mmap is a POSIX-compliant Unix system call that maps files or devices into memory. It is a method of memory-mapped file I/O.

This approach allows the model to access weight data without fully loading it into RAM, enabling efficient, on-demand retrieval from disk.

3. GGML KV Cache Handling

Figure 9: Annotated image of llama-kv-cache.h definition, showcasing the functions used to READ and WRITE from the KV cache.

4. GGML Token Sampling

Figure 10: Annotated image of llama-sampling.cpp, showing the last step of a token generation iteration, to sample a token from the vocabulary, using the popular sampling methods (TopK, TopP, Temperature).

5. The High-Level Architecture

The succession of steps in llama.cpp [2], is to load the model architecture, define the execution graph, and populate weights for each tensor in the graph. Further, the model is ready for inference, and the last two steps from those described above come into place during LLM, generating a new token.

Tip: For a complete understanding on how LLMs generate tokens, read this [10].

Now, after going through all the core steps and their order, we can finally represent everything as a complete workflow in the sequence diagram below.

To solidify our understanding one last time, we have this workflow to describe the end-to-end template of how llama.cpp, GGML, and GGUF come in together:

The GGUF model is loaded by llama.cpp
The GGUF binary file is parsed, and GGML allocates placeholders for tensors.
Then the execution graph of the LLM model is constructed.
Next, we load the weights for each tensor in the graph.
At this point, the model is ready for inference.
The user sends an input prompt.
The prompt is tokenized, and token_ids are passed through the model.
The Model graph is executed with initial tokens.
The KV Cache state for the initial iteration is saved.
A new token is sampled from the probability distribution of the vocabulary.
The new sampled token is detokenized and streamed to the client.
The updated sequence (initial + new token) is passed through the graph loop.
After finishing the generation, we release the memory buffers.
The user gets the final output text.

6. Conclusion

In this article, we’ve learned the most popular stack of running LLMs on resource-limited hardware, CPUs, and Edge Devices. Multiple options allow you to deploy models locally, and all of them are powered by llama.cpp, with the best example being Ollama, which has over 150k stars.

And that actually connects with the upcoming article, as we’ll be covering Ollama, but with more hands-on coding, and practical tips you could apply to your own projects and systems.

We’ve covered the core components of llama.cpp: GGUF and GGML, with GGUF being an actively adopted format seen in other frameworks such as vLLM.

For AI engineers, learning about Edge AI is one of the most strategic moves you can make today. AI shows signs (OSS, smaller models, etc.) of increasingly moving towards processing data closer to where it’s generated.

Understanding how to optimize models for local and edge deployment will position you strongly for the coming years. This may not happen in the next year, but investing in this knowledge now will give you a significant upper hand.

Hope you enjoyed this article!

Images and Media were created by the author, if not otherwise stated.

References

[1] Introduction to ggml. (2025, February 24). Huggingface.co. https://huggingface.co/blog/introduction-to-ggml

[2] ggml-org/llama.cpp: LLM inference in C/C++. (2025). GitHub. https://github.com/ggml-org/llama.cpp

‌[3] LLM Visualization. (2025). Bbycroft.net. https://bbycroft.net/llm

‌[4] Edge AI Market Size, Share & Growth | Industry Report, 2030. (2024). Grandviewresearch.com. https://www.grandviewresearch.com/industry-analysis/edge-ai-market-report

‌[5] GGUF. (2025). Huggingface.co. https://huggingface.co/docs/hub/en/gguf

[6] GGUF - vLLM. (2025). Vllm.ai. https://docs.vllm.ai/en/stable/features/quantization/gguf.html

‌[7] Nscale Contracts Approximately 200,000 NVIDIA GB300 GPUs with Microsoft to Deliver NVIDIA AI Infrastructure Across Europe and the U.S. | Press Release | Nscale. (2025). Nscale.com. https://www.nscale.com/press-releases/nscale-microsoft-2025

‌[8] Colossus | xAI. (2024). X.ai. https://x.ai/colossus

‌[9] Helix: A Vision-Language-Action Model for Generalist Humanoid Control. (2025, February 20). FigureAI. https://www.figure.ai/news/helix

‌[10] Razvant, A. (2025, February 20). Understanding LLM Inference. https://read.theaimerge.com/p/understanding-llm-inference

The Engineer’s Guide to AI-Assisted Productivity

Alex Razvant — Thu, 12 Feb 2026 12:01:56 GMT

There are a few things most engineers can probably agree on:

Code is cheap nowadays. Design and Planning is not.
AI without a harness is “stupid” and dangerous.
Hype is for marketing. Engineering ships.

And in this article, I’ll try to provide efficient safeguards for each point mentioned.

The Problem of “Developer Productivity”

Across blogs, and all social media channels, there’s a strange obsession with output measured in volume. New libraries built in a day, AI “building” a Compiler, OpenClaw hype and CEOs claiming they don’t need SWEs and Engineers anymore, everything is written by AI nowadays.

If you’ve seen claims like these, you’re not alone. A large chunk of these claims are either marketing, hype, or people making optimistic claims too soon.

At the core of this obsession, sits a flawed idea: Lines of Code.

AI can spill-out 1000 lines of functionality, or that could be done with 200 lines of smartly planned and engineering design. Both will work initially, but the gap will grow in time and at scale. More code means more surface area. More edge cases. More friction when requirements change.

Software is not written to be done “once”. It’s designed to evolve, that’s why we plan roadmaps, allocate budgets and expect systems to change long after their first release.

This is where the real work has always been. Writing code is not the difficult part, designing, planning and building - always was.

AI and Lines of Code

It’s foolish to think that in a real codebase, quality of the code is about “how many lines you can push”, and there are people boasting they push 10k lines to prod everyday. Lines of Code (LoC) have never been a meaningful productivity metric. At best, they’re irrelevant.

Linus Torvalds (creator and lead developer of the Linux Kernel) once put it bluntly:

“Measuring productivity in LoC is just incompetence. Anybody who thinks that’s a valid metric, is too stupid to work at a tech-company”.

Harsh? Yes. But its also true as good engineering is about reducing complexity, and I think a lot of engineers out there will share that view.

As a concrete example, let’s take this PR.

Figure 1: A 13k LoC PR being raised on ocaml repository. Plus, AI should never write all the code, an engineer should be able to explain everything that happens in a PR.

The author, openly admitting he “didn’t write a single line of code”, dumped a 13k LoC fully AI-generated Pull Request, expecting the code owners to review, approve and merge it to the codebase.

That’s wrong and a bad practice that you should avoid. Regardless if you contribute to open-source or work on a codebase, as part of a team.

As an engineer, the important metric you should follow is “ship changes that are consistent and safe to release“. If you can make it fast, on top of that, props to you.

Anyone uses AI to write code, at this point - it’ll be a lie to say otherwise. But if you don’t set constraints, AI will happily produce slop code that’s off-pattern and weirdly expensive to maintain.

So I’m writing this article with that in mind.

Tips, Routines and Advice

In this article, I’ll share the routines and practices that most consistently increase my output as an engineer when using AI - across tooling, workflow, and review.

If you’re not already applying these, you can adopt them quickly - especially if you’re moving fast in a real codebase or collaborating in a larger team.

I’ll break it down into six parts:

How do I get the best out of Cursor IDE
How I use CLI agents (Codex, ClaudeCode)
What I’ve learned on how to treat PRs
Code Review without Nitpicking
Using Squash Commits and pre-commit Hooks
My end-of-day “agent memory dump” workflow with Obsidian (big win)

What “using AI for coding” means in my day-to-day

I don’t treat AI as “the engineer.”

Yes, it can generate code quite quickly, but I still don’t fully trust the outputs and have to check and reiterate to make sure the code does what it’s intended to do. The goal is the right code, in the right shape, that will keep working as the system evolves.

In practice, I like to stay in control of the workflow, especially when juggling shared codebases and multiple moving parts.

I’ve never measured it precisely, but based on day-to-day observation, I tend to keep roughly a 60/40 split between code AI writes and the code I write.

Figure 2. My recent month usage dashboard in Cursor. The rate of changes I accept is way lower than lines making it being commited and added to the codebase.

For example, here’s what my Cursor analytics looked like over the past month.

It shows thousands of lines and tokens consumed, but the ratio of accepted changes into my branches is below 50%.

A good rule of thumb, is to treat AI Generated code as “proposed” solution and not source of truth. Don’t accept changes blindly, understand it first.

AI will help me with Design and Planning, and write a lot of the low → mid-low complexity code, which is simpler and more or less repetitive.

All that while I analyse and guide through, or even write the hard-path logic. This aligns with the “plan first” advice you’ll see in most serious guidance around AI coding assistants: generate a plan in read-only mode, then execute it step-by-step.

Tip #1 - Using Cursor And Cursor Rules Standards

Before switching to Cursor, I’ve been mostly using VS Code + Copilot Pro plugin. In my current team, we’re mostly using Cursor Enterprise and Claude Code CLI Agent.

In this section, I’ll describe a few best practices I’m using, that work for me when setting the Cursor IDE with specific repo rules.

What are Cursor Rules?

Think of Cursor Rules as targeted sections of a big system prompt for your Agent.

These are one of the highest-leverage tools you can benefit from working with AI in a real codebase, as it helps the coding agent scope each iteration, or list of TODOs it plans out to map to your actual project requirements.

Cursor supports both .md and .mdc files, but I recommend .mdc (Markdown Components) because it supports frontmatter.

Frontmatter is a block of YAML, JSON, or TOML metadata located at the very top of Markdown files, enclosed by triple-dashed lines (---), allowing you to add metadata.

For instance, in MDC files, you can specify the path globs **/*.py, to target and apply a rule only to Python files, or filter by folder/subfolder using services/*/*.py to apply a rule to all Python files under services.

Here is a 3-step plan to prepare your Cursor rules:

Keep rules small, or split into multiple files.
Add the rule only after observed failures
Treat rules as guardrails (scope with globs per language, folder, files)

Here’s a minimal example I’m using for my personal projects with Python codebases:

# .cursor/rules/base-python.mdc
---
description: Specific rules for handling scaffolding of new Python packages.
globs: src/**/*.py
---

You are an expert in Python, FastAPI, and scalable API development. The Python components you are building will communicate with other services, particularly Inference Engines.

## Always
- Prefer iteration and modularization over code duplication.
- Follow existing patterns in the codebase. Do not invent new styles or structures.
- If unsure about design or placement, move to Cursor Plan mode and ask before implementing.
- Use descriptive variable names with auxiliary verbs (e.g., is_active, has_permission).
- Don't write summaries, improvement plans, refactoring plans as Markdown files, unless asked.

## Formatting & Style
- Follow PEP 8.
- Assume Ruff formatting by running `uv run ruff format`

## Imports
- Prefer absolute imports, use the isort plugin to sort imports in order.
- Avoid circular dependencies.

## Types & Interfaces
- Use type hints on all function signatures, use the `typing` module.
- Always use Pydantic BaseModels over dataclasses.

## Logging & Errors
- Use structured logging (no prints).
- Log required context (e.g. correlation_id) in API calls.
- Always log error when catching a specific exception, never use bare `except`.

## Testing
- New behavior must include a test.
- Place tests in the `tests/unit` folder of the Python Package.

Rules should be short, targeted and enforceable. Keep the .mdc file light, with one rule per line, as even if your coding agent has up to 200k tokens of context, you don’t want to clutter it or confuse it with complex rules that spawn multiple lines or instructions.

Places where you can find Cursor Rules:
Awesome Cursor Rules
Cursor Directory
Tip: Don’t copy paste them directly from these sources, but rather customize based on your project needs. Keep the rules targeted and actionable.

Tip #2 - Agent Skills (Claude/Codex)

Apart from Cursor, in my team we’re also using Claude Code, and a few colleagues use Codex, on their personal subscription plan.

In this section, I’ll compare Claude Code with Cursor on how to set them up, and what is different between the two, while also providing a few examples on how you can structure Agent Skills for Claude.

Figure 3. The Claude Skills Architecture, showing how the Agent interacts with the Skills on File System. (From Anthropic Skills Overview)

There are two approaches for teaching agents framework knowledge.

First, configuring Agent context at the project root level. (AGENTS.md/CLAUDE.md)
Second, configuring Agent context using modular domain knowledge. (Skills)

AGENTS.md/CLAUDE.md

Targeted at general agents, also compatible in Cursor is a markdown file in your project root that provides persistent context to coding agents. Whatever you put in AGENTS.md is available to the agent on every turn, without the agent needing to decide to load it. Claude Code uses CLAUDE.md for the same purpose.

Agent SKILLS

Skills on the other hand, are an open standard for packaging domain knowledge that coding agents can invoke and use. A skill bundles prompts, tools, scripts and documentation that can be invoked on demand, as the Agent sees fit.

Think of Agent Skills, similarly to how an MCP Server works. The concept is similar.

Understanding CLAUDE.md vs SKILLS vs COMMANDS

CLAUDE.md
This is the simplest one, it’s a set of files that get treated as the default prompt for Claude Code, loaded at the beginning of every conversation.
Agent Skills
Skills are better-structured CLAUDE.md files. They are invoked by Claude automatically when relevant (when agent decides to) or invoked manually by the user using /. Compared to CLAUDE.md, Skills are more token-efficient, as they don’t glob the Agent’s context with every new session.
Slash Commands (In Jan 2026, they were merged with Skills)
Before the merge, these were similar to Skills in the way they’re packaging instructions separately, which also can be invoked by Claude when needed or manually by the user.
In a sense, the difference I see between Command and Skill, is that a Slash command is “intended” to be invoked manually by the user, whereas an Agent Skill should be invoked by the Agent on demand (and not manually)
Agent Plugins
Plugins can be considered as an interface to package skills, slash commands, agents, hooks, and MCP servers together. At any point in time, a plugin doesn’t have to use all of them. Similarly, this can be distributed as a standalone Skill, but the Plugin format makes it easy to install.

Let’s continue with an example of a Skill I’m using on my current project I’m working on. The scope of this skill is to manage the API layer and Database integration, with FastAPI and PyMongo.

The Structure

The standard structure of a Skill is referenced in the official Anthropic’s Skills repository. Each skill has it’s own folder, with a SKILL.md file at the root.

Additionally, a Skill folder could contain helpers and references, such as Scripts that can be executed or small targeted “CLAUDE.MD” files as a reference for the Skill to look-up to.

For instance, here’s my FastAPI-Backend Claude Skill:

The skill name ‘api-developer’ under the .claude/skills folder.
A ‘references’ folder that contains individual, reference context or “prompts” for the Agent to look-up.
A ‘scripts’ folder where I attach a ‘health_check.sh’ script the Agent could run, if it decides to.
The ‘SKILL.md’ that describes the actual ‘api-developer’ agent skill

I won’t add the contents of each file, but I’ll unpack the core structure of how this Skill is composed, how to tie in references and scripts and its overall instructions and metadata.

The SKILL.md

---
name: fastapi-mongo
description: Build async FastAPI APIs using MongoDB and Pydantic v2.
metadata:
  domain: backend
  role: specialist
  scope: implementation
  output-format: code
---

# FastAPI + Mongo Specialist

You are a senior Python backend engineer focused on **FastAPI**, **Pydantic v2**, and **MongoDB**.

You build clean, async, production-ready REST APIs with strong validation and automatic OpenAPI docs.

## When to Use
- Creating REST APIs with FastAPI
- Defining request/response schemas with Pydantic v2
- Performing async MongoDB CRUD operations
- Structuring small to medium backend services

## Tech Stack
- Python 3.11+
- FastAPI
- Pydantic v2
- MongoDB (Motor / async PyMongo)

## Core Workflow
1. Define Pydantic models for input/output
2. Implement async FastAPI endpoints
3. Perform MongoDB operations using async drivers
4. Return proper HTTP status codes and responses

To verify the API is up (e.g. before running tests or debugging), run `scripts/health_check.sh` from the skill directory (optionally set `BASE_URL`).


## Reference Guide

Load detailed guidance based on context:

| Topic | Reference | Load When |
|-------|-----------|-----------|
| Pydantic V2 | `references/pydantic-v2.md` | Creating schemas, validation, model_config |
| Mongo | `references/database.md` | Async database, models, CRUD operations |
| Endpoints | `references/endpoints-routing.md` | APIRouter, dependencies, routing |
| Testing | `references/testing.md` | pytest-asyncio, httpx, fixtures |
| Health check | `scripts/health_check.sh` | Verify API is up before tests or manual checks |

## Constraints

### Must Do
- Use async/await for all I/O
- Use Pydantic v2 syntax (`model_config`, `field_validator`)
- Use type hints everywhere
- Use FastAPI dependency injection where appropriate
- Validate all incoming data with Pydantic
- Return JSON-serializable responses

### Must Not Do
- Use synchronous MongoDB drivers
- Use Pydantic v1 syntax
- Hardcode configuration values
- Skip schema validation
- Mix sync and async code

## Output Expectations

When implementing features, provide:

1. **Pydantic models**
2. **FastAPI router with endpoints**
3. **MongoDB CRUD helpers**
4. **Minimal explanation**

## Knowledge Scope

FastAPI routing, dependencies, response models,  
Pydantic v2 validation and serialization,  
Async MongoDB CRUD patterns,  
OpenAPI / Swagger auto-documentation

Key Section #1 - Reference Guide

Since I have a ‘references’ folder and I want the Agent to look-up specific instructions for each topic, I tell it where to find the other Markdown files it should look-up, in the References section.

Upon execution, if the Agent is working on building a new Router or FastAPI endpoint, it’ll scan the ‘references/endpoints-routing.md’ file, to load additional context.

For instance, here’s the Pydantic-V2 reference file the Agent could use, which basically contains a few hard references and code examples of using Pydantic BaseModels and built-in validators.

Nothing complex, just hardening the Agent to generate code that follows the pattern.

# Pydantic V2 Schemas

## Schema Patterns

```python
from pydantic import BaseModel, EmailStr, Field, field_validator, model_validator
from typing import Self

class UserCreate(BaseModel):
    email: EmailStr
    password: str = Field(min_length=8)
    username: str = Field(min_length=3, max_length=50)
    age: int = Field(ge=18, le=120)

    @field_validator('password')
    @classmethod
    def validate_password(cls, v: str) -> str:
        if not any(c.isupper() for c in v):
            raise ValueError('Password must contain uppercase')
        if not any(c.isdigit() for c in v):
            raise ValueError('Password must contain digit')
        return v

    @field_validator('username')
    @classmethod
    def validate_username(cls, v: str) -> str:
        if not v.isalnum():
            raise ValueError('Username must be alphanumeric')
        return v.lower()

class UserUpdate(BaseModel):
    email: EmailStr | None = None
    username: str | None = Field(None, min_length=3, max_length=50)
```
## Model Validator

```python
class OrderCreate(BaseModel):
    items: list[OrderItem]
    discount_code: str | None = None
    total: float

    @model_validator(mode='after')
    def validate_order(self) -> Self:
        calculated = sum(item.price * item.quantity for item in self.items)
        if abs(self.total - calculated) > 0.01:
            raise ValueError('Total does not match items')
        return self
```
## Serialization Control

```python
class User(BaseModel):
    model_config = {
        "from_attributes": True,
        "json_schema_extra": {
            "example": {"email": "user@example.com", "username": "johndoe"}
        }
    }

    id: int
    email: EmailStr
    password: str = Field(exclude=True)  # Never serialize
    internal_id: str = Field(repr=False)  # Hide from repr

## Settings (Pydantic V2)

```python
from pydantic_settings import BaseSettings, SettingsConfigDict

class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        case_sensitive=True,
    )

    DATABASE_URL: str
    SECRET_KEY: str
    DEBUG: bool = False
    CORS_ORIGINS: list[str] = ["http://localhost:3000"]
    API_V1_PREFIX: str = "/api/v1"

settings = Settings()
```

## Quick Reference

| V1 Syntax | V2 Syntax |
|-----------|-----------|
| `@validator` | `@field_validator` |
| `@root_validator` | `@model_validator` |
| `class Config` | `model_config = {}` |
| `orm_mode = True` | `from_attributes = True` |
| `Optional[X]` | `X \| None` |
| `.dict()` | `.model_dump()` |
| `.parse_obj()` | `.model_validate()` |

Key Section #2 - Harness (DOs and DONTs)

These are helpful in making the Agent not repeating the same mistakes, and enforcing rules. Usually, under this section you’d put the pathways an Agent must follow, after you noticed its drift during the Agent’s coding session.

Key Section #3 - Scope

This is an additional small but helpful section, where you could add more details that’ll help Claude Code decide if this skill is the appropriate one to be used.

There are a lot of Claude Code Skills examples out there, from ones that target Programming, to drawing Design Sketches or planning trips. I’ll attach a few sources from where you can find and adopt Claude Skills, programming focused.

Resources for Claude Skills:

My recommendation is to customize a Skill for your codebase.
Avoid integrating Skill as Copy/Paste, and don’t install untrusted Plugins.

Tip #3 - How to treat PRs

The philosophy is simple here: include enough details in a PR while avoiding cluttering or bloating it. With each feature, fix or bug issue I worked on, I usually follow these 3 steps before raising a PR:

Attaching the Ticket
Attaching the 1-pager Design Doc link (for major changes)
Attaching short one-two paragraph descriptions of what the PR does.

In my team, we have a standard to follow, where we keep PRs light in details, covering only the core paths one would need to know in order to be able to review the code changes.

I would lie if I said I didn’t have PRs with 1000+ lines of code.
That slowed us down a lot, both in my output speed, having to address multiple comments and on the reviewer’s time as nobody can intake and reason on how multiple changes will impact the codebase.

If there’s too much detail to be covered, I double-down on using block or sequence diagrams that outline how the change impacts the overall behaviour, and I put these diagrams in a 1-2 pager doc, and attach the shared link in the PR.

When a team member is doing code review, he’ll usually follow these steps:

Read the short description of the PR
Go to the design/pager doc for details and understanding context
Review the code

We also use GitHub Copilot with common rules for Code Review, that we’ve set at the GitHub Organization level, to flag major issues only, and specifically avoid nitpicks.

A PR Body Example

// PR Title: feat(PRJ-002): Parallelize embedding generation through worker pools

## Track
Ticket: PRJ-002
Design (optional): 

## Description
This PR implements the queue payload and parallelizes the vector embedding fill through a pool of workers. The configuration parameters are defined in configs/worker.yml.

During the API call for generating embeddings (v1/embed), the workers will spawn as background tasks, compute embeddings, log statuses per `worker_id` and basic telemetry data.

## Notes
For testing, use `make test-local` with the UV environment activated, or `make test-docker` to spawn a local docker container with mock data.

Although there are use cases when PRs transform into small threads this approach optimized the time spent reviewing or explaining design decisions.

The base idea is, provide the minimal required context (description, diagrams, decisions) upfront, such that the person reviewing will get a good enough picture of what he’s supposed to review and look at.

Three more important aspects

Keep your PRs light, target one fix, one feature. Don’t flex in Lines of Code.
I add proper CI stages in Github Actions workflows (lint, test and code-coverage)
In a PR, if Copilot flagged an easy fix, I raise an issue and assign it to Copilot.

Tip #4 - Code Review without Nitpicking

I won’t sugarcoat this.

Early in my career, I reviewed PRs like a grumpy linter.

I used to flag every optimization I could find around performance, code structure or asking for more documentation. I used to think that because I’ve seen more systems made me an authority on keeping things clean and right.

What I actually did was slow the team down.

The good thing: I learned from it.
The bad one: I learned a bit late.

A good review is a lever. It nudges the team toward better outcomes without blocking delivery.

If you’re at a point in your career where you can improve this, do it now. It compounds.

Focus on what will cause real pain later: correctness, security, reliability, maintainability. Save style and “nice-to-haves” for follow-ups (or a separate PR). I could summarize the plan I’m following at the moment into three parts:

Identify the issue (what and where)
Provide a solution (a concrete alternative)
Explain impact and offer to help (why it matters downstream)

A few examples of DOs and DONTs

Dataclass introduced when repo uses Pydantic
1. ❌ Don’t: “Use Pydantic, as dataclasses aren’t used in our code”
2. ✅ Do: “The is a dataclass and it uses custom validators under tools/validations.py. Let’s try to make it a BaseModel and benefit from model_validator hook in Pydantic. That will keep the model + validations in the same file, which is easier to navigate and reduces LoC”
Using hardcoded values or magic numbers
1. ❌ Don’t: “Please remove the hardcoded ttl_backfill_sec = 10”.
2. ✅ Do: “I think it’ll be better if we surface ttl_backfill_sec into the application config. Leaving it as is, would imply a new release + deployment, compared to injecting it into ENV and restarting the existing service. A second option, in case it won’t change often, is to move it to common/constants.py, we might want to re-use it in multiple places.”
PII sent to the LLM
- ❌ Don’t: “Be careful with sensitive data.”
- ✅ Do: “This call will hit the MCP and return user data emails/phone numbers. Please consider adding a PII redaction step before calling the model, we could use a Pydantic BaseModel, with a redact_pii() that obfuscates the fields before the LLM call”

You should always focus on critical parts that will turn-out bad downstream, and leave improvements, styling and guides for later. Always try to provide a solution to support your comment, be it code, a short snippet of documentation or design documents.

Tip #5 - Using Pre-commit + Squash Merges

I use pre-commit locally and enforce the same checks in CI, although some are repeating, that gives me consistent code quality and fast feedback before I even raise my PR.

Here’s a real example.

On one codebase, we use Protobuf to define API contracts shared across multiple services. Some services are in Go, and any time I add a new RPC or deprecate a field - the .pb.go files have to be regenerated.

The problem I encountered is: I’d update the .proto files, forget to run the generator, and end up with this annoying drift where .proto files are correct but .pb.go files are outdated.

CI would catch it, but only after I pushed and waited for a GitHub Action to fail.

Adding a pre-commit hook for proto generation, fixed it before reaching CI and wasting a CI run on GitHub.

Basic .pre-commit-config.yaml example:

repos:
  - repo: https://github.com/psf/black
    rev: 24.10.0
    hooks:
      - id: black
  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: v0.6.9
    hooks:
      - id: ruff
        args: [--fix]
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.6.0
    hooks:
      - id: end-of-file-fixer
      - id: trailing-whitespace

One more important aspect

When pre-commit reformats files and updates generated code, you often end up with a few “noise commits” like:

“fix lint”
“regenerate protos”
“format, wip”

Using squash merges keeps your main branch clean by turning that whole PR into one readable commit with a clear change narrative.

A Squash commit, will register as a single merge commit to your target branch (e.g main).
Say you have 30 commits made during your work on a feature - you don’t want all that history to be added to your main work tree. Squash all the PR commits into a single, clear and readable commit to summarize what the fix or feature PR does.

In my case, I don’t dump the whole commit log into the squash message box. I write one clean message that describes the PR like, similar to a “manifest.json”.

❌ Bad example:

// By default, Git will paste all the PR commit history in the message box.
* feat: add queue payload + worker entrypoint
* feat: add vector store upsert interface
* feat: add metrics + correlation_id propagation

✅ Better example:

This feature implements queuing of embedding_generation tasks, and assigns them to a pool of workers. 

Added `metrics.py` to monitor worker progress and logs and keeps track of the request's correlation_id.

You could use AI to generate and/or summarize the commit history of your PR in a compact message that you add as the Squash Commit message.

Tip #6 - EoD Memory and Context Dump

I’m giving a golden tip here.

Enjoying this article so far? You can share it so others might benefit from it.

I work on multiple codebases at the same time. AI enables me to do that.

Everything runs smoothly as long as I’m inside deep focus windows. In a 1–2 hour sprint, I can code nonstop, iterate on multiple features and fixes, and steer multiple agent threads in parallel (Cursor + CLI).

However, I’ve noticed that once I take a break, I need a buffer to get back to the usual routine of understanding where I left, what I was working on and what were the next steps.

That “context reload” is manageable during the work day.

But I’ve found it extremely difficult, taking on and continuing to work the next day, as it usually takes time to regain the entire context from the day before.

So, I have a trick up my sleeve: an end-of-day memory and context dump.

My Routine

Whenever I’m ready to call it a day, I ask every agent session I touched (across Cursor or the CLI) to produce a short, structured summary, essentially a “handoff note” to my tomorrow-self.

I store each summary locally as:

“memories/[dd-mm-yyyy]-[agent-scope].md”.
Max ~1 Page
Bullet points only
No rich descriptions or narrative

Let’s go through a real example. I might have:

Agent session: Building FastAPI endpoints
Agent session: Infra stack + Docker Compose profiles
Agent session: Integrating Triton inference server endpoints
Agent session: Telemetry during inference

The prompt I’m using to summarize agents sessions is a custom version of the Claude’s /compact prompt:

Your task is to create a detailed summary of the conversation so far, paying close attention to the user's explicit requests and your previous actions.
This summary should be thorough in capturing architectural decisions that would be essential for continuing development work without losing context.

Before providing your final summary, wrap your analysis in  tags to organize your thoughts and ensure you've covered all necessary points. In your analysis process:

1. Chronologically analyze each message and section of the conversation. For each section thoroughly identify:
   - The user's explicit requests and intents
   - Key decisions, technical concepts and code patterns
   - Specific details like file names, function signatures.

2. Double-check for technical accuracy and completeness, addressing each required element thoroughly.

Your summary should include the following sections:

1. Primary Request and Intent: Capture all of the user's explicit requests and intents in detail
2. Key Technical Concepts: List all important technical concepts, technologies
3. Files and Code Sections: Enumerate specific files and code sections examined, modified, or created. Pay special attention to the most recent messages and include full code snippets where applicable and include a summary of why this file read or edit is important.
4. Problem Solving: Document problems solved and any ongoing troubleshooting efforts.
5. Pending Tasks: Outline any pending tasks that you have explicitly been asked to work on.
6. Current Work: Describe in detail precisely what was being worked on immediately before this summary request.
7. If there is a next step, include direct quotes from the most recent conversation showing exactly what task you were working on and where you left off.

Here's an example of how your output should be structured:



[Your thought process, ensuring all points are covered thoroughly and accurately]



1. Primary Request and Intent:
   [Detailed description]

2. Key Technical Concepts:
   - [Concept 1]
   - [Concept 2]
   - [...]

3. Files and Code Sections:
   - [File Name 1]
      - [Summary of why this file is important]
      - [Summary of the changes made to this file, if any]
      - [Important Code Snippet]
   - [File Name 2]
      - [Important Code Snippet]
   - [...]

4. Problem Solving:
   [Description of solved problems and ongoing troubleshooting]

5. Pending Tasks:
   - [Task 1]
   - [Task 2]
   - [...]

6. Current Work:
   [Precise description of current work]




This should help me gain a fresh perspective on the progress and work being done.
Please provide your summary based on the conversation so far, following this structure and ensuring precision and thoroughness in your response.

Then, I read each of these one-pagers, and move to my Obsidian board, create a note for the next day, and manually write the things I consider important.

Each note contains the “context refresh” for the upcoming day. I treat each one as a “ticket” for tomorrow’s work.

That way, on the next day - I’ll have a minimal, fresh view of everything done yesterday. It helps me:

During the Daily Standup - giving a status overview of what I’ve done
Design Meetings, explaining decisions
Composing the PR details
Quick refresh to continue from where I’ve left

Now I have a clear image of what I need to do, and if i need more info, I go back to that specific Agent memory I saved locally in the memories folder re-read the richer context and start working.

That’s a golden nugget! Helps a lot!

Closing Thoughts

If you’re using Cursor:

Customize the rules for your own codebase.
Don’t copy/paste rules from somewhere, without checking first.
Keep the rules/ files light, under 500 LoC.
Keep the rules to 1 per line.
Favor using the Plan mode in Cursor first, before the Agent mode.
Switch from Agent Mode to Ask mode whenever asking questions.
When the context of your Agent is closing to full, opt for a memory-dump to save the most important details from that session.

If you’re using Claude Code or Codex:

Try using CLAUDE.md first, before Skills.
When building a Skill, keep it light, and provide small, actionable instructions.
When creating a Skill, follow the official Anthropic template
- - SKILL.MD
  - references/
If a Skill might require executing code, create a Script for that workflow, and place it under the /scripts folder.
If there are complex behaviours or patterns you want the Agent to follow, place them as Markdown files under references/ and link them in your SKILL.md
Adapt Skills for your own context and codebase.

Other Tips:

Keep PRs small, well-described, and review with “issue → solution → impact”.
On EoD, opt for doing a daily “agent memory dump” of your Agent session, to quickly summarize the context of the work you’ve done.
Summarize the memories, and prepare a plan for the next-day of work using notes to keep momentum.

I strongly believe this will help you get up to speed when using AI for coding tasks, and will make you a better engineer overall.

Would love to hear your thoughts 💬

How are you using AI in your day-to-day developer workflow?

Upcoming Livestream: GPUs for AI (Shaped by You)

Alex Razvant — Tue, 27 Jan 2026 09:30:49 GMT

Hey everyone,

In the upcoming weeks, I’ll be hosting a live session on GPUs for AI together with (from The Neural Maze), where we aim to unpack the role of GPUs in AI.

GPUs sit at the core of modern AI systems, but for many, they remain a black box.

Every AI workload there is, is powered by an accelerator, GPU, NPU, TPU, LPU, or any other chip, ASIC or not. But how they actually work, what really matters when choosing a GPU, how to optimize the AI workloads and models, are all valid questions that an AI Engineer should ask.

Instead of guessing what to cover, I want to build this session around your interests.

In this article, I’ll be attaching two things:

A video overview of the topics I thought about
A set of polls, to reason about your interests

Before diving into the polls, which I kindly ask you to take 2-3 minutes to complete, I want to shortly go over the diagrams and charts I’ve prepared, to probe the topics we could discuss, related to AI and GPUs.

Starting Points

In this video, I go through a few diagrams and sketches I’ve prepared, in the order I plan to cover them in the upcoming live session.

(Your feedback can change this)

The topics already outlined:

NVIDIA GPUs
- What the hardware looks like
- What is PCIe, and how does it differ from SXM
- Reading a datasheet, CUDA, NVCC, Cuda Capability
ASICs (Application-specific integrated circuit)
- Google TPUs
- Cerebras Waffer-Scale
- Groq LPU

(Additional) Topics which might be interesting:

Optimizing Models on GPUs
- TensorRT, GGUF
- ONNX, OpenVino, Apple MLX

Your Feedback (helps greatly)

Below are the questions I’ve shared as polls with the audience. Your picks can directly shape the structure, depth, and focus of the livestream.

We’re aiming to transform the livestream into an open discussion, where we could answer all of your questions.

So, feel free to ask ANY type of question during the session.
There are no stupid questions; we want to hear from everyone!

Thanks for reading The AI Merge! Share this post such to reach more people.

2. Your understanding of GPUs

The aim is to understand what GPU you’ve been mostly using, to dive into the specific details. The way CUDA works for NVIDIA GPUs, for example, is different than how MPS works on Apple M-series chips, or how ROCm works on AMD GPUs.

3. Topics focused on using GPUs for AI

Most AI Engineers will build applications, optimize and deploy models, or optimize already running pipelines and clusters, if they also work on the infrastructure side.

The goal here is to find out if we should touch on Inference Engines, how LLM (or AI) Inference works, and what the popular Inference Engines are, with brief details on how key components work.

One more thing

If you want other answer options that are not present in the polls, please leave a comment on this article with your thoughts, options, and impressions.

I’ll be reading and replying to each comment in part.

I’m sure you’ll learn a lot about how GPUs work, their role and scope in AI development, both during building and deploying models to user-facing applications.

Let’s prepare for an amazing learning session!

See you soon,
Alex

The Smartest AI Engineers Will Bet on This in 2026

Alex Razvant — Tue, 13 Jan 2026 11:03:06 GMT

2025 has been the year of LLMs and reasoning models pushing toward agents and agentic workflows.

That’s why I’ll spend 2026 focused more on system design, architecture, scale, inference, and core engineering fundamentals.

We now have a wide range of models to choose from, cheaper inference, and plenty of guidebooks on how to prototype with LLMs, from RAG pipelines to basic workflows and agents. Getting something working is no longer the hard part.

In the last few months of 2025, more models have been released compared to previous months. Image taken from https://lifearchitect.ai/models-table/

What’s still hard is running these systems in production.

Across the industry, only a small number of teams have managed to move beyond pilots and demos. And when systems fail, it’s rarely because of the model itself. It’s the engineering around the model: how systems are designed, monitored, tested, and improved over time. These are the same problems software teams have always faced, but made harder this time, due to the non-deterministic behavior of AI Systems.

In 2026, the most valuable work is learning how to build AI systems that hold up under real usage. New abstractions will come and go, including agents, but they only create value when they sit on top of well-designed, maintainable systems.

This article is about that gap.

What you’ll get from this article:

What are the 2025 MIT, BCG, and Gartner reports on AI saying?
Where should an AI Engineer invest their time?
How to avoid confusion and hype cycles?
Actionable plan for engineers, AI Foundations, Engineering, and Systems.

AI Adoption is Moving Slower

Reading through McKinsey’s 2025 [12] survey, turns out that even if almost 88% of companies use AI in at least one function of their business, nearly 60% of them are still stuck in the research and experimentation.

Image 1. Interest and use of GenAI has grown a lot compared to last year. Most companies however, are still in the experimentation & piloting phases, outlining an execution gap.

Similarly, the BCG’s [8] research shows only 5% of firms are “future-built” AI leaders, seeing significant bottom-line value; these are the infra-owners, the frontier-AI labs, and large companies that use AI as a backbone.

They have the talent, compute, resources, and leverage to do so.

Three concrete examples:

Google - massive search context, default user behavior, and existing distribution to build the “AI Mode” on Google Search.
Perplexity - one of the first to add source citations baked directly into answers.
NVIDIA - seasoned talent, vertical integration, chips → simulators → models and large-scale real and synthetic data on multiple modalities (i.e., NVIDIA Isaac Sim, Omniverse, Nemotron).

On the other side, 60% report minimal or no gains from AI; these are the laggards. Even though the models became cheaper per generated token, and the industry caught up on Cookbooks and some Best Practices to follow when building with AI, you’d think building and scaling would be easier.

What does that mean?

Most of the companies won’t have an “AI-first” strategy, but an “Integrate AI” strategy. That isn’t new, as it’s been that way for the past two years.

That means, we’re entering into the Scaling phase, where PoC’s must become resilient AI Systems running in production.

In the following chart from the Boston Consulting Group report, we can see the median average moved a bit in 2025, compared to last year, but still, most companies are stuck in-between Emerging and Scaling phases.

Image 2. Most companies still work on developing an AI Strategy, which can be seen in the 3% drop in emerging trends and 13% increase in scaling focus, and most importantly, in the 11% drop for the first column.

If we zoom out on this infographic, the 60% composed of the Laggards in the first two columns, are moving from no interest in AI, to building foundational capabilities.

That doesn’t cover the “using AI models” part, but getting up to speed on talent, system design at scale, and business functions, which can be translated as:

“Understanding” - people learn how AI and its components work
“Designing” - the business functions where AI could act as leverage
“Building” - the core components and experimenting

What does that mean for an AI Engineer?

A note to end this section on is that the best placement for an engineer in this phase, and going through the next year’s advancements, is learning scale and bringing value out of an AI System.

The Engineers who can operationalize, scale, and measure AI systems will be more valuable than those who can prototype with new models.

Enjoying this article so far? You can share it.

An AI Engineer’s Focus

For most engineers, unless you’re working in applied research (frontier labs) or high-performance scaling (inference providers, GPU kernels), the job isn’t about pushing models forward.

It’s about making systems work reliably.

A year ago, I wrote about why MLOps fails in production and why nearly half of projects fail to scale into real value. At the time, the discussion was framed around classical Machine Learning and Deep Learning: models, engineering, pipelines, and deployment workflows.

The names have changed since then, but the problem hasn’t. (Read it Here)

We might’ve switched from ML Engineering to AI Engineering, from Deep Learning to Generative AI - but the problem is still there. Most of these systems don’t require AI as a backbone; we’re far from fully autonomous agents, where the autonomy slider is set to maximum.

They do require solid engineering, however.

There is a big difference between “AI everywhere” and “AI delivering value everywhere”, even if full-agentic sounds good and trendy.

Here’s a clearer signal to enforce the above. ↓

Image 3. This distribution shows a lack of execution capacity. Experimentation is quick, but moving reliably from pilot to scaling is a challenge. Boston Consulting Group 2025 Report.

There’s clearly an execution gap. Projects don’t fail because of models, but because they can’t reliably move from pilot to production. This idea goes back to 3-4 years ago, when MLOps was a hot topic of discussion in the AI community.

The features are there, the models are there, but the pipelines, monitoring, detecting model drift, golden sets, and continuous improvement and calibration are lacking or not designed properly for production.

This leads to the following question: “How does an AI engineer position themselves?”

Before answering that, it’s worth grounding ourselves in the hype cycles - because they explain why some people are fully invested in AI, while others are exhausted by the headlines.

Avoid the Unnecessary Confusion

It’s funny how AI feels like living inside contradictory headlines:

“Agents are here” and then “This is the decade of Agents” (A.Karpathy)
“AI writes all the code” and then “We’re hiring Senior Engineers” (Microsoft)
“AI will replace Software Engineers in 6 months” (D.Amodei)

These statements aren’t mutually exclusive, but seeing them side by side, and especially seeing them everywhere, creates a kind of fatigue. I think that’s probably one reason many experienced engineers don’t rush headfirst into fully AI-driven approaches. A few conflicting claims, from people in the field, are often enough to make someone pause and wait for clearer signals.

And then, there’s this chart, which was very popular last year ↓

Image 4. The 2025 Gartner Hype Cycle chart visually tracks the maturity and adoption of new technologies and how the public perception shifts.

This is the Gartner Hype Cycle chart, which helps businesses understand when to invest in emerging tech, mapping unrealistic hype against real-world value over time, with cycles often taking 3-10 years.

Engineers who’ve built real systems tend to be skeptical by default, usually taking the slower path: try the tools, see where they help, notice where they fall apart, and only then decide where they fit.

If you’re a Software Engineer, Data Engineer, Data Scientist, ML/MLOps/AI Engineer, or have built or worked on real projects, you’re probably tired of the hype around AI, and that’s a valid feeling that many people share. Maybe a few years ago, when a new model or something with Agents in it was announced, you became curious and excited.

Nowadays, you just default to what’s proven to work, and pay way less attention to the new shiny things that end up not delivering.

Speaking of shiny things, let’s walk through a few examples, directly correlated with the Hype-Cycle chart:

Rabbit R1 (Inflated Expectations) - the agentic device released in January 2024, marketed as one that could execute actions across apps, thanks to the Large Action Model (LAM). In reality, brittle UI automation.
Devin (Peak → Trough) - the first AI Software Engineer, making SW Engineers worry about their jobs. In reality, it's expensive and loses context on real projects and complex codebases.
Autonomous Agents (Trough of Disillusionment) - advertising fully agentic AI, replacing jobs, automating everything, or saving 90% costs. In reality, most projects rolled back to human-in-the-loop and workflows.

That circles back to the points raised in the sections above. The engineering effort is moving slowly but steadily, as many companies may have built their first PoC and Pilot projects, but hit a wall on scaling and bringing solutions into production.

What should you focus on?

At this point, the question isn’t whether AI will matter - it already does.

The real one is where to invest your time, when most tools, models, and trends won’t make it to production, or introduce tech debt and have teams port out and migrate to either in-house built tools or other ones entirely.

That’s why I structure this section into three pillars: Foundations, Engineering, and Systems.

It closely resembles the new direction I’m taking with this publication, moving out from doing deep walkthroughs on various, unstructured topics to structured, ladder-like paths - something I’ve been planning on since October.

A snippet of the Learn section on the upcoming website I’m building for The AI Merge. This is the page that will encompass the learning paths, which will also mirror the structure of this newsletter.

I’m intentionally not listing dozens of resources. People like structured lists. They bookmark them. They rarely finish them.

Instead of outlining 50+ links, I’ve limited each topic to two or three high-quality resources that I consistently read and go through. I don’t want to create another roadmap, but a short, actionable path you can actually follow.

My recommendation is that you pick one resource out of all listed, one for each pillar, and study it for a few months.

Now let’s get to the part you’ve been expecting.

#1 AI Foundations

There’s this misconception that only AI Research engineers, or the ones building efficient GPU Kernels, training Foundation Models, or working at the bleeding edge of AI, must master the foundations.

That’s not true. Foundations show up for everyone the moment you ship AI to real users.

Even if you don’t need to master distributed training, large-scale data patterns, and GPU topologies, that doesn’t mean understanding Context Length, Sampling Parameters, Inference, Quantization, or the basic math behind AI doesn’t matter for you.

I’m intentionally focusing less on Machine Learning as you’ll pick up a lot of ML concepts naturally while studying deep learning.

But in case you want a list of the ML fundamentals you should consciously cover:

#1.0 Machine Learning

Supervised vs unsupervised learning
Overfitting / underfitting / Regularization (L1, L2)
Train / validation / test splits
Gradient descent (conceptually and mathematically)

All of these concepts are covered in Hundred Page Machine Learning, A.Burkov.

#1.1 Mathematics

You don’t need to derive gradients by hand; it’s enough to get the basic intuition for linear algebra, probability, and optimization to reason about model behavior.

3Blue1Brown YouTube Channel, Grant Sanderson
Most probably, you’ve already seen videos from 3B1B. In case you haven’t, the step-by-step animations help you understand the logic behind AI.
The Palindrome Newsletter, Tivadar Danka
Frankly, one of the best and most up-to-date resources on Math behind AI/ML. Tivadar does a great job explaining complex concepts with simple terms.

Don’t spend too much time on maths, understand the core principles is enough.

#1.2 Deep Learning

The term Deep Learning might have grown out of fashion, being replaced by the general AI or GenAI term. But for someone working in the AI field, Deep Learning is the foundation on top of which Language Models, Vision Models, and Generative AI are built.

Two resources to get started with Deep Learning:

Introduction to Deep Learning, MIT 2025 Playlist.
Neural Networks, model training, distillation, optimization, and finetuning - all of these live under the Deep Learning umbrella. This MIT Playlist is a great up-to-date resource on understanding what Deep Learning is.
NVIDIA Deep Learning Institute
A rich collection of industry-relevant technical trainings from NVIDIA teaches the broader spectrum of DL.

If you want to understand Generative AI, LLMs and Diffusion, you must understand the core components of Deep Learning first.

#1.3 Generative AI

This is where most of the AI industry keeps an eye on, and most engineers operate today. At the same time, it’s where misunderstandings are common.

Most resources listed will focus on Language Modeling:

Large Language Models, Stanford CME 295 Playlist
If you’re somewhat experienced, take Lectures 5, 7, 8, and 9. For beginners, add 1 and 2 to the list. For everyone, watch the CS25 V5 lecture from Josh Batson at Anthropic.
Language Modeling from Scratch, Stanford CS336 Playlist
These explain everything around LLMs, the architectures, how they’re trained, why scaling works, and where it breaks down.

A few good books on the topic (pick one):

LLMs from Scratch, Sebastian Raschka
If you want a step-by-step walkthrough on building an LLM from scratch with Pytorch.
Hands-On Large Language Models: Language Understanding and Generation, J.Alammar and M.Grootendorst
A real deep dive on everything about LLMs, from NLP (Natural Language Processing) to understanding newer architectures, training.
Generative Deep Learning, David Foster
I recommend this one, as apart from Language Models, it also covers the multimodal segment, models that can generate Audio, Image, and Video.

For minimal investment, look into understanding how LLMs work, and how to evaluate LLM-based systems.

#1.4 LLM-based Systems

Agents and Agentic Workflows are becoming more and more popular. However, it is still a field surrounded by hyped claims. Fully autonomous systems still require a lot more work, or as A.Karpathy put it “We’re in the decade of Agents, not the year of Agents”.

You don’t need 100 resources on Agents and Workflows, you only need this one.

Agentic Design Patterns, A.Gulli
This is a Free, 400 Pages booklet published by a Google Sr Dir and Distinguished Engineer, covering everything you need to know about Agents, Workflows, Memory and Agency.

I’m at Chapter 20, close to finishing it, but surely will read it again shortly as it’s information dense, hands-down a great resource that gathers everything in one place.

#2 Engineering

Any engineer from the pre-2023 (ChatGPT) era will agree with this.

There’s another misconception that’s quite popular in the AI field, and I think few people actually address it, to clear the hype.

“AI engineers don’t really need strong software engineering skills - agents can handle most of the code.”

AI-assisted coding doesn’t replace engineering - it exposes it. When everyone can write code, quality comes from thinking clearly: understanding the problem, designing the system and its intricacies, and only then writing the implementation.

If you want to build solid AI Systems that reach production, you need to be a good software engineer first.

I’m going to split the resources into 2 pillars: Architecture and Programming.

#2.1 Architecture

This is less about templates to structure your projects and more about building robust codebases that are easy to evolve and maintain.

Clean Architecture, R.C.Martin
Even if published a decade ago, it still stays strong as it solves major software development pain points by isolating business logic from technical details (UI, database, frameworks).
Fundamentals of Software Architecture, M.Richards and N.Ford
Goes through architectural patterns (monoliths, microservices, event-driven).

You don’t need to know everything: DDD, Clean Architecture or Vertical Slice. But you must understand basics of Monoliths, Microservices, Event Driven Architectures.

#2.2 Programming

This depends on your primary stack, but the principle is universal.

Even if you’re using AI agents like Claude or Codex to write code faster, all you’re really doing is amplifying whatever quality already exists. If your code is hard to read, hard to test, or full of hidden side effects, AI will scale those problems - not fix them.

As this becomes the default, the work will shift away from writing code toward reading, reviewing code, and making speed matter less than clarity.

You don’t need to know every language. Python is enough.

Python.org
Effective Python, Brett Slatkin
A great practical guide to writing clear, idiomatic, and maintainable Python.
Fluent Python, Luciano Ramalho
Extensive, going through Python's core language features and libraries.

If you’re already programming in Python, study the Effective Python book. If you’re new to Python, skim through Fluent Python, then Effective Python.

These are large books, best way to get value out of them is to jump to the required chapter, skim through the pages, and gather notes.

(Bonus) Learning Go is another long-term safe bet in AI/ML.

Effective Go, go.dev
If you’re already familiar with Python and want to pick up another language that’s close to Python, Go is the best candidate.
Learning Go, Jon Bodner
A deeper walkthrough on Go's idioms, and how to avoid recreating patterns that don't make sense in a Go context.

I’ve been programming in Go for the past 6 months, and I love it. I can finally understand the Ollama codebase.

#3 AI Systems

System Design is mandatory; every tech interview will have a System Design part.

Good design translates into stable software, or at least software that’s easier to evolve. Past the architecture and past the code that implements features, software must scale, be easier to maintain, and be resilient.

With AI, this becomes harder.

We introduce non-determinism into software, which means assumptions break more often and in less obvious ways. The outputs will vary, costs fluctuate as AI Systems require way more compute than traditional ones, and designing a system right from the get-go is more important than building it.

Best guides and books I follow and read on System Design:

ByteByteGo, A.Xiu
One of the most popular and best resources on understanding how large systems running at scale in production are designed and built.
Designing Data-Intensive Applications (DDIA), M.Kleppmann
Not focusing on AI per se, but it’s a great resource to understand how real systems behave under load and scale.
System Design Playbook
Massive thanks to for compiling it and making it free.

For AI Systems specifically, I recommend:

AI Engineer (YouTube)
One of the few places that discusses AI application design from a systems and production perspective, a lot of great talks were uploaded recently.
AI Engineering, Chip Huyen
This book needs no introduction. I can describe it as an extensive zero-shot introduction to AI Engineering; it’s not too technical, but packed with details.
The Anthropic and OpenAI engineering blogs.

What you need is not more resources to bookmark, but one or two solid entrypoints on all these pillars.

The next step is to apply these learnings in your own work, and that’s where I’m helping with The AI Merge.

Closing Thoughts

My goal with this article was to gain an overview of the industry's current state, analyzing the 2025 industry reports from Gartner, MIT, and Boston Consulting Group.

All of them outline the same ideas: The current gap in AI Systems is execution.

Building demos is easy; morphing them into a system and deploying in production is a different game.

The three pillars I’ve outlined, Foundations, Engineering, and Systems, aren’t just theoretical; they construct an efficient bottom-up learning plan that focuses more on the Engineering side of things rather than on the surface-level AI features.

At the end of the day, I believe the best AI Engineers are Software Engineers at their core, people who’ve built the right mental models, understanding how software works, how to build and scale it, how systems work, and how to design one.

This knowledge ports easily to AI Systems.

That’s the direction this publication, and the learning paths behind it, will continue to follow.

Would love to hear your thoughts 💬

What’s your take on the new Knowledge Pillars paths - Foundations, Engineering, and Systems?

If you’re an engineer, how do you see this execution gap actually closing?

References:

[1] AI Engineer. (n.d.). AI Engineer. https://www.ai.engineer/

[2] AI Engineering: Building Applications with Foundation Models, Huyen Chip (2026). Amazon.com. https://www.amazon.com/AI-Engineering-Building-Applications-Foundation-ebook/dp/B0DPLNK9GN

[3] O’Reilly Media. (n.d.). Designing Data-Intensive Applications. https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/

[4] NVIDIA. (n.d.). Training and Certification. https://www.nvidia.com/en-us/training/

[5] Danka, T. (n.d.). The Palindrome Newsletter.

[6] Martin, R. C. (n.d.). Clean Architecture: A Craftsman’s Guide to Software Structure and Design. https://www.amazon.com/Clean-Architecture-Craftsmans-Software-Structure/dp/0134494164

[7] Stanford University Human-Centered AI. (n.d.). 2025 AI Index Report: Technical Performance. https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance

[8] Boston Consulting Group. (n.d.). The Widening AI Value Gap. https://media-publications.bcg.com/The-Widening-AI-Value-Gap-Sept-2025.pdf

[9]Gartner. (n.d.). Gartner Hype Cycle Identifies Top AI Innovations in 2025. https://www.gartner.com/en/newsroom/press-releases/2025-08-05-gartner-hype-cycle-identifies-top-ai-innovations-in-2025

[10] MIT. (n.d.). State of AI in Business 2025 Report. https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf

[11] Cleanlab. (2025). AI agents in production: 2025 report. https://cleanlab.ai/ai-agents-in-production-2025/

[12] Thompson, A. D. (2023, February 25). Models Table (10,000+ LLM data points). Dr Alan D. Thompson – LifeArchitect.ai. https://lifearchitect.ai/models-table/

[13] McKinsey & Company. (2025, November 5). The state of AI: Agents, innovation, and transformation. McKinsey & Company. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai

Images

Images 2,3,4 were taken from the Boston Consulting Group and Gartner’s State of AI 2025 reports.

Last article of 2025 - A Directional Update

Alex Razvant — Sat, 27 Dec 2025 14:35:36 GMT

This article is the last of 2025.

It’s an end-of-year note - a pause to reflect on the direction I’ll be taking next.

Over the past few weeks, I’ve been rethinking how I build, what I focus on, and what kind of work is worth compounding. Until now, this newsletter has explored the inner workings of AI from many angles. From low-level concepts to the tools and frameworks used to build AI systems in practice.

Some things are about to change.

Before getting into that, I want to start with a short story.
Not as inspiration, but as context.

Looking back from where I started to where I am, and connecting the dots, made me realize something about leverage.

Compounding Leverage

When I started my university studies in 2015, it was my first real break away from home. I picked a CS major, but didn’t have a laptop to work on or study on.

For the first four months, I worked 3 night shifts per week as a bill counter, where I unpacked cash bags, counted notes, scanned for damaged ones, and repackaged everything to be delivered to ATMs.

Not fulfilling or strategic work. But it solved the money problem for a bit.

The first thing I bought was an ASUS laptop that I spent around 1500 RON (~300 USD) on. That laptop was part reward, part requirement, as I could now learn and tinker with code past the CS Labs I had.

In 2017, the same pattern followed. I was fairly convinced at the time that I needed to buy a Mac, as I wanted to study Objective-C and Swift, and tinker with iOS applications.

Opportunity came, and I went to the US to work for the summer on a student J1 visa, thinking I could make a good amount of money.

New York at night, A picture I took from a ferry coming back from Liberty Island, in 2017.

Financially, it didn’t work out well as I didn’t come back with savings. I did come back with a refurbished MacBook Pro (13in, 2017), though. Next year, I will get my first full-time job as a programmer, working with Python, C++, and Computer Vision.

That’s how I entered the AI and ML world.

Why share this?

None of these steps was optimized. Night shifts made me skip classes and cut through the study time. The J1 summer didn’t pay off financially, and again, I could have spent that summer studying. But each decision quietly compounded my ability to explore, confidence, study, and move closer to things I cared about.

Direction mattered more than speed.

The New Direction

That baseline for this newsletter hasn’t changed. AI & Building AI Systems still stands solid.

What has changed is how wide the surface has become.

As I was looking through the archive and all the articles I’ve posted, I noticed that over time, the focus of this newsletter slowly expanded: from details on architecture and hardware, to tools, to frameworks, tips & tricks, and advanced low-level details.

I’ve covered GPUs, Neural Network Architectures, Programming Languages, AI Inference Frameworks, Engines, and AI Engineering concepts - but slowly diverged from the initial idea of building and shipping end-to-end, explaining along the way.

I’ve decided to do the opposite:

Reducing from the surface
Increasing depth, more practicality
Focusing on building and integrating AI

All the existing articles I’ve published still helped many of you understand how everything works, but less on how to connect the dots in a bigger system.

It’s the building systems around AI, rather than building AI around a system.

Most real-world environments are legacy systems that could benefit from AI, not greenfield agent platforms that replace everything overnight.

In practice, that often looks like small practical additions: a retrieval layer over internal docs, a vision model to filter large image datasets, a video summarization workflow, etc.

This is the layer I want to focus on.

To do that properly, I need to realign what I publish with what I actually want to build. That’s why I decided to get a fresh, ground-up look at everything I’ve shared here, and start aligning the pillars one by one.

Starting with changing the name.

A New Name: The AI Merge

The name Neural Bits was tightly coupled to the idea of byte-sized insights on AI.

Looking back, that description was only half accurate. Some articles were dense and technical, others were shorter and lighter reads - but from a reader’s perspective, I think most of them were seen as deep dives into individual components: a model, a framework, a piece of infrastructure.

There was another practical detail as well. A few months after starting the newsletter, I discovered that Neural Bits was already the name of a software development company in Mumbai. At the time, I chose not to change it as I focused on writing and learning, not branding.

The AI Merge is closer to what I originally set out to do: build AI that fits into real systems, end to end.

The foundations I’ve written about - AI Engineering, Inference, Multimodal AI, LLMs, APIs, GPUs, optimization - still matter. They’re prerequisites. It’s easier to reason about monitoring when you understand inference patterns. It’s easier to deploy and optimize models when you understand the underlying infrastructure.

But the new focus will be more practical, on how AI merges into full systems.

That’s what the new name reflects.

A New Channel: Video

Some things are easier to understand when you can see them built.

A few months ago, I collaborated with a good friend, (from The Neural Maze), on a free course we called Kubrick: The Multimodal Agent. Until then, most of my work had been text-first, writing code and explaining it through articles.

Working on that course changed how I think about video content. Walking through code and design decisions step by step is a great way of teaching. Instead of describing what to build, we could show how and why things were built the way they were.

Many people who went through it reached out with questions and follow-ups, and others even scaled and built their own solution, following the same principles.

That feedback was really helpful. Because of that, I’ve started the YouTube channel.

It will focus on longer-form walkthroughs: explaining System Design decisions, live coding and course walkthroughs, webinars, Q&A sessions, and more.

YouTube Channel

A New Platform: The Website

I’m working on building a strong reference point.

A place where I can present the projects and courses I’m working on, share a bit more context about my background, and keep the most important resources accessible in one place. It will also collect recordings, notes, and links to resources I’ve found useful along the way - things that are worth keeping an eye on, but don’t always fit into a single post or video.

A view of the best articles and resources covered in the Newsletter.

I envision it as a place you can come back to when you want an overview of what I’m working on, what’s available to learn from, how I can help you, and how the different pieces connect.

A central page for courses and projects, past and in progress.

The website ties all the channels together. It provides a quick and strong intro to everything you could learn, build, and apply.

I’m rolling it out in 2026.

A note before the New Year

Before closing this out, thank you.

At the start of 2025, this newsletter had around 300 subscribers. Today, there are more than 7,300 of you learning along.

This year, we went deep - from GPU programming and AI inference, to frameworks, inference engines, AI Engineering topics, and building a free course that many of you extended on your own.

I also had the opportunity to work closely with NVIDIA and received a DGX Spark, which I’ll use to build and teach in public next year.

Going into 2026 with a clearer direction, new channels, live sessions, and end-to-end projects. Less surface, more systems.

Wishing you a great end of the year!
Looking forward to learning and building together in the next one.

This shift feels important to me. I'm happy to hear your thoughts.

Unboxing the NVIDIA DGX Spark: First Impressions

Alex Razvant — Sat, 20 Dec 2025 14:15:48 GMT

Welcome to The AI Merge. Each week, I write about practical, production-ready AI/ML Engineering. Join over 7000+ engineers and learn to build real-world AI Systems.

Subscribe now

I’ve always leaned towards NVIDIA hardware. It started years ago when I was adding parts to my wishlist for building my dream gaming PC, and it continued as my work shifted toward AI development. Over time, a large majority of the GPUs and edge devices I’ve built on or deployed AI on have been NVIDIA-based because that’s where the ecosystem, tooling, and performance consistently came together.

I’ve worked on Jetson Nano, Jetson AGX Xavier at the edge, RTX 3090, 4090, L40, A2, and more in workstations, and datacenter GPUs A100, or H100. While each serves a different role, some prioritizing memory capacity, others raw compute or newer CUDA capabilities - they all reflect NVIDIA’s focus on end-to-end acceleration across hardware and software.

With DGX Spark, NVIDIA is targeting a sweet spot in local AI development: a desk-side developer kit with large unified memory, power-efficient design, and a capable GPU, built for AI engineers to develop, test, and validate models locally before scaling out to the cloud or large clusters.

Figure 1. The NVIDIA DGX Spark Founder’s Edition.

Past development, modern AI ultimately scales to massive (mostly) NVIDIA-powered superclusters.

Note: AI Superclusters tend to go with NVIDIA GPUs, xAI’s Collosus Supercluster, Microsoft + OpenAI, Nscale + Microsoft and the *potential Stargate Project.

Figure 2. DGX Spark is a small unit, 15x15x5 cm. It’s got a similar size to a Mac Mini, and in my case, it’s close to the same size as my CalDigit TS4 docking station.

In this article, I’ll take a hands-on look at the system, starting with the unboxing and hardware design, then moving through the software stack, intended audience, and a few benchmarks.

Starting with a few Notes

Lately, many early comparisons have evaluated DGX Spark against multi-GPU workstation builds - configurations like 4× RTX 4090, 2× RTX 5090, RTX PRO 6000 Blackwell, or even Apple’s M-series Ultra systems - and concluded that Spark is “underperforming” based purely on raw throughput metrics.

These comparisons assume that DGX Spark is intended to compete as a high-end GPU replacement. It isn’t.

DGX Spark is positioned as a local, DGX-aligned AI development system, not a benchmark-driven workstation. Its goal is to provide a coherent hardware–software environment for building, testing, and validating AI systems locally, using the same architectural assumptions that apply in NVIDIA’s datacenter platforms.

Once that framing is clear, the tradeoffs behind common benchmark comparisons become easier to interpret.

Apples vs Oranges Comparison

A single RTX 4090 offers strong raw performance with 24GB of GDDR6X VRAM over PCIe 4.0 x16. Scaling to larger memory capacities requires multiple cards, 4x to reach 96GB, introducing fragmented memory pools, PCIe-based interconnects, and more power consumption.

The RTX 5090 has a better bandwidth and capacity (32GB GDDR7 over PCIe 5.0 x16), but you’d still require 4x5090 to get 128GB VRAM, and they’re not cheap either.

On the workstations, the RTX PRO 6000 Blackwell has 96GB of ECC GDDR7 with large memory bandwidth, but it’s an $8,000+ card.

Apple’s M-series Ultra has large unified memory pools and excellent power efficiency, but NVIDIA ecosystem.

The DGX Spark is not a simple GPU; it’s an AI development platform, and I think - a good sweet spot within all the configurations above. It replicates the NVIDIA DGX into a powerful desk-side mini PC, focusing on 4 important pillars:

The GB10 Blackwell Chip - with support for FP4 and NVFP4 (up to 1 PFLOP of FP4 performance), which aligns it directly with NVIDIA’s current and future AI compute roadmap.
Large VRAM and Storage - Spark features 128GB LPDDR5x unified memory and supports up to 4TB SSD.
Datacenter Networking - Spark ships with dual QSFP ports powered by a ConnectX-7 NIC, delivering 200 Gbps networking out of the box. The NIC alone is a ~$1,700–2,000 component, something you simply don’t get in standard workstations or consumer desktops.
NVIDIA AI Software Stack - Spark runs DGX OS, preconfigured with NVIDIA’s full AI software stack, including core AI libraries such as CUDA, NCCL, cuDNN, TensorRT, and the rest.

Figure 3. How DGX Spark came to be, replicating a DGX-like AI development experience in a devkit. Size shrunk, power grew.

Judging DGX Spark purely on raw benchmark numbers misses its value proposition. It is not designed to replace multi-GPU workstations, nor to compete on inference performance alone. Instead, it provides a compact, coherent, and datacenter-aligned AI development environment that mirrors how models are ultimately trained, distributed, and deployed at scale within NVIDIA’s ecosystem.

When evaluated on those terms, the design tradeoffs of DGX Spark are deliberate and consistent with its intended role.

Hardware and Design

The DGX Spark is a gorgeous piece of engineering. It’s got a full-metal chassis with a gold-like finish, has two metal foam front/back panels which are strikingly similar to the design of NVIDIA DGX A100 and H100.

Figure 4. DGX Spark unboxing.

Physically, it’s a small and elegant unit, roughly the footprint of a Mac Mini. On the back panel, it’s got:

4 x USB-C ports, with the leftmost one being the power delivery(up to 240W)
1 x HDMI 2.1a display connector port to plug in your monitor
1 x Ethernet 10 GbE RJ-45 Ethernet port
2 x QSFP ports, driven by NVIDIA ConnectX-7 200GB/s NIC

Figure 5. A view of DGX Spark connectivity layout, with all 4 x USB-C ports, HDMI, Ethernet, and ConnectX7 QSFP ports.

Note: QSFP (Quad Small Form-factor Pluggable) is a hot-pluggable optical or electrical interface used in high-speed networking equipment such as switches, routers, and servers.

The QSFP ports are particularly interesting, as underneath they’re powered by a premium datacenter ConnectX-7 NIC (Network Interface Controller), allowing you to build a mini-DGX cluster locally with only 2 DGX Spark Units, and working with LLMs over 400B parameters.

The exact NIC model that Spark has comes in the 200GB/s via 2 QSFP ports, split into 100GB/s per port. The example shown in the following image is the 2x200GB/s at USD $2200, placing the Spark version of this NIC at around 60-70% of that price (~USD $1500).

Figure 6. Price of a single Connect-X7 InfiniBand Controller 2x200GB. This adapter is almost as big as the Spark, so NVIDIA shrunk it down to fit.

At the datacenter scale, networking is what enables efficient multi-GPU and multi-node parallelisation through NCCL. With ConnectX-7, multiple DGX Spark units can be directly connected or attached to a high-end switch, and as ConnectX-7 supports RDMA and GPUDirect RDMA, Spark can also move data directly from storage or edge systems into GPU memory with minimal CPU involvement, something traditional workstations simply aren’t designed to do.

For local and edge AI, this matters when working with large datasets, streaming data, or building systems that more closely resemble production AI infrastructure.

Hardware Specifications

The DGX Spark is powered by the GB10 Grace Blackwell Superchip. It combines a Blackwell GPU with 5th-generation Tensor Cores and a 20-core Grace Arm CPU (10× high-performance Cortex-X925 + 10× efficiency Cortex-A725 cores). Memory-wise, it’s got 128 GB LPDDR5X unified memory, split into 8 chips around the GB10 Grace GPU chip.

Figure 7. A view of the Spark’s internal board, showcasing the GB10, 128GB (8x16) Unified Memory Slots, ConnectX7, and the other connectivity ports. Source StorageReview

The CPU and GPU are connected through NVLinkTM-C2C, a Chip-to-Chip interconnect, not via PCIe slots, cables, or external connectors, and that’s faster and more energy efficient.

Note: Although PCIe 5.0 x16 is limited to ~64 GB/s for CPU-GPU transfers, DGX Spark’s unified memory architecture provides ~273 GB/s of shared memory bandwidth accessible by both CPU and GPU.

Both the CPU and GPU chips in GB10 share a coherent unified memory address space and behave like a single processor, yielding much lower latency and much higher bandwidth than PCIe. Through PCIe, the data has to flow H2C (host to chip) and C2H (chip to host) between your CPU memory space (RAM) and GPU memory space (VRAM) in a machine that has a discrete GPU, plugged into a PCIe slot on the motherboard.

Memory Bandwidth

To understand why memory bandwidth matters for DGX Spark, it helps to contrast it with a traditional discrete GPU setup.

In a conventional system, let’s say as an RTX 5090 installed in a PCIe 5.0 x16 slot - the GPU is a discrete card, separate from the CPU and system memory. Communication between the CPU and GPU is happening over PCIe 5.0, which provides roughly 64 GB/s of usable bandwidth per direction. All data movement between host memory (RAM) and device memory (VRAM) must be explicitly orchestrated by software and is not hardware-coherent.

Figure 8. A standard PCIe 5.0 x16 Interface slot is present in most PC motherboards. On the right, it’s an RTX5090 PCIe x16, and the GPU is connected through the motherboard slot. Sources: Wikipedia, TechPowerUP

Note: In PyTorch, when you call something like tensor.to(”cuda:0”), you’re explicitly triggering a device transfer. This operation copies the tensor’s data from CPU system memory (RAM) into GPU device memory (VRAM) so the GPU can access and process it.
If you look deeper into PyTorch’s C++ backend, you’ll find operations commonly referred to as HtoD (host-to-device) and DtoH (device-to-host). Through the cudaMemcpyAsync primitive, these allocate memory in the appropriate CPU or GPU address space and then perform explicit memory copies between them via PCIe.

DGX Spark takes a different approach. Instead of discrete CPU and GPU memory pools connected by PCIe, it uses a single, hardware-coherent unified memory system shared by both the CPU and GPU. This doesn’t need explicit HtoD and DtoH copies and allows both processors to directly access the same memory address space.

However, this design comes with a tradeoff. The unified memory pool delivers approximately ~273 GB/s of bandwidth, being a LPDDR5X system memory. This is still higher than what PCIe 5.0 provides per BUS, but is lower than the bandwidth of GPU-local memory such as GDDR7, GDDR6, or HBM.

For instance, RTX5090 has GDDR7 at ~1.5TB/s memory bandwidth of VRAM.
The DGX Spark has LPDDR5X at 273GB/s bandwidth on the unified memory.
But, to consider : Spark is smaller, has really good compute capabilities, and consumes way less power than a dedicated GPU.

As a result, DGX Spark favors memory coherence and large memory size (i.e., 128GB) over raw memory throughput, which can limit performance for workloads that are heavily bandwidth-bound on the GPU.

Although the lower memory bandwidth of 273 GB/s, the Spark still yields decent inference performance for large MoE models, Image Generation Models, and some dense models, and good GPU performance for workloads that are GPU-bound, such as prompt-prefill, or model training and finetuning.

This makes Spark really well-suited for handling larger models and latency-tolerant workloads, but a slightly poorer fit for large GPU workloads that are fundamentally bandwidth-bound.

The GB10 Chip

The GB10 is a multi-die single-chip solution for high-performance Arm-based workstations. With a GPU die based on the Blackwell architecture and a CPU die built by MediaTek with 20 Arm CPU cores. Both dies are built on TSMC’s 3nm process.

Note: A CPU/GPU die is the actual, tiny piece of silicon (semiconductor) where all the transistors, cores, and processing logic for a CPU or GPU are fabricated and etched.

The Apple M4 Max, for example, has up to a 16-core CPU, a mix of up to 12 high-performance cores and 4 efficiency cores. Spark comes with 20 (10 performance, 10 efficiency).

The integrated GPU delivers up to 1 petaFLOP AI compute (FP4 sparse), and supports the optimised NVFP4 precision, a data type which only Blackwell supports.

Figure 9. An annotated view of the GB10 Chip layout, the S-die and G-die of the chip, and the unified memory slots. Source ServeTheHome

Networking and Scalability

One major aspect that’s often overlooked in these comparisons is the networking interface built into DGX Spark. Spark includes NVIDIA ConnectX-7, a high-performance SmartNIC designed for datacenter AI and HPC workloads, not something you typically find in workstation or desktop systems.

The QSFP ports on DGX Spark support InfiniBand, for hardware-accelerated capabilities such as:

RDMA and GPUDirect RDMA for low-latency, high-throughput GPU-to-GPU and GPU-to-storage transfers, with minimal CPU involvement.
RDMA over Converged Ethernet (RoCE) for efficient multi-node communication across Ethernet-based environments.

This capability matters when you start thinking beyond single-node workloads.

For example, consider scaling LLM inference using disaggregated serving (e.g., NVIDIA Dynamo with SGLang) across multiple nodes. In this model, tokens are exchanged between multiple GPU workers selected dynamically based on their load. The system can scale workers up or down and parallelise prefill and generation phases independently, workflows that are highly sensitive to latency and interconnect performance.

With DGX Spark, this entire setup can be prototyped locally. With 2 x Sparks, one can effectively emulate a small AI cluster with 256GB of unified memory, allowing developers to validate distributed inference and system behaviour before deploying the same architecture to DGX Cloud or a larger datacenter environment.

Now, to round up the specs, the DGX Spark also comes with a fast 4 TB NVMe SSD (i.e., NVIDIA’s Founders Edition) for storage and runs a tuned Linux-based DGX OS that preloads with NVIDIA’s AI software stack for AI development from the first boot.

Rounding up the Specs

Figure 10. A complete overview of the DGX Spark specifications. Source NVIDIA

Software and Developer Experience

I also find one of the biggest advantages of the DGX Spark to be the software ecosystem and experience it offers to AI Developers. The Spark comes preconfigured with the full NVIDIA AI software stack through the DGX OS, so you have all the drivers, libraries, and frameworks ready to go.

Figure 11. The DGX OS is prebuilt with all the required NVIDIA libraries and frameworks for AI Development. Third-party add-ons are simple plug-and-play.

This means the correct GPU drivers, CUDA toolkit, cuDNN, NCCL, TensorRT, Triton Inference Server, and key system-level optimizations are already installed and versioned as a single OS bundle. For AI developers, this removes a somewhat large amount of setup friction.

Popular tooling, such as PyTorch, JAX, Hugging Face, Llama.cpp, Ollama, LMStudio, and ComfyUI can be installed and run with minimal configuration. For developers working with LLMs, multimodal models, or image generation pipelines, this helps reduce the time-to-first-experiment.

This experience is notably different from other NVIDIA Edge AI devkits such as Jetson Nano, Orin, or AGX Thor, which rely on JetPack OS. While JetPack is well-suited for embedded and edge deployments, it is more constrained, more tightly coupled to specific hardware configurations.

On top of the base platform, NVIDIA also provides a library of DGX Playbooks. These cover a broad range of real-world use cases, including multi-agent systems, multimodal pipelines, LLM fine-tuning, full pre-training workflows, image generation, model quantization, and production model serving.

Figure 12. The DGX Spark Playbooks collection, where you can test/benchmark your Spark unit on different applications and workloads.

The Benchmarks

Below, I’ll present a set of DGX Spark inference benchmarks to illustrate Spark’s performance. While Spark delivers lower token decoding throughput than high-end discrete GPUs, it achieves strong prompt prefill performance thanks to the GB10 Blackwell chip and maintains decoding speeds that are practical for interactive use for most models.

Note: From a UX perspective, sustained generation speeds in the 60–70 tokens/sec range at a single session, already feel responsive. Once requests are batched, tokens/sec increase to 100 and even 200 tokens.sec.

Although I mentioned above that Spark is not really an inference box, I’m still attaching these benchmarks to outline the decent and even good performance the DGX Spark has in most use cases tested, especially in MoEs.

DGX Spark inference on LLMs ranging from Llama 3.1 8B to GPT-OSS 120B
The Spark is super fast at Prefill, which is the compute-bound phase of LLM Inference, thanks to the GB10 Superchip. But although slower at the decoding phase, which is memory-bound due to the 273GB/s that Spark has, it still yields good TPS (60 tok/s for GPT-OSS 20B, and 41 tok/s for the 120B variant).

* Plus, in these benchmarks, none of the models are quantised to NVFP4, which only Blackwell supports. Most models are in INT4, so performance is still left off the table in these cases.

Figure 13. DGX Spark benchmarks using Ollama runner, across multiple models, dense and MoE GGUF variants.

Mac Studio M1 Max and Mini M4 Pro on Llama 3.1 8B to GPT-OSS 120B
The Mac Studio M1 Max and Mini M4 Pro don’t fit larger models and don’t support the CUDA ecosystem. Results on supported models might be similar in the decoding phase, but DGX Spark outperforms in the Prefill compute-bound phase by a lot in most cases.

Figure 14. The same model benchmarks on Mac Studio and Mac Mini M4 Pro.

DGX Spark Inference compared to RTX5080 and 5090
The 5080/5090 come with 16GB/32GB of VRAM, which is 8/4 times lower than Spark’s 128GB of unified memory. Spark still yields good results on the prefill phase, 23.169 Tokens vs 30.982 Tokens on RTX5090 and 28.927 on RTX5080, while numbers for the generation speed are capped by the memory bandwidth, making the TG speeds on the Spark 2/3 times lower than RTX5080/5090 variants.

Figure 15. Same subset of models benchmarked on the Spark, RTX5080, and 5090.

DGX Spark vs NVIDIA Jetson AGX Thor (VIDEO)
1. More verbose benchmarks on DGX Spark vs AGX Thor using llama.cpp

DGX Spark running Minimax-M2-230B
* Given this is a 230B model, 10-30 tok/s generation speeds are still pretty decent.
Source: llama.cpp
Extensive Benchmarking of DGX Spark on llama.cpp
* The link above leads to a llama.cpp thread where the community is benchmarking the DGX Spark on a variety of use cases. Models include GPT-OSS-20B, Gemma3-27B, GLM4.5, etc.

My Own Benchmarks

To ground my testing in realistic developer workloads, I benchmarked DGX Spark inference across four models that span small multimodal to long-context MoE models.

Subscribe now

All tests were conducted using llama.cpp build 7474 (commit f9ec8858e), compiled with -DGGUF_CUDA=ON. I used both llama-bench and llama-batched-bench to measure performance under sequential and batched inference workloads.

The models I’ve tested:

Small multimodal (≤5B): Gemma 3 4B
Mid MoE (10–30B): Ministral-3-14B-Instruct
Large MoE (100B+): GPT-OSS-120B
Transformer Hybrid + MoE: Nemotron-Nano-3-30B-A3B

Setup Details:

The llama.cpp batching parameters I’ve used:

`-fa 1` - Enabling the FlashAttention plugin
`-ub 2048`- The prompt tokens batch size.
`-npp 4096,8192` - The npp stands for NumberPromptPrefil, and the benchmark will use 4096 tok context window, then 8192 tok context window.
`ntg 32` - This describes how many tokens to generate before measuring TokenGeneration time.
`-npl 1,2,4,8` - Concurrency/Parallel requests
`--no-mmap` - Loads the model into memory, without any cached layers.

The Results on DGX Spark

Figure 18. LLama cpp benchmark using sequential requests, across Ministral-3-14B, GPT-OSS-128B, and Gemma3-4B. All models are GGUFs with ~4BPW (Q4s)

Figure 19. LLama cpp batched benchmark using parallel requests. Same models.

Figure 20. Nemotron 3 Nano 30B-A3B MoE model benchmark on the DGX Spark.

Other benchmark resources:

Availability and OEM systems

NVIDIA’s Founders Edition is available to order at $3,999 for the 4TB configuration. Alongside NVIDIA’s own unit, several GB10 desktops are arriving from the big OEMs. The core hardware will be pretty similar across all of the OEMs, with many announcements, including the Dell Pro Max, Lenovo ThinkStation PGX, Acer Veriton GN100, and ASUS Ascent GX10 having the GB10 chip.

Figure 21. View of OEMs’ versions of the DGX Spark from third-party providers. Source NVIDIA

Conclusion

The DGX Spark is best understood not as a faster GPU, but as a developer kit for building, validating, and testing large AI workloads locally.

It sits between discrete GPUs and full AI workstations within NVIDIA’s ecosystem, systems most AI developers rely on today - but packaged in a compact, desk-side form factor that fits naturally into a local development workflow.

Looking at the community benchmarks, most criticism of the DGX Spark comes from evaluating it against metrics that don’t match its intended use case. Many reviews frame it as an inference box or a replacement for high-end GPUs, but Spark was not designed to compete with large workstations purely on inference throughput.

While it doesn’t offer the memory bandwidth of high-end discrete GPUs, it still delivers practical performance on memory-bound workloads, with generation speeds that remain usable for interactive development.

More importantly, Spark is designed to align local AI development with how AI systems are actually built, whether that means working with a single large LLM, optimizing or quantizing models, fine-tuning and serving multiple smaller models, or building agentic systems that are eventually deployed at scale.

When judged on those terms, DGX Spark does exactly what it is designed to do.

Thanks NVIDIA for sending me a DGX Spark 💚

Thanks for reading Neural Bits 👋
I’ll have another update to share in the next issue, coming next week.

References

[1] NVIDIA DGX Spark. (2025). NVIDIA. https://www.nvidia.com/en-us/products/workstations/dgx-spark/

[2] Spark, D. (2019). NVIDIA DGX Spark Benchmarks. Google Docs. https://docs.google.com/spreadsheets/d/1SF1u0J2vJ-ou-R_Ry1JZQ0iscOZL8UKHpdVFr85tNLU/edit?gid=0#gid=0

[‌3] Raschka, S. (2025, October 29). DGX Spark and Mac Mini for Local PyTorch Development. Sebastian Raschka, PhD. https://sebastianraschka.com/blog/2025/dgx-impressions.html

[4] Kennedy, P. (2025, October 14). NVIDIA DGX Spark Review: The GB10 Machine is so Freaking Cool. ServeTheHome. https://www.servethehome.com/nvidia-dgx-spark-review-the-gb10-machine-is-so-freaking-cool/2/

[5] llama.cpp/benches/dgx-spark/dgx-spark.md at master · ggml-org/llama.cpp. (2025). GitHub. https://github.com/ggml-org/llama.cpp/blob/master/benches/dgx-spark/dgx-spark.md

[6] NVIDIA DGX Spark In-Depth Review: A New Standard for Local AI Inference | LMSYS Org. (2025). Lmsys.org. https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/

‌[7] Mann, T. (2025, August 27). Nvidia details its itty-bitty GB10 superchip for local AI development. Theregister.com; The Register. https://www.theregister.com/2025/08/27/nvidia_blackwell_gb10/

[‌8] NVIDIA DGX Spark Review: The AI Appliance Bringing Datacenter Capabilities to Desktops. (2025, November 13). StorageReview.com. https://www.storagereview.com/review/nvidia-dgx-spark-review-the-ai-appliance-bringing-datacenter-capabilities-to-desktops

‌[9] Performance of llama.cpp on NVIDIA DGX Spark · ggml-org/llama.cpp · Discussion #16578. (2025, October 14). GitHub. https://github.com/ggml-org/llama.cpp/discussions/16578#discussioncomment-14688238

[10] NVIDIA/dgx-spark-playbooks: Collection of step-by-step playbooks for setting up AI/ML workloads on NVIDIA DGX Spark devices with Blackwell architecture. (2025). GitHub. https://github.com/NVIDIA/dgx-spark-playbooks

An AI Engineer's Guide To Choosing GPUs

Alex Razvant — Sun, 07 Dec 2025 14:02:40 GMT

Welcome to Neural Bits. Each week, I write about practical, production-ready AI/ML Engineering. Join over 7000+ engineers and learn to build real-world AI Systems.

Subscribe now

Most AI engineers use NVIDIA GPUs as their compute for AI workloads. Also, most know the names of their GPUs but not the intricate details of the entire picture that matters for a deployable AI System.

From the RTX 3/4/590 everyone trains their LoRA Adapters on, the H100 that powered and still powers LLM clusters, to the new Blackwell B100+ chips entering data centers specifically for Gen AI Training and Inference at scale.

There are a lot of options and configurations. But knowing the name of a GPU will not tell you the most important thing:

GPUs are not monolithic products.

They’re systems composed of layers:

A microarchitecture (e.g., Pascal, Ampere, Hopper, Blackwell) that defines the underlying chip features, which precision formats and tensor features exist etc.
A memory subsystem that determines how fast model weights and activations can move.
A form factor and interconnect (PCIe, SXM, NVLink) that indicate whether GPUs can scale together while using their full capacity.

This guide breaks down the internal logic of NVIDIA’s GPU lineup, as I see it useful from an AI Engineer’s lens:

How does architecture map to capability? How do memory and interconnect constrain or enable AI workloads? And how do consumer GPUs differ from data-center GPUs beyond price and marketing?

To better navigate this article, please use the Table of Contents on the left side.

Fun Personal Story - My First GPU

My first ever GPU was an NVIDIA 7300GT, with 256MB VRAM and 128bit BUS. Nowadays, even a microwave is more powerful than this chip was in 2008 when I got it with my first ever Desktop PC my grandmother bought me.

I remember back then, trying to run Grand Theft Auto 4 on my PC, and the game wouldn’t even start, I think rendering the first frame of the Rockstar Games Logo was too much for this little fella. I remember trying to convince my parents to buy me a NVIDIA 9500GT as one of my friends had it and his PC was running the game on High Settings at 1280x1024. That was way over what they could afford at the time.

You could imagine I was spending most of my time at his place whenever I got the chance. In the end, with a lot of tweaks, I’ve managed to play it a bit at 340x280 resolution with everything on Very Low settings, on my own PC.

I remember even modifying the game’s internal .ini files in Windows/ProgramFiles, trying to tweak DirectX 9.0 and disable every Graphics Feature I could find in there, being guided by every tutorial I could find at the time, and after waiting a solid few minutes for pages and videos to load with my Dial-Up internet of 40kb/s via the phone line.

It looked somewhat like this, but with way blurrier pixels, and a 12-13 FPS max, with the GPU Fan working overtime at 70-80 degrees C.

But hey, I could play it. :)

Figure 1. GTA 4 running on NVIDIA GT7300 at 10FPS, very low settings with an IntelCore I5 and 8GB RAM. Sourced from Youtube.

Interestingly enough, It was back then I’ve started to find out NVIDIA SLI, different GPU families, VRAM, memory. I didn’t know or wanted to understand what these were. My complete goal of trying to run this game that everyone at school talked about, on my PC such that I could join the “group”.

Getting back to nowadays, you could easily run way better looking games directly on your phone, at smooth 30+ FPS while not draining your phone battery.

The picture I wanted to paint with this story is that GPUs, Graphics, Supercomputers, AI Compute and technology overall have come a very long way. Now, compute is faster, bigger, more energy efficient and cheaper than ever.

1. Deep Learning Started on 2 x GTX550

On a recent Joe Rogan podcast episode, Jensen Huang brought up a moment in deep learning history that’s easy to forget now. In 2012, Alex Krizhevsky and Ilya Sutskever trained AlexNet, the Image Classification model that ended up beating all the existing Computer Vision algorithms at the time.

Figure 2. A screenshot from the recent Joe Rogan Podcast episode featuring Jensen Huang, the CEO of NVIDIA.

They’ve done that using 2 x NVIDIA GTX 580 gaming GPUs with 3 GB of VRAM each, to build the fast convolutions. That was their whole setup.

The code cuda-convnet was good enough that for several years it was the industry standard and powered the first couple years of the deep learning boom. That success in 2012, hinted that AI progress was going to depend heavily on GPU hardware.

But, hardware is only half of the slice. If you write or deploy modern AI models, you’re almost certainly doing it on NVIDIA hardware. That’s not just about FLOPs or how large the GPU VRAM is, equally important is the Software stack, the low-level libraries, frameworks and SDKs that allow AI Engineers to train, optimize and deploy their AI Models.

As an AI engineer, your life is much easier if you understand how NVIDIA organizes its GPU stack.

This piece is a practical map of that stack, starting Hardware First.

Software view: compute capability and CUDA features
Architecture view: Ampere → Hopper → Blackwell
Hardware view: PCIe vs SXM, NVLink, and when they matter

2. Understanding Compute Capability

Every NVIDIA GPU has a Compute Capability (CC) like 7.0, 8.9, 9.0, etc. This number defines which instructions, cuda cores, tensor cores, memory ops, and features a GPU supports. Simply put, the CC number defines the set of hardware features per GPU architecture.

For instance, if we analyze this table, we’ll see the CC number associated with each family of GPU Chips, from the older Tesla GPUs, up to the latest Blackwell chip designed for AI.

The GT7300 I’ve had in 2008, was part of the Tesla family of architectures. Interestingly enough a slightly modified version of a Tesla-family GPU, the 7800GTX, called (RSX) Reality Synthesizer was used in Playstation 3.
That chip was developed jointly by Sony and NVIDIA.

Figure 3. How CC maps to GPU Architectures, indicating a range of Compute Capability scores for each CUDA SDK version. Image taken from Wikipedia, with added annotations.

If you own an NVIDIA GPU, you can see the CC via running this in your terminal:

nvidia-smi --query-gpu=name,compute_cap --format=csv

Figure 4. The compute capability (CC) and other nvidia-smi details of my RTX4080 GPU, after executing the query above.

A few things are tightly coupled to compute capability:

Tensor Cores & precision formats
- Ampere (A100, RTX 30XX): TF32 + FP16 Tensor Cores
- Hopper (H100): adds FP8 via the Transformer Engine.
- Blackwell (B100/B200): pushes further to FP4/NVFP4 for inference.

Figure 5. An example on how TensorCore composition ties to Compute Capability. For each CC, the TensorCore configuration are different, more optimized. Image from wikipedia.

Memory - Newer CCs support HBM2E/HBM3/HBM3e, larger memory, and faster NVLink generations.
CUDA & library support - At some point, new CUDA features stop backporting to older CCs.

The rule of thumb when analyzing GPUs is the higher the CC, the more “native” support you get for modern AI features (FP8/FP4, better sparsity, bigger memory, new interconnects). The following diagram is an overview on GPU Architecture families and models, from Consumer GPUs to Data Center GPUs, and how these are tied to the Compute Capability score.

Figure 6. Showcasing the GPU Architecture and CC link in a broader view, with GPU Models included. Image from wikipedia, with added annotations.

To summarize this section, Compute Capability tells you which hardware features a GPU actually supports, and whether your kernels will run at full speed. VRAM, FLOPs, and interconnect matter but only after the capability makes those features usable.

After CC, the next layer into understanding a GPU’s Performance is the Technical Cheatsheet, from where we extract details such as connectors, FLOPs, Bandwidth Memory and more.

3. Understanding a Technical Cheatsheet

After understanding CC, a GPU cheatsheet is another key reference tool for an AI engineer to understand hardware and software optimization details. In a Technical Cheatsheet, an engineer will find the metrics on CPU Performance, power usage, number of FLOPs in different precision formats (FP32/FP16) and the GPU form factor.

The latter is important for building compute clusters, where multiple GPUs have to be connected and share the resource pool. A cheatsheet allows you to quickly answer some of these questions:

Will this GPU support the precision modes?
Does it have enough VRAM and Bandwidth?
Is inter-GPU bandwidth high enough for model parallelism?
Will this deploy cleanly in my existing hardware stack?

In the following image, let’s inspect the Technical Cheatsheet for the Hopper H200 GPU, covering a few key details on FLOPs and explain the difference between the form factors such as SXM or PCIe.

Figure 7. The NVIDIA H200 GPU Technical Cheatsheet with annotated examples and images showcasing the difference between PCIe and SXM form factors.

From the cheatsheet, an AI Engineer would likely look first at the GPU Memory, Bandwidth and FLOPS for a specific Precision Type which directly impacts the speed of AI model training and inference.

For this specific GPU model, a single H200 GPU features 141GB of memory, with a 4.8TB/s bandwidth. For Vision based workloads, that maybe imply real-time vision AI inference, this GPU features NVDEC which allows for video decoding and feeding data as tensor-ready structures instead of it passing through the CPU.

MIG - Multi Instance GPU

Another important detail is MIG (Multi Instance GPU) which allows engineers to shard a single physical GPU into multiple instances of virtual GPUs, each in an isolated space.

For instance, a single H200 could be split into 4 MIG instances, each one with 36GB of VRAM. That means 4 different AI Engineers, could work in separated environments each with their own workload.
Think of a multi-agent system, with multiple LLM models in their own VRAM and GPU boundary working simultaneously on different tasks.

During experimentation of model training phase, MIG also could come in handy to run the same experiment with multiple configurations or optimization profiles. A MIG instance could quantize to FP8 and inference with a 32 batch size, and another could quantize to FP4 and inference batch size 64.

Form Factor - SXM or PCIe

Let’s focus on the form-factor, as this also impacts the GPU performance. In this cheatsheet, there are 2 form factors PCIe and SXM. The PCIe (Peripheral Component Interconnect Express) is an interface standard that is common for consumer GPUs.

In the attached image, there is a Gaming PC Motherboard, featuring a PCIe 5.1 slot for the GPU, think RTX4080/4090/5090. On the other hand, SXM is a special chip embedded into the motherboard directly, and used in DataCenter clusters.

For example an H200 DGX Server contains 8 x H200 GPUs. These are not connected via PCIe, but on SXM directly and connected via NVLink.

Figure 8. A close-up on H200 SXM form-factor GPUs (left) and PCIe form-factor GPUs (right). Below is an image with how the chips look on control boards.

With SXM, you get higher power input which mean higher sustained clock speeds and direct GPU-GPU links through the NVLink switches. This is important for training or serving large models as AI Engineers could use the full benefits of parallelization techniques such as Tensor Parallel or Pipeline Parallel with low latency in GPU-GPU communication.

For example, the H100 SXM variants can participate in NVLink/NVSwitch topologies where 16 GPUs share hundreds of GB/s of bidirectional bandwidth. Multi-GPU clusters are generally used for training and inferencing large dense LLMs and MoE Models as token exchanges and activations in MoE networks require fast GPU-GPU communication.

Figure 9: NVIDIA NCCL 16 GPUs topology. Sourced from NVIDIA

What is NVLink

To understand NVLink and NVSwitch, we could first look at the older connector called SLI. The 2xGTX550 that were used to train AlexNet in 2012 were routed using the SLI Bridge to enable faster computation and data sharing between the two. SLI came during the era of Gaming, during which NVIDIA was selling consumer-first GPUs to render graphics.

Figure 10: NVIDIA GeForce GPUs connected using an SLI Bridge. Sourced from Wikipedia.

NVLink is the successor of the SLI, targeting AI workloads.

For Desktops (PCIe Cards): NVLink is connected using a physical external bridge called an NVLink Bridge. This is a compact, solid PCB-based connector that plugs into dedicated NVLink ports on the top edge of two adjacent GPU cards, similar to the older SLI bridge.

For Servers (SXM Modules): In high-density server environments (like NVIDIA’s DGX systems), the NVLink connections are integrated directly into the multi-GPU baseboard. The GPU modules (SXM form factor) plug into this baseboard, making the physical NVLink connection an internal part of the server structure.

For instance, below is an image of 2 x A100 PCIe GPUs, connected with NVLink bridges.

Figure 11: Two NVIDIA A100 GPUs in PCIe form-factor, connected using NVLink Bridges.

4. How to Choose a GPU as an AI Engineer

A typical AI engineering workflow is highly dependent on specialized hardware to accelerate model training and inference. While the majority of this work sources on cloud compute platforms, many teams, especially those dealing with extremely sensitive data or specialized needs, still utilize on-premise compute clusters. Regardless of the deployment environment, the decision on which GPU to use should be based on a well researched plan.

If you’ve found this article helpful, share it on your feed!

The common deployment environments for AI engineers are:

Cloud Compute: Services like AWS, Azure, GCP or natively NVIDIA DGX Cloud offer scalable, pay-as-you-go access to top-tier hardware (e.g., NVIDIA H100s). Niche providers like LambdaCloud or RunPod also offer compelling alternatives.
On-Premise Labs: Engineers working in private data centers or dedicated labs have full control over the hardware, often using NVIDIA DGX or HGX systems.

On-premise is the current pick for most AI Labs out there, OpenAI, Anthropic, X and Meta, they all bought DGX Clusters or large orders of GPUs from NVIDIA to build their own data centers.
That’s because in most AI Research, and you might run 100 experiments from which 70 failed, and paying for on-demand resources, cold start, provisioning on large cloud clusters gets quite expensive.

When comparing specific GPU SKUs, whether in the cloud or on-premise engineers often evaluate them based on three technical pillars:

Compute Capability (Hardware and Software)

For NVIDIA, the CC metric dictates the low-level features a GPU supports, covering Precision Types, Tensor Core or CUDA Cores configurations.

Usable Memory (VRAM & Bandwidth)

VRAM is the amount of memory available, and bandwidth is how fast data can be moved in and out of it. Although LLMs tend to get smaller and be quite good at 12B, 30B parameters, loading such models in memory in their pre-training BF16 precision requires a lot of VRAM.

Bandwidth is another key performance aspect. Training or fine tuning LLMs imply multiple RW operations that involve all of the GPUs memory, not only it’s VRAM. A GPU also has SRAM and Registers which are used whenever data computed by a kernel needs to be cached for the execution of another kernel, or copied back to VRAM and made accessible to the CPU.

Most last-gen GPUs have HBM (High Bandwidth Memory) which is more tuned for AI workloads than the GDDR-X memory used for consumer-grade GPUs.

Interconnect (Communication)

This defines how fast GPUs can communicate with each other, which is crucial for distributed training as most models are not trained or fine-tuned on a single GPU, but often involve a multi GPU cluster.

Note: Mistral 8x7B MoE was trained from scratch on 240 x H100 GPUs which, with this kind of setup being similar in most LLM pre-training runs.

The key difference to make here is the connection interface, choosing between PCIe standard or SXM and NVLink, with the later being the go-to for large scale LLM training in distributed setups.

Evaluating a GPU option following these three pillars, across software capability, memory, and interconnect would filter the GPU options one is selecting from, and allowing you to tune the system after the specific requirements of your workloads.

5. Closing Thoughts

The AI world moves fast, but the underlying questions don’t change:

Can my GPU run the kernels I need? → CC and Architecture
Can I fit my model and batch size? → VRAM & Memory Type & Bandwidth
Can my GPUs communicate fast enough? → PCIe & SXM

Ultimately, the right choices for an AI engineer comes down to matching these core needs to the right tool, ecosystem and scalability requirements. Establishing the range of requirements for the AI Workload you’re working on (pre-training, fine tuning or inference) would simplify the process of selecting the right compute by a lot.

Thank you for reading, if you’ve enjoyed this article, consider subscribing for free.

Subscribe now

My Best Recent Guides for AI Engineers

Alex Razvant — Sat, 29 Nov 2025 14:02:54 GMT

Welcome to Neural Bits. Each week, I write about practical, production-ready AI/ML Engineering. Join over 7000+ engineers and learn to build real-world AI Systems.

Subscribe now

In this week’s edition, I want to share three standout articles from my recent work that cover a mix of AI topics, including optimising and serving AI models both at scale and on Edge Hardware, and a few best practices for building Python Backends for your RAG or Agentic pipelines.

For each: a summary, why it’s valuable, and what you’ll learn by reading it.

1. Use These 6 Advanced Concepts with FastAPI

This article introduces six advanced concepts you could use when building backends and APIs with FastAPI. It’s especially oriented at developers working with FastAPI and Pydantic, to help structure and manage AI or ML applications more robustly.

Topics include model lifecycle (loading/unloading), state management, dependency injection, configuration management, and best practices for production-ready code.

Why it’s a good resource:

It brings engineering discipline to AI projects, showing you how to use the FastAPI app lifecycle hook, properly work with Pydantic configurations, and write robust, scalable Python code.
It covers some pitfalls around code structure, configuration, and how to efficiently use the DTO (Data Transfer Object) design pattern with Pydantic to exchange data between your service layers.

What you’ll learn:

How and when to use dependency-injection in FastAPI for your ML/AI applications.
Best practices for model loading/unloading, managing state, and building maintainable code.
How to manage FastAPI’s application state.
How to properly use PydanticSettings for your secrets and environment variables.

2. An AI Engineer’s Guide to Inference Engines and Frameworks

This guide describes the most popular Inference Engines and Frameworks an AI Engineer would use to port trained models into deployed, usable systems. Be it classical Deep Learning Models, LLMs, Finetuned LLMs or VLMs, this guide explains inference engines, serving frameworks, and what it takes to reliably serve AI models in production (latency, throughput, scalability, infrastructure).

Why it’s a good resource:

Provides a comprehensive overview of all the libraries and frameworks for AI Inference, used within Industry.
It’s practical and engineering-oriented: useful for anyone building real-world AI applications that aim to deploy at any scale.

What you’ll learn:

The model-to-production lifecycle.
How to select the Inference Engine and Framework for your model size, scale, system architecture and latency constraints.
The scope and inner workings of each engine and framework.

3. An AI Engineer’s Guide to Running LLMs on CPUs, GPUs and Edge (llama.cpp / GGML / GGUF)

This video guide explains how llama.cpp, GGML and GGUF work and how you can run LLMs locally - on laptop CPUs, mobile/edge devices, or consumer hardware. A large set of LLMs has big parameter counts and requires powerful GPUs with 20GB+ of VRAM to serve them. On the other hand, llama.cpp can serve quantised GGUF models that maintain close performance and accuracy, while drastically reducing the memory requirement by up to 4-5x times.

Why it’s a good resource:

It describes how model quantisations work and how GGUF compresses model weights.
It’s a practical guide for AI Engineers aiming to build local, offline, or edge-based AI applications, including AI Agents or Agentic Systems.

What you’ll learn:

What is llama.cpp / GGML / GGUF, how they work end-to-end, and why they enable inference on a wide variety of hardware (CPU, low-end GPU, edge devices).
How AI Engineers can inspect the layers of a GGUF model, and where to find GGUF models.

Ending Notes

Thanks for following along as I’ve been diving into the engineering side of AI these past couple of months. My goal with all these articles and guides has been to share and educate more developers to build real, practical AI systems past just demos.

This newsletter has been growing steadily and organically thanks to your support - and I’m incredibly grateful for that. 🙌

I’ve also been able to carve out more time recently, and I’ve started working on several big initiatives that I think will push this newsletter (and the whole project) into a new phase.

I can’t wait to share what I’ve been building, and what’s coming next!

Small VLMs Will Soon Compete With Frontier AI Models 10x Their Size

Alex Razvant — Sat, 22 Nov 2025 14:58:39 GMT

Welcome to Neural Bits. Each week, I write about practical, production-ready AI/ML Engineering. Join over 7000+ engineers and learn to build real-world AI Systems.

Subscribe now

The current edition of this newsletter is the second part in a small series titled “The Future of Agentic AI is Small”, where we unpack the technical details of NVIDIA’s Nemotron Family, a set of models, datasets, and techniques that push the boundary of Agentic AI to smaller, open, and more efficient models.

In the first part, we’ve covered the model family on a high level, spanning Nano 1B-15B models for Edge, Super 16B-50B for mid-tier, and Ultra 50B+ parameter models, which compete with frontier-level models on various benchmarks.

In the second part, we’ll focus on the multimodal star of the NVIDIA Nemotron Nano models, the Nano 2 VL 12B, released just a few weeks ago.

The Nano 2 VL 12B is the current leading open SVLM in the OCRBenchv2 benchmark, a benchmark designed to evaluate enterprise-scale document, invoices, and complex image understanding.

The Open NVIDIA Nemotron Nano 2 VL 12B model competes with Frontier Level AI Models on OCR, Multi-Image Reasoning, and Understanding.

We’ll cover the architecture, the text encoder, and the vision encoder that the Nano 2 VL uses, and explain the architectural improvements, the plug-and-play optimizers, reasoning modes, and more.

In this mini-series:
→ ✅ Part I - The Nano 2 Family of SLMs for best-in-class Reasoning Models
→ ✅ Part II - The Nano 2 VL Model for Agentic AI capabilities on Vision Tasks

1. The Nemotron Nano 2 VL 12B

The Nano V2 VL was designed for strong real-world document understanding, long video comprehension, and multimodal reasoning. Compared to its predecessor, the Llama‑3.1‑Nemotron‑Nano‑VL‑8B, this version packs improvements across multiple vision benchmarks, notably on complex OCR-related benchmarks and Video/Image reasoning ones.

It does that through enhancements in architecture, image processing tweaks, dataset curation, training recipes, and inference optimizations, all of which are described in the research paper, and which we’ll cover in this article.

To outline a summary of the improvements Nano 2 VL brings:

Leading accuracy on document/OCR/vision-language tasks, placing Top 1 in SLM Open Models and Top 3 competing with frontier models such as Gemini 2.5 Pro.
Context handling from 16k tokens in the previous Nano VL (v1), up to 128k context window, enabling very long documents / long videos / multi-page inputs / complex reasoning.
Architecturally, built on a hybrid “Mamba-Transformer” backbone, bringing faster inference due to the linear compute time that Mamba + State Space Models (SSM) brings, as well as a powerful Vision Encoder trained on multi-scale resolution images.
Throughput in video handling with “Efficient Video Sampling (EVS)”, a technique to reduce redundant tokens in long videos and multi-image setups.

To understand the impact of these improvements that Nano v2 VL brings, first, we need to briefly cover the previous iteration of this VLM and pinpoint the performance, accuracy, its architecture, and rankings.

1.1 The Nemotron Nano VL - 8B (previous version)

The previous Nemotron VLM was built on the powerful Meta Llama 3.1 architecture for the Language Decoder and the C-Radiov2-H Vision Encoder by NVIDIA. The language decoder in the first version, Llama-3.1-Instruct-8B, is Meta’s mid-sized instruction-tuned LLM for high-quality reasoning and task-following.

Being an Instruct model, during post-training, it was aligned to adhere to user prompts and preferences, making it perform reliably in conversational tasks.

Architecturally, Llama-3.1-Instruct-8B uses a decoder-only Transformer with grouped-query attention that helps inference throughput, but we’ll see in a bit how that differs from what the Text Decoder in Nano 2 VL brings.

For the Vision Encoder, the previous Nemotron VL model used NVIDIA RADIOv2 Vision Encoder. This encoder was built under a paradigm called “agglomerative modeling”. That means that instead of relying on a single specialized teacher model, the design distills multiple teacher vision models simultaneously into a single student backbone that learns to mimic all of them across all resolutions. We’ll see more details on this one in a bit.

Now, let’s present an overview of what we’ve discussed above, in the following Figure.

Figure 1. An overview of the previous Nemotron Nano VL model, including benchmark scores, architecture design, and notes.

Although it’s an SVLM (Small Vision Language Model), with only 8B parameters, it still ranked high on multiple Multimodal Tasks, including AI2D, OCRBenchv2, and ChartQA. Just to get an idea of what the data from these benchmarks looks like, let’s inspect a few samples from the AI2D Benchmark, where the Nemotron Nano VL (v1) scored 85% accuracy.

Figure 2: A Screenshot from the AI2D benchmarking datasets on HuggingFace.

And to see a zoomed-in sample of an image in this set (3rd row), we can reason the complexity of the Task, the model having to reason across all the animals in the image, and also “understand” the connections between the arrows and the question & options presented:

Figure 3. A zoomed-in image from the AI2D dataset, outlining the complex connections and visual context in the dataset samples.

Now that we have a complete overview of the previous Nemotron Nano VL version, let’s follow the same process for the Nemotron Nano 2 VL.

Thanks for reading Neural Bits! This post is public, so feel free to share it.

1.2 The Nemotron Nano 2 VL - 12B (the new version)

We could consider the Nano 2 VL model as built from the ground up, bringing changes to the Vision Encoder, Language Decoder, Post Training methods, and Image/Video processing workflow.

The Nano 2 VL builds on the Nemotron Nano 2 12B Language Model, this time, which is a more efficient and robust architecture compared to the Llama 3.1 used in the previous version. The Nano 2 - 12B LLM is a decoder-only stack where standard Transformer blocks are interleaved with Mamba-2 style state-space layers to improve efficiency and long-context scaling.

Note: We’ve covered the Nemotron Nano 2 12B Text Encoder model in the Part I of this article.
→ Read it here.

The Vision Encoder uses the same paradigm of “agglomerative models” but with a newer variant, c-RADIOv2-VLM-H, replacing the previous base and benefiting from improved multi-teacher distillation, multi-resolution consistency, and stronger dense features and robustness.

To put everything in a single Figure, we’ll have:

Figure 4: Overview of the Nemotron Nano 2 VL 12B model, describing the Language Decoder and Vision Encoder models used.

This time, the Nano 2 VL model is the leading open SVLM on the OCRBenchV2. To inspect a few samples of this benchmark, we have very complex use cases across invoices, documents, charts, and banners with tasks on Text Recognition, Ad Placement, Math Calculation, VQA (Visual Question Answering), and Text Extraction.

Figure 5: A view of the samples and tasks in the OCRBenchV2 Benchmark.

Now, let’s compare them side-by-side using a bar-chart across a variety of Multimodal Benchmarks, designed for Document Understanding and Image/Video reasoning tasks:

Figure 6: Comparing Nemotron Nano VL (previous version) with the current Nemotron Nano 2 VL across Multimodal Benchmarks.

Now that we have a clear idea of how these models are different and what improvements the Nano 2 VL brings, let’s inspect in more detail the interesting parts of the Vision Backbone (the RADIO Encoder) and the techniques used to align this model in post-training, as well as why EVS is important and how it works.

2. Understanding the RADIO Vision Encoder

RADIOv2.5 is an “agglomerative” vision encoder by NVIDIA, which is trained by distilling multiple top VFMs (Vision Foundation Models) such as OpenAI CLIP, Meta DINO, and Meta SAM into a single backbone.

Instead of inheriting the bias of one teacher, it learns the strengths of many teachers, keeping global semantics, dense features, and segmentation-friendly spatial maps. An overview of that process could be seen in the following Figure.

Figure 7: Multi-teacher distillation of VFMs (Vision Foundational Models) into a single backbone, RADIO2.5.

Document understanding or encoding images with complex details is a difficult task for Vision Models. Small text, skewed images, or low-resolution images pose a challenge for pretty much any VLM. The RADIO encoder mitigates that with multi-resolution consistency. Earlier encoders produced different features at different resolutions. RADIOv2.5, on the other hand, is trained with paired low/high-res teacher signals, so its features stay consistent across scales.

Subscribe now

Let’s understand how exactly that works, using the following Figure 8:

Figure 8: How Multi-teacher distillation of (CLIP, SAM, Dino) works in training the RADIO Vision Encoder.

Here, we have 4 configurations (iterations) of a model training step. Each mode uses a different Teacher configuration, enabling the student to learn both features at high resolution and at low resolution at the same time, thus generalizing better and maintaining accuracy.

Another feature of RADIOv2.5 is token merging. This compresses redundant regions while preserving important details, and helps keep the token counts low, which translates into the encoder being capable of processing more Images at once, or having longer videos passed as input.

Note: This is particularly important for high-resolution images passed as input, as they’re split into multiple tiles before being passed to the model. Merging tokens, helps keeping the information density high with fewer tokens, something that might cause challenges for other Vision Encoders, as preserving each token will quickly fill-up the context window.

Now that we’ve covered the internals of how the RADIOv2.5 encoder works, we could use Figure 9 from the RADIOv2.5 research paper to sum up an overview:

Figure 9: The improvements that Radio2.5-H brings over the previous iterations of agglomerative models (e.g, RADIO2.1 used in the previous version of this VL)

And as a last step of this section, let’s see a few benchmark results across multiple image-feature extraction tasks:

Figure 10: Comparison of dense task benchmarks against other agglomerative models. (Taken from RADIOv2.5 Research Paper)

In the last section, we’ll bring everything together and use the official diagram for the Nemotron Nano 2 VL architecture, from the research paper, to outline each component and how it works - building on top of what we’ve learned in the previous sections.

3. Painting the Complete Architecture

From the Nemotron Nano 2 VL Research Paper, we get this Figure to visualize the architecture of the 12B model. Starting from the bottom up, we have the Text Input (prompt) and Image Input (represented as Tiled Image on the diagram).

Figure 11: An overview of the Nemotron Nano 2 VL 12B model architecture. (Taken from the Research Paper)

Both these inputs are passed through their respective encoders (i.e, Text is tokenized, Image is tiled and encoded, and tokens are projected), and all token embeddings land in a common, multimodal embedding space. Then, the Language Decoder processes interleaved (i.e, common space) embeddings and produces the answers.

Having this overall picture, let’s unpack the interesting components and provide a bit more detail on how exactly all these processes work. Since we’re covering the Nano 2 VL model, we’ll focus mostly on how Images/Videos are processed and passed through the model.

3.1 Starting with Image Tilling & Patching

Let’s first build the intuition on why we need to split images into patches and/or split them into multiple tiles.

Transformers were originally built for language, where input is already a sequence of discrete units (tokens). These tokens fit naturally into the Transformer architecture, because they have an ‘order’ and `position` in a sentence.

Images, however, are 2D grids, not sequences. A Transformer can’t directly take a 2D array of pixels and compute attention - it needs a sequence of vectors. So that’s why we convert an image into a sequence by slicing it into patches, to “mimic” the order of words in a sentence.

Note: In the end, embeddings for both words (text tokens) and visual patches (vision tokens) will end-up in the same latent embedding space for the Language Model to sample and process from.

Now, in VLMs, the concept of patching is different than tiling. There’s a hierarchy order that these processing steps must follow:

Tiling = splits the entire image into multiple large tiles (e.g., 512×512 or 1024×1024), often with overlap. Each tile is then patched separately by the vision encoder.
Patching = turning a 2D image into 1D tokens with positional encodings, such that the Transformer knows how patches relate to each other across an image.

Tiling and Patching are concepts used in all Foundation Models (e.g, GPT, Gemini, Claude) or VLMs. If we have high-resolution images, or ones with weird aspect ratios (i.e, very tall PDFs, panoramic images), or images with dense text and diagrams, we split them into different tiles to capture all the visual context.

Then, each tile is patched individually such that the Transformer knows how image regions relate (i.e, Patch 1 from Tile 1, with Patch 64 from Tile 2).

Note: If we were to use patching alone, we’ll generate huge numbers of patches filling-up the entire context of the LLM.

Now that we have understood the core ideas behind Tiling and Patching, let’s visualize them in the following Figure, which contains specific details on what resolutions and layouts the RADIO Encoder within Nano 2 VL 12B model supports:

For the next step, let’s visualize and build an idea on how the RADIO Vision Encoder extracts and represents Image Features.

3.1 How RADIO Encoder Extracts Features

This is a shorter section, but it’ll build an intuition on the effects of Multi-Resolution training of the Vision Encoder, and how exactly it can extract important dense features from images.

In the following Figure 13, we have a representation of the dense features mapping on different images, showcasing how RADIOv2.5 can produce low-level Meta Dino-like features and high-level Meta SAM-like features. If we look at the yellow boxes with hashed outlines, we can see the distinct clustering of features part of the same object.

In the first image, for example, there is a clear distinction between the Person (blue), Forest(green), and Mountain(red) sections, and the features (colors) are clustered to outline fine-grained details and delimitations of objects.

Figure 13: Visualizations of model features exhibiting the mode switch issue (low res, to high res). PCA is used to project patch tokens into a 3D space representing RGB colors. The visualizations illustrate how the baseline RADIO switches from producing DINO-like features at low resolution to producing SAM-like features at high resolution.

3.2 How Nano 2 VL Handles Long Video Input

In this section, we cover the EVS - Efficient Video Sampling plugin that the model is using to reduce repetitive/redundant tokens on multi-image inputs, using the context window more efficiently.

Note: To understand EVS in a simple sentence, think that most video content has static elements, areas that don’t change from frame to frame. We don’t need to encode all those tokens, we can encode only the “moving areas” from a video.

Nemotron Nano 2 VL uses EVS successfully for efficient video processing, which speeds up video inference by 4x, a major speed-up, with minimal accuracy loss. Let’s understand that with an actual example, we’ll use a few images from the Joe Rogan Podcast as an example.

Note: In any podcast setting, the background rarely changes, it can largely change between 2 modes only: when either of the person is speaking.

Here, the region marked in Red rarely changes. So if we select a 60-second video portion of the podcast and pass it to Nemotron Nano 2 VL, we’ll fill just a small slice of the Context Window of the model, leaving a lot of room for generation text tokens, longer video, reasoning tokens, etc.

That’s because the visual information was compressed to remove or merge areas that don’t change in the video, and that’s thanks to what EVS does.

Figure 14: An example of a frame sequence from an episode of The Joe Rogan Experience Podcast, annotating the visual area that doesn’t change visual information.

For a more robust overview of EVS and how it works, we can inspect Figure 15:

Figure 15: A complete overview of EVS (Efficient Video Sampling), a plugin by NVIDIA that reduces the number of vision tokens we could compress a video sequence into, preserving the Context Window of the model.

As a summary, EVS improves inference throughput, up to 4x, while maintaining the accuracy and working outside of the model, as a pre-processing step to shrink and compress Vision Tokens, before passing them to the model.

In the next section, we’ll provide a brief overview of the Datasets and Training Recipes used to train the Nemotron Nano 2 VL model.

4. The Open Source Aspect

In this section, we’ll have a look at the Training Recipe, which describes how the model was trained, aligned, on which datasets, and for how long. This is one important aspect that defines a truly Open model.

Nemotron Nano V2 VL was trained via Pretraining + 4-stage SFT pipeline. SFT stands for Supervised Finetuning, and it implies custom-engineered datasets and samples to align the model to specific behaviours.

0. Pretraining - Warming Up (36B tokens)

This stage focuses on aligning the RADIOv2.5 vision encoder with the frozen Nano 2 12B Language Model. From the architecture in Section 3, we had the Vision Projector layer, which is an MLP (Multi Layer Perceptron) responsible for projecting Vision Tokens into Language Space.

Figure 16: Visualization of MLP Projection of Vision tokens into the Shared Latent Space of the LLM.

The training set included ~2.2M multimodal samples (captioning, VQA, grounding, OCR, document extraction), aiming to stabilise early multimodal alignment without affecting language capabilities (LLM was frozen in this setup).

1. Core Multimodal SFT-1 (112B Tokens, 16k Context Size)

This was the main data stage and the largest contribution to quality. Datasets cover 40B text tokens on reasoning, code, math, multilingual dialogue, and 72B multimodal tokens on captioning, TextVQA, DocVQA, ChartQA, and more.

The main outcome of this stage was strong OCR capabilities, general VQA, and document understanding and reasoning.

2. Video & Multi-Image SFT (55B Tokens, 49k Context Size)

This stage focused on building long-context and temporal reasoning. Datasets include Kinetics, YouCook2, ActivityNet, and VideoQA datasets, all of which contain multiple short clips of performing different actions. Below you can find an example of an entry from the YouCook2 dataset:

Figure 17: A sample from the YouCook2 Dataset, one of the largest task-oriented, instructional video datasets in the vision community

3. Code/Text Recovery (15B Tokens, 49k Context Size)

During the previous two stages that focused on multimodal reasoning, the team behind Nemotron Nano 2 VL noticed that code-reasoning abilities dipped, so this stage aims to recover the accuracy on the underrepresented text & code reasoning tasks.

This stage focused on code reasoning (~1M samples) to recover text abilities, restoring math, code, and text reasoning benchmarks.

4. Long-Context Scaling (12B Tokens, up to 300k Context Size)

The fourth and final stage focused on building multi-image reasoning and long-video understanding, emphasizing long context. Remember, the Nano 2 VL model has a 128k context window, allowing it to process large inputs, and this stage aligned the model to those characteristics.

Figure 18: Overview of the Training Recipe of Nemotron Nano 2 VL model. Total training covers ~220B multimodal+text tokens. (Taken from the Research Paper).

Having these details helps developers trust the model they’re building on top of, be able to explain and interpret the outputs, and mitigate biases.

Note: All the details in this section, are extracted and summarized from the Research Paper, which I recommend you reading.

5. The Model on HuggingFace

You can find and get started with Nemotron Nano 2 VL models on NVIDIA’s HuggingFace Collection. The Nano 2 VL model comes in 3 different quantizations, BF16, FP8, and NVFP4 (i.e, NVIDIA’s new FP4 format compatible with Blackwell Architectures).

Figure 19. The NVIDIA Nemotron V2 model collection on Hugging Face.

The models are zero-day compatible with vLLM, so you could follow the tutorial on the model card from HuggingFace, starting a vLLM server and building the client to interact with it. Additionally, you can test the model on Open Router or Baseten and Hyperbolic, as well as deploy it pre-packaged using NVIDIA NIM (NVIDIA Inference Microservice).

6. Video Demo

7. Conclusion

In this article, we’ve covered the Nemotron Nano 2 VL, an open-source, small Agentic AI-ready model, specifically tuned for multi-image reasoning, document understanding, and video understanding.

We’ve started by comparing it with the previous version, the Llama-3.1-Nemotron-VL, across the benchmark results and architecture components, outlining how the newer model brings improvements at each level, from accuracy to inference throughput.

The Nano 2 VL model is the leading open SVLM (Small Vision Language Model) on a complex OCR benchmark (the OCRBenchV2), mainly thanks to the Training Recipe and Architecture improvements it brings.

We’ve also unpacked the Vision Encoder (RADIO v2.5-H), Text Decoder (Nano 2 12B), and the EVS Plugin for long-video inputs, as well as iterating and explaining the finetuning stages and dataset used to train this model, and showcased a short demo on how the model does using a set of 6 pages from a PDF document.

Thank you for reading! See you next week

References

[1] NVIDIA. (2025). Nemotron Nano V2 VL: Model report. arXiv. https://arxiv.org/abs/2511.03929

[2] NVIDIA. (2025). Llama-3.1-Nemotron-Nano-VL-8B-V1. Hugging Face. https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1

[3] Bagrov, N., Khvedchenia, E., Tymchenko, B., Geifman, Y., Zilberstein, R., & the NVIDIA team. (2025). Efficient Video Sampling (EVS): Pruning temporally redundant tokens for faster VLM inference. arXiv. https://arxiv.org/abs/2510.14624

[4] Heinrich, G., Ranzinger, M., Yin, H., Molchanov, P., & Pan, Y. (2025). RADIOv2.5: Improved baselines for agglomerative vision foundation models. arXiv. https://arxiv.org/pdf/2412.07679

[5] Wang, H., Maaz, M., Khan, S., & colleagues. (2025). LongVideoBench: A benchmark for long-context video understanding in large vision-language models. arXiv. https://arxiv.org/abs/2407.15754

[6] Wang, H., & Team. (2025). MMBench-Doc: Benchmarking long-context document understanding with visualizations. arXiv. https://arxiv.org/abs/2407.01523

[7] ling99. (n.d.). OCRBench-v2 leaderboard. Hugging Face Spaces. https://huggingface.co/spaces/ling99/OCRBench-v2-leaderboard

[8] LMMS-Lab. (n.d.). AI2D dataset. Hugging Face Datasets. https://huggingface.co/datasets/lmms-lab/ai2d

The Future of Agentic AI is Small.

Alex Razvant — Sat, 15 Nov 2025 14:47:33 GMT

Welcome to Neural Bits. Each week, I write about practical, production-ready AI/ML Engineering. Join over 7000+ engineers and learn to build real-world AI Systems.

Subscribe now

As you already know, I love technical deep dives.

This is Part I of a small 2-part series where I unpack the new NVIDIA Nemotron models
In this first piece, I’m focusing on the text-only side of the family, the Nemotron Nano v2 Transformer-Hybrid models, which outcompete other popular and powerful open models on a large set of reasoning Benchmarks.

In Part II of this series, I’ll be covering the latest addition, the Nemotron Nano 2 VL 12B, which is a powerful VLM for multi-image, long video reasoning, and document understanding.

In this mini-series:

→ ✅ Part I - The Nano 2 Family of SLMs for best-in-class Reasoning Models.
→ ✅ Part II - The Nano 2 VL Model for Agentic AI capabilities on Vision Tasks

If you’re building agents, research workflows, or anything that relies on “thinking steps,” Nemotron Nano v2 is one of the most interesting architectures you can study right now.

(To navigate the topics, please use the Table of Contents on the left)

The NVIDIA Nemotron Family: Open Models for Agentic AI

NVIDIA Nemotron is not only a model architecture, although it might seem so.

It’s a family of models, curated datasets, and finetuning recipes, all open-source following NVIDIA’s license, allowing developers to build efficient and specialized AI Systems.

Note: A training recipe describes the step by step workflow of finetuning a model. For instance, starting with SFT (Supervised Finetuning), then iterations of Reinforcement Learning steps using Group Relative Policy Optimization (GRPO).

Nemotron models are openly available and integrated across the AI ecosystem so they can be deployed anywhere - from edge to cloud. These are a result of NVIDIA’s direction to create fully transparent models, helping developers build domain-specific AI agents and own the explainability of their outputs.

Unpacking the Nemotron Model Family

We can summarize the key details in 3 major components, as shown in Figure 1.

Figure 1: The NVIDIA Nemotron Family of Models, Training Recipes, and Open Datasets

Competing with Frontier Models

One notable example is Llama-3_3-Nemotron-Super-49B-v1_5, which builds on top of Meta Llama 3.3 and is optimized for advanced reasoning and agentic AI tasks. At only 49B parameters, it ranks higher than Qwen3-32B MoE, and it overthrows the previous Llama-Nemotron-Ultra-253B, a model 5x its size, across a large set of reasoning and scientific benchmarks.

Figure 2. Llama Nemotron Super 49B, an older version of the Nemotron Models family, yields the highest accuracy in multiple reasoning benchmarks.

To put that in balance, the Llama-Nemotron-Ultra-253B competes with Frontier Models such as Gemini 2.5 Pro, OpenAI o3-mini, and takes the lead on the popular DeepSeek R1, a 650B+ reasoning model, the latest Llama 4 Maverick, GPT4-o, and others, on the GPQA Benchmark as shown in Figure 3.

Figure 3. Llama 3.1 Nemotron Ultra placed in the Top-5 on the GPQA Scientific Reasoning Benchmark, close to Frontier-Level Models.

Note: GPQA is a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. The questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are “Google-proof”).

These are the text-only models, capable of advanced reasoning, deep research, tool calling, and overall Agentic AI-related tasks. They are designed for enterprise-grade accuracy and performance, compatible with popular Inference Frameworks (TGI, vLLM, TensorRT-LLM) or prebuilt as Inference Microservices (NVIDIA NIM) for large-scale deployment.

Companies such as CrowdStrike, ServiceNow, and DataRobot already use Nemotron in their products, and even NVIDIA uses it internally to help with new GPU Chip Designs. [3]

The Nemotron Nano V2

If the older generation of models were using known architectures, the Nemotron V2 iteration was built from the ground up, achieving not only higher accuracy compared to similar models in the SLM (Small Language Model) branch, but up to 6-7x times higher throughput on long sequences with reasoning mode active.

The Nano V2 versions, 12B and 9B both feature improvements on:

Architecture: a hybrid Transformer‑Mamba backbone where most layers are Mamba‑2 state‑space modules with a few attention ones.
Thinking budgets: allowing you to control how many reasoning tokens does the model spend.
Context: longer context windows, up to 128K tokens.
Performance: higher inference throughput, especially on long-sequence tasks.

Figure 4. Nemotron Nano v2 9B Distilled model outperforms Qwen3 8B tasks and benchmarks requiring advanced reasoning capabilities. Taken from Nano 2 9B Model Card.

This can be seen in more detail in the following Figure, comparing Nemotron Nano v2 B with Qwen3-8B, across various reasoning benchmarks, and throughput on long input and output sequences. In the figure, to interpret the numbers, the ISL stands for (Input Sequence Long) and OSL stands for (Output Sequence Long).

Figure 5. Comparing Nemotron Nano V2 9B with Qwen 3 8B across benchmarks and inference throughput, where the Nano V2 takes the lead thanks to the Transformer-H and Mamba-2 architecture improvements.

One more important feature that the Nano V2 brings is the ability to control the “Thinking Budget” at runtime. During inference, the user can specify how many tokens the model is allowed to “think”, allowing for controlled flow of Agentic AI tasks, notably if building Deep Research Agents, or Multi-Agent systems that involve step-by-step reasoning.

That considerable jump in inference throughput comes from the Transformer-H Hybrid architecture, which replaces a large chunk of the Attention Layers with Mamba 2 Layers, an interesting concept and an overall newer approach to Transformer models that we’ll unpack in the next section.

The Transformer-H Architecture and Mamba-2 Layers

The original Transformer architecture is built around self-attention plus feed-forward sub-layers, which has been the backbone of nearly all current large language models.

Because of the attention mechanism, each token can incorporate information from any other token in the sequence, as when we compute attention scores, we have a large MatMul between Queries, Keys, and Values that gives very good in-context learning, as each token “looks” at each other token.

But the attention mechanism has quadratic cost O(n²) in vanilla form, and the KV-cache in decoding can grow large. Although there are multiple methods to compute attention, as we’ll see in the following Figure, the MxN matrix will still grow with each token being added to the sequence.

Figure 6. Types of Attention Patterns, presented in the Longformer paper, that illustrate how Attention scales quadratically. Taken from the Longformer 2020 paper.

The Mamba-2 layers introduce a different approach to this problem by keeping an internal hidden state that evolves, driven by the tokens added to the sequence instead of maintaining full attention between all tokens.

The original Mamba, introduced in “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” demonstrated that State Space Models (SSMs) can compete with Transformers while scaling much more efficiently due to their linear computation cost.

However, transformers are better at tasks requiring precise recall of specific input elements, a capability that SSMs can struggle with because their compressed state is fixed and may lose details.

A great in-depth article on how Mamba and SSM work was written by Maarten Grootendorst in his Exploring Language Models Newsletter. Below is an attached animation from that article.

Mamba-2 extends this idea using a theoretical framework called State Space Duality (SSD), which combines the two based on the deep connections between attention mechanisms and SSMs.

The Mamba-2 layer acts as a state-space operator:

It maintains a continuous hidden state;
Updates this state token-by-token using learned transition dynamics;
Produces outputs that capture long-range dependencies without needing explicit attention maps.

At each step, the new state = a scaled version of the previous state plus the new input. In Mamba-2, this SSM idea is refined into the Structured State Space Duality (SSD) layer, a design that shows SSMs and attention are mathematically linked.

The SSD layer can run in two ways (thus the Duality)
- A linear, recurrent mode (inference)
- A matrix-based “attention” mode (efficient for training).

Figure 7. Illustration of Mamba-2 SSD and how a new token that joins the sequence influences the hidden state update.

The Nemotron Nano V2 Architecture

To dive a bit more into the architecture details, this model combines Mamba-2 Layers with Attention and FFN layers, trimming most of the Attention layers, and replacing them with Mamba-2 Layers.

As presented in the “NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model” research paper, if we extract the Model Architecture, we can see the exact configuration of layers being used.

Specifically, in the Nano V2 architecture, there are 62 total layers, with 28/28 split between FFN and Mamba Layers and 6 Attention Layers. That is described in the paper directly, in the Model Architecture section, where I’ve outlined them with yellow.

Thanks for reading Neural Bits! This post is public, so feel free to share it.

We can’t completely remove Attention layers, as they provide global context, something that Mamba-2 layers could be struggling as the longer a sequence grows, the hidden state update might loose information, as each new token added to the sequence updates the new hidden state.

The Reason for doing that is that Attention layers provide global context, whereas Mamba-2 layers handle state transitions more cheaply than full attention, which largely increases the throughput of the model during inference.

If we inspect the Nano v2 architecture using the HuggingFace Safetensors Explorer, we notice the exact details mentioned above. In this next Safetensors view, we look directly at the tensors structure and how they’re mapped by the HuggingFace Transformers Library when composing the model and loading it’s weights.

We see 62 layers, but notice the `mixer` component in the layer’s name as this is something specific to Mamba-2 layers:

Figure 8. Viewing the Nemotron Nano V2 Safetensors file, with the layers structure, shapes, and precision types.

In the context of Mamba / Mamba-2, the “mixer” refers to the module that processes the token sequence dimension via SSM mechanisms rather than (or alongside) standard attention. In practice, the difference between Attention and using SSM is better noticed at inference time.

Comparing Attention with Mamba-SSM

Because Mamba keeps an internal state, the throughput scales linearly with the sequence length, as each new token in the sequence acts as a signal to update the hidden state.

For transformers, that token is added to the MxN Attention Matrix, and its attention score is computed w.r.t all the other tokens in the sequence.

The underlying result is that once the sequence grows, Attention will yield OOM Errors, whereas Mamba scales linearly.

Figure 9. Comparing the throughput between Mamba + SSM layers and Attention, outlining how Mamba scales linearly with sequence length, whereas Transformers resulted in OOM, given the same batch size.

Getting Started with the Nemotron Nano V2

As a hands-on approach, let’s try to serve Nemotron Nano V2 model and then build a client in Python, to try out the Thinking Budgets approach that these models introduce.

As with any model on HuggingFace, you can use and test it directly with the Transformers library plug-and-play.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(”nvidia/NVIDIA-Nemotron-Nano-9B-v2”)
model = AutoModelForCausalLM.from_pretrained(
    “nvidia/NVIDIA-Nemotron-Nano-9B-v2”,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map=”auto”
)

messages = [
    {”role”: “system”, “content”: “/think”},
    {”role”: “user”, “content”: “Tell me a story about Mamba-2”},
]

tokenized_chat = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors=”pt”
).to(model.device)

outputs = model.generate(
    tokenized_chat,
    max_new_tokens=32,
    eos_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

The model, in BF16 Precision would require 12-16GB VRAM if you’re planning to load and run it on a GPU.

This call will parse the repository, load the tokenizer, and unpack the Safetensors checkpoint shards, loading them into memory and constructing the Model Graph, which is composed of Tensor Types, Precision Types, Layers etc.

This process of parsing a model, loading the weights, and preparing the Model Graph is the same in all Inference Engines, from vLLM, TGI, Transformers, llama.cpp and more.

Using Thinking Budget Client with vLLM

With vLLM, you’d start a Server to host the model, and then build a Client in Python that connects to the Server and sends requests.

Starting the Server - you also need to pass `mamba_ssm_cache` as this model uses the Transformer-H Architecture with Mamba-2 layers.

pip install -U “vllm>=0.10.1”
vllm serve nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
    --trust-remote-code \
    --max-num-seqs 64 \
    --mamba_ssm_cache_dtype float32

Build the Client in Python

from typing import Any, Dict, List

import openai
from transformers import AutoTokenizer


class ThinkingBudgetClient:
   def __init__(self, base_url: str, api_key: str, tok_path: str):
       self.base_url = base_url
       self.api_key = api_key
       self.tokenizer = AutoTokenizer.from_pretrained(tok_path)
       self.client = openai.OpenAI(self.base_url, self.api_key)


   def chat_completion(
       self,
       model: str,
       messages: List[Dict[str, Any]],
       max_thinking_budget: int = 512,
       max_tokens: int = 1024,
       **kwargs,
   ) -> Dict[str, Any]:
       assert (
           max_tokens > max_thinking_budget
       ), f”thinking budget must be smaller than maximum new tokens. Given {max_tokens=} and {max_thinking_budget=}”


       # 1. first call chat completion to get reasoning content
       response = self.client.chat.completions.create(
           model=model, messages=messages, max_tokens=max_thinking_budget, **kwargs
       )
       content = response.choices[0].message.content


       reasoning_content = content
       if not “” in reasoning_content:
           # reasoning content is too long, closed with a period (.)
           reasoning_content = f”{reasoning_content}.\n\n\n”
       reasoning_tokens_len = len(
           self.tokenizer.encode(reasoning_content, add_special_tokens=False)
       )
       remaining_tokens = max_tokens - reasoning_tokens_len
       assert (
           remaining_tokens > 0
       ), f”remaining tokens must be positive. Given {remaining_tokens=}. Increase the max_tokens or lower the max_thinking_budget.”


       # 2. append reasoning content to messages and call completion
       messages.append({”role”: “assistant”, “content”: reasoning_content})
       prompt = self.tokenizer.apply_chat_template(
           messages,
           tokenize=False,
           continue_final_message=True,
       )
       response = self.client.completions.create(
           model=model, prompt=prompt, max_tokens=remaining_tokens, **kwargs
       )


       response_data = {
           “reasoning_content”: reasoning_content.strip().strip(””).strip(),
           “content”: response.choices[0].text,
           “finish_reason”: response.choices[0].finish_reason,
       }
       return response_data

Sending the Request - limiting to 32 thinking tokens only.

tok_path = “nvidia/NVIDIA-Nemotron-Nano-9B-v2”
client = ThinkingBudgetClient(
   base_url=”http://localhost:8000/v1”,
   api_key=”EMPTY”,
   tok_path=tok_path,
)

messages=[
       {”role”: “system”, “content”: “You are a helpful assistant. /think”},
       {”role”: “user”, “content”: “What is 2+2?”},
   ],

result = client.chat_completion(
   model=”nvidia/NVIDIA-Nemotron-Nano-9B-v2”,
   messages=messages,
   max_thinking_budget=32,
   max_tokens=512,
   temperature=0.6,
   top_p=0.95,
)
print(result)

You should see a response similar to, limited by our max_thinking_budget:

{’reasoning_content’: “Okay, the user asked, What is 2+2? Let me think. Well, 2 plus 2 equals 4.”, ‘content’: ‘2 + 2 equals **4**.\n’, ‘finish_reason’: ‘stop’}

Conclusion

In Part I of this mini-series, we’ve started with an overview on the NVIDIA Nemotron Family of Models, Datasets, and Training recipes, covering the improvements that Nemotron Models bring, the datasets and techniques they’ve been trained with and how they rank on popular reasoning benchmarks for Agentic AI tasks.

In short, the Nemotron line of models from NVIDIA is aimed to be small, capable, Agentic AI-ready models that run fast, require low compute, and can be deployed at scale from cloud to Edge.

Also in this article, we’ve covered the most notable architecture improvement, replacing traditional Attention Layers with SSM (State Space Models), notably Mamba-2 layers, and have explained the differences, benefits, and impact this has on Inference Throughput.

In Part II of this series, we’ll cover the Nemotron Nano 2 VL 12B, a model that has the same powers as the Nano 2, but adds the Vision Modality, leading to the OCRBenchv2, a very complex Benchmark on Document Understanding and Multi-Image Reasoning.

Share Neural Bits

Thanks for reading, don’t miss the next article!

References:

[1] NVIDIA. (2025). Nemotron Nano V2 Reasoning Benchmarks. HuggingFace.
https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2

[2] NVIDIA. (2025). Nemotron Nano 2 VL Model Card. HuggingFace.
https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16

[3] Nemotron’s Open Secret: Accelerating AI Development with Open Models, Data, and Recipes. (2025). HuggingFace.co.
https://huggingface.co/blog/nvidia/nemotron-open-models-data

[4] (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. ArXiv.org. https://arxiv.org/abs/2311.12022

[5] NVIDIA Research. (2025). Nemotron Nano 2: Hybrid Mamba-Transformer Architecture.
https://arxiv.org/abs/2508.14444

[6] Gu & Dao. (2024). Mamba: Linear-Time State-Space Models.
https://arxiv.org/abs/2405.21060

What You Don’t See: My Engineering Process Behind Each Article

Alex Razvant — Sat, 01 Nov 2025 14:02:32 GMT

Welcome to Neural Bits. Each week, I write about practical, production-ready AI/ML Engineering. Join over 6700+ engineers and learn to build real-world AI Systems.

Subscribe now

How does the behind-the-scenes of writing this Newsletter look?

What you see and read in here is the final result of days, and sometimes 1-2 weeks of tinkering, sketching, building, editing, maybe recording, and finally publishing.

I spend a lot of time designing, coding, breaking things, fixing them again, and writing about what I learn. But the parts I rarely show are what happens before any code runs.

For this article, I initially thought to publish another deep dive on AI Systems. But as I didn’t feel it was complete yet, for my standards, I wanted to slow down a bit and show you the hidden side of the process.

Even though this post isn’t technical, it captures something just as important - the thinking process that drives good engineering work.

Tip: If you’re an Engineer, you’ll quickly notice that your job isn’t just to write code.

You’ll spend just as much time, if not more, reading and writing.

Explaining how systems work. Reviewing architecture documents. Writing design and tooling proposals.

Communicating clearly becomes just as valuable as building efficiently.

On that note, let me show you the behind-the-scenes of my work, building AI Systems and then writing and explaining them.

Everything happens mostly here, at my desk

My working desk.

Every topic I write about usually begins as a single line I drop into my Content System in Notion. It might come from a bug I’m trying to fix, a concept I want to understand, or just a random curiosity while reading.

Whenever I find something useful, a paragraph, a code snippet, a paper, or a full blog post - I save it.

Over time, I gather a lot of notes that I could filter and group into a rough plan or a more general topic that addresses a potential reader’s purpose.

On weekends, I go through those notes again and try to group them under a bigger topic I could turn into an article.

Tip: This is a habit I’ve learned from a Staff Engineer, of keeping an internal progress log in Apple Notes, Obsidian or Markdown files.

It’s simple but powerful. When you come back to a project later, you already have the full context waiting for you. Fun fact, it’s a bit like building your own RAG system, where your notes are the knowledge base, and your brain is the LLM model.

Back at it, after iterating the notes, I start to plan the structure, compose the diagrams and visuals, and jump to implementation.

A real example of the Writing Process

In January of this year, I published an article on the basics of GPU Programming for AI Engineers, one of my most-read pieces so far, with over 12k views and 110 Likes.

It came from a pile of scattered notes: CUDA, GPU architecture, custom kernels in C++ and Python, and the new Triton language from OpenAI. Around that time, I was also studying Unsloth and JIT, trying to understand how it optimizes LLM finetuning using Triton kernels.

Some Notes in my Notion Board where I capture ideas and information about a topic.

All these notes were connected. So I grouped them, built a few examples, added visuals, and turned it into a clear and easy-to-digest guide.

Tip: Good writing is clear thinking made visible. This is an insight I’ve learned from “On Writing Well” by William Zinsser [2]

While I was also learning myself, I put my thoughts and gotchas on paper.

Why did it resonate so much?

Because AI engineers rarely build their own GPU kernels, researchers do.

But understanding how Kernels and GPUs work is a major advantage for anyone working with AI.

Thus, a hands-on introduction to how GPUs work: what CUDA is, VRAM, what a kernel looks like, and how to build one in C++, CudaPy, and Triton proved to be so helpful.

A real example of the Designing Process

Writing code for simple tutorials is easy.

You could do the initial scaffolding and share those as Jupyter Notebooks or a single Python script.

Building a bigger project with multiple components requires sketches and System Design at first.

On top of that, explaining each component and its role is super difficult if you don’t have the blueprint for that already set.

Note: Just as in any software application, re-designing components on the run adds tech depth and more engineering hours for refactoring later.

And that is something that’s often swept under the rug, prioritizing shipping new features.

Before I write any code, I usually try to nail down how everything fits together. At first, pen and paper, most of the time nothing fancy - just quick diagrams to help me see how the pieces fit together.

This is a System Design sketch of the Data Pipeline on a course I’m working on.

However, using quick sketches doesn’t cover everything. I had multiple instances where something I thought about initially doesn’t scale, has a major flaw, or simply doesn’t work.

Tip: Chip Huyen’s book ‘Designing Machine Learning Systems’ [5] contains really good advice on how to properly design ML Systems.

There were times when I had a flashing idea of how to add a component or remove another one entirely from my initial plan. In my case, these ideas stay latent in the back of my mind and come up at unusual times.

For example, one instance I remember was when I was reading Atomic Habits [1] by James Clear, relaxing on the couch, and suddenly, I got the idea of how I could fix the Storage structure component in one of my projects.

Totally unexpected, no connection between the book I was reading and the initial diagram sketch I’ve done a few weeks prior.

Usually, those ideas don’t stay in your mind for too long, so you've got to act fast.

In that case, a system design diagram can turn into something like this, which is messier, but closer to the final working version.

This is the result of a flashing idea, put on paper as quickly as possible.

Just like in real engineering work, only after a system is reviewed and iterated over can you get to a plan to start implementing and turning sketches into real working components.

A few final diagrams, results of multiple sketches and re-iterations.

Before reaching this stage, there’s a lot of invisible work, revisiting assumptions, balancing trade-offs, and running thought experiments before any code exists.

Note: In real AI/ML or SWE projects, you won’t do this alone, but as part of a team. Understanding how this process works keeps everyone on the same page.

That’s the heart of engineering thinking. The cheapest bugs to fix are the ones you catch in your design, not your deployment, which is more important for AI/ML projects than standard software and SDLC (Software Development Lifecycle).

The big plot twist

In software engineering, we can design, build, test, and deploy in a fairly linear way. All the tooling and components are established, resilient, and tested in production.

In AI/ML, the cycle loops back on itself (MLOps). Standardization is still a work in progress.

We’re at the phase where we have a lot of tools and techniques, but fewer standardizations and production-tested methods.

If you’ve found it helpful. Share this post!

That’s why designing AI Systems is, at times, far more complex than standard applications.

A real example of the Building Process

Once the system design feels solid enough, or at least has fewer gaps, I start building.

Note: As an example, I’ll be using a recent AI Agent based project I’ve worked on.

I begin by setting up the foundations, mostly starting with a blank, single Python script that will get me something working as fast as possible. At this stage, I usually avoid optimizations or abstractions - just focusing on getting something to run end-to-end, pure Python.

I don’t finetune or optimize models yet, version or track my prompts, add in databases, or throw in MCP and Agents.

If I need multiple customizable prompts, for example, I’ll extract them from Python and save them alongside code, in Markdown.

An initial structure for storing prompts alongside code.

Next, I split the project into components to separate concerns and add strict models that will define the workflow, since I already have the system design diagrams in place. For example, I can define a collection of reusable Pydantic models that act as contracts between components.

Example of a collection of reusable Pydantic Models.

These data models help me validate input/output, keep types consistent, and debug issues in isolation.

Some of the next steps would be to build the minimal functionality on each component, as defined in the System diagram, and aim to test them in integration.

As I’m reaching the email length limit, I’ll skip over the details of those steps for now, but this is usually where structuring the project is something I’ll consider.

As the codebase grows, I want it to stay easy to navigate and evolve, without different components becoming tangled in ways they shouldn’t.

The exact setup can vary.

I usually follow the Clean Architecture software pattern, which I’m most familiar with.

A nice template for your AI projects, that follows the Clean Architecture pattern, is this one by (from The Neural Maze [4]), which I recommend if you want to start fresh or already have an AI project that became cluttered and you want to bring a scalable and readable structure to it.

Agent API Cookiecutter template from Miguel at The Neural Maze.

Be it Clean Architecture [3], Clean Architecture with DDD, or Vertical Slice, the key idea remains the same: keep boundaries clear and logic isolated.

Ending Notes

Every article, project, or idea starts the same way - rough notes and bad sketches.

If there’s one thing I’ve learned through all of this, it’s that clarity comes from iteration. The more you think, design, and rebuild, the simpler things become.

That’s when everything finally clicks.

I hope this gave you a few insights into how to think about your own engineering process.

Thank you for reading. See you next week!

References

[1] Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones. (2025, June 24). James Clear. https://jamesclear.com/atomic-habits

[2] Amazon.com: On Writing Well, Zinsser, William: https://www.amazon.com/Writing-Well-Classic-Guide-Nonfiction/dp/0060891548

[3] Clean Architecture: A Craftsman’s Guide to Software Structure and Design (Robert C. Martin Series) https://www.amazon.com/Clean-Architecture-Craftsmans-Software-Structure/dp/0134494164

[4] Pedrido, M. O. (2025, August 6). The Neural Maze. Substack.com; The Neural Maze. https://theneuralmaze.substack.com/

‌ [5] Amazon.com: Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications: 9781098107963: Huyen, Chip: Books. (2025). Amazon.com. https://www.amazon.com/Designing-Machine-Learning-Systems-Production-Ready/dp/1098107969

‌

The Complete Guide to Ollama: Local LLM Inference Made Simple

Sat, 25 Oct 2025 13:02:36 GMT

Welcome to Neural Bits. Each week, I write about practical, production-ready AI/ML Engineering. Join over 6600+ engineers and build real-world AI Systems.

Subscribe now

It has been seven years since the original GPT architecture was developed, and roughly three years since solutions to deploy LLMs both locally and in the cloud became mainstream.

Just two years ago, we got the first release of vLLM, soon followed by llama.cpp, GGUF, Ollama, SG-Lang, and many others. At first, their roles weren’t entirely clear, but over time, each found its niche and purpose within the LLM ecosystem.

A key responsibility of an AI Engineer is to design an AI System, going through all of its components. From data ingestion to building the knowledge base, preference finetuning the LLMs, prompt engineering, building Agents, and enhancing them with tools, memory, and evaluation workflows, and the list goes on.

Figure 1: The landscape of AI solutions split by layers, where Inference is a small but crucial part of the entire picture.

However, before starting work on all the above-mentioned, an AI Engineer would have to think about the entire scope of a project and evaluate and select the most appropriate tools for each workload across the AI development pipeline.

This time, we’re talking about Inference, specifically LLM Inference solutions for rapid prototyping of any System that relies on LLMs or Agents.

The AI landscape is quite large enough, with more tools and frameworks joining the race, but there are only a few that deserve your time and attention, judging by the most common use cases an AI Engineer would work on.

In this article, we’ll focus on Ollama, which is currently one of the simplest and most efficient ways to run and develop with LLMs locally, without the burden of complex setups and configurations.

Let’s define the structure of this article.

What is Ollama, and why does it matter
[VIDEO] What happens once you install Ollama
The High-Level Architecture
[VIDEO] Customizing Ollama Models Locally
[VIDEO] Adding Models from HuggingFace
[VIDEO] Running Ollama OpenAI API Python Client
Running Ollama Server in Docker
Conclusion

1. What is Ollama, and why does it matter

Ollama is an open-source inference framework that simplifies the process of running LLMs locally. The repository has over 500 contributors and over 150k stars, making it a mature, active, and well-maintained codebase with frequent releases, which enforces its position in the LLM Inference landscape.

Pro Tip: Whenever considering a tool or framework to use, always look for the liveness of the codebase, stars, contributions, PRs and releases and pick something that’s active and established within the field.

Using Ollama, you can easily prototype and build applications on top of LLMs without having to connect to APIs or Cloud LLM Providers. Although it also contains a model registry and exposes an API connection just like the others, you can run your LLM models locally with Ollama.

What is the Key Value Proposition of Ollama?

We should start with Privacy, running LLMs on your local hardware.
Next would be Accessibility, as it simplifies the setup massively. For example, to get your model running locally with `llama.cpp`, you might spend a few hours around the `Makefile`.
Third, Customisation, bring your own models to be served with Ollama, with a few extra steps of converting the format to a Modelfile for non-existing ones.
Lastly, Quantization allows LLM models to run on older-generation NVIDIA GPUs, lower VRAM GPUs, AMD, M1, traditional CPUs, or even Edge Hardware.

Amongst developers that are working on LLM-powered applications, Ollama is one of the top choices for the local inference serving component.

Before diving into the interesting bits of how Ollama works, first, let’s go through and understand how it’s installed on your system, as there are a few subtle differences varying across macOS, Linux, and Windows.

What happens once you install Ollama

Ollama is built in Go, also known as Golang, which is a compiled language alongside others such as C++, C, Rust, or C#.

That means, when you install it, a compiled binary for your specific CPU architecture and OS configuration is copied and added to your applications. On Windows, the installer comes with an `.exe`, on macOS, the installer comes as a `.pkg`, whereas on Linux, you install it via an `install.sh`.

MacOS and Windows

On Windows and Mac, the setup is quite straightforward in terms of what’s happening under the hood. You go to the Ollama website, select your OS, download the installer, and then a GUI guides you through the process.

Figure 2. Ollama landing page when downloading the Ollama binary.

For Linux, on the other hand, a few more steps take place, which you might be curious about, and that’s what we’re diving into in the next section.

Ollama on Linux

On Linux, we get an `install.sh` script, which sets up Ollama on our Linux machine.

Figure 3. The Ollama install option for Linux Distributions.

Most of the time, you’d be installing it using the provided command, but for engineers who require customization or want to disable specific Flags, there’s a Manual Install workflow one could take. For instance, as an AI Engineer installing Ollama locally, you might want to:

Disable Ollama from starting automatically when your System boots up.
Test Ollama on CPU only, disable HW Acceleration.
Pin Ollama to a specific CUDA version on your System.

Tip: For all the steps mentioned above, you’ll want to manually install it. For a default instalation, which is the common option - just use the install.sh script.

The System Configuration for Ollama on Linux

Let’s do a short but deep dive on what’s happening underneath when you run the install.sh script to install Ollama on Linux. To summarize that, we have the following 3 components:

Installing the System Dependencies

NEEDS=$(require curl awk grep sed tee xargs)
if [ -n “$NEEDS” ]; then
    status “ERROR: The following tools are required but missing:”
    for NEED in $NEEDS; do
        echo “  - $NEED”
    done
    exit 1
fi

for BINDIR in /usr/local/bin /usr/bin /bin; do
    echo $PATH | grep -q $BINDIR && break || continue
done
OLLAMA_INSTALL_DIR=$(dirname ${BINDIR})

Configuring Ollama as a SystemD Service

cat </dev/null
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=$BINDIR/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment=”PATH=$PATH”

[Install]
WantedBy=default.target

Installing GPU Libraries (if GPU present)

check_gpu() {
    # Look for devices based on vendor ID for NVIDIA and AMD
    case $1 in
        lspci)
            case $2 in
                nvidia) available lspci && lspci -d ‘10de:’ | grep -q ‘NVIDIA’ || return 1 ;;
                amdgpu) available lspci && lspci -d ‘1002:’ | grep -q ‘AMD’ || return 1 ;;
            esac ;;
        lshw)
            case $2 in
                nvidia) available lshw && $SUDO lshw -c display -numeric -disable network | grep -q ‘vendor: .* \[10DE\]’ || return 1 ;;
                amdgpu) available lshw && $SUDO lshw -c display -numeric -disable network | grep -q ‘vendor: .* \[1002\]’ || return 1 ;;
            esac ;;
        nvidia-smi) available nvidia-smi || return 1 ;;
    esac
}

Follow this live-coding tutorial, where we walk through all these components.

The High-Level Architecture

When you download Ollama for Mac or Windows, you get an executable that, once started, automatically brings up an HTTP server as a process in the system’s applications.

Ollama is composed of three components: Model, Server, and Inference Engine. Starting with the model, this is your GGUF LLM Checkpoint.

Tip: To see a similar deep-dive on GGUF, GGML and llama.cpp, see this previous article.

Given that, Ollama is not actually handling the heavy processing of the AI inference. It just acts as an abstraction layer on top to orchestrate everything, with the heavy lifting being done by the llama.cpp → GGLM → GGUF pipeline.

In Figure 4 below, we describe a summarized diagram of how everything couples together:

Figure 4. The simplified workflow of how Ollama works on top of GGML and Llama.cpp

From a workflow perspective, this is how Ollama handles things under the hood:

You have an LLM Model Checkpoint in GGUF
When Ollama loads the model, it starts a llama.cpp inference server as a separate server, and passes the GGUF model path to it.
The llama.cpp server unpacks the GGUF file using the GGML library and creates a computation graph of the model.
When Ollama receives a Prompt, it routes it to the llama.cpp server that handles the workload.
Ollama then streams the decoded text token to the caller or returns the full response.

The takeaway here is that Ollama is an easy-to-set-up and use abstraction. Getting started with llama.cpp directly involves building the binary for your system architecture, running manual scripts, and learning additional commands.

We’re unpacking Ollama first as it’s both the fastest and familiar way to get up and running with local LLMs.

Query: In future tutorials, we might also look into low-level llama.cpp for Advanced Users.

The End-to-End Ollama Workflow

In Figure 5 below, we map every action that takes place once you start a model using the Interactive CLI (Terminal) to chat with your model. Starting with loading using `ollama run :`, the HTTP Server will route the request and start a `llama.cpp` server to act as the Inference Engine.

Then, with each input prompt, the Ollama server routes it to llama.cpp for inference, and streams back the generated tokens.

Figure 5: Ollama model workflow across the loading of the model and routing the chat requests to the llama.cpp inference engine.

Running pre-defined models is pretty straightforward, with no additional steps required. However, for adding custom models, we have a few options that is downloading a model from Ollama Library, editing a model locally, or bringing our own model from a HuggingFace GGUF Checkpoint.

Ollama Models Locally

Before showcasing how to register new models with Ollama, first, we have to explain what a Modelfile is.

Def: From the documentation, a Modelfile is the blueprint to create and share models with Ollama.

If you think of an Ollama model as a Docker Image, the Modelfile is similar to the Dockerfile used to build the layers for that image. In an Ollama Model file, we store the source of the model, the parameters such as Chat Template, Sampling TopK, TopP, Documentation, License, and other additional fields as a blueprint configuration for the model.

Here’s a basic example (from the docs):

FROM llama3.2
# sets the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 1
# sets the context window size to 4096, this controls how many tokens the LLM can use as context to generate the next token
PARAMETER num_ctx 4096

# sets a custom system message to specify the behavior of the chat assistant
SYSTEM You are Mario from super mario bros, acting as an assistant.

Also, for downloading models locally, you could simply use the `ollama pull` command and inspect the Ollama Model Library for any model.

ollama pull gemma3:4b

Figure 6: The Ollama Model Library with pre-configured GGUF checkpoints, alongside their Modelfiles.

In the following live-coding segment, we’ll go through:

Viewing Model files for existing Models
Editing the Gemma3-1B Model file and registering a new Model
Comparing model versions by changing the Sampling Configurations

Adding Models from HuggingFace

As a second option, we could download GGUF checkpoints for any LLM model from HuggingFace and register them as a custom model in our local Ollama registry.

In the following live-coding segment, we’ll go through:

Creating a Python Environment with UV
Installing the HuggingFace Hub CLI
Authenticating using an Access Token
Downloading Phi-3-mini-4k-instruct-Q4_K.gguf
Registering and testing it in Ollama

Running Ollama OpenAI API Python Client

In the previous examples, we’ve showcased the Interactive Ollama Client, which runs in the CLI. That’s more of a playground, or for rapid prototyping to test how your model responds to different queries.

On a more serious note, if we build a real AI project, we need to define the client as part of an API or the Application itself. In this section, we’ll showcase how one could create an Ollama Client in Python using the OpenAI API Schema-compatible endpoints.

Info: When OpenAI launched the GPT-3 API in mid-2020, it was the first widely available commercial API, so developers got the time to familiarize themselves with the Schema formats and endpoints. Furthermore, to address the same developer base, the new frameworks and LLM providers choose to “adapt” to this schema format.

The popular endpoints that you might’ve seen elsewhere if you’ve worked with LLM providers’ APIs are:

/v1/completions: Text completions for a single, free-form text prompt.
/v1/chat/completions: Generates conversational responses.
/v1/models: Lists the available models, the server can load from cache.
/v1/embeddings: Returns the raw embeddings of a text input.

In the following video segment, we’re going to demonstrate how one can build and connect to the Ollama Server directly from Python, following the OpenAI API schema.

Running Ollama Server in Docker

Figure 7: Ollama is available as a Docker Image. (Source)

If you build a multi-container application, the best solution is to run both the Ollama Server and your Application Client as separate Docker containers. Since last year, Ollama has been available as a Docker container that you could start up quite easily.

For CPU Only

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

For GPU Accelerated (NVIDIA)

You have to mount the GPUs using `—gpus=all` such that Docker socket can allow the container to use the GPU interface.

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

For starting a model

docker exec -it ollama ollama run gemma3:1b

Service in Docker Compose

Using this setup, you could add more services to your Docker Compose and build a more advanced application with a robust and reproducible way for deployment.

version: “3.9”

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - “11434:11434”
    volumes:
      - :/root/.ollama
    networks:
      - ollama-net
    environment:
      - OLLAMA_HOST=0.0.0.0
      deploy:
        resources:
          reservations:
            devices:
              - driver: nvidia
                count: all
                capabilities: [gpu]

networks:
  ollama-net:
    driver: bridge

volumes:
  :

Ollama Cloud

Ollama’s cloud is a new way to run open models using datacenter-grade hardware. Many new models are too large to fit on widely available GPUs or run very slowly. Ollama’s cloud provides a way to run these models fast while using Ollama’s App, CLI, and API.

Conclusion

In this article, we’ve explored Ollama in depth, how it works under the hood, and how you can install, configure, and experiment with models using the CLI, Docker, or Python.

Ollama is one of the most, if not the most, developer-friendly solutions for running LLMs locally. It builds on top of the llama.cpp runtim, while abstracting away the complexity of model loading, quantization, and inference management. Its built-in high-performance HTTP server, written in Go, handles the multi-model orchestration and efficient request handling, all out of the box.

If you’re prototyping with small checkpoints, fine-tuning your own models, or deploying local LLMs for your Chatbots or Agentic Applications, Ollama is the best solution for that.

In the first two articles, we covered llama.cpp, GGML, GGUF and Ollama. In the next series, we’ll move further towards building a more complex AI System and Agents, where we’ll be using Ollama and GGUF model checkpoints for local inference.

Stay tuned!

Images and Media were generated by the author, if not otherwise stated.

References

[1] Ollama. (2024). Ollama. https://ollama.com/blog/openai-compatibility

[2] Ollama. (2023). Ollama. https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image

[3] OpenAI Platform. (2025). Openai.com. https://platform.openai.com/docs/api-reference/introduction

[4] openai/openai-openapi: OpenAPI specification for the OpenAI API. (2023, June 19). GitHub. https://github.com/openai/openai-openapi?tab=readme-ov-file

[5] Razvant, A. (2025, October 18). An AI Engineer’s Guide to Running LLMs on CPUs, GPUs, and Edge Devices. Substack.com; Neural Bits. https://multimodalai.substack.com/p/an-ai-engineers-guide-to-running

[‌‌6] Razvant, A. (2025, February 20). Understanding LLM Inference. Substack.com; Neural Bits. https://multimodalai.substack.com/p/understanding-llm-inference

‌

I’ve Partnered with NVIDIA! 🔥

Alex Razvant — Sat, 04 Oct 2025 14:09:01 GMT

Welcome to Neural Bits. Each week, I write about practical, production-ready AI/ML Engineering. Join over 6200 engineers and build real-world AI Systems.

Subscribe now

Hey all, awesome news!

A big reason I started the Neural Bits Newsletter was to distill, showcase, and teach the real-world side of AI, avoiding hype and focusing on building real AI systems.

As a senior AI engineer, my technical writing focuses mainly on deep dives: going low-level into details, unpacking how things work, and then building systems that make it all come together.

My philosophy is simple: first deeply understand AI, then build with AI.
That’s what I’ve been doing for the past 8 years.

How it started

I’ve been covering NVIDIA and AI for quite a while. One of my first articles on the newsletter was unpacking NVIDIA NIM, then Triton Server, Dynamo, and more.

That caught NVIDIA’s eye. Many people working for NVIDIA engaged, reshared, and even messaged me about my content on AI/ML.

Over the past few months, I’ve had a few chats with NVIDIA, attended workshops, been invited to pre-release sessions, in-depth walkthroughs, and developer sessions.

Now that got supercharged!

Some of my featured posts, reshared by NVIDIA Folks from various teams (GenAI, CUDA, TensoRT, NIM, RTX)

A Technical Sneakpeek

Since I don’t want to keep this article informal only, I added a full walkthrough on NVIDIA Nemotron Models.

I’ve been keeping an eye on the entire landscape of AI models out there, lately on Agentic or Reasoning-powered ones. I’ve built with LLama 3, Qwen-2.5VL, Qwen3, and DeepSeek distilled versions.

Recently, the Nemotron Family of models caught my eye.

Reading the docs, multiple improvements were added, from post-training techniques, to new compute kernels and even new neural network layers.

I’m planning to build something around the Nano models to enable agentic reasoning on low-resource compute, and will keep you updated on the topic.

For now, I’ve created this flowchart to cover the landscape of how these models were built, finetuned, and improved.

NVIDIA Nemotron Model Family Landscape, techniques, improvements, variants, and real-world uses.

What This Partnership Means

For over 8 years, I’ve been working with AI and Deep Learning, building Vision AI Systems, RAG, and Multimodal RAG, everything around LLMs, and recently Agents and Multi-Agent Systems. I’ve shared deep dives into tools, frameworks, and workflows that AI engineers need to build, deploy robust and scalable AI workloads.

My content is different, deeply technical in the sense that I don’t stay at the surface, but go through the complete chain, from low-level concepts to best practices in building with AI.

My main goal is to bridge the gap between what’s real vs hyped up stuff. To talk about what an AI/ML Engineer really needs.

From Concepts, GPUs, Data, Pipelines, Models, Architectures, AI Research, Software Engineering, Deployments, Scale, APIs, Optimizations, etc.

With this partnership, I’ll now be collaborating directly with NVIDIA experts, giving you access to exclusive insights, tutorials, and deep dives.

BONUS - Free NVIDIA Webinar

From CUDA, TensorRT-LLM, NeMo, Triton, to Dynamo, each of these frameworks and libraries powers a different part of the AI development chain. For an AI Engineer, it might be overwhelming at first to get through all of these.

Naturally, the next set of questions comes up:

How can I get better at using NVIDIA’s software stack?
Which courses or certifications could I pick?

This upcoming session will help you find those answers.

NVIDIA Training is hosting a FREE webinar to guide you through their Generative AI certification exams.

Register here: https://nvda.ws/4mBCirc
📅 Tuesday, October 7, 2025 | 9:00 a.m. CET

Seats are limited. Use your corporate email for better chances.

What You Can Expect

I won’t sell you or push for you to buy something.
I’ll be a source to distill complex concepts, from NVIDIA experts down to AI Engineers and aspiring AI enthusiasts.
I’ll keep the deep technical expertise in my articles.
I’ll focus on hands-on tutorials and practical projects you can follow step by step.

No fluff, no selling you anything - just real technical knowledge and actionable guidance on AI Engineering!

Can’t wait to bring you the best out of this, straight to your inbox! 🫶
Subscribe and stay tuned!

Piece of advice for AI Engineers

Alex Razvant — Sat, 27 Sep 2025 13:30:30 GMT

Welcome to Neural Bits. Each week, my subscribers receive an edition on practical, production-ready AI/ML Engineering, building skills that you need in your AI/ML Journey. Join over 6200 engineers, and learn how to build real-world AI Systems.

Subscribe now

A few days ago, a subscriber preparing for an interview at NVIDIA asked me a simple but important question:

→ “What do I need to know about NVIDIA software to stand out?”

That question made me realize something: while I’ve written about many of NVIDIA’s frameworks and libraries, I’ve never put together a clear step-by-step breakdown of what matters, where each tool fits, and why it’s important.

While most engineers focus on learning about the application layer, building RAG, Agents, or learning more about PyTorch, very few dive into the NVIDIA ecosystem that, in many ways, powers those frameworks.

This article is my attempt to bridge that gap for you. To keep it structured, I’ve broken it into these parts:

AI Engineer vs AI User
A Pragmatic Mindset against Hype
Future-proof on NVIDIA’s AI tooling
What an AI Engineer should know about it

Let’s decode it!

Setting the Ground

1. AI Engineer vs AI User

First, we have to make this important distinction:

An AI Engineer isn’t the same as being an AI user. An engineer doesn’t just call models through an API, rely on blackbox dashboards to track tokens and usages, or outsource data cleaning and fine-tuning to external tools.

The title engineer implies ownership: you understand how each part of the AI system works, and you can actively design, optimize, and integrate those components yourself.

Making API calls is one thing - building robust AI systems is engineering.

2. The Pragmatic Mindset

AI is exciting, but the hype can mislead. Many roadmaps, courses, and shiny projects exist, but it’s crucial to separate what’s flashy from what’s useful.

Why? Let’s take AI Agents as an example.

AI Agents are just a small part of what an AI System is and does, as an engineer you should focus on the entire system, not just the shiny part.

It’s easy to spin up a POC AI Agent today, but the real work begins when you face infrastructure, optimization, evaluation, data pipelines, security and production deployment.

It’s better to keep a pragmatic mindset, master the software engineering concepts, and think in end-to-end systems.

Speaking of a pragmatic mindset, here’s a snippet from The Pragmatic Engineer Newsletter article with Chip Huyen as a guest, only a single mention of Agents, multiple mentions of Infrastructure, Optimizations, Applications, and Security.

Figure 2. AI Engineering Stack (feat. Chip Huyen) (source)

3. Why NVIDIA Matters

NVIDIA GPUs remain the standard for AI compute. Their ecosystem spans model training, inference, deployment, and optimization. Studying NVIDIA’s stack helps engineers understand the backbone of modern AI infrastructure.

Why the standard? Here are three recent NVIDIA Investments:
NScale raises $1.1billion backed by NVIDIA for the EU AI Supercluster.
xAI Collosus Cluster (200k NVIDIA Hopper GPUs)
Meta’s Gen AI Infrastructure (350k Hopper GPUs)

For you, this is a smart move to study and understand NVIDIA’s AI Ecosystem, as it powers a large chunk of AI Infrastructure.

Let’s decode it.

What an AI Engineer must know about GPUs

On this topic, you should know and understand the basic hardware principles of what a GPU is, how it works, and all the key terms that describe its capabilities.

Learning the Hardware Components

Cores
CUDA cores are good for general parallel computation, think of Shaders and Graphics rendering. Tensor cores are specialized for matrix multiplication, which is key in Transformer architectures.
Memory Hierarchy
- VRAM - the global GPU memory. This memory is the place where your CPU will copy Tensors, Activations, and overall Data that will be used by your GPU.
- Shared Memory - This is smaller in size but much faster memory. This is the place where GPU kernel data resides. Your GPU will move data from VRAM to shared memory and then use it for computation, making it available across multiple threads.
- Registers - Smallest and fastest. GPU registers serve as the fastest, thread-private memory for temporary data storage, and they’re used by the SMs (Streaming Multiprocessors) to do computations.
Precision
- FP32 - This is the 32-bit Floating Point type, standard training format.
- FP16/BF16 - Faster, less memory, commonly used for inference.
- FP8 - New precision format, compatible with newer GPU architectures.

To get practical, see my article on GPU Programming with code.

Learning to read a GPU Cheatsheet

The idea for sharing this came from Andrej Karpathy’s video on Large Language Models. He starts with a nice introduction on how LLMs work, and then, when going through model training, he shares the H100 GPU Cheatsheet to look into Precision formats. (at 40:03 mark)

Figure 3. Andrej Karpathy, going over the NVIDIA H100’s Cheatsheet (source)

A cheatsheet contains the GPU performance details, covering the Bandwidth, Precision Formats, Architecture variants, and more. This will help you understand why some GPUs are better at specific precisions, compute saturation, arithmetic intensity, and overall workload.

As an AI Engineer, focus on learning at a high-level, how to extract key details about a GPU’s based on their cheatsheets. Find the A100 Cheatsheet below:

Figure 4. NVIDIA A100 Cheatsheet (source)

Optional Study on other GPU Variants

These are called DSAs (Domain Specific Architectures) and not GPUs.

During the Deep Learning boom (2012-2013), a GPU was known as a Graphics Card, to render complex graphics and shaders for video games.

Around the same time, ASICs emerged as DSAs, built specifically for crypto mining (BTC, ETH).

One of the reasons manufacturers build DSAs today is due to this:

A GPU has a memory hierarchy, which is general-purpose but not always optimal for AI workloads, especially large matrix multiplications in Transformers.
A GPU is a programmable, flexible accelerator, but not as powerful or power-efficient as a DSA.
A DSA, on the other hand, is highly specialized hardware designed for a narrow workload, in our case, AI Training and Inference.

Popular examples of DSAs for AI are Google’s TPU (Tensor Processing Unit, 2016), Cerebras, SambaNova, and Tenstorrent. The Groq LPU (Language Processing Unit), also released in 2016, is another DSA for Inference, which is very efficient.

Figure 5. Google TPU (May 2016)

Takeaway from here is to know that even if NVIDIA powers the largest chunk of AI Infrastructure, DSAs also emerge targeting narrow niches within AI Compute.

Must-knows about NVIDIA’s AI Stack

When it comes to deep learning and AI, NVIDIA is a big player. Besides the hardware, every AI Engineer would directly or indirectly have to use a software component from NVIDIA’s library.

That’s mainly because NVIDIA has built an ecosystem that supports every stage of AI development, across many industries and applications.

From core utilities to train AI models, to RL simulators, robotics, to enterprise-scale AI deployments. In this section, we’ll go over three components: Model Training, Optimization, and Deployment.

Learning about Model Training Utilities

CUDA & cuDNN
CUDA is a parallel computing platform and API that powers the execution of all GPU kernels on the GPU. The cuDNN library contains a large set of primitives prebuilt for every compute operator or set of operators within your Neural Network layers. These two go hand-in-hand and are at the foundation of each AI model training/inferencing workflow.
NSight
A system-wide performance analysis tool to visualize an application’s algorithm execution flow. This will help identify the largest opportunities to optimize.
NeMo-RL
The NeMo library is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters. It seamlessly integrates with HuggingFace, so no overhead on that front.

Learning about Model Inference Utilities

TensorRT & TensorRT-LLM
This is a powerful compiler for AI models that will take the model graph, scan it, and optimize it for optimal settings on specific NVIDIA GPU hardware. TensorRT-LLM is a specific adaptation of the compiler for Transformer-based models.
ℹ️ I’ve covered TensorRT in depth in the second section of this article.
Triton Inference Server
One of the most mature solutions for deploying general-purpose AI models in production.

ℹ️ I’ve covered Triton Server in depth in this article.
Dynamo Inference
Dynamo is the newest framework in NVIDIA’s Stack, specifically designed for large-scale Generative AI workloads.

ℹ️ I’ve unpacked the Dynamo architecture in this article.

Learning about Model Deployment Utilities

NVIDIA GPU Operator
By adding this operator on top of your K8S cluster, you’ll enable lifecycle management of GPU resources, handling driver installation, runtime libraries, and configuration for GPU-accelerated workloads.
NVIDIA NIM
A NIM comes from NVIDIA Inference Microservice. It is a component that prepackages the optimal setup for a GenAI model, making it deployable at scale in a robust manner. Although customizable, NIMs are enterprise-first.

ℹ️ I’ve covered NIM in this article.
KAI Scheduler
The KAI Scheduler was initially developed by Run:ai and is a Kubernetes-native GPU scheduling solution designed to optimize resource allocation for AI workloads. NVIDIA acquired Run:ai for $700 million, integrated it, and then open-sourced KAI Scheduler.

Conclusion

In this article, we started by comparing an AI Engineer to an AI User, outlining that an AI Engineer does more than just build applications.

They need to understand every component of an AI system.

From there, we gave examples showing why it’s important to grasp end-to-end systems thinking, past the POCs, and not focus on hype and trends. Understanding these areas helps engineers design systems that scale efficiently and perform reliably.

One important topic was why AI Engineer often works directly or indirectly with NVIDIA’s tools, since NVIDIA powers a huge portion of AI compute.

The article covered GPUs, the difference between GPUs and Domain-Specific Architectures (DSAs) as a starter, and then explored libraries and tools for training, inference, and deployment, part of NVIDIA’s AI stack.

This guide is a starting point for any AI Engineer who wants to understand the ecosystem of NVIDIA hardware and software, which is something one most certainly will encounter in their work.

Even though most AI Engineers focus on building data pipelines, training models, and creating applications on top of foundational models, the concepts discussed here are essential for understanding the engineering behind the systems, including infrastructure, optimization, and efficient deployment.

Thank you for reading, see you next week! 👋

References

NVIDIA-NeMo/RL: Scalable toolkit for efficient model reinforcement. (2025, August). GitHub. https://github.com/NVIDIA-NeMo/RL
NVIDIA/KAI-Scheduler: KAI Scheduler is an open source Kubernetes Native scheduler for AI workloads at large scale. (2025, September 17). GitHub. https://github.com/NVIDIA/KAI-Scheduler
NVIDIA Dynamo. (2025). NVIDIA Developer. https://developer.nvidia.com/dynamo‌
Orosz, G., & Huyen, C. (2025, May 20). The AI Engineering Stack. Pragmaticengineer.com; The Pragmatic Engineer. https://newsletter.pragmaticengineer.com/p/the-ai-engineering-stack
Browne, R. (2025, September 25). British AI firm Nscale raises $1.1 billion in Nvidia-backed funding round. CNBC. https://www.cnbc.com/2025/09/25/nvidia-backed-uk-ai-firm-nscale-raises-1point1-billion-funding-round.html
Lee, K. (2024, March 12). Building Meta’s GenAI Infrastructure. Engineering at Meta. https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/
Kerstin. (2023, September 21). FPGAs vs. GPGPUs. IBE Electronics. https://www.pcbaaa.com/gpu-vs-gpgpu-vs-dsa-vs-fpga-vs-asic/
Invited: The Magnificent Seven Challenges and Opportunities in Domain-Specific Accelerator Design for Autonomous Systems. (2024). Arxiv.org. https://arxiv.org/html/2407.17311v1

‌Images and Media

If not stated otherwise, all images were created by the author.

A New Chapter in My AI Journey

Alex Razvant — Sat, 20 Sep 2025 07:02:54 GMT

Welcome to Neural Bits. Each week, I send one article on practical, production-ready AI/ML Engineering to help you learn and upskill in your AI/ML Journey. Subscribe and join over 6000 engineers who learn how to build real-world AI Systems.

Subscribe now

During the past two weeks, a lot has happened.

I’ve completed the code development and video recordings for the Kubrick Open Source Course (YouTube), which I developed alongside (from The Neural Maze).

This Monday marked my last day at Everseen, where I spent the past four years training models, building Vision AI systems, and running MLOps workflows.

I’ve stepped into a new role as a Senior AI/ML Engineer, where I’ll be working on large-scale AI (GenAI) systems, something I’m genuinely excited about. I’m already diving into the codebase, workflows, and getting up to speed with projects and documentation.

That means, even more production-ready AI Systems insights to share with you here.

My schedule has taken a hit with the transition, but this post isn’t about taking a break; it’s about the next steps I’ve been planning for a while.

If you’ve been following my newsletter, you know I’m focused on production-ready AI and the components of real AI systems. These are the areas I think not too many talk about, and I want to continue filling that gap.

So in this article, I want to touch on 3 things:

What I’ve learned, transitioning between AI roles
My Video Content First Impressions
[NEW] Course on Production-Ready AI Systems

Let’s get started!

Two thoughts when transitioning between roles

Leaving a role after years in it is never easy. It’s not just the projects or features you’ve worked on and shipped, but more about the failed demos, inside jokes on calls, `git blame` pranks on your teammate’s PR, and real friendships you’ve built along the way.

Moving forward, I wanted to outline three things that I think are key for everyone:

Keep your inbox open
Even if you leave, make it easy for colleagues to reach you, and offer yourself to help when your help is needed, considering your time. Be open to short chats or advice on things you’ve worked on.
Leave it better than you’ve found it
And by this I mean, clearing up the backlog, documenting everything, and doing your best to ensure that whoever takes on from where you’ve left, will ramp up quickly and thank you for that. This is a treat that every Engineer should aim for.
Don’t be like Joe

Comfort Zone
After years in a role, it’s easy to fall into comfort zones: how you organize your work, how you communicate with teammates, or how you approach problem-solving can all become routines. A new environment shakes that up. I think this is beneficial.

Video Content - First Impressions

You might’ve seen that I’ve uploaded a few videos in my previous posts.

That’s something I was planning to do for a while, but didn’t quite manage to get the time for. The course I built with Miguel was the kickstart for that, as I recorded over 1.5 hours of video code walkthrough, going over key technical components.

Here is me explaining how React works, for Data Scientists and AI folks.

Why React?
Some AI tooling is increasingly moving to JavaScript, with tools like Transformers JS, n8n, MCP, and Mastra. Learning a frontend framework is becoming a valuable skill for AI engineers.

I like the video format, as it allows me to show and explain, in real-time, what I’m talking about. You could expect to see me more often, posting Videos, going through complex AI concepts, and live-coding from now on.

Help me shape and structure video topics better:

Also, please leave your thoughts in the comments on what else I could do to make these videos better 👇 Thank you!

A New Course: Production-Ready AI Systems

I’ll keep this short, as I’ll be rolling out big updates in the upcoming weeks.

I’ve been working on a new project that combines Vision AI, Generative AI, and Agents to build a fully-fledged AI system from the ground up. This will be my most advanced project yet, and I believe the only one that covers advanced AI concepts and key AI libraries and frameworks at a low level.

🥁🥁 Here’s a sneak peek!

We’ll build an Edge Multi-Agent Vision System for Wildlife Conservation

What’s inside?

Building End-to-End MLOps Pipelines
Finetuning & Evaluation
Model Optimization - advanced model optimization techniques.
Perception & Multimodal AI - we’ll classify actions from videos
MCP Servers & Agent-to-Agent - we’ll build a network of AI Agents
AI System Design and a lot more

To properly organize this, I’d love your input! Please take 1 minute to answer these 3 polls. Your feedback will help me structure the course modules in the best way possible.

Have extra details to add? Please leave them in the comments.

What to do Next

I’ll keep you updated with everything you need to know, but for now, stay tuned for the next articles and videos I’m working on.

This next project is quite complex; I’m not sugarcoating it, but I want to make sure I ease the way towards understanding every piece of the puzzle.

My goal is to make sure you understand how everything fits together.

Thanks for reading! Follow this newsletter on YouTube as well, as I’ll be posting longer videos to prepare you for some of the hands-on concepts we’ll be covering in this new project.

👋 Cheers

Video Lesson on Advanced Multimodal AI Concepts

Alex Razvant — Sat, 13 Sep 2025 08:13:16 GMT

Welcome to Neural Bits. Each week, get one deep-dive article covering advanced, production-ready AI/ML development.

Subscribe to join 5900+ AI/ML Engineers learning how to build production-ready AI Systems.

This article is an extra module to the Open Source Kubrick course, which I’ve built in collaboration with (from The Neural Maze).

In this article, I’ve recorded two videos on advanced topics in Multimodal AI, which will help you understand video formats as well as Multimodal Models, such as CLIP and Vision Language Models (VLMs), alongside other insights.

Find the full Video Course on Kubrick, below:

My recommendation, if you’re not familiar with the course, is to watch the full video walkthrough, and then learn from these two extra deep dives, which focus on more advanced Deep Learning and Multimodal Data concepts.

Happy learning!

Introduction

How Video Format Works (7m)

Here you’ll learn low-level details about the Video Format, and how any video player (QuickTime Player, VLC, OpenCV, etc) is reading a video file, decodes it, and displays images, plays audio at the right time.

Summary of the topics:

Opening an MP4 video in Hex Format
Reading and explaining the Video Header and Encoded Packets
Learning how Video Re-Encoding works
Learning about different codecs H264 (AVC) vs H265 (HEVC)
Learning about ISO Multimedia Standards

Redactions:

1/ In the video, I said H. 265 keeps better Lightning and Softer Shadows.
- That’s true in the context of H.265 codec being compatible with HDR (High Dynamic Range, e.g, DolbyVision). The codec itself is only a better compression method than H264.

How the CLIP Model and VLM Work in General

Here you’ll learn about Contrastive Learning, the Loss Objective CLIP is trying to optimize for, how it was trained, and other low-level architecture and workflow details. You’ll also learn about Image Encoders, Patching, CNN Receptive Fields, Vision Transformer, and how CLIP can be used as part of VLMs for the VQA (Visual Question Answering) task, going through a VLM architecture, step-by-step.

Summary of the topics:

CLIP Model Card and the Model Scope
How was it trained, Contrastive Learning, Loss Function
Vision Transformer, Image Patching, Positional Embeddings
How is ViT compared to CNNs’ Receptive Fields when learning Image Features
Using interactive 3D Vectors in Desmos UI to showcase Contrastive Loss
Explaining CLIP as part of the Image Encoder in a VLM Architecture

Redactions:

1/ In the video, I mention that ViT is better at learning Global Image Context compared to CNNs
- That’s true, if we compare network sizes. Vision Transformers are capable of learning global features, due to how Attention Works. CNNs, on the other hand, are still capable of that, but we need to increase the Network depth, such that Receptive Fields capture more context throughout the stacked layers chain.

Ending Notes

Subscribe now

If you’ve enjoyed these and want to stay updated when I post more similar content, but focused on End-to-End projects, make sure to also follow me on:

1/ Daily Content on AI Engineering

2/ Code Resources & OSS Courses

3/ Advanced AI Deep Dives (>1h, coming soon)

The face behind this Newsletter (First Video)

Alex Razvant — Sun, 07 Sep 2025 07:02:30 GMT

Hey all,

I know this is an unusual hour to send an article, but I thought it was worth taking this chance to let you all know who is behind the Neural Bits Newsletter, a bit about me, and what you can expect to find next.

Subscribe now

For the past few weeks, I’ve been considering raising the bar on the content I create, and many of you have been messaging me on LinkedIn regarding live walkthroughs and workshop-like video content.

I’ve been planning this for a while now, and this article is the first in that series.

Who Am I

Key points, for those who don’t want to hear me rumble for 10 minutes:

I’ve worked for ~1.5 years as a Software Engineer and 8 years as an AI/ML/MLOps Engineer.
Worked in AI Research and published a Research Paper in 2020
Worked across the full AI landscape, MLOps, Deep Learning, CNNs, LLMs, VLMs, Diffusion Models, and have built real AI Systems.
I’m not an AI Influencer, I actually work full-time as an AI Engineer.
I don’t create content for followers or vanity metrics; I try to help others upskill.
I’ll be mixing Text Articles with Video Content soon enough.
I’ll be adding Video Walkthroughs (Andrej Karpathy Style)
Expect more complex AI Projects, end-to-end.
Expect more hands-on content, building AI Systems.
I’ll also think of how I can repay my Paid Subscribers, through exclusive content, courses, 1-1 sessions, etc.

Socials

Github Code Repository → https://github.com/multi-modal-ai
Youtube Channel → https://www.youtube.com/@multimodalityai
LinkedIn → https://www.linkedin.com/in/arazvant/

Adding a Patch :)

I recorded this bit, mainly to let you know that and I are currently working on recording the Video Course for the Kubrick Project, we should publish it soon!

To stay updated, you can follow me or Miguel on LinkedIn, as we’ll be announcing it there first!

Thank you all,
Hope you’re excited and will learn a lot more with the new changes I’ll bring on here!

All the best 🙏

See you next week!

The AI Engineer's Guide to Inference Engines and Frameworks

Alex Razvant — Thu, 21 Aug 2025 08:01:13 GMT

Hey, welcome!
Each week, I share insights on advanced, production-ready AI/ML development, the kind of practical knowledge that rarely gets covered elsewhere.

Join 5,400+ AI/ML engineers who subscribe to level up their skills with hands-on advice, insights, and lessons I’ve learned from nearly a decade working in the AI industry.

Subscribe now

Introduction

When deploying machine learning models into production, inference speed becomes just as important as accuracy.

To promote an AI model (LLMs included) from research into production, the deployment lifecycle can be viewed in two distinct stages.

The research and development stage focuses on model quality, involving pre-training and post-training phases. The deployment and monitoring stage focuses on the model’s performance as part of a larger AI System.

This article covers everything AI Engineers need to know about Inference Engines and Serving Frameworks, going through every available open-source solution for optimizing, compiling, and serving AI Models.

The Phases of Model Development
Inference Engines and Inference Frameworks
- ONNX and ONNX Runtime
- TensorRT, TensorRT-LLM
- vLLM, vLLM + LMCache
- vLLM + Ray
- llama.cpp
- Ollama
- NVIDIA Triton Inference Server
- HuggingFace TGI (Text-Generation Inference)
- CoreML
- OpenVINO, OpenVINO GenAI for Intel Hardware
Distributed GenAI Inference Frameworks
1. NVIDIA Dynamo
2. vLLM + llm-D (Kubernetes)
3. AirBrix
4. Mojo and Mojo MAX Engine
Conclusion

Model Development (Pre/Post-Training)

During the first stage, most AI researchers and ML Engineers could train a smaller model from scratch, something which rarely happens, and more often we use a pre-trained model, then further adapt it to downstream tasks. This bit of adapting a model to a new task is a form of Transfer Learning, an older term, given the rate at which the AI field progresses. (i.e 2010, A Survey on Transfer Learning).

Transfer learning is a technique in ML in which knowledge learned from a task is re-used in order to boost performance on a related task.

Nowadays, we hear about `fine-tuning`, `downstream task adaptation`, `post-training`, and other terms, which all refer to the core concept of Transfer Learning.

Figure 1. In pre-training, we start with raw, unstructured data and a base model whose parameters are selected randomly. Pre-training requires more compute, more time, and is pricier.

Examples

Let’s say we want to build a model that detects objects. For that, we’d select a Computer Vision model (e.g, YOLOs, RCNN) which was pretrained on the MS COCO dataset or other detection datasets, and we’ll further `finetune` it on our custom subset of objects.

The same goes for an LLM. We’d select a pretrained model on Web text, which learned to complete the next word/token in a sentence, and further `finetune` it for summarization, chat, text entity recognition, writing poetry, or other adaptations.

The difference being that for LLMs, the number of techniques one could use is far richer.

For LLMs, we have SFT (Structured Finetuning), RLHF (Reinforcement Learning from Human Feedback), RLAIF (R.L from AI Feedback), DPO (Direct Preference Optimization), GRPO (Group Relative Policy Optimization), and many other methods.

To group all these, the term `post-training` is widely used within the field, and it makes the most sense.

Figure 2. During post-training, we prepare a high-quality subset of samples aligned with our target task, and then we further adapt the base model to our specific downstream task.

Model Deployment

Model deployment and monitoring is the phase where most AI/ML Engineers would work in as it connects the research-engineering gap. This phase includes optimizing the model, packaging it into serving infrastructure, integrating with APIs and applications, and monitoring the end-to-end system.

Even if a model performs well in pre-deployment tests or evals, what ultimately matters in production are its performance metrics with regards to the entire AI System it is in.

These performance optimizations could be grouped into 2 categories:

System-wide, which focuses more on the infrastructure part with model parallelisation across multiple GPUs or multiple Nodes, distributed processing of model feed-forward stages, caching, or speculative decoding (LLMs).
Model-wide optimizations are applied at the model level, and these include quantizations, pruning, Knowledge Distillation via Teacher/Student, or model compilation.

You could learn more about these two categories of optimizations, on LLMs specifically by reading one of my previous articles:

Examples

With traditional Deep Learning models, such as the ones for vision-based tasks of tracking an object in a video or removing the background of an image, we would quantize or compile them on specific GPU hardware to reduce the latency and size of the model, making it run faster.

When running AI on video, each second, there are ~30 different frames we could send to the model for inference, and since a large majority of Vision AI has to run in real-time, we need lightweight, fast models.

In the case of LLMs, inference is a bit trickier as it’s composed of 2 stages: prefill and decode, the former being highly parallelizable, and the latter running sequentially.

Figure 3: In this image, we can see the Cost vs Latency for the two inference stages of LLMs, clearly showing that the decoding (sequential) stage could be a real bottleneck when increasing the model size. Image taken from the Efficiently Scaling Transformer Inference (2022) paper.

Apart from this, LLMs are large, thus one GPU might fit the model weights, but will crash due to OOM (Out of Memory) during inference, because layer activations also have to be stored in memory.

Optimizations for large models such as LLMs include sharding the model across different GPUs (TP = Tensor Parallel, PP = Pipeline Parallel), FlashAttention (fusing attention operations into a single GPU kernel), and efficiently storing the KV-Cache for further re-uses.

Having covered the optimizations we could apply, let’s get practical and study what Inference Engines are, how they work, and which one to choose.

Inference Engines and Inference Frameworks

An inference engine is a specialized runtime designed to execute trained models efficiently on hardware. Between a trained model, which is usually in its default PyTorch (.pt), TensorFlow (.tf2) format, and a serving framework such as NVIDIA Triton Inference Server, FastAPI, TorchServe, BentoML, or Tensorflow Serving, there is an inference engine.

An inference engine optimizes a model to balance latency, throughput, precision, and efficiency for it in a production scenario.

Commonly, inference engines do one or more of the following:

Graph Optimization
It analyzes the computational graph of the model and applies optimizations to reduce the graph size and depth, fusing model layers or removing redundant computations.
Hardware-Specific Optimization
Models could be compiled for target hardware accelerators such as CPU, GPU, TPU, or custom accelerators, by selecting highly tuned compute kernels for each architecture in part.
Lowering Precision
Reduces the memory footprint by quantizing layers’ precision (e.g., FP32 → FP16 → INT8/INT4).
Model Pruning & Sparsity
Pruning redundant weights or exploiting sparsity in matrices.

Since there is a major difference between traditional deep learning models and generative models, an inference engine will apply different optimizations.

General Purpose Inference Engines

These include the traditional compilers such as ONNX Runtime, TensorRT (NVIDIA), OpenVino (Intel), and CoreML (Apple).

From the above-mentioned list, only ONNX Runtime is cross-compatible with multiple accelerators.

Let’s see why.

ONNX and ONNX Runtime

ONNX stands for (open-neural-network-exchange) and is an open-source standard for representing deep learning models. An ONNX model is a serialized file containing the computation graph, architecture, weights, and other parameters.

When we convert a model to ONNX, the graph of operations is represented in an internal format (ONNX IR), which can be interpreted by other Inference Engines that will further transform this IR into their internal representation.

An important to be made is that ONNX is a model format, we could use to serialize a model into, whereas ONNX Runtime is an Inference Engine that can run ONNX models.

When we run inference via ONNX Runtime, an inference session object is created, which loads the model graph, decodes the operator, and partitions the graph of operations based on the Execution Provider.

Execution providers in ONNXRuntime are the target platforms the model should be optimized for (CPU, CUDA, ROCm).

Figure 4: Illustrating what the inference workflow looks like when using the ONNXRuntime inference engine.

After you’ve converted a model to the ONNX format, you can view its graph architecture using Netron, which is a JavaScript application that parses model files. Research engineers often use Netron to analyze models.

Figure 5. Visualizing an ONNX model using Netron Web Viewer. Taken from the netron.app viewer.

From ONNX, a model could be further converted into other Inference engine formats, such as OpenVino for Intel CPUs and GPUs, Core ML for Apple devices, or TensorRT for NVIDIA GPUs.

NVIDIA TensorRT

The TensorRT compiler is specifically designed to maximize performance on NVIDIA GPUs, due to its low-level integration with the underlying CUDA kernels, cuDNN routines, and NVIDIA GPUs.

Figure 6: How a model graph is decoded and optimized with TensorRT.

Before compiling a model to TensorRT, one would have to convert it to the standardised ONNX format first. Further, TensorRT applies a series of optimizations that are closely related to the NVIDIA GPU architecture this model is built for.

When using TensorRT, if we compile a model for the A100 GPU, we’ll have to repeat the entire process for other GPUs.

TensorRT is tightly related to how many CUDA and Tensor Cores, how many SM (Streaming Multiprocessors) a GPU has, what’s the overall Compute Capability (CC) of the GPU, and more.

You could learn more about GPUs, GPU Programming and CUDA by reading my previous article on GPU Programming.

Here are two key optimizations that TensorRT brings that have a large impact on performance:

Kernel Fusion - distinct kernels can be fused and calculated on a single data flow. For example, we can fuse a convolution kernel followed by ReLU, and compute the results on a single pass through the data.
Kernel Auto-Tuning - CUDA programming is based on Threads and Blocks, and kernel auto-tuning detects the optimal n_of_threads and n_of_blocks for the specific GPU Architecture, based on its CUDA Cores and Tensor Cores counts.

Because TensorRT is general-purpose, let’s walk through how an LLM model is converted into a TensorRT engine.

NVIDIA TensorRT-LLM

TensorRT-LLM is a high-level Python API to build TensorRT engines that contain transformer-specific optimizations. It provides state-of-the-art optimizations, including custom attention kernels, in-flight batching, paged KV caching, transformer quantizations (FP8, INT4, SmoothQuant), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.

Figure 7: Comparison between TensorRT and TensorRT-LLM across the purpose, optimizations used, and learning curve.

To optimize an LLM using TensorRT-LLM, the team at NVIDIA published the code workflow for each LLM variant. For example, Llama models follow a different optimization configuration than Gemma or Mistral models.

To find the model compatibility and steps to optimize and serve TensorRT-LLM engines, see the library models section on GitHub.

With TensorRT-LLM, after optimizing a model, we don’t get a monolithic engine file, but multiple sub-engines connected as a pipeline.

Each sub-engine is responsible for a part of the model in the inference workflow:

Tokenizer Engine - this is the optimized model that does token-preprocessing before passing the sequence to the LLM.
The Transformer Model - this is the TensorRT engine of the transformer model.
Detokenizer Engine - this block decodes token IDs back to text tokens, also integrating in-flight-batching and optimized kernels.
The BLS Model Pipeline - this is the pipeline composer, a Business Logic Scripting module chaining the blocks.

TensorRT-LLM is designed for large-scale deployments, squeezing the best out of the GPUs. Simpler engines that have comparable results on latency for small-mid scale deployments are vLLM and SGLang.

vLLM and vLLM + Ray

vLLM is a high-throughput and memory-efficient inference engine for serving Large Language Models. It was originally developed in the Sky Computing Lab at UC Berkeley, and it quickly evolved into a community-driven project. Being built mostly in Python (80+%) with a bit of CUDA and C++ (~13%) for Attention kernels and other optimizations, vLLM became popular amongst AI Engineers who need to deploy LLMs at scale.

Figure 8: How inference throughput compares between vLLM, HF Transformers, and HF TGI (Text Generation Inference) engines to run LLMs. Taken from vLLM GitHub page.

Amongst the optimizations vLLM brings to the table, two key techniques that improve performance are Paged Attention and Continuous Batching.

1/ Smarter Memory Management with Paged Attention

Paged Attention is a mechanism for efficient management of attention key and value memory, which could improve LLM serving ~24x using 50% less GPU memory compared with traditional methods.

In LLM inference, having a starting prompt, the prefill phase computes the initial attention states Keys (K) and Values (V). Next, with each decoding step, we get a new token into the sequence (i.e, completion tokens) and the K, V states have to be updated.

To avoid computing these states on each iteration, vLLM uses Paged Attention to store previous states, and re-computing only the current token state.

2/ Continuous Request Batching

Static batching waits for all sequences in a batch to finish. For LLMs, since inference requests will have different lengths, continuous batching dynamically replaces completed sequences with new ones at each iteration.

This allows new requests to fill GPU slots immediately, resulting in higher throughput, reduced latency, and more efficient GPU utilization.

You could dive into more detail about vLLM, and learn how to get started with vLLM by reading my previous article.

One important add-on to vLLM is LMCache, which adds a different layer on how vLLM processes and reuses the KV Cache in its built-in Paged Attention mechanism.

LMCache aims to reduce TTFT (Time To First Token) and increase throughput, especially under long-context scenarios. Compared to other serving frameworks, it caches reusable texts across various locations, including (GPU, CPU DRAM, Local Disk), caching entire reused text KV Cache, not only prefixes.

Figure 8.1. LMCache will cache and reuse longer token KV Cache, not only prefixes. This reduces the GPU cycles, speeding up the prefill phase. Taken from the GitHub Page.

This next set of Inference Engines is more specialized, either architecture-bound bound such as CoreML for Apple Devices, or use-case bound, such as llama.cpp for LLMs.

1/ CoreML

CoreML is Apple’s framework for on-device inference. Models in this format leverage the unified CPU & GPU memory and the Neural Engine components in Apple chips.

Figure 9. The savefile format of a CoreML Model Package, to be deployed on Apple Devices (iPhone, iPad, MacBook). Taken from CoreML Docs.

On Apple devices, all memory is unified. You no longer need a separate GPU with dedicated VRAM because the entire model is loaded into the shared memory pool (RAM + GPU + CPU).

See this tutorial on how to port Llama-3.1 with CoreML.

The CPU, GPU, and Neural Engine can access model parameters directly from this unified memory, without transferring data back and forth, which allows developers to load larger models.

See this benchmark on running DeepSeek R1 on a MacBook Pro M2.

2/ Intel OpenVINO

OpenVINO is an open-source toolkit developed by Intel for optimizing and deploying deep learning models. It enables AI inference on a variety of Intel hardware, including CPUs, GPUs, and VPUs, with a "write-once, deploy-anywhere" approach.

Initially, OpenVINO was popular for computer vision models, since most AI deployments that needed to run efficiently on smaller devices or at the edge were based on vision tasks. As a result, OpenVINO was heavily optimized for CV and CNN-based models.

Today, OpenVINO also provides a GenAI blueprint to optimize and serve LLMs using the OpenVINO runtime. To do that, you could either use the `optimum-cli` from HuggingFace to convert your LLM to OpenVino format, or download a pre-compiled OpenVino profile.

Figure 10. Serving an LLM from HuggingFace using the OpenVino GenAI library. This presents 2 methods to compile an OpenVino profile: using the HF optimum-cli optimizer tool or downloading a pre-compiled profile. Taken from the GitHub Page.

Find more about OpenVINO and OpenVINO GenAI in the official docs.

3/ LLama C++

llama.cpp is a high-performance C/C++ inference engine to run LLMs locally with minimal dependencies. Being built in a low-level language, and with every system being able to run C and C++, it’s compatible with a wide range of target architectures, making it portable.

Figure 11.1 The LLaMA C++ Logo. Taken from Google Images.

Based on your system architecture, llama.cpp binaries are built with the appropriate backends, such as:

BLAS (Basic Linear Algebra Subprograms) for all architectures
CUDA for NVIDIA GPUs.
HIP for AMD GPUs.
MPS (Metal Performance Shaders) for Apple M-series.

In this context, a backend is nothing more than an execution engine that translates model operations into hardware-specific instructions.
See the full list of supported llama.cpp Backends.

One key component of the llama.cpp ecosystem is the GGUF binary model format. On HuggingFace, the recommended model storage format is safetensors, which stores tensor-only data. GGUF is llama.cpp compatible, and it stores tensors and rich metadata.

Figure 11.2 The GGUF compressed format to store LLM weights, parameters, and rich metadata. Taken from HuggingFace.

GGUF standardizes how weights, tokenizers, and metadata are stored, making models portable across different backends (CPU, CUDA, Metal, Vulkan, etc.), enabling faster model loading times.

See this extensive tutorial, on getting started with llama.cpp.

Serving Frameworks

Having covered a majority of Inference Engines, the next component of an AI Inference stack is the Serving Framework. Besides parsing the inference engine and executing it, a Serving Framework takes care of managing the inference server.

This includes system-wide optimizations, batching, request-response mapping, and more.

In this section, we’ll describe 4 Serving Frameworks while using the LLM serving use-case. We won’t cover TorchServe or TensorFlow Serving, as these are more general-oriented.

1/ Hugging Face TGI - Text Generation Inference

TGI is a toolkit for deploying and serving LLMs, enabling high-performance text generation for the most popular open-source LLMs.

As with any other Serving Framework, TGI is built on two components: a Web Server in Rust to handle batching and request routing, and the LLM Executor from the HuggingFace Transformers library to run the token sequence through the LLM.

Notable system-wide optimizations that TGI offers are:

Tensor Parallel - to distribute larger models as shards on multiple GPUs.
FlashAttention plugin that fuses the Attention operations as a single kernel, to reduce the memory footprint and increase token throughput.
Continuous Batching and Paged Attention

Figure 12. HuggingFace TGI architecture, composed of the Rust web server and the high-level Python API from the HuggingFace Transformers library to execute the LLM feedforward. Taken from TGI HuggingFace Page.

Find more about TGI and how to get started.

2/ Ollama

Ollama can be considered as a wrapper over llama.cpp, to serve LLMs locally.

Similarly to TGI on the architecture side, Ollama runs GGUF models through llama.cpp as the Inference Engine, but builds an optimized Web Server in GO to handle batching, system-wide optimizations, or request-response routing.

Getting up-and-running with llama.cpp is no easy-feat, and Ollama aims to simplify that by providing an easy-to-use workflow to run LLMs.

Figure 13. Downloading the Ollama installer. Taken from the ollama landing page.

With Ollama, you can download the installer and launch it, which will automatically start an Ollama Server and keep it alive. Further, using the Ollama UI or the CLI, you could pull models and serve them.

In reality, Ollama models are GGUF profiles stored under the Ollama’s own model format called Modelfile. These GGUFs are executed by llama.cpp underneath.
Ollama provides the ecosystem around it.

Ollama Server exposes OpenAI-compatible API endpoints you could integrate further into your applications.

Find more about Ollama, how it works and how to get started.

3/ Triton Inferece Server

NVIDIA’s Triton Inference Server is one of the most mature and production-ready serving frameworks out there. Even if its learning curve is high and is quite complex to master, Triton Server is strictly optimized for AI workloads in production.

Figure 14. How NVIDIA’s Triton Inference Server works underneath. Models are loaded from the model repository and deployed. Triton Server comes with built-in telemetry via Prometheus, exposing key metrics on both hardware and model performance.

NVIDIA Triton Server supports multiple Inference Engines, such as TorchScript, TensorFlow, ONNX, TensorRT, and, more recently, it can also serve TensorRT-LLM engines.

Learn more about NVIDIA Triton Server in this hands-on article.

Distributed Inference Frameworks (LLMs)

1/ NVIDIA Dynamo

Dynamo is the newest serving framework from NVIDIA that focuses strictly on LLM inference. Referred to as `Triton Server successor`, Dynamo focuses on distributed inference of large LLMs and particularly Reasoning LLMs.

One of the key optimizations of Dynamo is Disaggregated Serving.

If you remember, LLM inference has two stages, prefill and decode, with the decode part predicting one token at a time - Dynamo focuses on parallelizing the decode stage on multiple GPUs or Nodes.

Dynamo supports and can serve both TensorRT-LLM and vLLM engines, with the former getting the SOTA results on NVIDIA GPUs.

Figure 15. How NVIDIA Dynamo works underneath. This diagram is an edited version of the original, where I overlaid key details on how each component works.

One key component of Dynamo is the NIXL (NVIDIA Inference Xfer Library), which enables high-performance point-to-point communication, allowing for the KV-Cache computed on a set of distributed nodes to be accessed and shared fast, reducing the time it takes for prefill & decode inference steps.

Check this previous article with a full, in-depth walkthrough of NVIDIA Dynamo.

2/ AirBrix (backed by vLLM Team)

Similar to Dynamo, AirBrix is an open-source initiative designed to provide essential building blocks to construct scalable GenAI inference infrastructure, built by the same team behind vLLM.

Airbrix distributes the inference infrastructure on top of a Kubernetes cluster, with the Control Plane handling the Model Metadata stores, which could include LoRA Adapters or fully-finetuned model checkpoints, Load Balancing, the Autoscaler of Inference Endpoints, and Controllers.

The actual inference workload is distributed across a set of Kubernetes Pods, with each Pod serving a vLLM Engine with a built-in Model Loader, WatchDog for metrics, and autoscaling triggers. To share the KV-Cache across multiple vLLM Instances, AirBrix uses KV-Cache-specific pods, mounted as side-cars to the runtimes serving the models.

For cross-pod KV-Cache, AirBrix could also use the NVIDIA NIXL Library.

Figure 16. Airbrix architecture. Taken from the GitHub Page.

3/ vLLM + LLM-D (Kubernetes)

llm-d follows the same principles of AirBrix and Dynamo. It uses vLLM as model server and engine, Inference Gateway as request scheduler and balancer, and Kubernetes as infrastructure orchestrator and workload control plane.

To share KV-Cache across Pods and Nodes, it can also use NVIDIA NIXL or DCN libraries, and for pod-persistent independent KV-Cache, it can use LMCache or Host Memory.

Figure 17. The LLM-d architecture showcases how it distributes LLM Inference workloads on top of a Kubernetes Cluster. It uses vLLM as the Engine and Serving mechanism, and routes requests via the Inference Gateway to the appropriate Pods in the Inference Pool. Taken from the GitHub Page.

Other Interesting Frameworks

Mojo and Mojo MAX Engine

Developed by Modular, Mojo aims to combine the flexibility and usability of Python with the raw performance of languages such as C++ or Rust.

Mojo is designed to become a superset of Python, focusing on familiar syntax while adding systems-level features for performance and control.

An AI Engineer wants the performance without having to write custom kernels for every accelerator, such as CUDA for NVIDIA GPUs or ROCm for AMD. Mojo’s ambition is to make this possible.

To understand how, let’s go through a few details at the compiler level.

Python is an interpreted language. The source code you write is compiled into bytecode, which the CPython interpreter executes through its virtual machine.

Mojo, like C++ or Rust, is a compiled language, but it works differently. In C++, for instance, if you build a library for x86 CPUs, you need to rebuild it again for ARM64 (e.g, Apple M-series chips).

Mojo avoids this limitation by building on LLVM. Instead of targeting one architecture directly, it translates source code into MLIR (Multi-Level Intermediate Representation). MLIR contains a set of different dialects with primitives for multiple hardware targets (NVIDIA, AMD, TPU, CPU, etc.).

Figure 17. The Model to Hardware workflow using Mojo. Taken from the Modular Blog.

MAX Engine is Modular’s next-generation compiler and runtime library for running AI inference and serving AI Models.

Similar to other serving frameworks, MAX Engine can take a trained model either in PyTorch (TorchScript), ONNX, or native Mojo formats, and serve it on a wide range of hardware, based on the principles discussed above.

Figure 18. Architecture of the Mojo MAX Engine. Taken from the MojoMAX landing page.

2/ LMDeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLM.

Figure 17. LM Deploy engine compared to vLLM, achieving higher request throughput and more efficient output tokens saturation when increasing the batch size. Taken from the GitHub Page.

Conclusion

In this article, we started by explaining the phases of Model Development and Deployment, a workflow that every AI model likely follows. We covered pre-training, post-training, and what to consider before deploying an AI model as part of a larger System.

On that topic, we covered a wide range of AI inference engines and serving frameworks, each designed for different deployment scenarios.

Each has its strengths, and in only a few cases will one be better than another. The core optimizations are the same overall, with minor changes in between.

As a rule of thumb:
→ Use OpenVINO when your deployment targets are Intel CPUs and GPUs.
→ For Apple devices, focus on CoreML and llama.cpp.
→ To test things out and run LLMs locally, go with Ollama.
→ To debug models and store them in a standardized, multi-compatible format, you might want to choose ONNX.
→ Experimental yet promising is the Mojo MAX Engine, which is cross-compatible with any type of hardware.
→ For a large majority of LLM deployments, vLLM is enough.
→ If you want more control, with more complexity, choose llama.cpp
→ For real large-scale deployments, stick to TensorRT and TensorRT-LLM Inference Engines and choose NVIDIA Triton, NVIDIA Dynamo, vLLM + llm-D or AirBrix for large-scale distributed inference.

The key takeaway: there is no one-size-fits-all runtime.

The right choice depends on your hardware, your latency vs. throughput needs, and whether your model is running in the cloud or on the edge.

Thank you for reading, hope you learned a lot!
See you next week!

References

What Is Transfer Learning? A Guide for Deep Learning | Built In. (2022). Built In. https://builtin.com/data-science/transfer-learning

Lambert, N. (2025, January 8). The state of post-training in 2025. Interconnects.ai; Interconnects. https://www.interconnects.ai/p/the-state-of-post-training-2025

‌ONNX Runtime | Home. (2025). Onnxruntime.ai. https://onnxruntime.ai/

‌Shah, A. (2024, February 21). NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma. NVIDIA Technical Blog. https://developer.nvidia.com/blog/nvidia-tensorrt-llm-revs-up-inference-for-google-gemma/

‌Razvant, A. (2024, August 6). 3 Inference Engines for optimal model throughput. Substack.com; Neural Bits. https://multimodalai.substack.com/p/3-inference-engines-for-optimal-throughput

‌Razvant, A. (2025, March 27). How does vLLM serve LLMs efficiently at scale? Substack.com; Neural Bits. https://multimodalai.substack.com/p/unpacking-vllm-distributed-inference

‌Core ML | Apple Developer Documentation. (2025). Apple Developer Documentation. https://developer.apple.com/documentation/coreml

‌openvinotoolkit/openvino: OpenVINO^TM is an open source toolkit for optimizing and deploying AI inference. (2025, June 18). GitHub. https://github.com/openvinotoolkit/openvino

‌GGUF. (2025). Huggingface.co. https://huggingface.co/docs/hub/en/gguf

‌ggml-org/llama.cpp: LLM inference in C/C++. (2025). GitHub. https://github.com/ggml-org/llama.cpp

‌Text Generation Inference. (2025). Huggingface.co. https://huggingface.co/docs/text-generation-inference/en/index

‌Ollama. (2025). Ollama. https://ollama.com/

‌Reminder: You don’t need ollama, running llamacpp is as easy as ollama. Ollama i... | Hacker News. (2024). Ycombinator.com. https://news.ycombinator.com/item

ai-dynamo/dynamo: A Datacenter Scale Distributed Inference Serving Framework. (2025, August 12). GitHub. https://github.com/ai-dynamo/dynamo

Team, Aib. (2025, February 21). Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM. VLLM Blog; AIBrix Team. https://blog.vllm.ai/2025/02/21/aibrix-release.html

‌LMCache/LMCache: Supercharge Your LLM with the Fastest KV Cache Layer. (2025, August 3). GitHub. https://github.com/LMCache/LMCache

Modular: MAX 24.3 - Introducing MAX Engine Extensibility. (2024). Modular.com. https://www.modular.com/blog/max-24-3-introducing-max-engine-extensibility

Images and Videos

All images are created by the author, if not otherwise stated.

The AI Merge

Building a Local AI Task Manager with PydanticAI and Ollama

Table of Contents

Step 1 - The LLM Engine

Step 2 - Project Scaffolding

Step 3 - Defining Agent Schemas

Step 4 - Agents Runtime Context

Step 5 - Defining the Task & Report Agents

Step 6 - Adding Tools

Step 7 - The Execution Flow

Step 8 - Project Demo

Conclusion

References

Win an NVIDIA DGX Spark by joining me for Virtual NVIDIA GTC 2026

The surprise: I’m giving away 1× NVIDIA DGX Spark (Europe Only)

How to enter the DGX Spark giveaway

Quick FAQ:

A few interesting Virtual GTC sessions

Behind The Scenes and what I’ve been building on the Spark

Wrap-up

Local LLM Inference : llama.cpp, GGUF, Quantizations and GGML Explained

Table of Contents

1. What is GGUF?

2. What Quantizations does GGUF Support?

3. What is GGML? (tl;dr)

4. The Llama.cpp Library Workflow

1. GGML Model Loading

2. GGML Populating Weights

3. GGML KV Cache Handling

4. GGML Token Sampling

5. The High-Level Architecture

6. Conclusion

References

The Engineer’s Guide to AI-Assisted Productivity

The Problem of “Developer Productivity”

AI and Lines of Code

Tips, Routines and Advice

What “using AI for coding” means in my day-to-day

Tip #1 - Using Cursor And Cursor Rules Standards

Tip #2 - Agent Skills (Claude/Codex)

AGENTS.md/CLAUDE.md

Agent SKILLS

Understanding CLAUDE.md vs SKILLS vs COMMANDS

The Structure

The SKILL.md

Key Section #1 - Reference Guide

Key Section #2 - Harness (DOs and DONTs)

Key Section #3 - Scope

Tip #3 - How to treat PRs

A PR Body Example

Three more important aspects

Tip #4 - Code Review without Nitpicking

A few examples of DOs and DONTs

Tip #5 - Using Pre-commit + Squash Merges

One more important aspect

Tip #6 - EoD Memory and Context Dump

My Routine

Closing Thoughts

Upcoming Livestream: GPUs for AI (Shaped by You)

Starting Points

Your Feedback (helps greatly)

1. Understanding your current experience with Software and/or AI.

2. Your understanding of GPUs

3. Topics focused on using GPUs for AI

One more thing

The Smartest AI Engineers Will Bet on This in 2026

AI Adoption is Moving Slower

An AI Engineer’s Focus

Avoid the Unnecessary Confusion

What should you focus on?

#1 AI Foundations

#1.0 Machine Learning

#1.1 Mathematics

#1.2 Deep Learning

#1.3 Generative AI

#1.4 LLM-based Systems

#2 Engineering

#2.1 Architecture

#2.2 Programming

#3 AI Systems