Docker

Achieving Test Reliability for Native E2E Testing: Beyond Fixing Broken Tests

Jin Kim — Fri, 13 Mar 2026 13:00:00 +0000

End-to-end (E2E) tests are particularly important for native applications that run on various platforms (Android/iOS), screen sizes, and OS versions. E2E testing picks up differences in behavior across this fragmented ecosystem.

But keeping E2E tests reliable is often more challenging than writing them in the first place.

The fragmented device ecosystem, gaps in test frameworks, network inconsistencies, unstable test environments, and constantly changing UI all contribute to test flakiness. Teams easily get trapped in a cycle of constantly fixing failing tests due to UI changes or environment instability rather than improving the overall reliability of their test infrastructure. They end up frustrated and hesitant to adopt E2E tests in their workflows.

Having led the native E2E testing infrastructure setup at a mid-sized company, I learned the hard way how critical it is to define and implement strategies for test ownership, observability, and notifications in ensuring long-term test stability. In this piece, I discuss the challenges I’ve seen teams face and share lessons on how to build reliable E2E systems that you actually trust.

Challenges with Reactive Test Maintenance

After setting up periodic E2E runs on the CI, our team initially focused on triaging, investigating, and fixing every failing test to improve test stability. However, even after nearly a year of patching flaky tests, the reliability of our E2E suite didn’t improve, and engineers slowly lost confidence in the usefulness and reliability of the test suite.

I learned that teams that focus primarily on fixing broken tests often end up in a cycle of chasing failures without fixing the root causes of instability. This reactive approach creates several problems:

Test suite fragility: If teams continue patching broken tests without addressing real issues with either the underlying app changes or unstable environments, the test suite becomes increasingly brittle. Over time, tests fail for reasons unrelated to real product defects, making it harder to distinguish genuine regressions from noise.
High maintenance overhead: Debugging and fixing flaky tests often requires a significant amount of developer time and resources. Unlike unit tests, which run quickly and fail in isolation, E2E tests execute against the development, staging, or pre-production environment, making failures harder to reproduce and diagnose. Adjusting E2E tests to work across devices with different screen sizes or OS versions requires additional work, making fixes a non-trivial task.
Reduces trust in the test suites: When failures are common and noisy, teams lose confidence in the E2E suite, and they often start ignoring test failures. This undermines the purpose of having automated tests in the first place. Instead, teams rely on local dev testing or manual QA cycles to validate changes. Over time, the suite becomes more of a liability than a safeguard, slowing down delivery instead of enabling it.

A reactive approach to fixing E2E tests slows down release cycles. Developers must spend significant amounts of time repeatedly fixing and rerunning failing tests, while teams rely on manual QA to catch actual regressions.

Building a Reliable E2E Infrastructure

When our test suite stability didn’t improve after more than a year of chasing failures, we took a step back to analyze historical results and look for patterns.

We discovered that a significant number of failures could be attributed to an unstable environment or an unexpected state of the test account. For example, spikes in API latencies in the test environment frequently caused false negatives, adding to the noise. Similarly, tests run against existing user accounts could become inconsistent due to a past failure or if multiple tests attempted to use the same account.

I learned that investing in improving your test infrastructure is the only way to get to a stable and reliable native E2E testing workflow. This involves stabilizing the test environment, defining clear test ownership, reducing noisy alerts, and improving observability. Let’s look at each of these in more detail.

Stabilize the Test Environment

Many flaky E2E tests can be traced back to inconsistencies in the underlying environment, such as sporadic device issues, network instability, or API downtime in a staging environment.

To avoid noisy and unreliable tests, ensure you have a stable and standardized test environment with the following test practices:

Standardize device and environment setup: Device and test environment stability issues heavily impact test stability. To reduce API downtimes, isolate the E2E testing environment from the developer or staging environment to prevent interference from unstable builds and experimental features. Teams could either build a stable pre-prod environment that uses a production-ready artifact or spin up ephemeral environments for each E2E run to ensure consistency. Running tests on standardized device images or containerized emulators with consistent OS versions, configurations, and resources further improves stability. For critical flows, you can schedule periodic runs on physical device farms to validate against real hardware while keeping day-to-day tests stable and cost-effective.
Isolating test data per session: A test that makes modifications to any data should start from a clean slate. For instance, while testing a todo application, every test session should use a new test account to avoid unexpected scenarios because of unpredictable account state. To speed up tests, execute setup scripts in `before` hooks to handle account creation, and seed any required data automatically.
Mocking certain network responses: While an E2E test is meant to test the entire user journey with real data, in some cases, it’s necessary to mock specific API responses to maintain a predictable test environment. For instance, if your application relies on A/B tests or uses feature flags, different sessions might receive different experiences based on the user allocation. This can cause unexpected failures unrelated to actual regressions. Mocking these responses in test builds ensures consistency across sessions, and it avoids building complex test cases that handle different user experiences.

Establish Clear Test Ownership

When a test fails, it’s often unclear who’s responsible for investigating and fixing it. Over time, such an absence of clear test ownership and accountability results in unreliable, unmaintained, and flaky tests.

Assigning ownership of tests based on the ownership of product features can alleviate this problem to some extent. Ideally, the owning team should be responsible for writing, maintaining, and fixing tests for their critical flows. This ownership model ensures that failures are triaged quickly and that tests are updated as the product evolves instead of becoming stale and unstable.

Test ownership becomes challenging in codebases where multiple product teams own parts of a single user flow. For example, in a shopping application, different teams might own the login, product catalog, and checkout experiences. If a checkout flow test fails at the login step, it can be confusing which team should triage the issue. Without a clear policy, the failure might be ignored, or multiple teams might end up duplicating the effort.

To handle these scenarios, set a policy that defines the first point of contact (POC) per test based on the end-user experience. This ensures a single team takes responsibility for triaging the issue, but that fixes can be handed off to upstream dependencies as needed.

Reduce Noise and Improve Alerting

A common challenge with native E2E testing is noisy alerts due to flaky or failing tests. Teams are often flooded with non-actionable alerts when flaky tests fail because of transient network or device issues. Repeated failure notifications about known bugs can also lead to alert fatigue.

The following techniques reduce this noise so that teams are only notified for actionable failures:

Mute flaky tests and known bugs: Instead of reporting and notifying teams about all test failures, allow alerts from tests that are identified as flaky or linked to known issues to be muted without a code change. You can manage muted tests through a remote configuration, environment variables, or a tool like BrowserStack. Flag them for follow-up work, but let alerts only go out for new or unexpected regressions. Muting is particularly important for E2E tests since fixing failing tests often requires significant developer time and resources. Repeated alerts can be especially distracting for developers.
Enrich notifications with failure details: Instead of generic failure messages, include details such as the failing user flow, commit details, the error message, and links to logs or dashboards in your alerts. These details help developers identify and triage issues quicker, resulting in faster fixes and higher confidence in the suite.
Track test metrics and trends: In addition to test suite level reports, track and analyze the historical results of your tests to understand failure rates, flakiness trends, and failure hotspots. For example, if you observe repeated failures in the login flow, it might indicate unstable tests or sporadic bugs in that flow. Tracking these metrics over time provides visibility into whether the E2E suite is improving or degrading, and it helps you prioritize stabilization efforts based on impact.

Hybrid Strategies for Scaling E2E with Dockerized Emulators

Running native E2E tests at scale is challenging due to cost and resource constraints. Device farms that provide access to real cloud-based devices are expensive for running a large suite of tests at high frequency. This becomes a constraint for integrating E2E tests with the CI pipeline that executes with every pull request before the changes are merged.

As mentioned earlier, a hybrid testing approach that uses Dockerized emulators for PR builds alongside real devices for periodic runs can help you overcome this challenge. When our team moved PR checks to Dockerized emulators, we got faster feedback and significantly reduced cloud device costs.

Containerized device runners can be spun up quickly in CI. For example, the docker-android image lets you run an Android emulator in a containerized Docker environment. It supports multiple device profiles, OS versions, and UI-testing frameworks such as Appium and Espresso. Teams can easily integrate these emulators into CI pipelines to run E2E tests at scale without investing in a huge testing budget

If you are building E2E tests for mobile web, you can also use containerized browser images to run tests consistently across different environments to further reduce cost and setup complexity.

There’s Hope!

If your team has been chasing native E2E test failures like we were, you’re probably also burning engineering time and resources without improving test stability. I hope that this article has encouraged you that there’s a better way: improving your test environment, device setup, alerting, and observability.

Your best first step is to analyze your historical test failures and categorize them into buckets. Use these insights to define actionable items for reducing flakiness. Use this roadmap to identify test infrastructure investments or process changes that will deliver the most impact.

After our team invested in test infrastructure improvements, we saw a clear improvement in stability. Developers had a better understanding of real failures, and the number of noisy alerts was reduced. Flakiness didn’t disappear entirely, but the improved reliability of the test suite helped us catch multiple native app regressions before the changes were released to production.

I hope this article will help you achieve similar wins.

How to Run Claude Code with Docker: Local Models, MCP Servers, and Secure Sandboxes

Yiwen Xu — Fri, 13 Mar 2026 12:17:54 +0000

Claude Code is quickly becoming a go-to AI coding assistant for developers and increasingly for non-developers who want to build with code. But to truly unlock its potential, it needs the right local infrastructure, tool access, and security boundaries.

In this blog, we’ll show you how to run Claude Code with Docker to gain full control over your models, securely connect it to real-world tools using MCP servers, and safely give it autonomy inside isolated sandboxes. Read on for practical resources to help you build a secure, private, and cost-efficient AI-powered development workflow.

Run Claude Code Locally with Docker Model Runner

This post walks through how to configure Claude Code to use Docker Model Runner, giving you full control over your data, infrastructure, and spend. Claude Code supports custom API endpoints through the ANTHROPIC_BASE_URL environment variable. Since Docker Model Runner exposes an Anthropic-compatible API, integrating the two is simple. This allows you to run models locally while maintaining the Claude Code experience.

With your model running under your control, it’s time to connect Claude Code to tools to expand its capabilities.

How to Add MCP Servers to Claude Code with Docker MCP Toolkit

MCP is becoming the de facto standard to connect coding agents like Claude Code to your real tools, databases, repositories, browsers, and APIs. With more than 300 pre-built,containerized MCP servers, one-click deployment in Docker Desktop, and automatic credential handling, developers can connect Claude Code to trusted environments in minutes — not hours. No dependency issues, no manual configuration, just a consistent, secure workflow across Mac, Windows, and Linux.

In this guide, you’ll learn how to:

Set up Claude Code and connect it to Docker MCP Toolkit.
Configure the Atlassian MCP server for Jira integration.
Configure the GitHub MCP server to access repository history and run git commands.
Configure the Filesystem MCP server to scan and read your local codebase.
Automate tech debt tracking by converting 15 TODO comments into tracked Jira tickets.
See how Claude Code can query git history, categorize issues, and create tickets — all without leaving your development environment.

Prefer a video walkthrough? Check out our tutorial on how to add MCP servers to Claude Code with Docker MCP Toolkit.

Connecting tools unlocks powerful automation but with greater capability comes greater responsibility. If you’re going to let agents take action, you need to run them safely.

Docker Sandboxes: Run Claude Code and Other Coding Agents Unsupervised (but Safely)

As Claude Code moves from suggestions to real-world actions like installing packages and modifying files, isolation becomes critical.

Sandboxes provide disposable, isolated environments purpose-built for coding agents. Each agent runs in an isolated version of your development environment, so when it installs packages, modifies configurations, deletes files, or runs Docker containers, your host machine remains untouched.

This isolation lets you run agents like Claude Code with autonomy. Since they can’t harm your computer, let them run free. Check out our announcement on more secure, easier to use, and more powerful Docker Sandboxes.

Summary

Claude Code is powerful on its own but when used with Docker, it becomes a secure, extensible, and fully controlled AI development environment.

In this post, you learned how to:

Run Claude Code locally using Docker Model Runner with an Anthropic-compatible API endpoint, giving you full control over your data, infrastructure, and cost.
Connect Claude Code to tools using the Docker MCP Toolkit, with 300+ containerized MCP servers for services like Jira, GitHub, and local filesystems — all deployable in one click.
Run Claude Code safely in Docker Sandboxes, isolated environments that allow coding agents to operate autonomously without risking your host machine.

By combining local model execution, secure tool connectivity, and isolated runtime environments, Docker enables you to run AI coding agents like Claude Code with both autonomy and control, making them practical for real-world development workflows.

Secure Agent Execution with NanoClaw and Docker Sandboxes

Jin Kim — Fri, 13 Mar 2026 12:01:00 +0000

Agents have enormous potential to power secure, personal AI assistants that automate complex tasks and workflows. Realizing that potential, however, requires strong isolation, a codebase that teams can easily inspect and understand, and clear control boundaries they can trust.

Today, NanoClaw, a lightweight agent framework, is integrating with Docker Sandboxes to deliver secure-by-design agent execution. With this integration, every NanoClaw agent runs inside a disposable, MicroVM-based Docker Sandbox that enforces strong operating system level isolation. Combined with NanoClaw’s minimal attack surface and fully auditable open-source codebase, the stack is purpose-built to meet enterprise security standards from day one.

From Powerful Agents to Trusted Agents

The timing reflects a broader shift in the agent landscape. Agents are no longer confined to answering prompts. They are becoming operational systems.

Modern agents connect to live data sources, execute code, trigger workflows, and operate directly within collaboration platforms such as Slack, Discord, WhatsApp, and Telegram. They are evolving from conversational interfaces into active participants in real work.

That shift from prototype to production introduces two critical requirements: transparency and isolation.

First, transparency.

Organizations need agents built on code they can inspect and understand, with clear visibility into dependencies, source files, and core behavior. NanoClaw delivers exactly that. Its agent behavior is powered by just 15 core source files, with lines of code up to 100 times smaller than many alternatives. That simplicity makes it dramatically easier to evaluate risk, understand system behavior, and build with confidence.

Second, isolation.

Agents must run within restricted environments, with tightly controlled filesystems and limited host access. Through the Docker Sandbox integration, each NanoClaw agent runs inside a dedicated MicroVM that mirrors your development environment, with only your project workspace mounted in. Agents can install packages, modify configurations, and even run Docker itself, while your host machine remains untouched.

In traditional environments, enabling more permissive agent modes can introduce significant risk. Inside a Docker Sandbox, that risk is contained within an isolated MicroVM that can be discarded instantly. This makes advanced modes such as –dangerously-skip-permissions practical in production because their impact is fully confined.

The result is greater autonomy without greater exposure.

Agents no longer require constant approval prompts to move forward. They can install tools, adapt their environment, and iterate independently. Because their actions are contained within secure, disposable boundaries, they can safely explore broader solution spaces while preserving enterprise-grade safeguards.

Powerful agents are easy to prototype. Trusted agents are built with isolation by design.

Together, NanoClaw and Docker make secure-by-default the standard for agent deployment.

“Infrastructure needs to catch up to the intelligence of agents. Powerful agents require isolation,” said Mark Cavage, President and Chief Operating Officer at Docker, Inc. “Running NanoClaw inside Docker Sandboxes gives the agent a secure, disposable boundary, so it can run freely, safely.”

“Teams trust agents to take on increasingly complex and valuable work, but securing agents cannot be based on trust,” said Gavriel Cohen, CEO and co-founder of NanoCo and creator of NanoClaw. “It needs to be based on a provably secure hard boundary, scoped access to data and tools, and control over the actions agents are allowed to take. The security model should not limit what agents can accomplish. It should make it safe to let them loose. NanoClaw was built on that principle, and Docker Sandboxes provides the enterprise-grade infrastructure to enforce it.”

Get Started

Ready to try it out? Deploy NanoClaw in Docker Sandboxes today:

GitHub: github.com/qwibitai/nanoclaw
Docker Sandboxes: Learn more

Flexibility Over Lock-In: The Enterprise Shift in Agent Strategy

Yiwen Xu — Thu, 12 Mar 2026 12:50:49 +0000

Building agents is now a strategic priority for 95% of respondents in our latest State of Agentic AI research, which surveyed more than 800 developers and decision-makers worldwide. The shift is happening quickly: agent adoption has moved beyond experiments and demos into early operational maturity. But the road to enterprise-scale adoption is still complex. The foundations are forming, yet far from fully integrated, production-grade platforms that teams can confidently build on.

Security continues to surface as a top blocker to agent adoption. But it’s not the only one. Technical complexity is rising fast as well. Vendor lock-in is a big concern for the vast majority of the respondents surveyed.

So how do teams cut through the complexity and prepare for a world of multi-model, multi-tool, and multi-framework agents, while avoiding vendor lock-in in their agent workflows? In this blog, we break down the key findings from our research: what teams are actually using to power their agentic workloads, and what it takes to build a more scalable, future-ready agent architecture.

Multi-model and multi-cloud are the new normal. And complexity is rising

Our recent Agent AI study found that enterprises are embracing multi-model and multi-cloud architectures to gain greater control over performance, customization, privacy, and compliance. Multi-model is now the norm. Nearly two-thirds of organizations (61%) combine cloud-hosted and local models. And complexity doesn’t stop there: 46% report using between four and six models within their agents, while just 2% rely on a single model.

Deployment environments are just as diverse. 79% of respondents operate agents across two or more environments; 51% in public clouds, 40% on-premises, and 32% on serverless platforms.

This architectural flexibility delivers control, but it also multiplies orchestration and governance efforts. Coordinating models, tools, frameworks, and environments is consistently cited as one of the hardest parts of building agents. Nearly half of respondents (48%) identify operational complexity in managing multiple components as their biggest challenge, while 43% point to increased security exposure driven by orchestration sprawl.

The strategic shift away from vendor lock-in

As organizations double down on agent investments, concerns about supply chain fragility are rising. Seventy-six percent of global respondents report active worries about vendor lock-in.

Seventy-six percent of global respondents report active concerns about vendor lock-in

Rather than consolidating, teams are responding by diversifying. They’re distributing workloads across multiple models, tools, and cloud environments to reduce dependency and maintain leverage. Among the 61% of organizations using both cloud-hosted and locally hosted models, the primary drivers are control (64%), data privacy (60%), and compliance (54%). Cost ranks significantly lower at 41%, underscoring that flexibility and governance, not cost savings are shaping architectural decisions.

Containers power the next wave of agent adoption

Containerization is already foundational to agent development. Nearly all organizations surveyed (94%) use containers in their agent development or production workflows and the remainder plan to adopt them.

Nearly all organizations surveyed (94%) use containers in their agent development or production workflows and the remainder plan to adopt them.

As agent initiatives scale, teams are extending the same cloud-native practices that power their application pipelines such as microservices architectures, CI/CD, and container orchestration to support agent workloads. Containers are not an add-on; they are the operational backbone. In fact, 94% of teams building agents rely on them.

At the same time, early signs of orchestration standardization are emerging. Among teams building agents with Docker, 40% are using Docker Compose as their orchestration layer, a signal that familiar, container-based tooling is becoming a practical coordination layer for increasingly complex agent systems.

The agentic future won’t be monolithic

The agentic future won’t be monolithic. It’s already multi-cloud, multi-model, and multi-environment. That reality makes open standards and portable infrastructure foundational for sustaining enterprise trust and long-term flexibility.

What’s needed next isn’t reinvention, but standardization around an open, interoperable and portable infrastructure: the flexibility to work across any model, tool, and agent framework, secure-by-default runtimes, consistent orchestration and integrated policy controls. Teams that invest now in a trust layer built on container principles of isolation, portability and simplicity can move beyond point productivity gains to sustainable enterprise-wide outcomes while reducing vendor lock-in risk.

Download the full Agentic AI report for more insights and recommendations on how to scale agents for enterprise.

Join us on March 25, 2026, for a webinar where we’ll walk through the key findings and the strategies that can help you prioritize what comes next.

Learn more:

Get your copy of the latest State of Agentic AI report!
Learn more about Docker’s AI solutions
Read more about why AI agents challenge existing governance approaches and explore a new framework designed for agentic AI.

Building AI Teams: How Docker Sandboxes and Docker Agent Transform Development

Jennifer Kohl — Wed, 11 Mar 2026 13:00:00 +0000

It’s 11 PM. You’ve got a JIRA ticket open, an IDE with three unsaved files, a browser tab on Stack Overflow, and another on documentation. You’re context-switching between designing UI, writing backend APIs, fixing bugs, and running tests. You’re wearing all the hats, product manager, designer, engineer, QA specialist, and it’s exhausting.

What if instead of doing it all yourself, you could describe the goal and have a team of specialized AI agents handle it for you?

One agent breaks down requirements, another designs the interface, a third builds the backend, a fourth tests it, and a fifth fixes any issues. Each agent focuses on what it does best, working together autonomously while you sip your coffee.That’s not sci-fi, it’s what Agent + Docker Sandboxes delivers today.

What is Docker Agent?

Docker Agent is an open source tool for building teams of specialized AI agents. Instead of prompting one general-purpose model to do everything, you define agents with specific roles that collaborate to solve complex problems.

Here’s a typical dev-team configuration:

agents:
root:
model: openai/gpt-5
description: Product Manager - Leads the development team and coordinates iterations
instruction: |
Break user requirements into small iterations. Coordinate designer → frontend → QA.
- Define feature and acceptance criteria
- Ensure iterations deliver complete, testable features
- Prioritize based on value and dependencies
sub_agents: [designer, awesome_engineer, qa, fixer_engineer]
toolsets:
- type: filesystem
- type: think
- type: todo
- type: memory
path: dev_memory.db

designer:
model: openai/gpt-5
description: UI/UX Designer - Creates user interface designs and wireframes
instruction: |
Create wireframes and mockups for features. Ensure responsive, accessible designs.
- Use consistent patterns and modern principles
- Specify colors, fonts, interactions, and mobile layout
toolsets:
- type: filesystem
- type: think
- type: memory
path: dev_memory.db

qa:
model: openai
description: QA Specialist - Analyzes errors, stack traces, and code to identify bugs
instruction: |
Analyze error logs, stack traces, and code to find bugs. Explain what's wrong and why it's happening.
- Review test results, error messages, and stack traces
.......

awesome_engineer:
model: openai
description: Awesome Engineer - Implements user interfaces based on designs
instruction: |
Implement responsive, accessible UI from designs. Build backend APIs and integrate.
..........

fixer_engineer:
model: openai
description: Test Integration Engineer - Fixes test failures and integration issues
instruction: |
Fix test failures and integration issues reported by QA.
- Review bug reports from QA

The root agent acts as product manager, coordinating the team. When a user requests a feature, root delegates to designer for wireframes, then awesome_engineer for implementation, qa for testing, and fixer_engineer for bug fixes. Each agent uses its own model, has its own context, and accesses tools like filesystem, shell, memory, and MCP servers.

Agent Configuration

Each agent is defined with five key attributes:

model: The AI model to use (e.g., openai/gpt-5, anthropic/claude-sonnet-4-5). Different agents can use different models optimized for their tasks.
description: A concise summary of the agent’s role. This helps Docker Agent understand when to delegate tasks to this agent.
instruction: Detailed guidance on what the agent should do. Includes workflows, constraints, and domain-specific knowledge.
sub_agents: A list of agents this agent can delegate work to. This creates the team hierarchy.
toolsets: The tools available to the agent. Built-in options include filesystem (read/write files), shell (run commands), think (reasoning), todo (task tracking), memory (persistent storage), and mcp (external tool connections).

This configuration system gives you fine-grained control over each agent’s capabilities and how they coordinate with each other.

Why Agent Teams Matter

One agent handling complex work means constant context-switching. Split the work across focused agents instead, each handles what it’s best at. Docker Agent manages the coordination.

The benefits are clear:

Specialization: Each agent is optimized for its role (design vs. coding vs. debugging)
Parallel execution: Multiple agents can work on different aspects simultaneously
Better outcomes: Focused agents produce higher quality work in their domain
Maintainability: Clear separation of concerns makes teams easier to debug and iterate

The Problem: Running AI Agents Safely

Agent teams are powerful, but they come with a serious security concern. These agents need to:

Read and write files on your system
Execute shell commands (npm install, git commit, etc.)
Access external APIs and tools
Run potentially untrusted code

Giving AI agents full access to your development machine is risky. A misconfigured agent could delete files, leak secrets, or run malicious commands. You need isolation, agents should be powerful but contained.

Traditional virtual machines are too heavy. Chroot jails are fragile. You need something that provides:

Strong isolation from your host machine
Workspace access so agents can read your project files
Familiar experience with the same paths and tools
Easy setup without complex networking or configuration

Docker Sandboxes: The Secure Foundation

Docker Sandboxes solves this by providing isolated environments for running AI agents. As of Docker Desktop 4.60+, sandboxes run inside dedicated microVMs, providing a hard security boundary beyond traditional container isolation. When you run docker sandbox run , Docker creates an isolated microVM workspace that:

Mounts your project directory at the same absolute path (on Linux and macOS)
Preserves your Git configuration for proper commit attribution
Does not inherit environment variables from your current shell session
Gives agents full autonomy without compromising your host
Provides network isolation with configurable allow/deny lists

Docker Sandboxes now natively supports six agent types: Claude Code, Gemini, Codex, Copilot, Agent, and Kiro (all experimental). Agent can be launched directly as a sandbox agent:

# Run Agent natively in a sandbox
docker sandbox create agent ~/path/to/workspace
docker sandbox run agent ~/path/to/workspace

Or, for more control, use a detached sandbox:

# Create a sandbox
docker sandbox run -d --name my-agent-sandbox claude

# Copy agent into the sandbox
docker cp /usr/bin/agent :/usr/bin/agent

# Run your agent team
docker exec -it  bash -c "cd /path/to/workspace && agent run dev-team.yaml"

Your workspace /Users/alice/projects/myapp on the host is also /Users/alice/projects/myapp inside the microVM. Error messages, scripts with hard-coded paths, and relative imports all work as expected. But the agent is contained in its own microVM, it can’t access files outside the mounted workspace, and any damage it causes is limited to the sandbox.

Why Docker Sandboxes Matter

The combination of agents and Docker Sandboxes gives you something powerful:

Full agent autonomy: Agents can install packages, run tests, make commits, and use tools without constant human oversight
Complete safety: Even if an agent makes a mistake, it’s contained within the microVM sandbox
Hard security boundary: MicroVM isolation goes beyond containers, each sandbox runs in its own virtual machine
Network control: Allow/deny lists let you restrict which external services agents can access
Familiar experience: Same paths, same tools, same workflow as working directly on your machine
Workspace persistence: Changes sync between host and microVM, so your work is always available

Here’s how the workflow looks in practice:

User requests a feature to the root agent: “Create a bank app with Gradio”
Root creates a todo list and delegates to the designer
Designer generates wireframes and UI specifications
Awesome_engineer implements the code, running pip install gradio and python app/main.py
QA runs tests, finds bugs, and reports them
Fixer_engineer resolves the issues
Root confirms all tests pass and marks the feature complete

All of this happens autonomously inside a sandboxed environment. The agents can install dependencies, modify files, and execute commands, but they’re isolated from your host machine.

Try It Yourself

Let’s walk through setting up a simple agent team in a Docker Sandbox.

Prerequisites

Docker Desktop 4.60+ with sandbox support (microVM-based isolation)
agent (included in Docker Desktop 4.49+)
API key for your model provider (Anthropic, OpenAI, or Google)

Step 1: Create Your Agent Team

Save this configuration as dev-team.yaml:

models:
 openai:
   provider: openai
   model: gpt-5

agents:
 root:
   model: openai
   description: Product Manager - Leads the development team
   instruction: |
     Break user requirements into small iterations. Coordinate designer → frontend → QA.
   sub_agents: [designer, awesome_engineer, qa]
   toolsets:
     - type: filesystem
     - type: think
     - type: todo

 designer:
   model: openai
   description: UI/UX Designer - Creates designs and wireframes
   instruction: |
     Create wireframes and mockups for features. Ensure responsive designs.
   toolsets:
     - type: filesystem
     - type: think

 awesome_engineer:
   model: openai
   description: Developer - Implements features
   instruction: |
     Build features based on designs. Write clean, tested code.
   toolsets:
     - type: filesystem
     - type: shell
     - type: think

 qa:
   model: openai
   description: QA Specialist - Tests and identifies bugs
   instruction: |
     Test features and identify bugs. Report issues to fixer.
   toolsets:
     - type: filesystem
     - type: think

Step 2: Create a Docker Sandbox

The simplest approach is to use agent as a native sandbox agent:

# Run agent directly in a sandbox (experimental)
docker sandbox run agent ~/path/to/your/workspace

Alternatively, use a detached Claude sandbox for more control:

# Start a detached sandbox
docker sandbox run -d --name my-dev-sandbox claude

# Copy agent into the sandbox
which agent  # Find the path on your host
docker cp $(which agent) $(docker sandbox ls --filter name=my-dev-sandbox -q):/usr/bin/agent

Step 3: Set Environment Variables

# Run agent with your API key (passed inline since export doesn't persist across exec calls)
docker exec -it -e OPENAI_API_KEY=your_key_here my-dev-sandbox bash

Step 4: Run Your Agent Team

# Mount your workspace and run agent
docker exec -it my-dev-sandbox bash -c "cd /path/to/your/workspace && agent run dev-team.yaml"

Now you can describe what you want to build, and your agent team will handle the rest:

User: Create a bank application using Python. The bank app should have basic functionality like account savings, show balance, withdraw, add money, etc. Build the UI using Gradio. Create a directory called app, and inside of it, create all of the files needed by the project

Agent (root): I'll break this down into iterations and coordinate with the team...

Watch as the designer creates wireframes, the engineer builds the Gradio app, and QA tests it, all autonomously in a secure sandbox.

Final result from a one shot prompt

Step 5: Clean Up

When you’re done:

# Remove the sandbox
docker sandbox rm my-dev-sandbox

Docker enforces one sandbox per workspace. Running docker sandbox run in the same directory reuses the existing container. To change configuration, remove and recreate the sandbox.

Current Limitations

Docker Sandboxes and Docker Agent are evolving rapidly. Here are a few things to know:

Docker Sandboxes now supports six agent types natively: Claude Code, Gemini, Codex, Copilot, agent, and Kiro. All are experimental and breaking changes may occur between Docker Desktop versions.
Custom Shell that doesn’t include a pre-installed agent binary. Instead, it provides a clean environment where you can install and configure any agent or tool
MicroVM sandboxes require macOS or Windows. Linux users can use legacy container-based sandboxes with Docker Desktop 4.57+
API keys may still need manual configuration depending on the agent type
Sandbox templates are optimized for certain workflows; custom setups may require additional configuration

Why This Matters Now

AI agents are becoming more capable, but they need infrastructure to run safely and effectively. The combination of agent and Docker Sandboxes addresses this by:

Feature	Traditional Approach	With agent + Docker Sandboxes
Autonomy	Limited – requires constant oversight	High – agents work independently
Security	Risky – agents have host access	Isolated – agents run in microVMs
Specialization	One model does everything	Multiple agents with focused roles
Reproducibility	Inconsistent across machines	MicroVM-isolated, version-controlled
Scalability	Manual coordination	Automated team orchestration

This isn’t just about convenience, it’s about enabling AI agents to do real work in production environments, with the safety guarantees that developers expect.

What’s Next

Explore the Docker Agent documentation to build your own agent teams
Check out Docker Sandboxes for advanced configurations
Browse example agent configurations in the agent repository
Integrate agent with your editor or use agents as tools in MCP clients

Conclusion

We’re moving from “prompting AI to write code” to “orchestrating AI teams to build software.” agent gives you the team structure; Docker Sandboxes provides the secure foundation.

The days of wearing every hat as a solo developer are numbered. With specialized AI agents working in isolated containers, you can focus on what matters, designing great software, while your AI team handles the implementation, testing, and iteration.

Try it out. Build your own agent team. Run it in a Docker Sandbox. See what happens when you have a development team at your fingertips, ready to ship features while you grab lunch.

What’s Holding Back AI Agents? It’s Still Security

Yiwen Xu — Tue, 10 Mar 2026 12:59:28 +0000

It’s hard to find a team today that isn’t talking about agents. For most organizations, this isn’t a “someday” project anymore. Building agents is a strategic priority for 95% of respondents that we surveyed across the globe with 800+ developers and decision makers in our latest State of Agentic AI research. The shift is happening fast: agent adoption has moved beyond experiments and demos into something closer to early operational maturity. 60% of organizations already report having AI agents in production, though a third of those remain in early stages.

Agent adoption today is driven by a pragmatic focus on productivity, efficiency, and operational transformation, not revenue growth or cost reduction. Early adoption is concentrated in internal, productivity-focused use cases, especially across software, infrastructure, and operations. The feedback loops are fast, and the risks are easier to control.

So what’s holding back agent scaling? Friction shows up and nearly all roads lead to the same place: AI agent security.

AI agent security isn’t one issue it’s the constraint

When teams talk about what’s holding them back, AI agent security rises to the top. In the same survey, 40% of respondents cite security as their top blocker when building agents. The reason it hits so hard is that it’s not confined to a single layer of the stack. It shows up everywhere, and it compounds as deployments grow.

For starters, when it comes to infrastructure, as organizations expand agent deployments, teams emphasize the need for secure sandboxing and runtime isolation, even for internal agents.

At the operations layer, complexity becomes a security problem. Once you have more tools, more integrations, and more orchestration logic, it gets harder to see what’s happening end-to-end and harder to control it. Our latest research data reflects that sprawl: over a third of respondents report challenges coordinating multiple tools, and a comparable share say integrations introduce security or compliance risk. That’s a classic pattern: operational complexity creates blind spots, and blind spots become exposure.

45% of organizations say the biggest challenge is ensuring tools are secure, trusted, and enterprise-ready.

And at the governance layer, enterprises want something simple: consistency. They want guardrails, policy enforcement, and auditability that work across teams and workflows. But current tooling isn’t meeting that bar yet. In fact, 45% of organizations say the biggest challenge is ensuring tools are secure, trusted, and enterprise-ready. That’s not a minor complaint: it’s the difference between “we can try this” and “we can scale this.”

MCP is popular but not ready for enterprise

Many teams are adopting Model Context Protocol (MCP) because it gives agents a standardized way to connect to tools, data, and external systems, making agents more useful and customized. Among respondents further along in their agent journey, 85% say they’re familiar with MCP and two-thirds say they actively use it across personal and professional projects.

Research data suggests that most teams are operating in what could be described as “leap-of-faith mode” when it comes to MCP, adopting the protocol without security guarantees and operational controls they would demand from mature enterprise infrastructure.

But the security story hasn’t caught up yet. Teams adopt MCP because it works, but they do so without the security guarantees and operational controls they would expect from mature enterprise infrastructure. For teams earlier in their agentic journey: 46% of them identify security and compliance as the top challenge with MCP.

Organizations are increasingly watching for threats like prompt injection and tool poisoning, along with the more foundational issues of access control, credentials, and authentication. The immaturity and security challenges of current MCP tooling make for a fragile foundation at this stage of agentic adoption.

Conclusion and recommendations

Ai agent security is what sets the speed limit for agentic AI in the enterprise. Organizations aren’t lacking interest, they’re lacking confidence that today’s tooling is enterprise-ready, that access controls can be enforced reliably, and that agents can be kept safely isolated from sensitive systems.

The path forward is clear. Unlocking agents’ full potential will require new platforms built for enterprise scale, with secure-by-default foundations, strong governance, and policy enforcement that’s integrated, not bolted on.

Download the full Agentic AI report for more insights and recommendations on how to scale agents for enterprise.

Join us on March 25, 2026, for a webinar where we’ll walk through the key findings and the strategies that can help you prioritize what comes next.

Learn more:

Get your copy of the latest State of Agentic AI report!
Learn more about Docker’s AI solutions
Read more about why AI agents challenge existing governance approaches and explore a new framework designed for agentic AI.

Celebrating Women in AI: 3 Questions with Cecilia Liu on Leading Docker’s MCP Strategy

Yiwen Xu — Fri, 06 Mar 2026 12:59:30 +0000

To celebrate International Women’s Day, we sat down with Cecilia Liu, Senior Product Manager at Docker, for three questions about the vision and strategy behind Docker’s MCP solutions. From shaping product direction to driving AI innovation, Cecilia plays a key role in defining how Docker enables secure, scalable AI tooling.

Cecilia leads product management for Docker’s MCP Catalog and Toolkit, our solution for running MCP servers securely and at scale through containerization. She drives Docker’s AI strategy across both enterprise and developer ecosystems, helping organizations deploy MCP infrastructure with confidence while empowering individual developers to seamlessly discover, integrate, and use MCP in their workflows. With a technical background in AI frameworks and an MBA from NYU Stern, Cecilia bridges the worlds of AI infrastructure and developer tools, turning complex challenges into practical, developer-first solutions.

What products are you responsible for?

I own Docker’s MCP solution. At its core, it’s about solving the problems that anyone working with MCP runs into: how do you find the right MCP servers, how do you actually use them without a steep learning curve, and how do you deploy and manage them reliably across a team or organization.

How does Docker’s MCP solution benefit developers and enterprise customers?

Dev productivity is where my heart is. I want to build something that meaningfully helps developers at every stage of their cycle — and that’s exactly how I think about Docker’s MCP solution.

For end-user developers and vibe coders, the goal is simple: you shouldn’t need to understand the underlying infrastructure to get value from MCP. As long as you’re working with AI, we make it easy to discover, configure, and start using MCP servers without any of the usual setup headaches. One thing I kept hearing in user feedback was that people couldn’t even tell if their setup was actually working. That pushed us to ship in-product setup instructions that walk you through not just configuration, but how to verify everything is running correctly. It sounds small, but it made a real difference.

For developers building MCP servers and integrating them into agents, I’m focused on giving them the right creation and testing tools so they can ship faster and with more confidence. That’s a big part of where we’re headed.

And for security and enterprise admins, we’re solving real deployment pain, making it faster and cheaper to roll out and manage MCP across an entire organization. Custom catalogs, role-based access controls, audit logging, policy enforcement. The goal is to give teams the visibility and control they need to adopt AI tooling confidently at scale.

Customers love us for all of the above, and there’s one more thing that ties it together: the security that comes built-in with Docker. That trust doesn’t happen overnight, and it’s something we take seriously across everything we ship.

What are you excited about when it comes to the future of MCP?

What excites me most is honestly the pace of change itself. The AI landscape is shifting constantly, and with every new tool that makes AI more powerful, there’s a whole new set of developers who need a way to actually use it productively. That’s a massive opportunity.

MCP is where that’s happening right now, and the adoption we’re seeing tells me the need is real. But what gets me out of bed is knowing the problems we’re solving: discoverability, usability, deployment. They are all going to matter just as much for whatever comes next. We’re not just building for today’s tools. We’re building the foundation that developers will reach for every time something new emerges.

Cecilia is speaking about scaling MCP for enterprises at the MCP Dev Summit in NYC on 3rd of April, 2026. If you’re attending, be sure to stop by Docker’s booth (D/P9).

Learn more

Explore Docker’s MCP Catalog and Toolkit on our website.
Dive into our documentation to get started quickly.
Ready to go hands-on? Open Docker Desktop or the CLI and start using MCP to streamline and automate your development workflows.

Announcing Docker Hardened System Packages

Vishrut Iyengar — Tue, 03 Mar 2026 20:30:00 +0000

Your Package Manager, Now with a Security Upgrade

Last December, we made Docker Hardened Images (DHI) free because we believe secure, minimal, production-ready images should be the default. Every developer deserves strong security at no cost. It should not be complicated or locked behind a paywall.

From the start, flexibility mattered just as much as security. Unlike opaque, proprietary hardened alternatives, DHI is built on trusted open source foundations like Alpine and Debian. That gives teams true multi-distro flexibility without forcing change. If you run Alpine, stay on Alpine. If Debian is your standard, keep it. DHI strengthens what you already use. It does not require you to replace it.

Today, we are extending that philosophy beyond images.

With Docker Hardened System Packages, we’re driving security deeper into the stack. Every package is built on the same secure supply chain foundation: source-built and patched by Docker, cryptographically attested, and backed by an SLA.

The best part? Multi-distro support by design.

The result is consistent, end-to-end hardening across environments with the production-grade reliability teams expect.

Since introducing DHI Community (our OSS tier), interest has surged. The DHI catalog has expanded from more than 1,000 to over 2,000 hardened container images. Its openness and ability to meet teams where they are have accelerated adoption across the ecosystem. Companies of all sizes, along with a growing number of open source projects, are making DHI their standard for secure containers.

Just consider this short selection of examples:

n8n.io has moved its production infrastructure to DHI, they share why and how in this recent webinar
Medplum, an open-source electronic health records platform (managing data of 20+ million patients) has now standardized to DHI
Adobe uses DHI because of great alignment with its security posture and developer tooling compatibility
Attentive co-authored this e-book with Docker on helping others move from POC to production with DHI

Docker Hardened System Packages: Going deeper into the container

From day one, Docker has built and secured the most critical operating system packages to deliver on our CVE remediation commitments. That’s how we continuously maintain near-zero CVEs in DHI images. At the same time, we recognize that many teams extend our minimal base images with additional upstream packages to meet their specific requirements. To support that reality, we are expanding our catalog with more than 8,000 hardened Alpine packages, with Debian coverage coming soon.

This expansion gives teams greater flexibility without weakening their security posture. You can start with a DHI base image and tailor it to your needs while maintaining the same hardened supply chain guarantees. There is no need to switch distros to get continuous patching, verified builds through a SLSA Build Level 3 pipeline, and enterprise-grade assurances. Your teams can continue working with the Alpine and Debian environments they know, now backed by Docker’s secure build system from base image to system package.

Why this matters for your security posture:

Complete provenance chain. Every package is built from source by Docker, attested, and cryptographically signed. From base image to final container, your provenance stays intact.

Faster vulnerability remediation. When a vulnerability is identified, we patch it at the package level and publish it to the catalog. Not image by image. That means fixes move faster and remediation scales across your entire container fleet.

Extending the near-zero CVE guarantee. DHI images maintain near-zero. Hardened System Packages extend that guarantee more broadly across the software ecosystem, covering packages you add during customization.

Use hardened packages with your containers. DHI Enterprise customers get access to the secure packages repository, making it possible to use Hardened System Packages beyond DHI images. Integrate them into your own pipelines and across Alpine and Debian workloads throughout your environment.

The work we’re doing on our users’ behalf: Maintaining thousands of packages is continuous work. We monitor upstream projects, backport patches, test compatibility, rebuild when dependencies change, and generate attestations for every release. Alpine alone accounts for more than 8,000 packages today, soon approaching 10,000, with Debian next.

Making enterprise-grade security even more accessible

We’re also simplifying how teams access DHI. The full catalog of thousands of open-source images under Apache 2.0 now has a new name: DHI Community. There are no licensing changes, this is just a name change, so all of that free goodness has an easy name to refer to.

For teams that need SLA-backed CVE remediation and customization capabilities at a more accessible price point, we’re announcing a new pricing tier today, DHI Select. This new tier brings enterprise-grade security at a price of $5,000 per repo.

For organizations with more demanding requirements, including unlimited customizations, access to the Hardened System Packages repo, and extended lifecycle coverage for up to five years after upstream EOL, DHI Enterprise and the DHI Extended Lifecycle Support add-on remain available.

More options means more teams can adopt the right level of security for where they are today.

Build with the standard that’s redefining container security

Docker’s momentum in securing the software supply chain is accelerating. We’re bringing security to more layers of the stack, making it easier for teams to build securely by default, for open source-based containers as well as your company’s internally-developed software. We’re also pushing toward a one-day (or shorter) timeline for critical CVE fixes. Each step builds on the last, moving us closer to end-to-end supply chain security for all of your critical applications.

Get started:

Join the n8n webinar to see how they’re running production workloads on DHI
Start your free trial and get access to the full DHI catalog, now with Docker Hardened System Packages

Docker Model Runner Brings vLLM to macOS with Apple Silicon

Yiwen Xu — Thu, 26 Feb 2026 14:42:57 +0000

vLLM has quickly become the go-to inference engine for developers who need high-throughput LLM serving. We brought vLLM to Docker Model Runner for NVIDIA GPUs on Linux, then extended it to Windows via WSL2.

That changes today. Docker Model Runner now supports vllm-metal, a new backend that brings vLLM inference to macOS using Apple Silicon’s Metal GPU. If you have a Mac with an M-series chip, you can now run MLX models through vLLM with the same OpenAI-compatible API, same Anthropic-compatible API for tools like Claude Code, and all in one, the same Docker workflow.

What is vllm-metal?

vllm-metal is a plugin for vLLM that brings high-performance LLM inference to Apple Silicon. Developed in collaboration between Docker and the vLLM project, it unifies MLX, the Apple’s machine learning framework, and PyTorch under a single compute pathway, plugging directly into vLLM’s existing engine, scheduler, and OpenAI-compatible API server.

The architecture is layered: vLLM’s core (engine, scheduler, tokenizer, API) stays unchanged on top. A plugin layer consisting of MetalPlatform, MetalWorker, and MetalModelRunner handles the Apple Silicon specifics. Underneath, MLX drives the actual inference while PyTorch handles model loading and weight conversion. The whole stack runs on Metal, Apple’s GPU framework.

+-------------------------------------------------------------+
|                          vLLM Core                          |
|        Engine | Scheduler | API | Tokenizers                |
+-------------------------------------------------------------+
                             |
                             v
+-------------------------------------------------------------+
|                   vllm_metal Plugin Layer                   |
|   +-----------+  +-----------+  +------------------------+  |
|   | Platform  |  | Worker    |  | ModelRunner            |  |
|   +-----------+  +-----------+  +------------------------+  |
+-------------------------------------------------------------+
                             |
                             v
+-------------------------------------------------------------+
|                   Unified Compute Backend                   |
|   +------------------+    +----------------------------+    |
|   | MLX (Primary)    |    | PyTorch (Interop)          |    |
|   | - SDPA           |    | - HF Loading               |    |
|   | - RMSNorm        |    | - Weight Conversion        |    |
|   | - RoPE           |    | - Tensor Bridge            |    |
|   | - Cache Ops      |    |                            |    |
|   +------------------+    +----------------------------+    |
+-------------------------------------------------------------+
                             |
                             v
+-------------------------------------------------------------+
|                       Metal GPU Layer                       |
|           Apple Silicon Unified Memory Architecture         |
+-------------------------------------------------------------+

Figure 1: High-level architecture diagram of vllm-metal. Credit: vllm-metal

What makes this particularly effective on Apple Silicon is unified memory. Unlike discrete GPUs where data must be copied between CPU and GPU memory, Apple Silicon shares a single memory pool. vllm-metal exploits this with zero-copy tensor operations. Combined with paged attention for efficient KV cache management and Grouped-Query Attention support, this means you can serve longer sequences with less memory waste.

vllm-metal runs MLX models published by the mlx-community on Hugging Face. These models are built specifically for the MLX framework and take full advantage of Metal GPU acceleration. Docker Model Runner automatically routes MLX models to vllm-metal when the backend is installed, falling back to the built-in MLX backend otherwise.

How vllm-metal works

vllm-metal runs natively on the host. This is necessary because Metal GPU access requires direct hardware access and there is no GPU passthrough for Metal in containers.

When you install the backend, Docker Model Runner:

Pulls a Docker image from Hub that contains a self-contained Python 3.12 environment with vllm-metal and all dependencies pre-packaged.
Extracts it to `~/.docker/model-runner/vllm-metal/`.
Verifies the installation by importing the `vllm_metal` module.

When a request comes in for a compatible model, the Docker Model Runner’s scheduler starts a vllm-metal server process that communicates over TCP, serving the standard OpenAI API. The model is loaded from Docker’s shared model store, which contains all the models you pull with `docker model pull`.

Which models work with vllm-metal?

vllm-metal works with safetensors models in MLX format. The mlx-community on Hugging Face maintains a large collection of quantized models optimized for Apple Silicon. Some examples you can try:

vLLM everywhere with Docker Model Runner

With vllm-metal, Docker Model Runner now supports vLLM across the three major platforms:

Platform	Backend	GPU
Linux	vllm	NVIDIA (CUDA)
Windows (WSL2)	vllm	NVIDIA (CUDA)
macOS	vllm-metal	Apple Silicon (Metal)

The same docker model commands work regardless of platform. Pull a model, run it. Docker Model Runner picks the right backend for your platform.

Get started

Update to Docker Desktop 4.62 or later for Mac, and install the backend:

docker model install-runner --backend vllm

Check out the Docker Model Runner documentation to learn more. For contributions, feedback, and bug reports, visit the docker/model-runner repository on GitHub.

Giving Back: vllm-metal is Now Open Source

At Docker, we believe that the best way to accelerate AI development is to build in the open. That is why we are proud to announce that Docker has contributed the vllm-metal project to the vLLM community. Originally developed by Docker engineers to power Model Runner on macOS, this project now lives under thevLLM GitHub organization. This ensures that every developer in the ecosystem can benefit from and contribute to high-performance inference on Apple Silicon. The project also has had significant contributions by Lik Xun Yuan, Chao-Ju Chen and Ranran Haoran Zhang.

The $599 AI Development Rig

For a long time, high-throughput vLLM development was gated behind a significant GPU cost. To get started, you typically need a dedicated Linux box with an RTX 4090 ($1,700+) or enterprise-grade A100/H100 cards ($10,000+).

vllm-metal changes the math

Now, a base $599 Mac Mini with an M4 chip becomes a viable vLLM development environment. Because Apple Silicon uses Unified Memory, that 16GB (or upgraded 32GB/64GB) of RAM is directly accessible by the GPU. This allows you to:

Develop & Test Locally: Build your vLLM-based applications on the same machine you use for coding.
Production-Mirroring: Use the exact same OpenAI-compatible API on your Mac Mini as you would on an H100 cluster in production.
Energy Efficiency: Run inference at a fraction of the power consumption (and heat) of a discrete GPU rig.

How does vllm-metal compare to llama.cpp?

We benchmarked both backends using Llama 3.2 1B Instruct with comparable 4-bit quantization, served through Docker Model Runner on Apple Silicon.

	llama.cpp	vLLM-Metal
Model	unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_0	mlx-community/llama-3.2-1b-instruct-4bit
Format	GGUF (Q4_0)	Safetensors (MLX 4-bit)

Throughput (tokens/sec, wall-clock)

max_tokens	llama.cpp	vLLM-Metal	speedup
128	333.3	251.5	1.3x
512	345.1	279.0	1.3x
1024	338.5	275.4	1.2x
2048	339.1	279.5	1.2x

Each configuration was run 3 times across 3 different prompts (9 total requests per data point).

Throughput is measured as completion_tokens / wall_clock_time, applied consistently to both backends.

Key observations:

llama.cpp is consistently ~1.2x faster than vLLM-Metal across all output lengths.
llama.cpp throughput is remarkably stable (~333-345 tok/s regardless of max_tokens), while vLLM-Metal shows more variance between individual runs (134-343 tok/s).
Both backends scale well. Neither backend shows significant degradation as output length increases.
Quantization methods differ (GGUF Q4_0 vs MLX 4-bit), so this benchmarks the full stack, engine + quantization, rather than the engine alone.

The benchmark script used for these results is available as a GitHub Gist.

How You Can Get Involved

The strength of Docker Model Runner lies in its community, and there’s always room to grow. To get involved:

Star the repository: Show your support by starring the Docker Model Runner repo.
Contribute your ideas: Create an issue or submit a pull request. We’re excited to see what ideas you have!
Spread the word: Tell your friends and colleagues who might be interested in running AI models with Docker.

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!

Learn More

Read the companion post: OpenCode with Docker Model Runner for Private AI Coding
Check out the Docker Model Runner General Availability announcement
Visit our Model Runner GitHub repo
Get started with a simple hello GenAI application

Open WebUI + Docker Model Runner: Self-Hosted Models, Zero Configuration

Yiwen Xu — Wed, 25 Feb 2026 14:37:33 +0000

We’re excited to share a seamless new integration between Docker Model Runner (DMR) and Open WebUI, bringing together two open source projects to make working with self-hosted models easier than ever.

With this update, Open WebUI automatically detects and connects to Docker Model Runner running at localhost:12434. If Docker Model Runner is enabled, Open WebUI uses it out of the box, no additional configuration required.

The result: a fully Docker-managed, self-hosted model experience running in minutes.

Note for Docker Desktop users:
If you are running Docker Model Runner via Docker Desktop, make sure TCP access is enabled. Open WebUI connects to Docker Model Runner over HTTP, which requires the TCP port to be exposed:

docker desktop enable model-runner --tcp

Better Together: Docker Model Runner and Open WebUI

Docker Model Runner and Open WebUI come from the same open source mindset. They’re built for developers who want control over where their models run and how their systems are put together, whether that’s on a laptop for quick experimentation or on a dedicated GPU host with more horsepower behind it.

Docker Model Runner focuses on the runtime layer: a Docker-native way to run and manage self-hosted models using the tooling developers already rely on. Open WebUI focuses on the experience: a clean, extensible interface that makes those models accessible and useful.

Now, the two connect automatically.

No manual endpoint configuration. No extra flags.

That’s the kind of integration open source does best, separate projects evolving independently, but designed well enough to fit together naturally.

Zero-Config Setup

If Docker Model Runner is enabled, getting started with Open WebUI is as simple as:

docker run -p 3000:8080 openwebui/open-webui

That’s it.

Open WebUI will automatically connect to Docker Model Runner and begin using your self-hosted models, no environment variables, no manual endpoint configuration, no extra flags.

Visit: http://localhost:3000 and create your account:

And you’re ready to interact with your models through a modern web interface:

Open by design

One of the nice things about this integration is that it didn’t require special coordination or proprietary hooks. Docker Model Runner and Open WebUI are both open source projects with clear boundaries and well-defined interfaces. They were built independently, and they still fit together cleanly.

Docker Model Runner focuses on running and managing models in a way that feels natural to anyone already using Docker.

Open WebUI focuses on making those models usable. It provides the interface layer, conversation management, and extensibility you’d expect from a modern web UI.

Because both projects are open, there’s no hidden contract between them. You can see how the connection works. You can modify it if you need to. You can deploy the pieces separately or together. The integration isn’t a black box, it’s just software speaking a clear interface.

Works with Your Setup

One of the practical benefits of this approach is flexibility.

Docker Model Runner doesn’t dictate where your models run. They might live on your laptop during development, on a more powerful remote machine, or inside a controlled internal environment. As long as Docker Model Runner is reachable, Open WebUI can connect to it.

That separation between runtime and interface is intentional. The UI doesn’t need to know how the model is provisioned. The runtime doesn’t need to know how the UI is presented. Each layer does its job.

With this integration, that boundary becomes almost invisible. Start the container, open your browser, and everything lines up.

You decide where the models run. Open WebUI simply meets them there.

Summary

Open WebUI and Docker Model Runner make self-hosted AI simple, flexible and fully under your control. Docker powers the runtime. Open WebUI delivers a modern interface on top.

With automatic detection and zero configuration, you can go from enabling Docker Model Runner to interact with your models in minutes.

Both projects are open source and built with clear boundaries, so you can run models wherever you choose and deploy the pieces together or separately. We can’t wait to see what you build next!

How You Can Get Involved

The strength of Docker Model Runner lies in its community and there’s always room to grow. We need your help to make this project the best it can be. To get involved, you can:

Star the repository: Show your support and help us gain visibility by starring the Docker Model Runner repo.
Contribute your ideas: Have an idea for a new feature or a bug fix? Create an issue to discuss it. Or fork the repository, make your changes, and submit a pull request. We’re excited to see what ideas you have!
Spread the word: Tell your friends, colleagues, and anyone else who might be interested in running AI models with Docker.

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!

Learn more

Check out the Docker Model Runner General Availability announcement
Visit our Model Runner GitHub repo! Docker Model Runner is open-source, and we welcome collaboration and contributions from the community!
Get started with Docker Model Runner with a simple hello GenAI application