How Companies Are Using Generative AI Today: Real Use Cases and Adoption Stages (Part 2)

Renata — Fri, 06 Mar 2026 15:19:34 +0000

Generative AI is no longer limited to individual productivity tools. Many companies now use it to improve internal operations, enhance customer experiences, and reduce costs. While early adoption focused on experimentation, Generative AI is increasingly becoming part of everyday business workflows.

This article is written for beginners, non-technical professionals, and early-stage teams who want a clear, practical understanding of how companies use Generative AI today. The focus is high level and business-oriented, without technical detail.

Introduction
Problems Companies Faced Before Generative AI
Why Traditional Automation Was Not Enough
Individual Use vs Company Use
Internal Tools vs Customer-Facing Products
Adoption Stages of Generative AI in Companies
Enterprise-Scale Adoption
Prompt-to-Output Example in a Company Context
What Actually Changes Inside a Company When Generative AI Is Introduced
Conclusion
Key Takeaways
References

Introduction

As companies grow, operational complexity increases. More customers generate more questions. More employees produce more content. More data requires more interpretation.

Before Generative AI, scaling cognitive work meant hiring more people or increasing manual effort.

Today, organizations use Generative AI to automate parts of that cognitive workload. Instead of focusing only on individual productivity, businesses are integrating AI into workflows to improve efficiency, reduce costs, and enhance customer experiences.

This article explains the business problems that led to Generative AI adoption, how companies apply it in practice, and how adoption typically evolves from experimentation to enterprise-scale integration.

Problems Companies Faced Before Generative AI

Before Generative AI, many business tasks required significant manual effort.

Common challenges included:

time-consuming content creation
high costs for repetitive cognitive work
limited personalization at scale
slow customer response times
reliance on specialized roles for routine tasks

Traditional automation helped with structured processes, but struggled with open-ended or language-based work.

Why Traditional Automation Was Not Enough

Traditional automation helped businesses improve efficiency in structured, rule-based processes. For example, systems could automatically route tickets, calculate totals, or trigger predefined workflows.

However, traditional automation struggled with open-ended, language-based tasks such as:

Writing and summarizing content
Interpreting customer feedback
Responding to varied user questions
Extracting meaning from unstructured text

These tasks require flexibility and contextual understanding. Rule-based systems depend on predefined logic and cannot adapt easily to unpredictable input.

As businesses scaled, this limitation became more visible. Companies needed a way to automate cognitive work — not just structured processes.

Generative AI emerged to address that gap.

Individual Use vs Company Use

Individuals and companies use Generative AI in fundamentally different ways.

Individual Use	Company Use
Personal productivity	Operational efficiency
Informal and ad hoc	Structured and repeatable
Low risk	Higher responsibility
No system integration	Integrated into workflows
Output affects one user	Output affects teams and customers

An individual might use GenAI to draft an email or learn a topic. Companies focus on consistency, scale, and outcomes such as reduced costs or faster response times.

Internal Tools vs Customer-Facing Products

To solve these operational challenges, companies apply Generative AI in two main areas: internal workflows and customer-facing products.

Internal Tools

Many organizations start with internal use cases because they are lower risk and easier to test.

Common examples include:

drafting internal documents and reports
summarizing meetings or long documents
assisting developers with code suggestions
generating marketing ideas or content outlines
helping customer support agents draft responses

Example:
A customer support team uses Generative AI to suggest draft replies to customer emails. Human agents review and edit the response before sending it.

How Generative AI Integrates Into Business Workflows

In practice, Generative AI does not operate in isolation. It is embedded within existing business systems.

A simplified workflow may look like this:

Customer Query → Support Platform → AI Retrieval → Draft Response → Human Review → Send

In this flow:

The AI retrieves relevant information from internal knowledge bases or documentation.
It generates a structured draft response.
A human agent reviews and approves the output.
The final message is delivered through existing systems.

This integration ensures that Generative AI enhances workflows rather than replacing them.

Customer-Facing Products

As confidence grows, companies introduce Generative AI into products and services used directly by customers.

Examples include:

chatbots and virtual assistants
AI-powered help centers or search tools
personalized recommendations
automated content generation features

Customer-facing use cases require more control and testing because outputs directly affect user trust and brand reputation.

Adoption Stages of Generative AI in Companies

Most organizations adopt Generative AI gradually rather than all at once.

Experimentation

Teams explore Generative AI through pilots, trials, or limited internal access. Employees may test external tools or early prototypes.

The goal at this stage is learning and validation, not optimization or scale.

Internal Production

Once value is demonstrated, companies integrate Generative AI into internal workflows. Access becomes more structured and usage aligns with specific tasks.

This stage focuses on efficiency, consistency, and reducing repetitive work.

Enterprise-Scale Adoption

At the enterprise level, Generative AI becomes part of core systems and processes rather than a standalone tool.

Organizations implement:

Role-based access controls to protect sensitive data
Secure connections to internal databases and knowledge systems
Logging and monitoring of prompts and outputs
Model version control to manage updates
Cost tracking and usage analytics

Governance becomes critical because AI outputs directly affect customers, decisions, and brand reputation.

At this stage, Generative AI is treated as managed infrastructure — not an experiment.

What Generative AI Replaces or Augments

Generative AI rarely replaces entire roles. Instead, it augments existing work.

It commonly:

replaces repetitive drafting and summarization
accelerates research and analysis
supports employees rather than removing them
enables scale without linear increases in cost

By handling routine cognitive tasks, Generative AI allows teams to focus on higher-value work.

Prompt-to-Output Example in a Company Context

Simple Example

Prompt:

Summarize this customer feedback report and highlight the top three recurring issues.

Output:

A concise summary identifying three recurring customer issues, written in clear, actionable language for internal teams.

In production environments, this process often includes retrieving relevant internal data before generating the output, ensuring responses remain aligned with company policies and real-time information.

This example shows how companies use prompts to generate structured outputs that support decision-making rather than replace human judgment.

What Actually Changes Inside a Company When Generative AI Is Introduced

Introducing Generative AI into a company does not only change tools. It changes workflows, responsibilities, and risk management practices.

As adoption moves beyond experimentation, organizations begin to adjust how work is structured and governed.

1. New Workflows

Generative AI shifts tasks from fully manual execution to human-AI collaboration.

Instead of:

Employee → Task → Output

Workflows often become:

Employee → AI Draft → Human Review → Final Output

This changes how time is allocated. Employees spend less time creating first drafts and more time reviewing, refining, and validating outputs.

Teams also redesign processes to integrate AI into existing systems such as CRM platforms, ticketing systems, content management tools, or internal dashboards.

AI becomes a step in the workflow — not a separate tool.

2. New Risks

With integration comes new types of risk.

Organizations must consider:

Data privacy and protection
Accuracy and hallucination risks
Bias in generated outputs
Compliance and regulatory exposure
Brand reputation impact

When AI outputs affect customers, decisions, or financial outcomes, errors carry real consequences.

This forces companies to introduce monitoring, logging, and validation layers that were not necessary during experimentation.

Generative AI becomes a managed system rather than a casual productivity tool.

3. New Ownership and Responsibilities

As usage scales, questions emerge:

Who owns the AI system?

In many companies, responsibility is shared across multiple teams:

Legal teams review compliance and policy implications
Security teams manage data access and protection
Platform or engineering teams handle infrastructure and integration
Business teams define use cases and evaluate outcomes

Generative AI introduces cross-functional coordination that did not previously exist.

It becomes part of organizational structure — not just technology adoption.

Why This Matters

These internal changes explain why moving from a demo to production is difficult.

The challenge is not only model performance. It is workflow redesign, governance, and operational ownership.

Understanding these internal shifts helps explain why successful adoption requires more than experimentation — it requires infrastructure, oversight, and clear accountability.

This naturally leads to the next discussion:
What separates a compelling demo from a reliable production system?

Conclusion

Generative AI represents more than a new productivity tool. It reshapes how organizations structure work, manage risk, and allocate responsibility.
What begins as experimentation often evolves into workflow redesign, governance frameworks, and cross-functional ownership. As adoption scales, Generative AI becomes embedded in core systems rather than existing as an isolated tool.
The companies that succeed are not those that experiment the most — but those that integrate thoughtfully, manage risk deliberately, and treat AI as a long-term operational capability.
When implemented strategically, Generative AI enhances efficiency while maintaining human oversight, accountability, and trust.

Key Takeaways

Companies use Generative AI differently than individuals.
Adoption usually starts with internal tools before reaching customers.
Generative AI augments work rather than replacing entire roles.
Business value comes from integration into workflows, not experimentation alone.
Adoption is gradual and evolves with organizational needs.

What’s Next?

The next article explores why many Generative AI projects fail after the demo stage, and what separates successful adoption from stalled experiments.

References

McKinsey & Company. The State of AI in 2023.
https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year
Harvard Business Review. How Generative AI Is Changing Work.
https://hbr.org/insight-center/how-generative-ai-is-changing-work

What Is Generative AI? A Beginner-Friendly Guide to How GenAI Actually Works (Part 1)

Renata — Tue, 24 Feb 2026 20:16:00 +0000

Generative AI has quickly moved from research labs into everyday tools. It helps people write, learn, create, and work more efficiently. Still, many explanations remain overly technical or abstract.

This article is written for beginners and non-technical professionals who want a clear, practical understanding of Generative AI—without math, system design, or engineering details.

The goal is clarity, not complexity.

Introduction
What Is Generative AI?
Traditional AI vs Generative AI
Types of Generative AI
How Generative AI Works (High Level)
Prompts and Outputs
Where Individuals Use Generative AI Today
Conclusion
Key Takeaways
References

Introduction

Generative AI has moved rapidly from experimental research into tools people use every day. From drafting text and summarizing information to creating images and supporting learning, its adoption continues to grow across many domains.

At its core, Generative AI enables software to create new content in response to human input. Rather than relying on fixed rules or returning simple predictions, these systems generate outputs that adapt to the intent expressed in a prompt. This represents a meaningful shift in how people interact with software.

This guide explains what Generative AI is, how it works at a high level, and where individuals use it today.

What Is Generative AI?

Generative AI, often shortened to GenAI, is a category of artificial intelligence focused on creating new content.

Instead of only analyzing data or making predictions, Generative AI produces outputs such as text, images, audio, video, or structured responses based on user input.

For example:

writing a short email from a brief instruction
generating an image from a text description
explaining a topic in simple language

In each case, the system creates something new rather than selecting from predefined answers.

Traditional AI vs Generative AI

To understand why Generative AI matters, it helps to compare it with earlier forms of AI.

Traditional AI	Generative AI
Analyzes existing data	Creates new content
Focuses on prediction and classification	Focuses on generation and response
Works within fixed rules or objectives	Adapts to open-ended prompts
Returns labels, scores, or decisions	Returns text, images, audio, or video
Often hidden in background systems	Designed for direct human interaction

Traditional AI might detect fraud or recommend products. Generative AI, by contrast, interacts directly with users and produces content that feels conversational and flexible.

Types of Generative AI

Generative AI systems can work with different kinds of data.

Text

Text-based GenAI can write, summarize, explain concepts, translate languages, and assist with learning or research.

Example: generating a short explanation of a topic from a simple question.

Images

Image generation tools create visuals from text descriptions.

Example: producing an illustration based on a written scene or idea.

Audio

Audio-focused GenAI can generate speech, sound effects, or music.

Example: converting written text into natural-sounding speech.

Video

Video generation systems create short clips or animations based on prompts.

Example: generating a short visual explainer from a script.

Multimodal AI

Multimodal systems work across multiple input and output types, such as understanding text and images together or generating images from written descriptions.

How Generative AI Works (High Level)

At a high level, Generative AI operates in two phases: training and inference.

Training

During training, the system learns patterns from very large collections of data. The goal is not memorization, but learning how language, images, or sounds are structured.

Rather than storing exact answers, the model learns probabilities and relationships. This is why it can generate new responses instead of repeating existing ones.

Inference

Inference happens when a user interacts with the system. The AI uses what it learned during training to generate a response based on the prompt.

Because the output is generated dynamically, the same prompt can produce slightly different responses each time.

Tokens

Generative AI processes information in small units called tokens. Tokens may represent parts of words, symbols, or short word sequences.

By predicting tokens step by step, the system builds responses that feel natural and coherent rather than fixed or scripted.

Prompts and Outputs

A prompt is the input provided by the user. It can be a question, instruction, or short description.

The output is the content generated in response.

Simple Example

Prompt:

“Explain Generative AI in one paragraph for a beginner.”

Output:
A short explanation written in plain language, adapted to a beginner audience.

Prompts guide the system rather than control it exactly. Clear prompts usually lead to more relevant outputs, but even simple inputs can be useful.

Where Individuals Use Generative AI Today

Generative AI is increasingly used as a general-purpose assistant.

Individuals commonly use it to:

write and edit text
learn new topics
summarize information
brainstorm ideas
create visual content
support everyday productivity

These use cases show Generative AI as a tool that supports human work rather than replacing it.

Conclusion

Generative AI represents a clear shift in how people interact with technology. Instead of software that only analyzes or predicts, users now engage with systems that create and respond.

Understanding Generative AI at a high level—what it is, how it works, and where it is used—helps individuals use these tools with confidence. Concepts like training, inference, prompts, and tokens provide enough context to be effective without technical depth.

As Generative AI continues to evolve, clarity matters more than complexity.

Key Takeaways

Generative AI focuses on creating new content rather than analyzing existing data.
It enables flexible, prompt-driven interaction with software.
GenAI works through training and inference, not memorization.
Prompts guide outputs, but responses are generated dynamically.
Individuals use Generative AI daily for learning, creativity, and productivity.

What’s Next?

The next article explores why many Generative AI projects fail after the demo stage—and what that reveals about real-world adoption beyond first impressions.

References

IBM. What Is Generative AI?
https://www.ibm.com/topics/generative-ai
McKinsey & Company. What Is Generative AI?
https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-generative-ai

Docker-based Model Runner for AWS CloudWatch Log Analysis

Karan Singh — Tue, 24 Feb 2026 18:42:41 +0000

Concise Summary

This project Convert plain-English questions into concise summaries over CloudWatch-style logs.

Conversational CloudWatch converts natural-language questions into structured insights over CloudWatch-style logs. It runs locally or with LocalStack, supports LLM summarization via TinyLlama, and exposes a simple FastAPI service packaged with Docker Compose for reproducible development and future AWS integration.

1. Problem & Context

The modern cloud environment produces huge amounts of logs, and it is hard to find the critical issues within a short time. Although AWS CloudWatch offers strong monitoring and alert systems, the search of trends in error logs can be complicated with complex queries or the need to go through them manually.

To deal with this, we defined a Dockerized model runner which wraps the FastAPI + TinyLlama summarization stack, which allows conversational inferences out of logs in a way similar to local logging or local AWS CloudWatch logging.

1.1 Use Case: Querying AWS CloudWatch with Natural Language

Primary Goal: Reduce incident-response time by letting engineers ask natural-language questions about logs and receive instant, summarized answers.

How It Works

Engineers ask plain-English questions (e.g., “Did errors spike in the auth service in the last 2 hours?”).
The FastAPI service (in its own Docker container) connects directly to AWS CloudWatch to fetch the raw logs.
Retrieved logs are passed to the Docker model runner (Ollama) for analysis and summarization.
The system returns a clear, concise answer with no CloudWatch syntax or Log Insights queries required.

Why This Matters

Eliminates slow log searches and manual filtering
Makes CloudWatch accessible to non-experts
Supports faster debugging and better on-call efficiency

2. Solution Overview: Docker Model Runner Architecture for AWS CloudWatch

Update (v1.1.0): This version adds Guardrails AI validation for safe prompt and time-range handling, a /version endpoint for runtime introspection, and a startup health confirmation log.

The entire application is managed by Docker Compose, which acts as the Docker component runner. It orchestrates all the individual services (components) needed for the app to function.

The most critical component is the Docker model runner, which is the ollama service defined in the docker-compose.yml file. This container runs the TinyLlama large language model locally. This allows our FastAPI application to send it raw logs and receive natural-language summaries back, all without relying on an external, paid API.

The solution integrates a FastAPI backend with an Ollama-based model runner (TinyLlama). Users submit a query through Swagger UI or via POST API calls. Depending on environment settings, the system either returns deterministic summaries (mock mode) or real LLM outputs.

2.1 System Architecture ( Figure 1)

Conversational CloudWatch v1.1.0 integrates FastAPI with Ollama’s TinyLlama model under Docker Compose. The architecture separates validation, retrieval, and summarization steps for clarity and reliability.
The diagram shows how the FastAPI container (port 8001) interacts with the Ollama model runner (port 11434). Requests arrive at /health_status, /recipes, or /query. Validations run before the summarizer (TinyLlama) produces structured JSON outputs.

2.2 Operating Modes

The system operates in two distinct modes for different environments:

Deterministic Local Mode: uses bundled CloudWatch-style sample logs and a rule-based summarizer to produce repeatable outputs; ideal for demos and tests without any AWS credentials.
LocalStack: Provides AWS-like local behavior for development, allowing the system to interact with a local mock of the CloudWatch API.
Real AWS: connects to CloudWatch with least-privilege credentials (read-only)

2.3 Guardrails

Limit prompts to 300 characters (prevent prompt injection)
Clamp time ranges between 5 minutes and 24 hours
Limit response size for readability
No secrets or PII in code; AWS access is read-only

2.4 Intergration with AWS

The system will integrate smoothly with AWS CloudWatch in read-only access so that the developers can analyze and visualize the actual log data without altering the production systems. This can be simulated in development with LocalStack and secured access can be enforced in production with least-privilege IAM credentials.

This configuration gives teams the ability to have free movement of local testing environments to live monitoring in AWS without much reconfiguration.

To continue progressing with this integration, in the future, we will implement the interactions with the CloudWatch API directly using Boto3, as well as deploy the entire application as a containerized one on AWS ECS or Cloud Run to achieve the full-scale and fully production-ready performance.

3. API Endpoint Details

The API exposes three simple endpoints under Base URL: http://localhost:8001

GET /health_status

Checks service status.

{"status":"ok"}

curl -s http://localhost:8001/health_status

POST /query

Runs a natural-language query over logs
Request schema

{
  "prompt": "string (required)",
  "log_group": "string (optional)",
  "time_range": "string (optional)",
  "mock": "boolean (optional)"
}

curl -s -X POST http://localhost:8001/query \
 -H "Content-Type: application/json" \
 -d '{"prompt":"show error spikes last 2h","mock":true}'

4. Environment & Prerequisites

Test Environment:

macOS 14 / Windows 11
Python 3.12+
Docker Engine 29.0.1 + Docker Compose v2.29
Optional tools: LocalStack & AWS CLI

Step-by-Step Implementation Flow (figure2)

What happens in the runtime flow:

Request Ingress
- Requests arrive via /health_status, /version, /recipes/{name}, or /query.
Guardrails Validation
- Checks prompt length (≤ 300 chars) and time-range pattern (^\d+[smhd]$).
- Ensures USE_LLM and mock flags are properly configured.
- Invalid requests are rejected with clear error messages.
Summarization
- The summarizer identifies spikes, reasons, or affected users.
- If USE_LLM=TRUE, it calls Ollama (TinyLlama) via port 11434 for natural summaries.
- Otherwise, deterministic summaries are generated locally.
Response & Output
- Returns structured JSON with fields:
- Logs a startup confirmation:
INFO: Health check OK — API responding normally (startup)

5. Reproduce Locally

The following commands rebuild and launch the stack, then verify key endpoints.

docker compose build

docker compose up -d

curl -s http://localhost:8001/health_status | jq .

curl -s "http://localhost:8001/recipes/error_spikes?log_group=/aws/lambda/auth-service&time_range=2h&mock=true"  | jq .

curl -s "http://localhost:8001/recipes/error_spikes?log_group=/aws/lambda/auth-service&time_range=2h&mock=false" | jq .

curl -s "http://localhost:8001/recipes/slow_queries?log_group=/aws/lambda/auth-service&time_range=4h&mock=false" | jq .

6. How to Reproduce

1. Clone the repo

git https://github.com/kubetoolsio/docker-model-runner-aws-cloudwatch.git

2. Start with Docker

docker compose up -d --build

3. Verify installation

curl -s http://localhost:8001/health_status

4. Send a test query (Real LLM Mode – Interacting with LocalStack/AWS):

curl -s -X POST http://localhost:8001/query \
-H "Content-Type: application/json" \
-d '{
"prompt": "Analyze error spikes in auth-service last 2h",
"log_group": "/aws/lambda/auth-service",
"time_range": "2h"
}'

7. Project Structure

8. Result Table

Query Type	Expected Output	Actual Result
/version endpoint	App metadata (version, mode, model)	Returned (Figure 9.1)
docker containers	The containers up and running	Running (figure 9.2)
/recipes/slow_queries	Structured count + insights summary	Real LLM summarization (Figure 9.3)

9. Evidence (Screenshots)

Figure 9.1 – Version Endpoint

Figure 9.2 – Docker Containers Running

Docker Desktop showing both services active and healthy:
conversational-cloudwatch-app-1 (FastAPI backend) and conversational-cloudwatch-ollama-1 (TinyLlama model runner), confirming proper Docker Compose orchestration.

Figure 9.3- Slow Queries Recipe

Executed with /recipes/slow_queries?mock=false, showing real LLM summarization of database latency and timeout incidents.
It identifies Gateway timeout, DatabaseError, and Expired token issues with next-step recommendations.

10. Current Limitations & Planned Improvements

Current Limitations

Local sample / LocalStack logs only (currently); real AWS CloudWatch log ingestion is wired but still being validated end-to-end.
Base recipe coverage only (slow_queries, error_spikes, traffic_summary)
IAM policy exists but not fully tested
Limited to TinyLlama model for summarization

Planned Improvements

Integrate with live AWS CloudWatch using Boto3 and expand model options
Add additional recipes (security alerts, latency profiling)
Expand LLM model options for improved summarization accuracy
Deploy to Cloud Run for scalable public access

In the future, we plan to containerize the AWS data-fetching client (which we call the MCP server) into its own dedicated Docker application. This will improve scalability and better separate the data-fetching logic from the main API

11. IAM and Security Considerations

Security is a core principle of this project:

When integrated with AWS, the system uses a least-privilege IAM policy (read-only).
No secrets or PII are stored in containers.
Sensitive configs reside in environment variables.
Guardrails validation prevents prompt injection attacks

The IAM policy draft (docs/IAM_DRAFT.md) ensures CloudWatch Logs access is strictly read-only with no ability to modify or delete logs.

12. Discussion

This project demonstrates how modern LLM tools can simplify complex operational tasks. The modular architecture enables flexible deployment and consistent performance across different environments.

Key architectural decisions:

Dockerization ensures environment parity between development and production
TinyLlama provides accurate summaries with low resource overhead
Guardrails AI validation adds a safety layer before LLM processing
Separation of adapters, recipes, and summarizers enables easy extension

13. Conclusion

This project demonstrates how conversational AI can ease CloudWatch log analysis with the help of FastAPI, Docker, and TinyLlama.
Version 1.1.0 provides a safe and scalable base, guardrails, complete Docker orchestration, and recipes to expand log-analysis.

The system is scalable to AWS CloudWatch and extensions such as new models or alert recipes in the future making natural-language monitoring a bit closer to production.

14. References & Credits

Amazon Web Services. (n.d.). Amazon CloudWatch Logs — User Guide https://docs.aws.amazon.com/
FastAPI. (n.d.). FastAPI documentation https://fastapi.tiangolo.com/
Docker. (n.d.). Docker documentation https://docs.docker.com/
LocalStack. (n.d.). LocalStack documentation. https://docs.localstack.cloud/
Uvicorn. (n.d.). Uvicorn documentation. https://www.uvicorn.org/
Pydantic. (n.d.). Pydantic v2 documentation https://docs.pydantic.dev/

Licensing: No PII or secrets used. All code/demo is shareable under referenced open-source licenses.

15. Contributing & Getting Involved

If you’d like to explore the source code, contribute improvements, or report issues:
Visit the Github Repository to see the full project code
If you find a bug, open an issue
If you have an idea for improving the project, raise a feature request
Check the issues page to see what others have reported

Text-to-SQL Agent with Docker MCP and Model Runner | Smarter and Safer

Vishesh Sharma — Wed, 12 Nov 2025 14:16:27 +0000

A containerized approach to natural language database queries with built-in safety and auditability

Author: Vishesh Sharma and Karan Singh

Date: November 12th, 2025

Core

The original Text-to-SQL setup worked but wasn’t scalable or secure.
The new Model Context Protocol (MCP) introduces modular, isolated layers.
Each service has a single role , the API orchestrates, model generates, MCP executes safely.
All components run in Docker for consistent, reproducible deployment.
The result is a secure, scalable, and auditable Text-to-SQL system that translates natural language into reliable SQL.

Introduction

When this project started, the goal was to make it possible to ask a question in plain English and get an accurate, safe SQL query in return. The earliest version achieved exactly that; it could translate natural language into structured data retrieval using a large language model (LLM). The earlier design proved fast but fragile, efficient but limited. The introduction of the Model Context Protocol (MCP) reshaped the entire system into something modular, traceable, and secure. This post explores that update and what changed and why it matters.

The Earlier Setup

The first version of the Text-to-SQL pipeline worked beautifully on paper. It relied on a FastAPI service, a model runner, and a SQLite database, all wired together inside a double Docker setup. The user sent a question to the API, which built a structured prompt using table metadata and passed it to the model. The model returned an SQL statement, the system validated it for safety, and the database executed it in read-only mode.

Everything ran inside containers:

The API container handled requests, prompt construction, SQL validation, and query execution.
The model container ran Ollama with the mistral:latest model.
The database was there locally, mounted as a read-only file.

At first look , it felt efficient as one connected chain where each part did its job quickly. For smaller workloads or local demos, this setup worked perfectly. You could run everything with a single command and start asking questions instantly.

Figure 1 — Earlier System Architecture: The modular design with separate components for prompt building, LLM execution, SQL validation, query execution, and logging, all coordinated through Docker.

The Limitations of the Earlier Setup

The architecture had no concept of concurrency, multi-user access, or proper isolation between components. Everything depended on one process staying healthy, and that process lived inside the API container.As soon as the system started handling real workloads like larger datasets, and repeated queries its inefficiency struck out. What was good for a demo wasn’t ready for day-to-day use.

The core issue was responsibility overload. The API container had to do everything at once: build prompts, talk to the model, validate generated SQL, run queries on the database, and then log every result. It was a single brain trying to think, act, and remember all at the same time. When traffic grew or responses slowed, that central role appeared as a bottleneck. The API couldn’t scale independently because model calls and database access were hard-wired together

The architecture worked fine for controlled environments but was too fragile for anything resembling a real-world deployment. It needed to structure something that could separate logic, data, and model operations into clearly defined layers. Here comes the MCP !!

The Core Parts of MCP

After setting up the base system, three main parts make everything work smoothly and safely: the MCP Toolkit, the MCP Gateway, and the MCP Server. Each has a single job and they all connect like pieces of a simple chain. Together, they keep the system clean, reliable, and easy to follow.

MCP Toolkit

The Toolkit is the first stop when the main app needs to talk to the database. Instead of reaching out directly, the app sends its request to the Toolkit.
The Toolkit understands what the app is asking for and passes it along in the right way. It can ask for table details or run a read-only query. This makes the app simpler because it no longer has to deal with the database directly. The Toolkit acts like a helpful middle layer that keeps communication clear and safe.

MCP Gateway

The Gateway sits in the middle of the system. Its job is to check that every request follows the right steps and is sent in the right format.
You can think of the Gateway as a checkpoint. It doesn’t change the data or rewrite anything, but it makes sure that what passes through is correct and safe to send forward. This helps the whole setup stay organized, especially when there are many requests happening at the same time.

MCP Server

The Server is where the actual database work happens. It is the only part that talks directly to the database.
When a request reaches the Server, it carefully runs the query and sends the results back. It is designed to read data only, never to change or delete it. This rule makes the database secure and ensures that even if something goes wrong elsewhere, the stored information stays safe.

How They Work Together

The app sends a request to the Toolkit.
The Toolkit forwards the request to the Gateway.
The Gateway passes it along to the Server.
The Server runs the query on the database.
The results move back through the Server → Gateway → Toolkit → App, and finally reach the user.

Each part knows its own role, and nothing overlaps. This makes the setup easier to understand, easier to fix, and easier to expand when more users or databases are added later.

The Model Context Protocol (MCP)

The next phase of the project introduces a new layer between the API and the database that is the Model Context Protocol. MCP redefined how components talked to each other. Instead of having one monolithic pipeline, the system was reorganized into small, well-defined services:

MCP Toolkit: The API’s local interface for schema retrieval and SQL execution.
MCP Gateway: A proxy layer that routes all requests safely and ensures standardization.
MCP Server: Handles actual database communication and enforces read-only execution.
Database: Still local or containerized, but now fully isolated behind the MCP chain.

This structure means the model runner and the API no longer touch the database directly. Every query passes through a managed route that provides logging, schema introspection, and validation.

The MCP chain acts like a compartment between the intelligent model and the sensitive data it needs. Each step has a clear responsibility, which keeps the system modular, auditable, and secure.

How the New Workflow Operates

When a user asks a question, the journey now looks like this:

The user sends a natural-language question to the FastAPI service.
The API requests the schema from the MCP Toolkit, which forwards it through the Gateway to the MCP Server.
The MCP Server fetches metadata from the database and returns it along the same path.
The API builds a prompt for the Model Runner (Mistral through Ollama).
The model generates a SQL query based on schema context.
The SQL Firewall checks the query for safety, structure, and table validation.
The validated query goes back through the MCP Toolkit → Gateway → Server → Database, where it executes in read-only mode.
The results follow the reverse path, reaching the user through the API.
Every step — input, SQL, and output — is logged by the Artifact Logger.

This creates a smooth but strongly governed flow where no single container can break isolation. Each service can restart, scale, or update independently without affecting others.

Figure 2 — Text-to-SQL Request–Response Workflow (MCP Pipeline):
The new MCP-based workflow illustrating how the API, model runner, and MCP layer (Toolkit, Gateway, Server, Database) interact to securely process and validate user queries.

Docker Compose: The Heart of the MCP Setup

With the Model Context Protocol (MCP) in place, the system is no longer a single stack.It now consists of multiple independent yet connected services and each with a clearly defined purpose.This is where Docker Compose becomes the centerpiece.

Instead of manually launching every container, Docker Compose acts as the conductor that starts, manages, and connects all these services together.
It ensures they start in the right order, share the right environment variables, and talk to one another seamlessly.

When you run

docker-compose up –build

Everything from the FastAPI layer to the database now comes alive in the right sequence. Each part knows only what it needs to know and nothing more.

How the Setup Works

Refer to the Github Repo with MCP branch

In this architecture, each service represents a single responsibility in the Text-to-SQL pipeline.They all live inside their own containers but communicate over internal Docker networks using well-defined URLs.

Let’s break it down layer by layer.

1. API Service

Purpose:

This is the main entry point where users interact with the system. It’s a FastAPI-based service that receives natural language questions, builds prompts for the model, validates the generated SQL, and sends it to the MCP layer for execution.

How it fits in:

Talks to the model runner (model-runner) to generate SQL.
Sends validated queries to the MCP Toolkit (mcp-toolkit) for safe execution.
Logs all requests and responses for traceability.

Port: 8000

In the earlier setup, the API handled everything from model calls, database queries, to logging. Now, it just coordinates and delegates those jobs, keeping things clean and secure.

2. Model Runner

Purpose:
This container runs the large language model (LLM) locally using Ollama and the mistral:latest model. It’s the system’s “brain,” responsible for converting human questions into SQL statements.

How it fits in:

Receives prompt data from the API.
Returns a generated SQL query.
Stays isolated and it doesn’t access any database directly.

Port: 11434 and Image: ollama/ollama:latest

To make startup smooth, the model runner uses a health check, Docker waits until the model is fully loaded before letting the API start.That way, the queries never fail due to “model not ready” errors.

3. MCP Toolkit

Purpose:
The MCP Toolkit is the middleman between the API and the rest of the MCP chain.
It takes API requests and translates them into MCP-compliant messages for the Gateway and Server layers.

How it fits in:

The API never talks to the database directly; it only talks to this toolkit.
Fetches schema information.
Sends SQL queries for execution.
Simplifies communication by exposing simple routes like /schema and /query.

Port: 8002

Think of it as a behaving as a translator it speaks both “API language” and “MCP language.”

4. MCP Gateway

Purpose:
The Gateway controls how the Toolkit communicates with the actual database layer.
It ensures every request passes through proper validation, logging, and protocol checks.

How it fits in:

Routes schema and query requests to the MCP Server.
Keeps the communication standardized and structured.

Port: 8001

You can imagine this as a custom traffic controller and nothing reaches the database unless the Gateway approves it.

5. MCP Server

Purpose:
This is the final stop before the database.The MCP Server executes validated SQL queries in a read-only mode and returns results safely.

How it fits in:

Talks only to the database, not the API or model directly.
Uses environment variables for credentials (DB_URL).
Ensures no unsafe SQL (like DROP, UPDATE, or DELETE) can be executed.

Port:9000

By isolating this layer, even if something goes wrong in the API or model, the data itself remains protected.

6. Database

Purpose:
The heart of the system’s information.In this case, it’s a PostgreSQL database initialized with a simple dataset.It can be replaced with other SQL engine .

How it fits in:

Only the MCP Server can talk to it.
Initializes with init_db.sql and runs a health check.
Fully containerized for consistency across environments.

Port: 5432 and Image: postgres:15-alpine

The Startup Order

Docker Compose ensures everything starts in a chain of dependencies:

Figure 3 — Docker Compose Service Chain:
Overview of all containers in the MCP architecture, including startup order and interdependencies between API, model runner, MCP Toolkit, Gateway, Server, and Database

The system waits for each service to report healthy status before moving to the next one.This guarantees that when the API finally goes live, the model, MCP layers, and database are all ready.

What Makes This Setup Different

The biggest shift from the earlier setup is separation and safety. Before, the API was overloaded and it did everything.Now, every container has a single focus:

The API coordinates.
The model runner generates.
The MCP chain protects and executes.
The database stores data safely behind multiple layers.

Service	Purpose/Role	Port	Depends On
API	User entry, orchestration and validation	8000	model-runner,mcp-toolkit
Model Runner	Runs the LLM(curently Mistral via Ollama)	11434	N/A
MCP Toolkit	Bridges API with MCP layers	8002	mcp-gateway
MCP Gateway	Routes and validates requests	8001	mcp-server
MCP Server	Executes SQL safely on DB	9000	database
Database	Stores and provides structured data	5432	N/A

Table 1 — Service Overview and Dependencies:
Summary of all Docker Compose services in the MCP-based Text-to-SQL architecture, outlining each component’s purpose, exposed port, and dependency chain for system orchestration.

Inside the MCP Layer

The MCP layer is where most of the magic happens. It forms the secure bridge between the application logic (API + Model) and the actual database.
Instead of letting the model or API run SQL directly, everything now passes through three well-defined components the Toolkit, Gateway, and Server.

Together, they act like a controlled relay system:the Toolkit asks, the Gateway routes, and the Server executes, always in read-only mode.

1. The MCP Toolkit

The Toolkit is the API’s assistant inside the MCP layer.Whenever the FastAPI service needs to fetch a schema or execute a SQL query, it doesn’t reach the database, it sends the request here first.

Here’s how it looks in code:

# mcp_toolkit.py
import httpx
from typing import Dict, Any
class MCPToolkit:
    “””Toolkit for interacting with the MCP Gateway.”””
    def __init__(self, base_url: str = “http://mcp-gateway:8001”):
        self.base_url = base_url

    async def get_schema(self) -> Dict[str, Any]:
        “””Fetch database schema through the MCP Gateway.”””
        async with httpx.AsyncClient() as client:
            resp = await client.post(f”{self.base_url}/schema”, json={})
            resp.raise_for_status()
            return resp.json()

    async def run_query(self, sql: str) -> Dict[str, Any]:
        “””Execute SQL query through the MCP Gateway.”””
        async with httpx.AsyncClient() as client:
            resp = await client.post(f”{self.base_url}/query”, json={“sql”: sql})
            resp.raise_for_status()
            return resp.json()
})            resp.raise_for_status()            return resp.json()

What this does:

The Toolkit exposes two async methods, one for schema retrieval, one for query execution.
Both requests go to the Gateway (http://mcp-gateway:8001), never the database.
It returns clean, structured JSON responses that the API can consume.

Essentially, this file gives the FastAPI layer a simple, safe interface to interact with the database world indirectly.

2. The MCP Gateway

Unlike a typical API router, the MCP Gateway doesn’t interpret or modify SQL.
Its only job is to securely pass messages between the outer world (Toolkit or API) and the inner world (MCP Server and database).Think of it as a custom traffic controller, every request passes through it, but it never opens the payload or touches the data directly.

Here’s the core logic:

from fastapi import FastAPI, Request
import httpx, os
app = FastAPI(title=”Local MCP Gateway”)
MCP_SERVER_URL = os.getenv(“MCP_SERVER_URL”, “http://mcp-server:9000”)

@app.post(“/{path:path}”)
async def proxy_post(path: str, request: Request):
    “””Forward POST requests to MCP server (e.g., /query, /schema).”””
    print(f”Proxying POST to MCP server: {path}”)
    try:
        data = await request.json()
    except Exception:
        data = {}

    try:
        async with httpx.AsyncClient() as client:
            resp = await client.post(f”{MCP_SERVER_URL}/{path}”, json=data)
            resp.raise_for_status()
            return resp.json()
    except Exception as e:
        return {“error”: str(e)}

@app.get(“/health”)
def health():
    “””Health check endpoint.”””
    return {“status”: “ok”}

What’s happening here

The route @app.post(“/{path:path}”) dynamically captures any incoming path (/schema, /query, or even /metadata) and forwards it to the MCP Server.
It acts like a universal POST proxy, using httpx for async forwarding.
If the body isn’t valid JSON, it safely defaults to {} instead of crashing.
Any error raised downstream is caught and returned in a simple, consistent JSON format.
The /health endpoint makes it easy for Docker or Compose to check that the gateway is alive.

3. The MCP Server

The Server is the only component that talks directly to the database. It runs in a read-only mode, executes validated SQL, and returns the result as structured JSON

read-only mode, executes validated SQL, and returns the result as structured JSON.

import psycopg2, os
# Database connection (Postgres)
DB_URI = os.getenv(“DB_URI”, “postgresql://postgres:password@db:5432/text2sqldb”)

# Create a single global connection but keep autocommit to avoid transaction issues
conn = psycopg2.connect(DB_URI)
conn.autocommit = True # prevents “current transaction is aborted” errors

@app.post(“/query”)
def run_query(request: SQLQuery):
    “””Execute a SQL query and return rows.”””
    sql = request.sql.strip()
    if not sql:
        return {“error”: “Empty SQL query.”}

    try:
        with conn.cursor() as cur:
            cur.execute(sql)
            try:
                rows = cur.fetchall()
            except psycopg2.ProgrammingError:
                # e.g. for INSERT/UPDATE/DELETE that have no results
                rows = []
        return {“rows”: rows}
    except Exception as e:
        conn.rollback()
        return {“error”: str(e)}

What this does:

Defines strict rules: only SELECT queries are permitted.
Uses SQLAlchemy to execute SQL safely against the database.
Converts query results into a list of JSON rows.
Returns structured output to the Gateway → Toolkit → API.

This ensures no matter what happens at the top layers, the database remains secure and untouched by unsafe operations.

4. The Chain in Action

Refer to the Github Repo with MCP branch

Here’s how a single query moves through the MCP layer in real time:

API → Toolkit: “Fetch schema or run this SQL.”
Toolkit → Gateway: “Forwarding structured request.”
Gateway → Server: “Executing validated command.”
Server → Gateway: “Here are your results.”
Gateway → Toolkit → API → User: “Final answer in JSON.”

Each part does one job and nothing more. This keeps logs clean, failures contained, and security intact.

Comparing the Earlier Setup with the MCP Pipeline

The difference between the two architectures is not just structural but it also reflects a shift.

Aspect	Earlier Architecture	MCP-Based Architecture
System Design	Single linear pipeline where the API, model, and database were all directly connected.	Layered and modular setup where the API, Model Runner, and MCP components (Toolkit, Gateway, Server) work independently.
Database Access	The API had full access to the database and its credentials.	Database access is limited to the MCP Server, and the API interacts only through controlled MCP routes.
Scalability	Scaling meant duplicating the entire system, including the model and database.	Each service can scale separately, allowing independent scaling of the API or model without affecting others.
Failure Handling	A crash or timeout in one process could stop the entire system.	Failures are contained within each service, which can restart or recover without disturbing others.
Concurrency and Multi-User Support	Built for a single user or demo use, without concurrency management.	Supports multiple users and parallel queries through containerized services.
Deployment Mode	Compact and quick to launch but only suitable for local testing or demos.	Designed for production environments, secure, reproducible, and easy to deploy in the cloud.
Maintenance	Difficult to debug or update since one issue could disrupt the whole setup.	Easier to maintain with clear separation of responsibilities between components.
Data Safety	The risk of accidental data modification due to open access.	Read-only operation and strict validation ensure data safety through the MCP chain.

Table 2 — Comparison Between Earlier Architecture and MCP-Based Pipeline:
Summary of how the MCP architecture replaces a tightly connected system with a structured, secure, and scalable setup that supports multi-user access and improves reliability.

In practice, this means the system can now grow, adapt, and recover gracefully. It’s the difference between a quick demo and a real, production-ready data service.

The Future of MCP and AI-Driven Data Systems

Figure 4 — Future Evolution of MCP Architecture:
A distributed multi-chain Model Context Protocol setup connected through a unified API, with adaptive governance and feedback loops for performance, security, and auditability.

Looking ahead, this approach opens several exciting possibilities. Multiple databases can be connected through independent MCP Servers, all working under a unified API. Logs and context traces could evolve into a learning system that improves model performance automatically. Governance and compliance features, like role-based access and audit trails, could make MCP-based pipelines ready for enterprise environments without extra complexity.

As AI becomes part of critical workflows, trust will matter more than speed. The MCP framework provides that trust by turning intelligence into something accountable and traceable. It allows innovation to scale without losing control striking a rare balance in modern AI development.

Conclusion

What began as a simple Text-to-SQL project has evolved into a model for how AI systems should think and behave. The earlier setup proved that natural-language querying could work, but the MCP-based version showed how to make it safe, traceable, and sustainable.

Each part of the system now plays a clear role , the model interprets, the API orchestrates, the firewall safeguards, and the MCP chain executes with precision. Together, they form a connected yet contained ecosystem that builds confidence into every interaction.

This evolution isn’t just technical progress; it’s a reflection of how AI infrastructure is maturing. Intelligence alone isn’t enough anymore and context and control define reliability. The Model Context Protocol gives us that framework, showing how to combine innovation with discipline.

It’s a small but important step toward the kind of systems that don’t just think for us they think responsibly, within boundaries we can trust.

References

Docker, Inc. AI Model Context Protocol (MCP) Catalog and Toolkit. Docker Documentation, 2025. https://docs.docker.com/ai/mcp-catalog-and-toolkit/toolkit/
PostgreSQL Global Development Group. Run PostgreSQL Using Docker. PostgreSQL Documentation, 2025. https://github.com/docker-library/docs/blob/master/postgres/README.md
Ollama. Running Local Large Language Models with Ollama. Ollama Documentation, 2025. https://docs.ollama.com/

Text-to-SQL Service Agent with Docker and LLM model runner

Vishesh Sharma — Mon, 27 Oct 2025 01:47:58 +0000

A containerized approach to natural language database queries with built-in safety and auditability

Author: Vishesh Sharma

Date: October 27th, 2025

CORE IDEA

This project shows how natural English questions can be safely turned into SQL queries, with every step logged and validated before running.
Unsafe commands like DROP or DELETE never make it through, thanks to a built-in SQL firewall and a read-only database layer.
Everything is fully containerized with Docker, so you can reproduce the setup on any machine and trust that the behavior will be consistent.
In benchmark tests, the system produced correct results for most queries, while blocking unsafe SQL commands.

Problem & Context

Accessing data is critical for modern organizations, yet most employees struggle with SQL queries. Business users want quick answers to everyday questions such as “Which customers placed the most orders last quarter?” but they often have to wait for overloaded data teams.

Large language models (LLMs) offer a way forward by converting natural language into SQL automatically, lowering the barrier to accessing data.However, giving models direct access to production databases creates serious risks. An LLM can hallucinate table names, misinterpret requests, or generate unsafe commands like DROP TABLE. Beyond these data-related technical issues, another concern arises: without logs or safeguards, there is no way to prove what queries were run or to prevent misuse. This project tackles both challenges by introducing a containerized pipeline that enables natural language access to data while enforcing strict safety and auditability.

Solution Overview & Architecture

Figure 1 — System Architecture: The modular design with separate components for prompt building, LLM execution, SQL validation, query execution, and logging, all coordinated through Docker.

The system is packaged as a

FastAPI service inside a Docker container, with internal components handling each stage of the Text-to-SQL pipeline.
When a user submits a question, the prompt builder loads the schema from metastore.yaml and constructs strict rules for SQL generation.
The LLM runner sends this prompt to a Hugging Face model through the OpenAI-compatible API, and
The resulting SQL is then checked by the SQL firewall. The firewall makes sure the query is a single SELECT, blocks dangerous commands like DROP or DELETE, and can enforce an allowlist of tables.

If the query passes validation, the read-only database executor runs it against the database and returns results as a Pandas DataFrame.

Every interaction is logged by the artifact store, which records the original question, generated SQL, results, and status in artifacts.log. The API endpoint /query exposes this full pipeline, while the docker-compose.yml file starts the container and saves the database and logs on your machine.

This setup is easy to run and still enforces strong safety rules, so even non-technical users can query data without risking changes or data loss

API Endpoint Details

The system exposes a single REST API endpoint for submitting natural language queries and retrieving results.

POST /query

Query Parameters: question (string, required) — The natural language question to be translated into SQL.
Response Schema:JSON object containing the generated SQL and the query results.

Example:

Figure 2: Output from the AI-SQL model runner showing an executed SQL query and returned JSON results for customers in Boston.

Example curl Request:

curl -X 'POST'   'http://localhost:8000/query?question=Show%20the%20customers%20in%20Boston'  
 -H 'accept: application/json'   -d ''

Environment & Prerequisites

This experiment is designed to be fully reproducible. The only requirement to run it is a working installation of Docker Desktop. All Python libraries, models, and dependencies are managed within the containerized environment.

System Configuration

Tested on: Windows 11 with WSL 2, Docker Desktop 4.41.2
Container OS: python:3.10-slim
Key Libraries: sqlglot>=23.0.0 ,pandas>=2.2.0, ollama==0.2.1
Generative Model: mistral:latest (Mistral 7B Instruct) served via the ollama/ollama Docker image

Dataset Details

The system works against a structured database schema.

Database: data/demo.db (SQLite) is initialised using init_db.py includes random generated dataset
Schema definition is stored at data/metastore.yaml
Tables: Customers, Orders, Products (descriptions + allowed joins defined in YAML)
Mode: Read-only — no insert/update/delete operations are permitted.
Language: Natural language queries in English (e.g., “Which customers placed the most orders in 2024?”).

Preprocessing Steps

Before a user’s natural language query is processed, the system applies some lightweight preprocessing:

Schema injection: The query is combined with table/column descriptions from metastore.yaml so the model has strict boundaries.
Rule injection: The prompt builder enforces guardrails (e.g., “Only SELECT queries”, “No hallucinated tables”).
Caching check: The system looks up the question in artifacts.log — if an identical SQL was generated before, it reuses that result.
Normalization: SQL queries are normalized during scoring (lowercased, punctuation stripped) to make keyword matching fair across models
Step-by-Step Implementation
The project mainly relies on our docker-compose.yml,which writes the whole application stack

Figure 3 — Query Workflow: From user input to final output. The pipeline builds a strict SQL prompt, generates candidate SQL through the LLM runner, validates it via the SQL firewall, executes only safe SELECT queries on a read-only database, and logs every step in the artifact store. Unsafe queries are blocked with an error message.

Key Points:

Image: ollama/ollama:latest
- Uses the official Ollama Docker image, which comes with the runtime required to serve models.
Volumes
- Mounts ollama_data to /root/.ollama inside the container.
- This persists downloaded models between container restarts, so you don’t re-download mistral:latest every time.
Healthcheck
- Runs ollama list | grep mistral:latest inside the container.
- If the model is not available, it exits with 1, marking the container as unhealthy.
- Docker Compose will retry until the model is ready (interval: 30s, timeout: 10s, max 5 retries).
- This ensures the pipeline doesn’t start until the model is available

model-runner:
    image: ollama/ollama:latest
    container_name: model-runner
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    healthcheck:
      test: ["CMD", "bash", "-lc", "ollama list | grep -q 'mistral:latest'"]
      interval: 30s
      timeout: 5s
      retries: 5
volumes:
  ollama_data:

The entire project is launched with a single command:

docker-compose up --build

Docker Model Runner

To keep everything simple and consistent, we used a Docker model runner to handle the language model separately from the main API. It’s based on the official ollama/ollama image and runs the mistral:latest model inside its own container.

The API talks to it through a local endpoint, so all the model work happens inside Docker

Figure 4 — AI_SQL container running successfully in Docker Desktop, confirming a healthy and active setup.

Once our containers appear in Docker Desktop, there’s no need to use the terminal at all. You just open Docker Desktop and hit the Play button on the model-runner, then do the same for the api or the cli container. Docker automatically makes sure the model runner starts first and is healthy before the API begins.

The first time it runs, the model downloads once and is stored in a persistent volume called ollama_data. After that, it’s reused automatically, so you don’t have to wait for downloads again. The setup also works fully offline once the model is ready, which makes it really handy for local testing or demos.

You can manage everything directly from Docker Desktop and view logs, restart containers, or stop them when you’re done. It’s a one-click setup that makes the whole system feel lightweight and easy to use.

Figure 5 — All containers running in Docker Desktop, with the API returning query results through the CLI.

Command Line Interface (CLI)

The CLI acts as a lightweight companion to the API, designed for users who prefer working directly in the terminal. It connects with the same Docker services, sending natural language queries to the API and returning formatted SQL results instantly.

Once the containers are running in Docker Desktop, the CLI can be launched to query without opening a browser. It’s especially handy for people who want fast feedback.

Each CLI interaction is logged in the same artifact store, keeping a consistent record of questions, SQL statements, and outputs across both the API and the web interface.

Figure 6 — CLI mode of our program: showing a user query and formatted SQL results, confirming live interaction with the API.

Results & Evaluation

We measured performance across two metrics:

Soft accuracy: keyword coverage of expected SQL terms.
Latency: average query generation time.

Figure 7. Summary of model accuracy and latency:All models achieved 100% accuracy, with Hugging Face models responding fastest and Ollama Mistral showing the highest latency.

The models achieved similar accuracy, though with different latency.

The Llama model balanced accuracy and fast efficiency, In contrast, the OpenAI OSS model was the slowest of the Hugging Face backends, while Ollama was the slowest overall, averaging five seconds per query.

One additional observation concerns caching. While the system does reuse previous results by looking up identical SQL strings in the artifact log, this is not a smart caching layer. For small test runs it works fine, but when thousands of queries accumulate, string-based lookup will become inefficient.

Limitations

This project provides a modular, containerized pipeline for safe natural language to SQL. Each part of the system, prompt builder, LLM runner, SQL firewall, database executor, and artifact store are isolated in its own module and coordinated through Docker.

First, the evaluation metrics can be made more solid. Currently, queries are scored using keyword coverage (soft accuracy), latency, and status. While this helps to highlight differences between models, it does not fully capture accuracy. Equivalent SQL constructs, such as MAX(price) versus ORDER BY price DESC LIMIT 1, may be undercounted. A better approach would compare execution results directly, ensuring accuracy is measured on outcomes rather than string matches.

Second, while the system is modular, it currently runs in a local Docker environment for reproducibility. In production, these components could be deployed as separate services in a cloud-native environment. For example, the API, model runner, firewall, executor, and logging system can each be containerized and scaled independently using Kubernetes or serverless runtimes.

Third, while an artifact log and lookup mechanism already exists, it is not a smart caching system. Queries are matched by SQL strings, which works for small datasets but will become inefficient as logs grow into thousands of entries. A more scalable approach would be to add indexed storage, semantic query matching, or a dedicated caching service to return frequent results instantly.

Finally, the current scope is limited to basic text-to-SQL translation. There is room to improve functionality into query explanation, summarization of results in natural language, and handling of more complex schemas or multiple databases.

Next Steps

Future improvements could include:

Improved evaluation: Move from keyword-based scoring to execution-based metrics that compare query outputs.
Cloud-native deployment: Deploy each module as an independent containerized service with orchestration tools like Kubernetes.
Multi-database support: Extend beyond SQLite to include PostgreSQL and managed cloud databases in read-only mode.
Enhanced features: Add result summarization, query explanations, and cross-database joins.
Smart Caching:Adding a dedicated caching service to smartly cache hot queries.

Contributing and Getting Involved

Go to the GitHub repository to see the full project code.
If you find a problem or a bug, open an issue so it can be fixed.
If you have an idea for improving the project, suggest it by raising a feature request.
Check the issues page to read what others have reported or suggested.
If you want to help, you can contribute by sending a pull request. This can be for fixing errors, updating documentation, or adding new features.

How to Check

This experiment is fully containerized and managed by Docker Compose. The following steps detail the project structure and the commands needed to replicate the entire analysis.

Project Structure

The repository is structured with a clear separation between application code, container configurations, and data are organized into different directories. This layout improves maintainability, reduces complexity, and makes the environment straightforward to manage.

Figure 8— Project File Structure: Organized layout of the repository , showing modular separation of controllers, models, views, data, and configuration files for the text-to-SQL system.

Prerequisite

To recreate this project, you only need to have Docker Desktop installed

Setup & Execution

Run the Full Experiment

To build and run everything (API + LLM+CLI), simply run

docker-compose up - - build

This will:

Pull/build all required images
Start docker model runner and ensure the mistral:latest model is available
Launch the FastAPI service (api)
Run evaluation (tests/test_evaluation.py) and save results into results/
After it finishes, shut everything down with:

docker-compose down

Run Only the Interactive API

If you want to just test queries interactively (via Swagger UI /docs or Thunder Client), run:

docker-compose up – -build api

This will start:

docker model runner
api (FastAPI service with /query endpoint)

Run Only the CLI

If you want to use the command-line interface, first build it with:

docker-compose up --build cli

This command builds the CLI container but runs it in detached mode inside Docker Desktop. To open and interact with it directly in the terminal, use:

docker-compose run --rm cli

Stop them when done with Ctrl+C, then clean up:

Before running, make sure Docker is running and the model is available:

docker exec -it model-runner ollama list

Output Artifacts

After running the full experiment, the following outputs are generated and stored locally:

data/demo.db: The read-only SQLite database used for executing validated queries.
data/artifacts.log: An append-only log of all user questions, generated SQL, execution status, and results. This ensures auditability and allows repeated queries to be cached.
results/detailed_results.csv: A row-by-row record of each query execution in test mode,running the file test_evaluation.py in the container, including question, model used, generated SQL, latency, and score.

Figure 9. Detailed evaluation results:Every query was accurately translated into SQL, with all models achieving full correctness (soft score = 1). Hugging Face models showed consistently lower latency, while Ollama Mistral exhibited higher response times

results/summary_results.csv: An aggregated summary of model performance across all test cases (soft accuracy, total queries, average latency).

Figure 10. Summary of model accuracy and latency:All models achieved 100% accuracy, with Hugging Face models responding fastest and Ollama Mistral showing the highest latency.

Conclusion

This project shows how natural language queries can be converted into safe SQL statements inside a containerized setup that runs directly through Docker Desktop. Each part of the system, including the prompt builder, SQL firewall, and artifact logger, helps maintain security and traceability. The Docker model runner makes deployment simple, requiring only one click to start.

The results confirm that the approach works as intended. It generates accurate queries, blocks unsafe ones, and keeps detailed logs for review. Overall, it provides a secure and convenient way to access structured data locally while maintaining full control and reliability.

References & Credits

Docker | Docker Compose | Model Runner – Containerization and model serving. https://www.docker.com
FastAPI – Used for building the API service. https://fastapi.tiangolo.com
Ollama – Local model runner for serving LLMs. https://docs.ollama.com
Hugging Face Inference API – Alternative backend for running LLMs. https://huggingface.co

Docker Model Runner & LLMs for AI Keyword Extraction: A Reproducible Benchmark

Mohammed Aminoor Rahman — Wed, 15 Oct 2025 17:48:19 +0000

Docker model runner, LLM

Core Idea

- This project solves the challenge of comparing traditional NLP methods against modern LLMs by creating a fully containerized, one-command experiment using Docker Compose.

- Quantitative analysis of the results shows the LLM’s output is consistently more semantically aligned with the source text, achieving a higher cosine similarity score than the combined baseline methods.

- The LLM (Mistral 7B) produces a hierarchically structured, conceptual analysis, while the baseline methods (RAKE, TF-IDF, KeyBERT, and Noun Chunking) excel at identifying statistically significant, literal terms.

GitHub Repository: docker-llm-runner-keyword-extraction-benchmark.git

Problem & Context

Developers and data scientists face a critical choice: for foundational tasks like keyword extraction, should they use established, lightning-fast Natural Language Processing (NLP) algorithms or pivot to the more powerful, but complex, Large Language Models (LLMs)? While traditional methods like TF-IDF (Term Frequency-Inverse Document Frequency) are predictable and efficient, LLMs promise a deeper, more human-like understanding of text.

Furthermore, AI experiments are notoriously difficult to reproduce. Differences in operating systems, package versions, and model availability can lead to inconsistent results. This project tackles both problems head-on by building a fully containerized experiment to fairly compare these two approaches.

Solution Overview & Architecture

To create a fair and reproducible testing ground, a multi-service application was designed and orchestrated by Docker Compose. The system runs classic NLP methods and a generative LLM in parallel, processes a dataset of text files, and generates a final quantitative comparison.

The architecture is composed of independent but connected containerized services managed by the Docker Engine. A key feature is the use of a health check to ensure the LLM model server is fully initialized before dependent services begin processing, creating a robust, automated workflow. The system also exposes a persistent API for real-time, interactive analysis.

Figure 1 System Architecture:: The layered architecture showing the Host Machine, the Docker Engine, and the containerized services it manages, along with volume mounts that link the host filesystem to the containers.

Keyword Extraction Parameters

To ensure a fair comparison, the following parameters were used across the extraction methods:

- Top-K: The top 15 keywords or phrases were requested from each method (TOP_N_KEYWORDS=15).

- Deduplication: Uniqueness is enforced within each method. For Noun Chunking, a set is used to store phrases, inherently removing duplicates. For rank-based methods like RAKE and KeyBERT, deduplication is a natural result of the ranking process.

- Tie Handling: In cases of tied scores, tie-breaking is handled by the default, deterministic behavior of the underlying libraries (e.g., Scikit-learn, KeyBERT), which typically relies on the order of appearance.

API Endpoint Details

The api service provides a real-time endpoint for on-demand analysis.

- Endpoint: POST /extract

- Rate Limits: The service does not currently implement authentication or rate-limiting.

- Request Schema: The endpoint expects a JSON body with a single key.

{ "text": "Your text to be analyzed goes here..." }

Response Schema: The endpoint returns a JSON object containing a full comparative analysis.

{
    "llm_analysis": {
        "primary_keywords": [...],
        "secondary_keywords": [...],
        "key_phrases": [...],
        "long_tail_phrases": [...],
        "evidence_sentences": [...],
        "confidence": 0.0,
        "must_include": []
    },
    "baseline_analysis": {
        "rake": [...],
        "tfidf": [...],
        "keybert": [...],
        "noun_chunks": [...]
    }
}

- Example curl Request:

curl -X POST http://localhost:5001/extract 
-H "Content-Type: application/json" 
-d '{"text": "A cryptocurrency is a digital currency designed to work through a computer network."}'

Environment & Prerequisites

System Configuration

- Tested on: Windows 11 with WSL 2, Docker Desktop 4.46

- Container OS: python:3.11-slim

- Key Libraries: scikit-learn=1.5.0, keybert=0.8.0, spacy=3.7.5, ollama=0.2.1

- Generative Model: mistral:latest (Mistral 7B Instruct) served via the ollama/ollama Docker image.

Dataset Details

The dataset consists of 30 plain text files generated using the Wikipedia API.

- Domain: The content is sourced from a seed list of topics primarily focused on science, technology, and economics (e.g., “Artificial intelligence,” “Blockchain,” “Quantum computing”).

- Language: English (en).

- Document Length: Each document is truncated to a maximum of 300 words.

- Tokenization: Standard whitespace tokenization is used.

Preprocessing Steps

Before analysis, both the baseline and LLM scripts apply the same minimal preprocessing to the source text:

- Lowercasing: All text is converted to lowercase.

- Whitespace Normalization: All newline characters (n) are replaced with a single space to create a continuous block of text.

LLM Parameters

The LLM is prompted to generate a structured JSON output. The following parameters and schema are used for the API calls

- Decoding Parameters:
  - - temperature: Two experiments are run. A “medium-low” setting of 0.3 for factual extraction and a “high” setting of 0.9 to observe creative output.
  - - top_p: Set to 1.0 for the low-temperature test and 0.9 for the high-temperature test.
  - - format: Set to json to enforce structured output from the Ollama server.

- Prompt Template: The prompt instructs the model to act as a precise text analysis engine and adhere to strict rules, including extracting terms directly from the source text.

- JSON Schema: The model is required to return its analysis in the following JSON format:

{
  "primary_keywords": ["..."],
  "secondary_keywords": ["..."],
  "key_phrases": ["..."],
  "long_tail_phrases": ["..."],
  "evidence_sentences": ["..."],
  "confidence": 0.0,
  "must_include": []
}

Step-by-Step Implementation

The core of this project is the docker-compose.yml file, which defines the entire application stack.

Figure 2 Workflow Diagram:: A flowchart illustrating the sequence of events triggered by docker compose up, from the Ollama health check to the parallel execution of the processors and the final comparison step.

The key to making this work is the healthcheck in the ollama service. This ensures that the other services that depend on the LLM will not even start until the multi-gigabyte Mistral model is fully downloaded and ready to serve requests, which solves a critical race condition.

# docker-compose.yml snippet
services:
  ollama:
    build:
      context: .
      dockerfile: docker/ollama.Dockerfile
    volumes:
      - ollama_data:/root/.ollama
    healthcheck:
      test: ["CMD-SHELL", "ollama list | grep mistral:latest"]
      # ...
  text-processor:
    build:
      context: .
      dockerfile: docker/Dockerfile
    depends_on:
      ollama:
        condition: service_healthy # This container waits for the health check to pass
    # ...

The entire experiment is launched with a single command:

docker compose up --build

Evaluation Metrics

To quantitatively compare the two methods, we defined three metrics calculated by our comparison.py script.

Jaccard Similarity

This metric measures the overlap of unique words between the baseline and LLM keyword sets. It is calculated by dividing the size of the intersection of the two sets by the size of their union. A higher score indicates more shared vocabulary.

J(A,B) = |A ∩ B| / |A ∪ B|

Average Phrase Length

This metric measures the complexity and descriptiveness of the keywords by calculating the average number of words per keyword/phrase for each method. A higher average suggests more conceptual and detailed output.

Semantic Similarity

This metric serves as a proxy for contextual relevance, simulating how a modern search engine might rank the keywords. To compute this, all keywords for a given method are first concatenated into a single string to create a “pseudo-summary.” The sentence-transformers library (all-MiniLM-L6-v2 model) then encodes this summary and the original source text into high-dimensional vectors. This encoding process uses a mean pooling strategy to create a single embedding from the individual word tokens and applies L2 normalization. Finally, the cosine similarity between the two resulting vectors is calculated, yielding a score where a higher value indicates a stronger semantic relationship.

Results

The script outputs the results to a CSV file located at artifacts/semantic_comparison_results.csv.

Figure 3 Quantitative Comparison Results:: A table showing the calculated metrics for a sample of documents, generated from artifacts/semantic_comparison_results.csv.

Visually, the difference in output is immediately apparent. The baseline methods produce flat lists of statistically relevant terms. The LLM produces a structured, hierarchical analysis.

Figure 4 Qualitative Output Comparison: A side-by-side screenshot comparing the raw JSON output from the baseline script and the medium-temperature LLM script for the same document.

Discussion

The results from the quantitative analysis (Figure 3) are revealing. The Jaccard Similarity is consistently low (0.2-0.4), which confirms that the LLM is not merely identifying the same statistically frequent words as the baseline methods; it’s finding different information.

The Average Phrase Length is significantly higher for the LLM, providing numerical evidence that it generates more complex and conceptual phrases. Most importantly, the Semantic Similarity score is consistently higher for the LLM’s output. This indicates that the LLM’s keywords are more contextually and semantically aligned with the overall meaning of the source text, a strong proxy for what a search engine would consider relevant.

The trade-off is performance. The multithreaded baseline script processed 30 files in under a minute, whereas the LLM script took several minutes.

Limitations & Next Steps

This experiment provides a solid foundation but has limitations. We only used one generative model (Mistral 7B) and focused on a single task (keyword extraction).

Next Steps could include:

- Benchmarking other models: Swapping mistral:latest for other models like llama3 or gemma to compare their analytical quality.

- Expanding the task: Modifying the prompts to perform other tasks, such as summarization or sentiment analysis.

- Cloud Deployment: Adapting the API service for deployment to a cloud environment using Kubernetes.

How to Reproduce

This experiment is fully containerized and managed by Docker Compose. The following steps detail the project structure and the commands needed to replicate the entire analysis.

Github: https://github.com/kubetoolsio/docker-llm-runner-keyword-extraction-benchmark.git

Project Structure

The repository is organized with a clear separation between application code, container configurations, and data.

Figure 5 Project File Structure: The organized layout of the repository, showing the clear separation between application code (e.g., api), container configurations (docker/), and data folders (dataset/, output/).

Prerequisite

To replicate this experiment, you only need one piece of software installed:

- Docker Desktop

Setup & Execution

All Python dependencies listed in the “ requirements/ ” files are automatically installed inside their respective containers when you build the images. There is no local pip install required.

To Run the Full Batch Experiment (and generate the final CSV):
This is the primary command. It builds all images, starts the services in the correct order, runs the batch processors (baseline-processor and text-processor), waits for them to finish, runs the final comparison script, and then exits.

docker compose up --build

After the command finishes, you can run docker compose down to clean up any services that may remain (like the API).

To Run Only the Interactive API:
If you only want to start the API and its Ollama dependency for real-time testing with a tool like Thunder Client, use this command:

docker compose up --build api

To stop the API and Ollama services when you are finished, press Ctrl + C and then run docker compose down.

Output Artifacts

After running the full experiment, the following outputs (artifacts) will be created in your project directory:

- baseline_outputs/baseline_keywords.json: A single JSON file containing the keywords generated by the classic NLP methods for every document.

- output/: This directory will be populated with two JSON files per input document, generated by the Mistral LLM with different creativity settings.

- ./semantic_comparison_results.csv: The final quantitative analysis, comparing the baseline and LLM outputs on key metrics.

References

- Ollama Documentation. (n.d.). Retrieved September 25, 2025, from https://docs.ollama.com/

- Docker Documentation. (n.d.). Retrieved September 25, 2025, from https://docs.docker.com/

- Mistral AI Documentation. (n.d.). Retrieved September 25, 2025, from https://docs.mistral.ai/

- NLTK Project. (n.d.). Natural Language Toolkit Documentation. Retrieved September 25, 2025, from https://www.nltk.org/

k0s Joins CNCF Sandbox: A Milestone for Lightweight Kubernetes

Karan Singh — Thu, 22 May 2025 15:09:43 +0000

Ever wondered what makes Kubernetes simpler, lighter, and more accessible? Today, we’re thrilled to share some big news: k0s, a lightweight Kubernetes distribution, has officially joined the Cloud Native Computing Foundation (CNCF) as a Sandbox project! This marks a major step forward for k0s and its community, opening doors to collaboration, innovation, and growth within the cloud-native ecosystem.

CNCF logo representing k0s joining the CNCF Sandbox program

What is CNCF and Why Does It Matter?

Before diving into k0s, let’s talk about CNCF. The Cloud Native Computing Foundation is home to the most innovative cloud-native projects, fostering collaboration among developers, operators, and organizations. Joining the CNCF Sandbox program means k0s is now part of this vibrant ecosystem, gaining visibility, community support, and opportunities to grow.

Meet k0s: Lightweight Kubernetes Designed for Everyone

So, what’s k0s all about? It’s a zero-friction Kubernetes distribution built with simplicity and efficiency in mind. Whether you’re a developer spinning up clusters on your laptop or an operator managing edge computing environments, k0s has something for you. Let’s break down its standout features:

Lightweight and Minimal: k0s is a single binary that encapsulates all necessary components, making installation and maintenance effortless.
Fully Open Source: Transparency and community-driven innovation are at its core.
Optimized for Data Center to Edge Use Cases: Its compact footprint makes it perfect for centralized data centers and resource-constrained environments.
Simplified Management: Despite its simplicity, k0s retains full Kubernetes functionality.

From AI inference engines at the edge to large-scale production clusters, k0s has proven itself as a versatile and reliable solution.

Why CNCF Sandbox is a Big Deal for k0s

The CNCF Sandbox is the entry point for promising early-stage projects in the cloud-native world. Here’s why this milestone matters:

Community Support: k0s now has access to a broad and engaged community of developers and users.
Visibility: Greater exposure to end users, contributors, and organizations.
Collaboration Opportunities: The chance to work closely with other CNCF projects, fostering interoperability and innovation.

By joining CNCF, k0s is positioned to refine its capabilities with invaluable feedback and contributions from the community.

How k0s Makes Kubernetes Simpler and More Accessible

Lightweight Design

k0s eliminates complexity with its single binary approach. This means fewer moving parts, easier installation, and streamlined maintenance. Imagine setting up Kubernetes without worrying about dependencies—it’s a game-changer.

Open Source Philosophy

Built with transparency, k0s thrives on community-driven innovation. Open governance ensures that anyone can contribute, making it a truly collaborative project.

Edge Computing Ready

k0s shines in resource-constrained environments, making it ideal for edge computing. Whether you’re deploying AI inference engines or managing IoT devices, k0s delivers efficiency without compromise.

Simplified Management

Despite its lightweight design, k0s retains full Kubernetes functionality. Operators can focus on what matters without getting bogged down by unnecessary complexity.

k0s in Action: Use Cases

Here’s where k0s truly stands out:

Developer-Friendly: Spin up clusters on your laptop with minimal hassle.
Data Centers: Run centralized, large-scale production clusters efficiently.
Edge Computing: Manage resource-constrained environments with ease.

Join the Journey: How You Can Get Involved

This milestone wouldn’t have been possible without the incredible k0s community. If you’re new to k0s, now’s the perfect time to dive in:

Try k0s: Download and get started with k0s here.
Contribute: Check out the GitHub repository and join the community discussions.
Spread the Word: Share your experiences with k0s and help grow the community.

Conclusion

This milestone marks the beginning of an exciting new chapter for k0s. By joining CNCF, k0s is set to make Kubernetes simpler, lighter, and more powerful for everyone. Whether you’re a developer, operator, or edge computing enthusiast, there’s a place for you in the k0s community. Let’s shape the future of Kubernetes together!

FAQ

What is k0s Kubernetes?

k0s is a lightweight, zero-dependencies Kubernetes distribution designed for simplicity and efficiency. It’s fully open-source and optimized for both data centers and edge computing.

Why is CNCF Sandbox important for k0s?

Joining the CNCF Sandbox gives k0s access to a vibrant community, greater visibility, and collaboration opportunities, helping it grow and refine its capabilities.

What are the key features of k0s?

k0s stands out for its lightweight design, open-source philosophy, edge computing readiness, and simplified management—all while retaining full Kubernetes functionality.

How can I contribute to the k0s project?

You can contribute by exploring the GitHub repository, joining community discussions, and sharing your experiences with k0s.

What are the benefits of using k0s for edge computing?

k0s is optimized for resource-constrained environments, making it ideal for edge computing tasks like AI inference engines and IoT device management

Running Local LLMs with Docker Model Runner: A Deep Dive with Full Observability and Sample Application

Karan Singh — Tue, 06 May 2025 05:26:25 +0000

Introduction

In this blog post, we’ll explore how developers and teams can speed up development, debugging, and performance analysis of AI-powered applications by running models locally—using tools like Docker Model Runner, MCP (Model Context Protocol), and an observability stack.

Running everything locally not only removes the need for costly cloud calls during development, but also gives you production-like visibility into your system—so you can catch issues early, understand latency, analyze errors, and optimize performance before shipping anything.

One key part of this setup is MCP, a simple but powerful middleware layer that connects your frontend or APIs to local AI models. For example, in a document analysis app, the MCP server handles incoming requests, extracts content from files (like PDFs), and sends prompts to the local model running inside a Docker container. Combined with observability tools (like OpenTelemetry, Jaeger, and Prometheus), this creates a self-contained environment that feels like production—just without the cost or complexity.

Why Are Traces and Metrics Important for LLM Applications?

Challenge	Explanation
Non-determinism	The same input can produce different outputs due to randomness in LLMs.
Subjective Quality	Quality is not just about being correct—it includes tone, relevance, and coherence, which are harder to measure.
Multiple Processing Steps	LLM apps often involve several steps (e.g., input processing → model call → post-processing), making it harder to track what’s slow or broken.
Resource Usage	LLMs can be very heavy on CPU, GPU, memory, and storage, especially when running locally.
Cost	Token usage costs for cloud models or hardware/infrastructure costs for local models can add up quickly.
Concurrency	As user volume increases, it becomes important to monitor how well the system handles multiple requests at once without degrading performance.
Observability Value	Traces and metrics help developers understand performance, detect errors, control costs, and manage scalability in a reliable and informed way.

Traces

Trace Element	What It Shows	Why It’s Valuable
Full Request Trace	Tracks the journey of a user request through different parts of the system.	Helps measure total latency and identify which step (e.g., input handling, model processing) is slow.
Backend Processing Span	Measures time spent handling the logic in the backend service.	Shows how backend handles concurrent requests.
Input Processing Span	Tracks time taken for tasks like parsing, formatting, or validation before sending to the model.	Useful for optimizing under high concurrency when pre-processing queues build up.
Model Inference Span	Measures how long it takes the model to respond to a given prompt or input.	Useful for tuning batching or managing queueing when concurrency is high.
Output Handling Span	Measures time for post-processing (e.g., formatting output).	Ensures that final steps are efficient.
Input/Output Attributes	Stores prompt, response, token count, etc. for each request span.	Useful for correlating long inputs or outputs with performance drops.
Error Traces	Captures when and where errors occur (e.g., failed model call, input error).	Helps diagnose issues that might only occur under concurrency stress (e.g., timeouts, rate limits).

Metrics

Metric	What It Measures	Why It’s Valuable (Includes Concurrency Aspects)
Request Latency (p50/p90/p99)	Time taken to complete a request at different percentiles.	Tracks how fast the system is, and how speed degrades under load.
Throughput (Requests/sec)	Number of requests the system can handle per second.	Critical for understanding how concurrency affects system load.
Error Rate (%)	Percentage of requests that fail or return errors.	Helps detect instability or bugs.
Resource Usage (CPU, GPU, Memory)	How much system resources are being consumed.	Helps in scaling decisions and resource optimization.
Token Usage	Number of tokens processed in requests (input and output).	Useful for cost tracking and understanding load.
Quality Scores	Metrics that measure relevance, accuracy, or usefulness of responses.	Helps ensure output quality stays high under different loads.
User Feedback	Ratings or other direct user opinions.	Detects satisfaction trends and also helps in understanding production datasets for training or fine-tuning
Safety/Compliance Scores	Measures sensitive data, or policy violations.	Ensures safe operation.

Concurrency

In Traces: Concurrency issues can show up as overlapping spans, delayed model responses, or backend queuing delays.
In Metrics: Look for increased p99 latency, rising error rates, or CPU/GPU spikes when traffic increases.
Why It Matters: As LLM apps scale, tracking how multiple simultaneous users affect performance, quality, and stability becomes critical.

Let’s Build !!

We’re creating a Todo web application that:

Allows users to upload PDF documents

Extracts text from these documents

Uses a locally running LLM to analyze the content

Provides insights and summaries about the document

Enables chat-based interaction with the document content

Includes comprehensive monitoring and observability

The Technology Stack

Local AI Development Workflow (with MCP + Docker + Observability)

[ User Frontend / API ]
|
v
┌────────────────────┐
│ MCP Server │ ◄── Observability: Traces, Logs, Metrics
└────────────────────┘
|
v
┌────────────────────────────┐
│ Local Docker Model Runner │ ◄── LLM (e.g., LLaMA, Mistral)
└────────────────────────────┘
|
v
┌────────────────────────────┐
│ Observability Tools │ ◄── Jaeger, Prometheus, Grafana
└────────────────────────────┘

Key components:

Frontend/API: Triggers an analysis request.
MCP Server: Extracts, formats, and sends data to the model; acts as a smart controller.
Docker Model Runner: Hosts the LLM locally and responds to prompts.
Observability Layer: Collects performance data, traces, and error logs across all steps.

Setting Up the Environment

1. Docker Model Runner

Docker Desktop now includes Model Runner, which allows you to run AI models locally without depending on external API services.

# Enable Docker Model Runner

$ docker desktop enable model-runner

# Pull the Llama 3 model

$ docker model pull ai/llama3.2:1B-Q8_0

# Verify the model is availabledocker model list

2. Project Structure

Project Github Skeleton:

https://github.com/kubetoolsca/docker-model-runner-observability

├── backend/ # Express backend

│ ├── routes/ # API routes

│ │ └── document.js # Document processing endpoints

│ ├── observability.js # OpenTelemetry setup

│ ├── server.js # Express server setup

│ └── Dockerfile # Backend container config

├── src/ # React frontend

│ ├── components/ # UI components

│ └── App.tsx # Main application component

├── observability/ # Observability configuration

│ ├── otel-collector-config.yaml # OpenTelemetry Collector config

│ ├── prometheus.yml # Prometheus config

│ └── grafana/ # Grafana dashboards

└── docker-compose.yml # Multi-container orchestration

Backend

The backend service handles document uploads, text extraction, and communication with the local LLM via Docker Model Runner.

Document Routes

document.js routes file handles two main operations:

/analyze – Upload and analyze a document

/chat – Chat with a document that has already been analyzed

The document analysis flow works like this:

Receive the uploaded PDF file

Store it temporarily

Extract text using pdf-parse

Send extracted text to the local LLM for analysis

Return results to the user

Docker Model Runner Integration

The challenging part of this implementation was connecting to Docker Model Runner correctly.

Model Runner exposes an OpenAI-compatible API, which means we need to format our requests accordingly:

// Example of calling the Model Runner API

const response = await axios.post(

  `${baseUrl}/chat/completions`,

  {

    model: targetModel,

    messages: [

      {

        role: “system”,

        content: “You are a helpful document analysis assistant.”

      },

      {

        role: “user”,

        content: `Analyze this document: ${extractedText}`

      }

    ],

    temperature: 0.7,

    max_tokens: 1024

  },

  {

    headers: { ‘Content-Type’: ‘application/json’ }

  }

);

The key to making this work is:

Using the correct hostname: model-runner.docker.internal

Using the correct endpoint path: /engines/v1/chat/completions

Formatting the request body to match the OpenAI chat completions API

Properly handling the response structure

Additionally, we implemented multiple fallback mechanisms to ensure our application stays responsive even if the Model Runner service is unavailable:

Multiple endpoint URLs to try (for different Docker configurations)

Graceful error handling with useful feedback

Extraction-only mode when AI services are unavailable

Setting Up the Observability Stack

A key aspect of our sample application is the observability stack, which helps monitor the system’s performance and identify issues.

OpenTelemetry Configuration

The observability.js file sets up OpenTelemetry in the Node.js backend:

function setupObservability(serviceName = ‘document-analysis-service’) {

  const resource = new Resource({

    [SemanticResourceAttributes.SERVICE_NAME]: serviceName,

  });

  // Configure OTel exporter

  const traceExporter = process.env.OTEL_EXPORTER_OTLP_ENDPOINT

    ? new OTLPTraceExporter({

        url: `${process.env.OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces`,

      })

    : undefined;

  const sdk = new NodeSDK({

    resource,

    traceExporter,

    instrumentations: [

      getNodeAutoInstrumentations({

        ‘@opentelemetry/instrumentation-http’: { enabled: true },

        ‘@opentelemetry/instrumentation-express’: { enabled: true },

      }),

    ],

  });

  sdk.start();

  // Setup process exit handlers

}

OpenTelemetry Collector

The OpenTelemetry Collector acts as a central aggregation point for our observability data. Our configuration in otel-collector-config.yaml defines:

Receivers: How data enters the collector (OTLP over HTTP)

Processors: How data is processed (batching)

Exporters: Where data is sent (Prometheus and Jaeger)

Pipelines: How data flows through the collector

receivers:

  otlp:

    protocols:

      http:

        endpoint: 0.0.0.0:4318

processors:

  batch:

exporters:

  prometheus:

    endpoint: 0.0.0.0:8889

    namespace: document_analysis

  logging:

    verbosity: detailed

  otlp:

    endpoint: jaeger:14250

    tls:

      insecure: true

service:

  pipelines:

    traces:

      receivers: [otlp]

      processors: [batch]

      exporters: [logging, otlp]

    metrics:

      receivers: [otlp]

      processors: [batch]

      exporters: [logging, prometheus]

Visualization with Grafana

Grafana provides dashboards to visualize the metrics and traces.

The pre-configured dashboard includes:

Document analysis request metrics

Response time statistics

This gives visibility into:

How many documents are being processed

How long analysis takes

Success and failure rates

Containerizing the Application

The docker-compose.yml file orchestrates all services:

services:

  # Frontend Application

  frontend:

    build:

      context: .

      dockerfile: Dockerfile

    ports:

      – “8080:8080”

  # MCP Server (Backend)

  mcp-server:

    build: ./backend

    ports:

      – “3000:3000”

    environment:

      – DMR_API_ENDPOINT=http://model-runner.docker.internal/engines/v1

      – TARGET_MODEL=ai/llama3.2:1B-Q8_0

      – OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

  # OpenTelemetry Collector

  otel-collector:

    image: otel/opentelemetry-collector-contrib

    volumes:

      – ./observability/otel-collector-config.yaml:/etc/otel-collector-config.yaml

  # Jaeger for trace visualization

  jaeger:

    image: jaegertracing/all-in-one

  # Prometheus for metrics storage

  prometheus:

    image: prom/prometheus

  # Grafana for dashboards

  grafana:

    image: grafana/grafana

The Frontend Components

Key components:

DocumentUploader: Handles file uploads with drag-and-drop

DocumentAnalysisResult: Displays analysis results

ChatInterface: Allows users to chat about the document

Challenges and Solutions

1. Docker Model Runner Connectivity

Challenge: The backend container couldn’t connect to the Docker Model Runner service

Solution: Below implemented hostname

model-runner.docker.internal

2. Error Handling and Fallbacks

Challenge: AI services can be unreliable or unavailable.

Solution: Implement graceful degradation:

Always provide basic text extraction even if analysis fails
Clear error messages for debugging
Multiple endpoint fallbacks

3. Observability Integration

Challenge: Tracking performance across multiple services.

Solution: OpenTelemetry integration with:

Automatic instrumentation for HTTP and Express

Custom span creation for important operations

Centralized collection and visualization

Performance Observations

With this implementation, users can observe for example:

Processing Speed: PDFs under 10MB typically process in 1-3 seconds for text extraction

AI Analysis Time: The local Llama 3 model analysis takes 3-8 seconds depending on document length

Memory: The backend uses approximately 150-250MB RAM

Response Size: Analysis results average 1-5KB of text

Conclusion

This implementation demonstrates how an AI application can be built and tested using locally-run models via Docker Model Runner and MCP. The addition of OpenTelemetry, Jaeger, Prometheus, and Grafana provides comprehensive visibility into the application’s performance and local models insights

By running LLMs locally with Docker Model Runner, we get:

Privacy: Document data never leaves your infrastructure

Cost efficiency: No per-token API charges

Reliability: No dependency on external API availability

Control: Choose appropriate models for your use case

& valuable insights into:

Performance bottlenecks: Identify slow components

Error patterns: Detect recurring issues

Resource utilization: Optimize container resources

User behavior: Understand how the application is used

Next Steps

To extend this , consider:

Supporting more file formats (DOC, DOCX, TXT)
Fine-tuning the LLM for specific document types
Use a benchmarking tool like locust / K6
Sky is the limit

Demystifying Kubernetes Logs

Rakesh Reddy — Thu, 05 Dec 2024 00:33:22 +0000

Introduction

Kubernetes has become the cornerstone of containerized application orchestration, but with great power comes the challenge of managing the logs it generates. These logs hold critical information for debugging, monitoring, and improving application performance. In this blog, we will demystify Kubernetes logs, explore their types,

Introduction

What are Kubernetes Logs?

Logs in Kubernetes capture events and messages generated by various components, such as containers, Pods, nodes, and the Kubernetes control plane. They provide insights into:

Application Behavior: Errors, warnings, and performance metrics.
System Events: Node and Pod status updates.
Debugging Information: Diagnosing issues at the application or system level.

Types of Kubernetes Logs

Application Logs:
- Generated by containers running inside Pods.
- Includes stdout and stderr streams.
- Useful for tracking app-specific issues.
Cluster Logs:
- Captures events at the cluster level, such as resource scheduling and state changes.
- Includes logs from components like the API server, kubelet, and scheduler.
Node Logs:
- Includes system-level logs for monitoring hardware and operating system events.

How Kubernetes Logs Work

Each container in Kubernetes writes logs to its local filesystem or streams them to stdout and stderr. Kubernetes captures these logs, making them available through kubectl commands:

bashCopy codekubectl logs

However, logs are ephemeral—they disappear when a container is terminated or a Pod is deleted. This is where log collection systems come into play.

Log Collection and Management

Manual Collection:
Access logs using kubectl logs. While simple, this method isn’t scalable for large clusters.
Centralized Log Aggregation:
Tools like Fluentd, Elasticsearch, and Loki are commonly used for aggregating logs.
- Fluentd collects logs from Pods and forwards them to a storage backend.
- Elastic Stack (ELK) provides powerful querying and visualization capabilities.
Cloud-Based Logging Solutions:
Platforms like Google Cloud Logging and AWS CloudWatch integrate seamlessly with Kubernetes clusters.

Best Practices for Managing Kubernetes Logs

Standardize Logging Formats:
Use structured formats like JSON for easier parsing and analysis.
Set Retention Policies:
Configure log storage with appropriate retention periods to balance storage costs and compliance requirements.
Leverage Labels and Metadata:
Use Kubernetes labels (e.g., app, environment) to filter logs effectively.
Monitor Log Volume:
Keep an eye on log sizes and set limits to avoid overwhelming your infrastructure.

Debugging with Kubernetes Logs

Logs are invaluable for diagnosing issues in your cluster. Common commands include:

Viewing logs from all containers in a Pod:bashCopy codekubectl logs --all-containers
Debugging with specific timestamps:bashCopy codekubectl logs --since=1h

Conclusion

Understanding and managing Kubernetes logs is essential for maintaining robust and reliable applications. By implementing best practices and leveraging log aggregation tools, you can simplify troubleshooting, enhance observability, and ensure your Kubernetes environment runs smoothly.

What are Kubernetes Pods?

Rakesh Reddy — Thu, 05 Dec 2024 00:33:18 +0000

Introduction

In the world of Kubernetes, Pods are the smallest and simplest deployable units. They play a crucial role in running and managing containerized applications effectively. Whether you’re new to Kubernetes or a seasoned developer, understanding Pods is essential to leveraging the full power of this container orchestration platform. This blog will take you through everything you need to know about Pods, their structure, and why they are so critical.

What is a Kubernetes Pod?

A Pod is a collection of one or more containers that are deployed together on the same host. They share the same network namespace, storage volumes, and can communicate with each other using localhost. While Kubernetes can manage individual containers, Pods provide an abstraction that simplifies the deployment and scaling of applications.

Key Characteristics of Pods

Multi-Container Deployment :
A Pod can run a single container or multiple tightly coupled containers. For example, a Pod might host a web application container alongside a logging or monitoring container.

Shared Network :
Containers within a Pod share the same IP address and port space, which allows seamless inter-container communication.

Shared Storage :
Pods can share storage volumes, enabling data persistence and collaboration between containers.

Ephemeral by Design :
Pods are designed to be temporary. Kubernetes may terminate and replace Pods as part of scaling, updating, or recovery processes.

Pod Use Cases

Single-Container Pods :
Most Pods run a single container. This approach simplifies deployment and isolates application functionality.

Multi-Container Pods :
When containers need to work closely, like a sidecar pattern, they are deployed in the same Pod.

For example:

A web server container and a caching container.
A main application and a helper container for logging.

Components of a Kubernetes Pod

Containers: The main applications running in the pod.
Shared Storage: Volumes mounted for data persistence and sharing.
Networking: Each pod gets a unique IP, and its containers share the same port space.
Pod Specifications: Defined in a YAML file, the pod spec includes details like container images, ports, and resource requests.

How Pods Work in Kubernetes

Creation
Pods are created using YAML manifests, which define the container image, resources, and networking settings. For example:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: nginx-container
    image: nginx

To create the Pod shown above, run the following command:

kubectl apply -f https://k8s.io/examples/pods/simple-pod.yaml

Lifecycle Management :
Kubernetes ensures that the desired state of Pods matches the actual state. It uses controllers like Deployments, StatefulSets, and ReplicaSets to manage Pods.

Communication :
Pods can communicate with each other through services. Each Pod is assigned a unique IP address within the cluster, and services enable communication across Pods.

Scaling :
Pods can be scaled horizontally by creating multiple replicas through a Deployment.

Pod Lifecycle

Pending: The pod is created but not yet scheduled on a node.
Running: The pod is successfully scheduled, and containers are running.
Succeeded: The pod completed execution (for pods with jobs).
Failed: The pod failed to run.
Unknown: The state of the pod is unclear due to communication issues.

Why are Pods Important?

Simplicity and Modularity :
Pods abstract the complexity of managing individual containers, making deployments more manageable.

Flexibility :
They support multi-container patterns, allowing developers to package and deploy applications with dependencies.

Integration with Kubernetes Features :
Pods integrate seamlessly with Kubernetes features like auto-scaling, rolling updates, and service discovery.

Common Challenges with Pods

Ephemeral Nature: Pods are short-lived. To handle persistence, developers must configure volumes and use StatefulSets where necessary.

Networking Overhead: Pods rely on Kubernetes networking, which can introduce latency and require proper configuration.

Conclusion

Kubernetes Pods are the fundamental units that enable efficient management of containerized workloads. By grouping containers together with shared resources and networking, Pods simplify application deployment, scaling, and orchestration. Whether you’re building a simple application or deploying complex microservices, mastering Pods is the first step toward Kubernetes expertise.

How Companies Are Using Generative AI Today: Real Use Cases and Adoption Stages (Part 2)

Table of Contents

Introduction

Problems Companies Faced Before Generative AI

Why Traditional Automation Was Not Enough

Individual Use vs Company Use

Internal Tools vs Customer-Facing Products

Internal Tools

How Generative AI Integrates Into Business Workflows

Customer-Facing Products

Adoption Stages of Generative AI in Companies

Experimentation

Internal Production

Enterprise-Scale Adoption

What Generative AI Replaces or Augments

Prompt-to-Output Example in a Company Context

Simple Example

What Actually Changes Inside a Company When Generative AI Is Introduced

1. New Workflows

2. New Risks

3. New Ownership and Responsibilities

Why This Matters

Conclusion

Key Takeaways

What’s Next?

References

What Is Generative AI? A Beginner-Friendly Guide to How GenAI Actually Works (Part 1)

Table of Contents

Introduction

What Is Generative AI?

Traditional AI vs Generative AI

Types of Generative AI

Text

Images

Audio

Video

Multimodal AI

How Generative AI Works (High Level)

Training

Inference

Tokens

Prompts and Outputs

Simple Example

Where Individuals Use Generative AI Today

Conclusion

Key Takeaways

What’s Next?

References

Docker-based Model Runner for AWS CloudWatch Log Analysis

Concise Summary

1. Problem & Context

1.1 Use Case: Querying AWS CloudWatch with Natural Language

How It Works

Why This Matters

2. Solution Overview: Docker Model Runner Architecture for AWS CloudWatch

2.1 System Architecture ( Figure 1)

2.2 Operating Modes

2.3 Guardrails

2.4 Intergration with AWS

3. API Endpoint Details

GET /health_status

POST /query

4. Environment & Prerequisites

Step-by-Step Implementation Flow (figure2)

5. Reproduce Locally

6. How to Reproduce

7. Project Structure

8. Result Table

9. Evidence (Screenshots)

10. Current Limitations & Planned Improvements

Current Limitations

Planned Improvements

11. IAM and Security Considerations

12. Discussion

13. Conclusion

14. References & Credits

15. Contributing & Getting Involved

Text-to-SQL Agent with Docker MCP and Model Runner | Smarter and Safer

Core

Introduction