DevOps Conference & Camps

How Developer Platforms Fail (And How Yours Won’t)

DevOpsCon editorial — Mon, 16 Feb 2026 15:23:01 +0000

Watch the full keynote below:

Russel unpacks the myths surrounding IDPs. Using real-world examples and a philosophical lens, he explores the critical choices that separate successful platforms from the ones developers refuse to touch.

Here’s what you’ll learn:

Myth vs. Reality: Which platform engineering promises actually hold water?
The Franken-platform: How to avoid building a tool no one wants to use.
Scaling Safely: Strategies for cost-effective growth in teams of 50+.

STAY TUNED

Learn more about DevOpsCon

[mc4wp-simple-turnstile]

The post How Developer Platforms Fail (And How Yours Won’t) appeared first on DevOps Conference & Camps.

Building Scalable CI/CD for a Go Monorepo Using GitLab & Go Templates

abairakdar — Thu, 12 Feb 2026 13:30:27 +0000

This article showcases a step-by-step guide for creating an extensible, and production-ready approach for CI/CD pipelines. Utilizing a Go monorepo, GitLab CI, and Go templating has several key advantages, including seamless integration with GitLab, a flexible template engine, and declarative pipeline logic.

1. Introduction

In modern software development, the adoption of monorepos has become increasingly common, especially in organizations practicing microservices architecture. A monorepo simplifies dependency management, enforces consistency across services, and improves developer velocity. However, the operational complexity of setting up robust CI/CD pipelines for each service within the monorepo can be daunting.

This article presents an advanced CI/CD architecture built with Go and GitLab CI, leveraging the power of Go templates for dynamic pipeline generation. It examines two codebases:

monorepo-go-main: A monorepo housing multiple Go services.
ci-pipeline-main: A utility written in Go that reads configuration and uses templates to generate .gitlab-ci.yml files dynamically.

The approach combines the flexibility of Go with the power of GitLab CI to deliver a scalable, maintainable, and DRY CI/CD pipeline, making it a strong candidate for large teams and enterprise-grade systems.

2. Background and Motivation

Monorepos consolidate multiple services into a single repository, which makes code sharing, refactoring, and dependency upgrades much easier. However, the CI/CD implications of this approach are non-trivial. Traditionally, every service might come with its own .gitlab-ci.yml or Jenkinsfile, leading to excessive duplication, inconsistencies, and increased maintenance overhead.

Moreover, teams often face challenges like:

Keeping pipeline definitions in sync across services
Managing build dependencies across a shared module
Avoiding unnecessary rebuilds when only one service changes
Onboarding new services without duplicating configuration

The motivation behind this setup is to address these challenges by introducing a highly modular, configuration-driven pipeline generation mechanism.

STAY TUNED

Learn more about DevOpsCon

[mc4wp-simple-turnstile]

3. Architecture Overview

The entire system revolves around a configuration-driven template engine that generates a unified CI/CD pipeline per commit. Each microservice in the monorepo is treated as an autonomous unit that can be built, tested, and deployed independently within the GitLab ecosystem.

Architectural Goals

Declarative Configuration: Services are defined in pipeline_config.yaml.
Templated Pipelines: A Go template is used to render the CI file.
Service Autonomy: Each service has its own isolated testing block.
Code Generation: The CI pipeline is generated programmatically.
Dockerization: A Dockerfile encapsulates the CI generation process for portability.

This architecture ensures that onboarding a new service is a one-line change in the config file, without needing to manually edit YAML logic.

4. Deep Dive: monorepo-go-main

Folder Structure

monorepo-go-main/
├── .gitlab-ci.yml
├── go.mod
├── pipeline_config.yaml
├── README.md
├── service-1/
│   ├── main.go
│   └── addition_test.go
├── service-2/
│   ├── main.go
│   └── subtraction_test.go
└── service-3/
    ├── main.go
    └── multiplication_test.go

Module Definition (go.mod)

module monorepo-go-main

go 1.20

This ensures all services share the same Go toolchain and dependency tree, promoting consistency in build behavior across the repo.

pipeline_config.yaml

services:
  - name: service-1
  - name: service-2
  - name: service-3

This configuration file becomes the single source of truth. You can imagine a scenario where this config file is automatically generated or updated by a higher-level orchestration tool or service discovery mechanism.

Example Service: service-1

package main

import "fmt"

func add(a, b int) int {
	return a + b
}

func main() {
	fmt.Println("Addition Result:", add(5, 3))
}

Each service is a minimal but functional Go program with unit tests. This structure makes testing and deployment granular and independent.

5. Dynamic Pipeline Generator: ci-pipeline-main

Folder Layout

ci-pipeline-main/
├── .gitlab-ci.yml (optional example)
├── Dockerfile
├── gitlab-ci.tmpl
└── gitlab.go

This utility acts as a CLI tool to be run locally or inside CI, generating the full.gitlab-ci.ymlfile based on service declarations.

Understanding gitlab.go

package main

import (
	"os"
	"text/template"
	"gopkg.in/yaml.v2"
	"io/ioutil"
)

type Config struct {
	Services []struct {
		Name string `yaml:"name"`
	} `yaml:"services"`
}

func main() {
	data, _ := ioutil.ReadFile("../monorepo-go-main/pipeline_config.yaml")
	var config Config
	yaml.Unmarshal(data, &config)
	tmpl, _ := template.ParseFiles("gitlab-ci.tmpl")
	tmpl.Execute(os.Stdout, config)
}

This program does three things:

Reads the YAML config for services
Parses the template file
Outputs the populated CI pipeline

### Why Go Templates?

Go’s text/template package provides a safe, readable, and powerful tool for creating infrastructure as code. It supports loops, conditions, and functions. Unlike Jinja2 or other alternatives, it integrates naturally into Go-based workflows.

Kubernetes Training (German only)

Entdecke die Kubernetes Trainings für Einsteiger und Fortgeschrittene mit DevOps-Profi Erkan Yanar

Mehr erfahren

6. Template Language in Action

Template: gitlab-ci.tmpl

stages:
  - test

{{ range .Services }}
{{ .Name }}:
  stage: test
  script:
    - cd {{ .Name }}
    - go test ./...
{{ end }}

This loops over all services and creates individual test jobs.

Example Output

stages:
  - test

service-1:
  stage: test
  script:
    - cd service-1
    - go test ./...

service-2:
  stage: test
  script:
    - cd service-2
    - go test ./...

This concise pattern removes redundancy and encourages consistency.

7. Dockerfile: Portable Generator

Dockerfile

FROM golang:1.20

WORKDIR /app
COPY . .
RUN go build -o generator gitlab.go
ENTRYPOINT ["./generator"]

The Dockerfile ensures that the generator can run anywhere, which is crucial for reproducible builds in CI environments.

8. Workflow Summary

End-to-End Usage

Clone both repositories.
Add new services to monorepo-go-main/service-X.
Update pipeline_config.yaml with the new service.
Run the Go generator locally or in your CI system.
GitLab executes the generated .gitlab-ci.yml.

This allows pipelines to evolve alongside the code without introducing errors or regressions.

Advantages

No YAML duplication.
Configuration is code.
Easier to audit and validate.
Unit testable pipeline logic.

9. Scaling the Pattern

Real-World Scenario

Imagine scaling from 3 to 50 microservices. Traditional approaches would involve maintaining 50.gitlab-ci.ymlblocks — one per service. With this generator:

The pipeline_config.yaml grows by 1 line per service.
No extra CI logic is added manually.
The Go code and template remain unchanged.

Adding New Capabilities

You can easily extend the template:

Add linting stages (golangci-lint)
Build docker images per service
Deploy to separate Kubernetes namespaces
Cache modules or test results

10. Observability & Auditing

Logging Pipeline Execution

You can enhance the script with logging outputs to file or console.

Slack Notifications

Add notification blocks to the template:

  after_script:
    - curl -X POST --data "job $CI_JOB_NAME finished" $SLACK_HOOK

Static Analysis

You can write a linter that validates the final .gitlab-ci.yml for syntax and best practices.

11. Testing the Template Engine

Unit Testing in Go

Write test cases for the generator:

func TestPipelineGeneration(t *testing.T) {
	config := Config{
		Services: []struct{ Name string }{{"svc1"}, {"svc2"}},
	}
	tmpl, _ := template.ParseFiles("gitlab-ci.tmpl")
	var buf bytes.Buffer
	tmpl.Execute(&buf, config)

	if !strings.Contains(buf.String(), "svc1") {
		t.Fatal("Expected svc1 in output")
	}
}

This ensures the integrity of CI logic as templates evolve.

12. Conclusion

This article showcases a robust, extensible, and production-ready approach to managing CI/CD pipelines in a Go monorepo using GitLab CI and Go templating. It combines configuration as code with portable, repeatable pipeline generation.

The key advantages:

Declarative and DRY pipeline logic
Portable via Docker
Flexible and extensible template engine
Seamlessly integrates with GitLab

Whether you’re in a startup or an enterprise, this design can dramatically simplify your DevOps lifecycle.

The post Building Scalable CI/CD for a Go Monorepo Using GitLab & Go Templates appeared first on DevOps Conference & Camps.

AI Assistants and Tools for efficient Requirements Engineering

abairakdar — Mon, 12 Jan 2026 08:28:47 +0000

Early and precise requirement definitions improve efficiency by reducing costly revisions and enabling better project planning. It also plays a key role in quality assurance by providing a foundation for thorough testing and controlled change management. Additionally, it helps organizations comply with regulatory and legal requirements, minimizing risks.

Requirements engineering is fundamentally a human, collaborative activity – it involves multiple stakeholders working together to define what a system should do. AI is now being used to enhance collaboration, especially in an era where teams are often distributed or working remotely. AI-supported workshops and meetings are a growing trend. For example, imagine a requirements workshop conducted via a video conference: an AI meeting assistant can join the call to transcribe the conversation in real time and highlight key points. This is already feasible with some tools which use AI to generate meeting transcripts and summaries. The benefit is that stakeholders can focus on the discussion, knowing that the AI will capture details and action items. The transcript can be shared and searched later, ensuring nothing said is forgotten. This also helps team members who could not attend or who have hearing/language barriers. Automated meeting transcription and summarization makes the content accessible to everyone [1]. Some AI assistants will even automatically generate a list of requirements or user stories discussed in the meeting and identify who agreed to what, making follow-ups much easier.

AI can also facilitate stakeholder alignment by analyzing inputs from various people to find common ground or highlight differences. For instance, in large projects stakeholders might submit their requirements or feedback in writing. AI clustering algorithms can group similar pieces of feedback, helping identify themes. During discussions, sentiment analysis can gauge stakeholders’ emotional responses to requirements proposals. An AI might detect that during a meeting, comments about a certain requirement were largely negative or contentious in tone – signaling the team to address that point more deeply. By doing a real-time “pulse check” on stakeholder sentiment, AI can alert the facilitator if, one group of stakeholders seems dissatisfied or if there is an unresolved disagreement. This is incredibly useful in remote settings where “reading the room” is harder. AI essentially reads the virtual room. Some project management platforms for example, ClickUp with its ClickUp AI addon are leveraging this to ensure everyone’s expectations [2].

Remote collaboration tools for RE are also being enhanced by AI. Whiteboarding and brainstorming applications like Miro [3], Mural [4], or Microsoft Whiteboard [5] now include AI features that can organize ideas, suggest templates, or even generate diagrams. In a remote brainstorming session for requirements, participants might throw in ideas as text notes; an AI could automatically group related ideas or create an affinity diagram on the fly. For distributed teams speaking different languages, AI translation can break down communication barriers. Modern video conferencing can provide live subtitles and translation in dozens of languages using AI. This means a stakeholder in Spain can speak in Spanish and an English-speaking analyst sees the comment in English nearly instantly – and vice versa. Such capabilities ensure everyone can contribute to requirements discussions, not just those fluent in a single language. AI-driven translation and localization can also be applied to requirements documents, enabling collaborative editing by global teams without waiting for human translators.

STAY TUNED

Learn more about DevOpsCon

[mc4wp-simple-turnstile]

AI is also being used to support asynchronous collaboration in RE. Not all stakeholders can meet at the same time. Some feedback might come in through emails, ticket systems, or chat over weeks. AI agents can monitor these channels and aggregate requirements-related information. For example, an AI could monitor a project’s chatroom and whenever a user story is mentioned with a new suggestion or a potential change, it logs it to the backlog or raises a flag for the business analyst to formally capture it. This kind of background assistance means the “voice of the stakeholder” is continually collected, even outside formal meetings. Moreover, AI can proactively engage stakeholders by, for instance, sending out automated questionnaires or polls, generated by an AI based on current project questions, to gather input on requirements decisions.

In collaborative modelling sessions an AI assistant could help by quickly drawing draft diagrams from the discussion. If stakeholders are mapping a process, the AI starts forming a flowchart that everyone can refine. This speeds up the convergence on a shared vision. Ultimately, AI-supported collaboration is about making sure that distance, time, and volume of information are no longer barriers for stakeholder engagement. Everyone sees the same information, concerns are identified via sentiment or analysis, and routine facilitation tasks are handled by AI. This leaves the human collaborators free to focus on decision-making and creative problem solving. The result is often more inclusive and efficient requirements workshops, where AI quietly handles the logistics and analysis in the background.

Tools and Platforms for AI-Enhanced RE

The growing interest in AI for requirements engineering has led to a variety of tools and platforms that practitioners can leverage. These range from AI-augmented features in established requirements management suites to innovative new products from startups. Below is an overview of notable tools and platforms that support AI-enhanced RE workflows:

AI assistant called Copilot4DevOps [6]: This tool, powered by OpenAI GPT models, can help authors in writing and refining requirements directly within Azure DevOps. For instance, a business analyst can ask it to draft a user story given a short title, and the Copilot will produce a first draft of the user story with acceptance criteria. It can also analyze existing requirements and suggest improvements or even convert a set of requirements into behavior-driven development scenarios in Gherkin syntax.Copilot4DevOps keeps the human in control. The analyst can accept or reject its suggestions. Notably, it also has features to rank requirements quality using the “6 Cs”.
Aqua AI by Aqua Cloud [7] is an Application Lifecycle Management tool that has integrated a robust AI “copilot” for requirements and testing. One standout feature is AI-powered requirements narration: a user can press a button and simply describe a requirement verbally for ~15 seconds, and aqua’s AI will convert that speech into a structured requirement in the system. This is useful for quickly capturing ideas on the fly. Aqua’s AI also performs duplicate detection across the requirements database, highlighting requirements that are very similar so that the team can consolidate them. Additionally, aqua AI can generate entire test cases from requirements and even prioritize tests based on requirement criticality. Essentially, aqua is embedding AI throughout the RE and QA workflow – from creation of requirements to ensuring each requirement has corresponding tests – in a seamless way.
Innoslate (Spec Innovations) [8] is a requirements and model-based systems engineering tool. Spec Innovations has developed custom GPT-based assistants trained specifically on requirements engineering knowledge. One of their AI assistants, the Requirements GPT, was trained on the INCOSE Requirements Writing Guide [9] and can generate comprehensive, well-structured, and testable requirements from a user prompt. For example, an engineer could input a high-level need, and the AI will produce a set of detailed requirements following best practices. They also created a Test Cases GPT that generates test case descriptions from requirements. These specialized AI models are integrated into the Innoslate platform, demonstrating how domain-specific LLMs can augment RE tasks with expert-level guidance built-in.

Many teams are also simply using general AI chatbots as ad-hoc RE tools. ChatGPT, for instance, can be prompted to act as a business analyst and generate a list of requirements given a project description, or to brainstorm edge cases for a feature. While not tied to any RE software, these AI systems can greatly assist in the early stages of requirements elicitation and analysis. They can also serve as interactive rubber ducks – a requirements engineer can explain a requirement to ChatGPT and ask, “what might I be missing?” or “can you rephrase this in a clearer way?” and get useful feedback. Some organizations fine-tune these models on their internal wiki or past projects, creating a custom AI that knows their domain. One must be cautious with confidentiality, but they offer a flexible, powerful platform for AI-assisted RE. For example, OpenAI’s code interpreter can even generate simple UML diagrams or perform data analysis on feedback spreadsheets to identify requirement patterns.

Documentation tools like Notion [10], Coda [11], or even Microsoft Word [12] with its coming AI features are integrating AI that can help summarize and organize information. In an RE context, Notion’s AI can take a large notes page from a stakeholder workshop and summarize the key needs expressed or turn a list of raw ideas into a structured list of requirements or user stories. It can also assist in writing documentation sections. These AI features act like a smart assistant for the requirements document itself, ensuring consistency in tone and filling gaps. While not RE-specific, they significantly speed up the creation of high-quality documentation.

There are also specialized tools focusing on the analysis of requirements text. For instance, Visure [13] has announced AI features like an assistant that suggests test cases for each requirement or assesses risk based on requirement complexity. Another example is Jama Connect [14] leveraging AI for impact analysis. By learning from projects, it might highlight which downstream work could be affected if a particular requirement changes. These platform-specific AIs might not be as famous as ChatGPT, but they are quietly improving the day-to-day work of requirements engineers by catching problems early.

A number of startups are targeting specific pain points in requirements engineering with AI. For example, WriteMyPRD [15] focuses on Product Requirements Documents: a product manager can input a product idea and these tools will produce a draft PRD complete with user personas, feature list, and even non-functional requirements. This can save a lot of initial drafting time.

There are also AI tools for UX requirements gathering user feedback from App Store reviews or support tickets and distil new requirements from them. We also see AI being used in prototyping tools that translate a requirement directly into a low-fidelity UI mock-up – effectively prototyping from requirements to validate understanding. While each of these niche tools addresses a slice of the RE process, together they indicate an ecosystem flourishing around AI for requirements. Teams can choose the tools that fill their specific gaps.

The landscape of AI tools for RE is rapidly evolving. New integrations and features are announced frequently. The good news is that many of these AI capabilities can be tried in free tiers or demos. Even niche tools often have trial periods, so RE professionals can experiment and see what fits their workflow. By embracing these tools, requirements engineers and product managers can reduce tedious work and invest more time in creative and high-level thinking – with AI handling a lot of the heavy lifting behind the scenes.

Ethical Considerations in AI-Driven Requirements Engineering

While AI promises significant improvements in RE, it also introduces ethical considerations that teams must address. One major concern is bias. AI systems learn from historical data, which may contain human biases. If an AI tool is used to generate or analyze requirements, it could inadvertently reinforce existing biases – for example, prioritizing requirements for one user group over another due to skewed training data.

An unchecked AI could even suggest solutions that unfairly disadvantage or exclude certain populations. It’s important to recognize these risks so that AI does not undermine the fairness of the requirements process. On the positive side, AI can also be harnessed to detect and reduce bias. Language models can be used to scan requirement documents for potentially biased or discriminatory language. AI can flag if requirements consistently use “he” instead of gender-neutral terms, or if they assume certain cultural norms. By implementing bias detection checks, organizations can turn AI into a tool for improving fairness, catching unconscious biases that human reviewers might miss. Ensuring diversity in the data and using bias-mitigation techniques are essential steps when deploying AI in RE.

Another critical consideration is data privacy and governance in AI-driven RE. Requirements engineering often involves handling sensitive information about users and business processes. If AI tools are employed, they may ingest stakeholder interview transcripts, user stories, or usage data – potentially including personal data. Teams must ensure compliance with regulations like the GDPR (General Data Protection Regulation) [16] for any personal data processed. GDPR mandates principles such as data minimization, purpose limitation, and user consent, which apply to AI as well.

Any AI used should be assessed for how it stores and uses data: for example, sending confidential requirements data to a third-party AI service could be a violation if not properly controlled. Organizations should establish clear data governance policies for AI usage – specifying what data can be used to train AI, anonymizing or encrypting sensitive details, and controlling access to the AI outputs. Fortunately, AI itself can assist with compliance. AI-driven tools exist that automate GDPR checklist compliance by tracking consent, identifying personal data in requirements, and ensuring privacy considerations are documented. Still, ultimate responsibility lies with the humans deploying the AI. Regular audits, like checking the AI’s training data and outputs for privacy and bias issues, are a must.

Transparency and accountability are also paramount. When AI participates in requirements engineering, stakeholders should be made aware of its role. If an AI assistant suggests a requirement or a priority, the decision-making process shouldn’t be a black box. Explainable AI techniques can help – for example, an AI tool could highlight which input data or rule led to a given suggestion. Transparency builds trust: stakeholders are more likely to accept an AI-generated requirement if they understand the reasoning behind it.

Moreover, maintaining a clear record of when and how AI was used is important for accountability. If a mistake is later found in the requirements, the team should be able to trace whether it came from an AI suggestion and examine why. Many organizations are establishing AI ethics guidelines that include having a human-in-the-loop. In AI-assisted RE, this means AI is a supportive tool, but human experts make the final decisions. It’s wise to follow the principle that AI provides recommendations, not decisions.

Teams should also consider accountability if AI errors occur for example, if an AI misses a regulatory requirement that leads to a compliance issue. Who is responsible? Setting clear governance for AI can mitigate this. In summary, adopting AI in RE requires not just technical implementation but also ethical foresight: addressing bias, safeguarding data, and preserving transparency and human oversight. By proactively incorporating these considerations e.g. through bias audits, data governance frameworks, and explainability features, organizations can harness AI’s benefits while upholding trust and ethics.

Kubernetes Training (German only)

Entdecke die Kubernetes Trainings für Einsteiger und Fortgeschrittene mit DevOps-Profi Erkan Yanar

Mehr erfahren

Requirements Engineering Outlook

AI is reshaping the landscape of Requirements Engineering, empowering teams to work faster, smarter, and more collaboratively. The case studies and tools discussed above show that AI can assist across all RE phases: from automating tedious tasks in elicitation and analysis, to improving the quality of specifications, aiding in validation with intelligent test generation, and simplifying requirements management with smart traceability and change impact analysis. The benefits of adopting AI in RE are multifold. Firstly, there are efficiency gains – tasks that once took hours can now be done in seconds or minutes. This can shorten development cycles and reduce costs. Secondly, AI provides a consistency and quality boost – it acts like an ever-vigilant reviewer that never tires of checking for mistakes or improvements, leading to more complete and clear requirements. Thirdly, AI can unlock insights from large datasets (like user feedback or operational data) that humans might miss, ensuring the requirements are data-driven and relevant. Finally, by handling grunt work, AI frees human requirements engineers to spend more time on strategy, stakeholder communication, and creativity. This augmentation of human work with AI strength is why many see embracing AI-driven RE as not just an opportunity but a necessity to thrive in an increasingly competitive software industry.

Sources

[1] The Benefits of Automated Meeting Transcription for Remote Teams

[2] How to Use AI for Stakeholder Kickoffs (Use Cases & Tools)

[3] Miro | The Innovation Workspace

[4] Work better together with Mural’s visual work platform | Mural

[5] Microsoft Whiteboard – Kostenloser Download und Installation unter Windows | Microsoft Store

[6] Copilot4DevOps – Azure DevOps AI Assistant – Modern Requirements

[7] KI-Integrationen in Softwaretesttools | aqua cloud

[8] Requirements Management Software

[9] incose_rwg_gtwr_v4_040423_final_drafts.pdf

[10] Meet the new Notion AI

[11] Coda: Your all-in-one collaborative workspace.

[12] App für digitale Online-Whiteboards | Microsoft Whiteboard

[13] Anforderungsmanagement-Software und -Tool | Visure-Lösungen

[14] Jama Connect | Collaboration Tool | SaaS Requirements Management

[15] WriteMyPrd | Make Writing PRDs a Breeze with ChatGPT

[16] General Data Protection Regulation (GDPR) – Legal Text

The post AI Assistants and Tools for efficient Requirements Engineering appeared first on DevOps Conference & Camps.

Redefining the Platform

DevOpsCon editorial — Mon, 13 Oct 2025 12:48:35 +0000

The landscape of building and deploying applications has long been too complex for one team alone to manage. With the advent of Continuous Delivery, all steps in the build and deployment process must be automated to allow speed and avoid errors. Plus, of course, each process step must be secure. Given the repeatable nature of the work, a number of highly successful open source and proprietary tools and languages have evolved in this space, allowing configuration of Infrastructure as Code and workflow specifications for build and deploy that automate repetitive steps such as testing. Each of these tools and languages is now a separate skillset in itself. This additional complexity created the need for a platform team who possess these skill sets and can create the deployment environment for application developers to use.

STAY TUNED

Learn more about DevOpsCon

[mc4wp-simple-turnstile]

As platform engineering has evolved into an art form, the DevOps “one team” drive has somewhat dissolved again, as it’s very difficult to not only understand your own complex code base, but also its complex deployment process! This means that some of the hard work to remove the Dev/Ops boundary has fallen by the wayside. The cultural clash between Dev and Ops, whereby Dev need constant change, but Ops need stability, was resolved by getting the Dev teams to automate the Ops processes. But it has now become a clash between developers and platform engineering teams. The problems usually arise over access. The developer asks, “Why can’t I have access to my Docker logs?” It’s often the case that the two teams drift apart, don’t meet often enough, don’t work together, don’t collaborate in an Agile manner and as such, do not produce a fit-for-purpose developer platform.

Cloud native architectures

I want to examine the platform team versus developer team challenge in the context of modern cloud-native architectures. By “cloud native”, I am referring to architecture designed to be cloud-hosted, designed to run with maximum efficiency on a given cloud platform. And by “efficiency” I mean compute efficiency – which should in turn lead to cost efficiency. For example, if I build a scalable microservice architecture that will be hosted on Google Cloud Platform (GCP), I should not rent Linux boxes from Google and install my own Kubernetes instance on the boxes. Instead, I would evaluate serverless offerings, and if they weren’t sufficient, I’d rent Google’s Kubernetes engine (GKE) as a service and configure it to host my containers as needed. This allows Google to efficiently manage the underlying infrastructure as they see fit. Additionally, my dev and platform teams don’t need to know how to install, network, and configure Kubernetes onto bare tin. They only need to know how to customize it for their containers.

Kubernetes Training (German only)

Entdecke die Kubernetes Trainings für Einsteiger und Fortgeschrittene mit DevOps-Profi Erkan Yanar

Mehr erfahren

The same principle applies for peripheral applications used for observability. For example, logging. If I run my application on Azure, I can choose to use Azure application insights to consolidate, view, and search my logs instead of installing my own flavour of ELK stack.

This creates an interesting problem for platform engineers. A number of the areas they are used to controlling now belong to the cloud service. Skillsets such as Puppet, Chef, Ansible, NGINX, or Kerberos configuration for example, or tasks such as JIRA/Confluence/Git installation, ELK stack installation are no longer necessary. Complex tasks like creating VPNs and network routes become a simple drag-and-drop using a web interface that even developers can manage. The same is true for authentication, authorization, and role management. For testing and deploying serverless functions, I can just use AWS CodeDeploy via the GUI. You could argue that the development landscape niche that the concept of platform engineering expanded to fill no longer exists for cloud native development.

Control freaks

Let’s focus on a common platform engineering task: provisioning different environments for different users. Developers have a dev environment that they can release to at will. QAs want a QA environment that they can control versions in and that doesn’t fall apart mid-test. Users want a production-like environment where they can trial new features, and of course we need a live version. Plus, we need live-like environments to run load tests or penetration tests in. A decade or two ago, this would cause all sorts of pain and cost. For example, let’s say you have a Java app deployed as a WAR file to an application server. Installing the application server on load-balanced boxes, making sure it’s internet-accessible but secure and can communicate with the database used to be a highly skilled job. Each environment was created manually. There was the danger that human error would bring differences, and make the tests invalid. Even worse, application server licenses were WAY too expensive to be wasted on developers. You might have used Apache Tomcat for developers and Websphere for production. Then the applications the developers wrote wouldn’t run in other environments due to wonderful things like different versions of the Java SAX XML parsing library.

With the advent of scripting languages like Puppet, Chef, Ansible, and Terraform, it became possible to create an environment “stamp” that could be configured and re-used. This made the job of creating environments much more stable and allowed much more successful testing. However, particularly in the case of Puppet and Chef, the language paradigm was extremely unintuitive for functional or object-oriented developers. It was rare to get someone who could do both. Platform engineers ruled the world of scripting and provided ephemeral environments for everyone.

Jump forward to today and the concept of an environment in the cloud is nothing but a name. The only real difference between “test” and “production” is potentially the number of Kubernetes pods available for scaling, and the URIs the service talks to. Plus, the major hyperscalers provide some really nice GUIs and how-to guides. Take Microsoft Azure for example. Using their ARM templates platform, you can create a template from your development environment once you are happy with it, store the template in version control, and use it to “stamp out” as many copies of the environment as you wish – with provisions for configuration of both secure and less secure variables that change between environments. AWS CloudFormation is the Amazon equivalent.

Figure 1: AWS CloudFormation for provisioning environments

This is no longer a skillset that a developer team can lack. If it’s the development team’s task to create the environments, it creates a sense of ownership for the running application. Yet, many platform engineers still consider environment provisioning to be their remit and theirs alone. This allows the barrier between devs and their deployments to stand. In this case, is the team really providing value? A self-service platform like backstage.io that gives developers access to the things they need creates a much better DevOps mindset. But, more on Backstage later.

The same principle applies with the build pipeline. If a platform engineering team creates the build pipeline but doesn’t give enough access to developers, the development team will go to all sorts of lengths to unblock themselves and get around it – with unpleasant consequences. A classic example is preventing developers from accessing their branch databases. I know a development team who dealt with this problem by deploying their own Postgres server onto Kubernetes pods and accessed the database through an illegal “dev” back door! That isn’t what you want in your environment.

A dead end?

Does this mean that we don’t need platform engineers anymore? Of course not! The “unicorn problem” still exists; it’s cognitive overload to try and understand both the application stack and deployment stack. Plus, there are a whole new set of security considerations to manage when hosting your environments in a public cloud. But it does mean that if we work with cloud platform providers, we can simplify the concept of a platform to “everything above Kubernetes”.

I have always been interested in how evolutionary concepts apply to all aspects of software engineering – from applications to architectures to ways of working. It’s not constantly necessary to change. For example, a lot of the Linux / UNIX kernel code is well over 40 years old and still going strong. Evolution simply means adapting to better fit your environment when your environment changes. Most companies need to evolve, as most companies’ operating landscape changes massively. A very small number of companies have thrived over time. According to research by McKinsey fewer than 10 percent of the non-financial S&P 500 companies in 1983 remained in the S&P 500 in 2013. That’s certainly where we are with the cloud native landscape. Hardware changes such as cheaper and better optic fibres, faster chips, longer living batteries, and 5G networks combined with software changes such as new security algorithms. This means that the cloud architectures have evolved, and as such, we need to move with it.

Now that the concept of the platform engineer has evolved, they need to evolve too and keep pace with the cloud native landscape. With the simplification of application provisioning due to improved tools, the biggest switch in this evolution is moving towards a customer mindset. The thing that the big cloud providers do well (and some do REALLY well) is making their developer services extremely customer-friendly, where the developer is the customer. Open source libraries and frameworks have also taken this step. I’ve already mentioned Backstage.io and how it really focuses on the developer experience. The move to self-service provisioning and away from a JIRA service-desk like approach to provisioning is a great help. For a modern, cloud-native platform engineering team to be successful, they need to treat the developer as a customer and work with them in an Agile way. This means focusing on user needs, researching the user experience, creating an MVP, and iteratively improving on it using customer feedback. These are things that come naturally to Agile development teams, but are rarer in platform engineering teams with their inherited operations-style background of being shut away in a locked server room, requiring stability and shying away from constant change. As a developer, I’ve never worked with a team like that. In fact, I’ve never worked with a platform engineering team. They tend to be in before development starts with a default set of requirements (“Create a pipeline, source code and artifact repositories, with minimal permissions”) and then they’re gone. Requests to change what’s in place tend to be via a service desk ticket with a multi-day SLA. It fills me with joy to think of having weekly sprint demos to show how our platform environment has improved in line with our requests!

Because DORA

A common question arises when discussing platforms to support development teams: Is investing in the developer experience worthwhile? Having the whole Agile team working constantly on non-business functionality sounds expensive, right? Devs are not the end of the chain. They should be working on improving the customer experience of the application users. So isn’t it just an unnecessary bottleneck in our team’s end goal to spend all this time focusing on whether or not our devs are happy? The DevOps Research and Assessment (DORA) is a long-running research program answering just that question. The outcomes are quite clear. If you want successful product delivery, you need a stable and successful pipeline. In other words, delays and security holes in getting software functionality to customers isn’t a Dev problem or an Ops problem. It’s a business problem. And as such, it has to be explained in business terms – which usually means giving it a monetary value.

This is possible, although there are so many variables that I will not attempt to give any sort of figures. The Google 2020 whitepaper “Return on Investment of DevOps Transformation” offers a way to calculate this by putting a number on unnecessary work saved and the retention of valuable skilled developers. The time saved can also be given a value in terms of new features that could be built using it.

Other metrics that can be given a value are security incidents (or lack of), and growth compared to competitors. The State of DevOps report can give ideas on the metrics to collect in order to figure out a value for missing security incidents, and to put a value on stability.

The final metric that is fairly easy to monetise and is tightly linked to a positive developer experience is attrition. Skilled developers are in high demand, and command top salaries. The industry expectation to replace a developer is approximately 6 months’ salary. Having an Agile, customer-focused platform engineering team that provides self-service Continuous Delivery functionality to the developers is a strong factor in whether or not a developer is satisfied in their job and likely to stay, with all the knowledge retention and cost saving this implies. Research shows that high-performing teams have up to 30% less attrition than those who do not focus on the developer experience.

The post Redefining the Platform appeared first on DevOps Conference & Camps.

How to Automate Security With GitOps

DevOpsCon editorial — Tue, 30 Sep 2025 12:16:24 +0000

Intro

Nowadays, software has a profound impact on organizations, not only digital ones but also traditional ones, such as banks, insurance companies, and airlines. For example, most (if not all) banks today are digital. The importance of software means we must deploy and release more often, securely, and reproducibly.

If we take a look at the most common problems enterprises find when developing applications, they are:

Slow and Risky Software Delivery: Traditional software development often involves long release cycles, with infrequent deployments that are risky and painful. The more content you deploy, the higher the chance of introducing a regression and identifying the cause of the problem.
Siloed Teams and Poor Collaboration: Historically, developers and operations teams have worked in silos, leading to miscommunication, finger-pointing, and inefficiencies. Moreover, as complexity grows, the application requires more teams (security, testers, etc.) working on it, making collaboration between teams even more critical.
Unreliable Software and Production Failures: Inconsistent environments and manual configurations frequently result in bugs in production. Manual deployments are the enemy of stability and reproducibility. Something that worked last time might not work this time because a human executed the process differently.
Lack of Visibility and Slow Feedback Loops: Teams often don’t know how the software behaves in production until a customer complains.
Difficulty Scaling Systems and Teams: As systems grow, managing deployments, environments, and team coordination becomes more challenging. In the past, applications were monolithic, relying on a single database; now, things are more complex, with multiple elements to manage.

So, with this complexity, we need methodologies to deploy correctly and adapt to the quickly changing world.

STAY TUNED

Learn more about DevOpsCon

[mc4wp-simple-turnstile]

What is DevOps?

DevOps is about breaking down the wall between all the actors in a software development cycle (developers, testers, operations) to work together to build, test, deploy, and monitor software more efficiently and effectively.

DevOps promotes continuous integration and delivery (CI/CD), enabling smaller, more frequent, and automated releases, thereby reducing risk and increasing speed.
DevOps promotes a shared responsibility and collaboration culture through cross-functional teams and unified workflows.
By utilizing Infrastructure as Code (IaC) and automated testing, DevOps ensures consistent environments across development and production.
DevOps emphasizes monitoring, logging, and real-time feedback, helping teams detect issues early and respond quickly.
DevOps practices support automation, standardization, and scalability, making it easier to manage growth efficiently.

Fig. 1: The structure of the DevOps methodology

DevOps is a methodology that covers a broad spectrum of the software development lifecycle and implies a cultural shift in enterprises. However, in this article, we’ll focus only on the CI/CD part in a practical way.

DevOps is not a tool, but we ultimately need to rely on tools and specific implementations. Nowadays, GitOps is one of the most used DevOps practices for CI/CD.

Kubernetes Training (German only)

Entdecke die Kubernetes Trainings für Einsteiger und Fortgeschrittene mit DevOps-Profi Erkan Yanar

Mehr erfahren

What is GitOps?

GitOps uses Git as the single source of truth for infrastructure and application deployments, typically YAML files. You define elements such as a continuous integration pipeline, deployment files, and tools like Argo CD or Flux, which automatically apply the Git state to the actual infrastructure and continuously reconcile any drift.

The significant advantage of using GitOps is that all changes are tracked through Git workflows, including pull requests and reviews, which improves auditability and collaboration.

Git becomes a central piece in developing, building, and deploying processes; you need a Git server such as GitHub or GitLab and a project organization to store all the source code, the deployment manifests, scripts to build the project, or infrastructure files. There are several ways to organize a project to meet GitOps expectations, one of which is to use a single repository with multiple folders. For example, one folder can be dedicated to the application’s source code, another to deployment files, and a third to the build pipeline, and so on.

However, the best approach is to split the content into two repositories. One repository contains the application source code, and another contains all the manifests and scripts necessary to build, deploy, and release the application, as well as the manifests required to prepare the environment.

Figure 2 shows a real example of a project with two Git repositories: one containing the source code and another containing the manifests.

Fig. 2: Two Git repositories – one with the source code, one with the manifests

In this article, we’ll focus on the latter, which includes the pipeline definition for building the application, the Kubernetes files for deploying the application, and the reconciliation manifest (ArgoCD).

The repository layout can contain different folders:

pipeline: In this folder, you’ll store all files related to building the application, typically YAML files for any CI/CD engine, such as Jenkins, GitHub Actions, GitLab Pipelines, or a native Kubernetes solution like Tekton.
manifests: In this folder, you’ll store the files to deploy the applications. Nowadays, Kubernetes deployment files are standard, but they can also be other types, such as shell scripts or Ansible files.
infrastructure: This is an optional folder for placing files related to the construction and configuration of the environment, such as the installation of Kubernetes Operators, Service Accounts, or Volumes, which are operations that you’ll run only once or rarely during the application’s lifecycle.
continuous-delivery: Folder containing automation and reconciliation files using tools like ArgoCD, Flux, or Ansible Event-Driven, to automatically apply the Git state to the actual infrastructure and continuously reconcile any change in the Git files.

Of course, this may vary depending on the specific use case.

Let’s explore some examples in each of these categories.

Continuous Integration

The continuous integration pipeline defines the steps or stages that the system executes to obtain the application sources, compile them, run tests, create a delivery package (typically a container image), and push it to an artifact repository.

The following snippets show an example of YAML files defining these steps in Tekton (https://tekton.dev/).

Tekton is an open-source framework for building continuous integration/continuous delivery (CI/CD) systems on Kubernetes.

Clone Task

Kubernetes executes the clone task by instantiating a Pod with an Alpine container and executing the git clone command. It puts the cloned repository into a workpace so all other Tekton tasks can refer to these files.

apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: git-clone
spec:
  params:
    - name: url
      type: string
    - name: revision
      type: string
      default: "main"
  workspaces:
    - name: source
  steps:
    - name: clone
      image: alpine/git
      script: |
        git clone $(params.url) --branch $(params.revision) $(workspaces.output.path)

Compile and Package

The other step is compiling, running the tests, and finally creating a package. For a Java project, it might look like:

apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: maven-build
spec:
  workspaces:
    - name: source
  steps:
    - name: mvn-package
      image: maven:3.8.6-openjdk-17
      workingDir: $(workspaces.source.path)
      script: |
        mvn clean package

Notice that the workingDir option refers to the workspace defined in the previous task.

Build and Push Container Image

The final step is to build a container image and push it to a container registry. Usually, there is no Docker Host in the Kubernetes installations to create a container image. You cannot run docker build because there is no Docker Host to execute it, but there are other ways to build a container using projects that require no Docker to build a container. There are several options, such as JiB, Kaniko, or Buildah (https://buildah.io/).

Here, we’ll use the latter one:

apiVersion: tekton.dev/v1
kind: Task
metadata:
  name: buildah-push
spec:
  params:
    - name: image
      type: string
  workspaces:
    - name: source
  steps:
    - name: build-push
      image: quay.io/buildah/stable
      securityContext:
        privileged: true
      script: |
        buildah bud -f $(workspaces.source.path)/Dockerfile -t $(params.image) $(workspaces.source.path)
        buildah push $(params.image)

Kubernetes starts the Buildah container and executes buildah bud for building the container and buildah push for pushing to the registry.

Pushing to an external registry requires a ServiceAccount with the container registry credentials configured via ServiceAccount secrets.

apiVersion: v1
kind: Secret
metadata:
  name: container-registry-secret
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson:

Create a service account and link it to this secret:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: build-bot
secrets:
  - name: container-registry-secret

The last step is defining a pipeline to orchestrate all these tasks:

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: java-build-and-push
spec:
  params:
    - name: git-url
    - name: git-revision
      default: main
    - name: image
  workspaces:
    - name: shared-workspace
  tasks:
    - name: clone
      taskRef:
        name: git-clone
      params:
        - name: url
          value: $(params.git-url)
        - name: revision
          value: $(params.git-revision)
      workspaces:
        - name: source
          workspace: shared-workspace

    - name: build
      taskRef:
        name: maven-build
      runAfter: [clone]
      workspaces:
        - name: source
          workspace: shared-workspace

    - name: image-build-push
      taskRef:
        name: buildah-push
      runAfter: [build]
      params:
        - name: image
          value: quay.io/myorg/myapp:1.0.0
      workspaces:
        - name: source
          workspace: shared-workspace

Tekton is a big project, and the example will still require some consideration, but you now have an overview of Tekton and how to define a continuous integration pipeline using code.

Let’s jump to the manifest folder.

Manifests

In this folder, place the manifest files required to deploy the application in any environment. Depending on the environment and the application, you may need more or fewer manifests, but let’s keep things simple. An application deployed to Kubernetes that connects to a database requires two manifests: one for the database secrets and another for the application deployment.

apiVersion: v1
kind: Secret
metadata:
  name: app-secret
type: Opaque
stringData:
  DB_USER: myuser
  DB_PASSWORD: mypassword

And the deployment file is injecting the secrets:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-app
          image: quay.io/myorg/myapp:1.0.0
          env:
            - name: DB_USER
              valueFrom:
                secretKeyRef:
                  name: app-secret
                  key: DB_USER
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: app-secret
                  key: DB_PASSWORD

Applying these files using kubectl apply will deploy the application to Kubernetes, which is valid. However, GitOps also promotes the idea of automating and reconciling applications.

Continuous Delivery

One of the most important aspects of GitOps is the automation and reconciliation of the Git repository with the environment where the application is deployed. To implement this feature, you need an external tool such as Argo CD, Flux, or Ansible Event-Driven.

These tools implement the following four steps:

Monitor the Git repository to detect any changes that may occur in any of the resources.
Detect the drift of any resource or manifest placed there, usually a change in a Kubernetes manifest.
Take action by applying the changed manifest to the Kubernetes cluster.
Wait till the synchronization of the resources succeeds.

Any change in the system may be tracked through Git, and a process like Argo CD will detect the change and apply it.

For example, the previous Kubernetes deployment snippet deploys version 1.0.0 of the application (image: quay.io/myorg/myapp:1.0.0). After running the continuous integration pipeline to generate the 1.0.1 version, you may need to deploy this newer version. In GitOps, this is done by updating the deployment file container image tag to 1.0.1 (image: quay.io/myorg/myapp:1.0.1). When the change is committed and pushed to the repository, the GitOps tool detects the change and applies the manifest, triggering a rolling update from version 1.0.0 to 1.0.1.

The following figure summarizes the whole pipeline explained above.

Fig. 3: Git pipeline

In the continuous-delivery directory, you’ll place the configuration file, in this example an Argo CD (https://argoproj.github.io/cd/ file, for reacting to any change made in the manifests directory of the Git repository:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  destination:
    namespace: myapp
    server: https://kubernetes.default.svc 
  project: default 
  source: 
    path: manifests
    repoURL: https://github.com/myorg/myapp-manifests
    targetRevision: main
  syncPolicy: 
    automated:
      prune: true
      selfHeal: false
    syncOptions:
    - CreateNamespace=true

Applying this manifest in a Kubernetes cluster with Argo CD installed will configure Argo CD to monitor the folder manifests of https://github.com/myorg/myapp-manifests repository.Any change made to this directory will trigger an update to the myapp namespace, applying the changed files.

We have the tools to implement GitOps, and we can see the significant advantages of having everything automated, from updating an application to recovering when something goes wrong.

But what about the database secrets defined in this example? They are in plain text in the YAML file, which means that anyone with access to the Git repository (even an attacker) can extract this sensitive data.

One option is to leave the Secrets outside of the GitOps workflow. It is an option, but it is never a good idea to have different workflows, as it creates two systems to maintain. To solve this problem, several projects are available to help you protect and store these files correctly. In this article, we’ll look at the Sealed Secrets project (https://github.com/bitnami-labs/sealed-secrets).

Secrets

Sealed Secrets is a Kubernetes project that enables secure management of Kubernetes Secrets in Git repositories, utilizing encryption to keep secrets safe within GitOps workflows.

How does it work? First, you need to install the Sealed Secrets controller in the Kubernetes cluster and the kubeseal CLI tool in your local machine (or the machine creating the secret).

In the local machine, create a standard Kubernetes Secret. For example:

apiVersion: v1
kind: Secret
metadata:
  name: app-secret
type: Opaque
stringData:
  DB_USER: myuser
  DB_PASSWORD: mypassword

The controller generated a private and public key to encrypt and decrypt content during installation. Using kubeseal CLI tool, you’ll automatically get the public key from the cluster, and encrypt the Kubernetes Secret locally:

kubeseal --format=yaml < secret.yaml > sealedsecret.yaml

The output of the command is a new Kubernetes resource file of kind SealedSecret equivalent to a Kubernetes Secret, but encrypted.

apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: app-secret
  namespace: myapp
spec:
  encryptedData:
    DB_USER: AgC2z9zN...truncated...==  # base64-encrypted secret value
    ….
  template:
    metadata:
      name: app-secret
      namespace: myapp
    type: Opaque

You can safely commit this file to the Git repository as its values are encrypted. You can also delete the original Kubernetes Secret as you don’t need it anymore.

When you manually or Argo CD automatically apply the SealedSecret object, the Sealed Secrets controller uses the private key to decrypt the object and recreate the Kubernetes Secret object inside the Kubernetes cluster for storage.

As you can see, you are protecting sensitive data from creation until the Kubernetes cluster consumes it.

Conclusion

In this article, we covered various technologies to implement GitOps effectively, including Tekton, Argo CD, and Sealed Secrets. Although they are widely used and battle-tested in large corporations, other solutions work perfectly well. What is important is to adopt a methodology that makes the process of building, deploying, and releasing an application safe, easy, and fast.

If you want to learn more about GitOps, you can download my book, GitOps Cookbook: Kubernetes Automation in Practice free from: https://developers.redhat.com/e-books/gitops-cookbook.

The post How to Automate Security With GitOps appeared first on DevOps Conference & Camps.

Adventures in Observability with Fluent Bit, OTEL, and Kubernetes

DevOpsCon editorial — Tue, 02 Sep 2025 12:42:53 +0000

As a core maintainer of Fluent Bit for a few years now, I wanted to pull together some notes on some questions I see come up a lot:

How does OTEL work/can I use OTEL with Fluent Bit? TLDR; Yes!
Kubernetes logging details and associated pitfalls: How does Fluent Bit work with Kubernetes, what things are required, what goes wrong, and how can I fix it?

This article will give you a taste of Fluent Bit with OTEL along with some detailed walkthroughs of both that and using Fluent Bit for Kubernetes observability. In each case, I will explain how things work with some pointers on usage and how to debug any issues you might encounter. I’ll provide some general advice at the end for resolving any problems you may see.

I have provided a repo via my OSS fluent.do consultancy with all the examples: https://github.com/FluentDo/fluent-bit-examples

If you want some very detailed walkthroughs of different observability deployments and use cases then my friend Eric Schabell also provides a full set of examples here: https://gitlab.com/o11y-workshops

The article will introduce some terms then dive into various examples which you can jump to directly using the links below.

Clarifying terminology

First, some definitions and introductions for those who may not be familiar with the terms and tools I will be discussing. These are just an introduction and a lot more detail is easy to find across the internet, so feel free to skip to the examples if you already know what everything is.

Observability

Observability is more than just traditional monitoring of a system, which just provides a health snapshot. It is intended to provide a single investigative source of truth for the information about the system’s various components and how they are used. The goal is to be able to investigate issues or improve functionality by diving into what the system is doing across various data points.

Typically, the three “pillars” of observability are referred to as:

Metrics: A snapshot of the current state of the system.
Logs: A record of all events that happen in the system.
Traces: A track of an application request as it flows through the system including the time taken for and the result of each component.

The traditional monitoring role within an observability system may alert you to a problem (e.g., CPU load metric is too high), which you can cross-reference with data from other sources (e.g., logs and traces) to determine the underlying cause.

To provide an observability system, you will need to deploy various components to first get the data, then process the data, and finally display it to the user or generate alerts for the user. This article primarily focuses on the agents at the edge that collect and process the data to send to the rest of the stack which deals with storage, querying, alerting, and visualisation.

OpenTelemetry (OTEL)

For years, vendors provided various observability solutions, each tending to be proprietary or at least hard to integrate easily. There were existing standards like syslog but generally only for distributing log data rather than handling the full set of observability requirements. Attempts were made to standardise with things like Prometheus being developed and standards like OpenMetrics before the various industry incumbents/standards/tools united into the OpenTelemetry standard (or OTEL for short).

OTEL as the standard is definitely a good idea. The concern I see with OTEL is the implementation – everyone has their own OTEL collector with different plugins/configurations/etc. There may still be some custom exporters (vendor code) used to talk to the observability backends so whilst the data may be received in OTLP format, it can use a custom exporter to send it out.

Fluent Bit

Fluent Bit started as an embedded OS variant (compiled from C code) of the larger Fluentd Ruby-based agent, with Fluent Bit focusing on lower resource usage and including every plugin in the core (rather than having to load the appropriate plugins from Ruby Gems to use them at runtime). Due to its focus on low resource usage, it has been adopted widely by almost every cloud provider and their users – running at those scales means any saving on resource usage is multiplied massively. Fluent Bit is part of the CNCF Graduated Fluentd project.

Fig. 1: Fluent Bit’s evolution from creation until v4

STAY TUNED

Learn more about DevOpsCon

[mc4wp-simple-turnstile]

There are three main drivers for Fluent Bit:

High performance and low resource usage.
Vendor neutral and flexible integration – open-source with integration across ecosystems like Prometheus, OpenTelemetry and more.
Broad ecosystem support – suitable for cloud, on-premise and edge deployments with an extensive plugin support for different data sources, destinations and processing.

Fluent Bit provides plugins for various types of inputs and outputs as well as being standards friendly and vendor agnostic. It is not limited to just working with just OTEL data, but a wide variety of sources and sinks including S3, generic HTTP, syslog and many more.

Fig. 2: Fluent Bit sources and destinations

Fluent Bit also provides various filter plugins that allow you to update, remote or add data into your pipeline as it flows through – a good example is the K8S filter which uses the K8S API to add the various metadata about the pod (annotations, labels, etc.) to the actual pod logs as they flow through the system. Other powerful filters include the ability to run any LUA script or WASM code directly on the data.

The basic Fluent Bit pipeline follows 6 primary stages:

Input: Ingest data from a variety of sources.
Parsing: Convert unstructured data into structured data.
Filtering: Modify, enrich or delete any of the data.
Buffering: Detain the now immutable data either in-memory or persist to a filesystem.
Routing: Match data to the relevant outputs required with no duplication/copying.
Output: Convert the internal representation to the required output format and send to the relevant destination.

The simplified diagram below shows this basic pipeline.

Fig. 3: Telemetry pipelines

Kubernetes Training (German only)

Entdecke die Kubernetes Trainings für Einsteiger und Fortgeschrittene mit DevOps-Profi Erkan Yanar

Mehr erfahren

In Fluent Bit, we use a msgpack internal structure for all data – that means all the records follow a common structured format based on JSON key-value pairs: msgpack.org

Some metadata is associated with each event (log/metric/trace), including a timestamp and a tag.

A tag is a specific keyword for this record that can then be selected by a match rule on filters to do some processing or outputs to select data to send to some output. Matching can be done by regex or wildcard as well to set up a full telemetry pipeline with individual filters/outputs working on a subset of the data or all of it.
Timestamps can be extracted (parsed) from the incoming event data or be allocated by Fluent Bit as the local time the event was created. They record the time of the specific event for other components to then work with.

Therefore, all events in Fluent Bit have a common internal structure which every plugin can work with:

Always structured into key-value pairs using msgpack
Always has a tag
Always has a timestamp
Additional optional metadata

Fluent Bit uses the match keyword to connect inputs to outputs (and filters) in a telemetry pipeline. This allows you to multiplex both inputs and outputs for routing data along with more complex setups like partial regex or wildcard matching. You can have a filter or output only select a subset of data to work with. More details can be found here: https://docs.fluentbit.io/manual/concepts/data-pipeline/router

Like many other tools, initially Fluent Bit only supported log data. But now it also supports metrics and trace data both as inputs and outputs. One area of active improvement is around the processing/filtering of the metric/trace data in the pipeline to make it as rich a set of options as we have for log data.

Fig. 4: Inputs and outputs

Later, I will cover some tips and tricks for debugging Fluent Bit. The official Slack channel is also very active (and predates the CNCF channel): https://slack.fluentd.org/

The official documentation covers everything in a lot more detail: https://docs.fluentbit.io/manual

Fluent Bit with OTEL

OpenTelemetry provides an open source standard for logs, metrics & traces. Fluent Bit and the OpenTelemetry collector are both powerful telemetry collectors within the CNCF ecosystem.

Both aim to collect, process, and route telemetry data and support all telemetry types.
They each emerged from different projects with different strengths: FB started with logs and OTEL started with traces.
The common narrative suggests you must choose one or the other but these projects can and should coexist.
Many teams are successfully using both by leveraging each for what it does best or for other non-functional requirements like experience with Golang vs C, ease of maintenance, etc.

The OpenTelemetry collector also has a Receiver and Exporter that enable you to ingest telemetry via the Fluent Forward protocol.

Fig. 5: Fluent Bit and OTEL

Now, I will show you various examples of how to use Fluent Bit in different deployment scenarios. We’ll demonstrate full working stacks using simple containers to make it easy to reuse the examples and pick out the bits you want to test/modify for your own use.

A repo is provided here with all examples: https://github.com/FluentDo/fluent-bit-examples

These examples are quite simple, primarily to walk you through basic scenarios explaining what is going on. There are also some other examples provided by others like https://github.com/isItObservable/fluentbit-vs-collector which may also be useful.

Fluent Bit YAML config

In each case, I will use the new (since v2.0 anyway!) YAML format rather than the old “classic” format to hopefully future proof this article while allowing you to start using the processors functionality only available with YAML configuration. The official documentation provides full details on this configuration format: https://docs.fluentbit.io/manual/administration/configuring-fluent-bit/yaml

Fluent Bit processors

A processor is essentially a filter that runs bound specifically to the input or output plugin it is associated with: this means that it only runs for the relevant data routed from that input or to that output.

Previously, filters were part of the overall pipeline so they can match any data from any other filters and inputs. They will run on the main thread to process their data. This has two benefits:

Processors run on the thread(s) associated with their input or output plugin. This can prevent the “noisy neighbours” problem with certain input data starving out more important processing.
Processors do not have to spend the usual cost to unpack and pack data into the generic internal msgpack format.

All existing filters can be used as processors, but there are some new processors added which cannot be used as filters. Processors are provided that work across the various logs, metrics, and trace data types, whereas filters are only provided for log type data.

Simple usage

We will run up the OTEL collector as a simple container with a Fluent Bit container feeding it OTEL dummy data as a simple test to show everything and walk through the configuration before moving on to more interesting and complex deployments. The OTEL collector is a simple OTEL receiver that is trivial to run and prove Fluent Bit is feeding it OTEL data.

OTEL collector

Start up the OTEL receiver to handle receiving OTEL data and printing it out. We use the following configuration:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

exporters:
  debug:
    verbosity: detailed
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [debug]
    metrics:
      receivers: [otlp]
      exporters: [debug]
    logs:
      receivers: [otlp]
      exporters: [debug]

Run up the container using the configuration above:

docker run -p 127.0.0.1:4317:4317 -p 127.0.0.1:4318:4318 -v $PWD/otel-config.yaml:/etc/otelcol-contrib/config.yaml  otel/opentelemetry-collector-contrib:0.128.0

We open the relevant ports to receive OTEL data and tell it to mount the configuration file into the default location.

Fluent Bit sending OTEL

Now we can run up the Fluent Bit container that generates some dummy log data for now to show it all working together. We will use the following configuration:

service:
  log_level: info
pipeline:
  inputs:
    - name: dummy
      tag: test
      processors:
        logs:
          - name: opentelemetry_envelope
  outputs:
    - name: opentelemetry
      match: "*"
      host: 127.0.0.1
      port: 4318
    - name: stdout
      match: "*"

To run it up with YAML we have to mount the configuration file in and override the default command to use the YAML file rather than the classic configuration:

docker run --rm -it --network=host -v $PWD/fluent-bit.yaml:/fluent-bit/etc/fluent-bit.yaml:ro fluent/fluent-bit -c /fluent-bit/etc/fluent-bit.yaml

We’re using host networking here to simplify sending from our container to the already open localhost ports. You should connect the ports properly using dedicated networks or host/ip addresses.

There is a full compose stack here as well to simplify things: https://github.com/FluentDo/fluent-bit-examples/tree/main/otel-collector

Let’s walk through the Fluent Bit configuration to explain the various components:

service:
  log_level: info

This just sets up the top-level Fluent Bit configuration. Specifically, I added this as an example to help with debugging if you need to increase the log level.

pipeline:
  inputs:
    - name: dummy
      tag: test
      processors:
        logs:
          - name: opentelemetry_envelope
  outputs:
    - name: opentelemetry
      match: "*"
      host: 127.0.0.1
      port: 4318
      tls: off
      metrics_uri: /v1/metrics
      logs_uri: /v1/logs
      traces_uri: /v1/traces
      log_response_payload: true
    - name: stdout
      match: "*"

Here, we show a simple telemetry pipeline using the dummy input to generate sample log messages that is then routed to both an opentelemetry output (using the appropriate port and localhost address along with URIs that the collector wants) and a local stdout output. This allows us to see the generated data both on the Fluent Bit side and what’s being sent to the OTEL collector we started previously.

Opentelemetry-envelope processor

The opentelemetry-envelope processor is used to ensure that the OTEL metadata is properly set up – this should be done for non-OTEL inputs that are going to OTEL outputs: https://docs.fluentbit.io/manual/pipeline/processors/opentelemetry-envelope.

Fig. 6: Opentelemetry-envelope

Essentially, it provides the OTLP relevant information in the schema at the metadata level as attributes (rather than within the actual log data in the record) which other filters can then work with or the output plugin can use, e.g.

processors:
        logs:
          - name: opentelemetry_envelope

          - name: content_modifier
            context: otel_resource_attributes
            action: upsert
            key: service.name
            value: YOUR_SERVICE_NAME

It is usable for metrics or log type data as well.

Fig. 7: Metrics

Output

You should see the Fluent Bit container reporting the generated dummy data like so:

[0] test: [[1749725103.415777685, {}], {"message"=>"dummy"}]
[0] test: [[1749725104.415246054, {}], {"message"=>"dummy"}]

The output from stdout shows first the tag we are matching test followed by the timestamp (in UNIX epoch format) and any other metadata with finally the actual logs payload being shown which in this case is the message key with the value dummy. The [0] information is to show we are reporting the first event in a batch – if there were multiple events for stdout to print then it would increment for each one until the next output.

Now, on the OTEL collector side, we should see the log messages coming in like so:

2025-06-12T10:45:04.958Z	info	ResourceLog #0
Resource SchemaURL: 
ScopeLogs #0
ScopeLogs SchemaURL: 
InstrumentationScope  
LogRecord #0
ObservedTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2025-06-12 10:45:04.415246054 +0000 UTC
SeverityText: 
SeverityNumber: Unspecified(0)
Body: Str(dummy)
Trace ID: 
Span ID: 
Flags: 0

You can see the body is reported as just dummy, i.e. the message key because we only have a single top-level one. If you look at the documentation, you can see that by default the opentelemetry output looks to send the message key which is useful for demoing with dummy.

We can tweak Fluent Bit to generate a multi-key input and then pick the relevant key to send via a configuration like so:

service:
  log_level: info
pipeline:
  inputs:
    - name: dummy
      tag: test
      dummy: '{"key1": "value1", "key2": "value2"}'
      processors:
        logs:
          - name: opentelemetry_envelope
  outputs:
    - name: opentelemetry
      match: "*"
      logs_body_key: key2
      host: 127.0.0.1
      port: 4318
      tls: off
      metrics_uri: /v1/metrics
      logs_uri: /v1/logs
      traces_uri: /v1/traces
      log_response_payload: true

Using this configuration, you can see Fluent Bit reporting output like this:

[0] test: [[1749726031.415125943, {}], {"key1"=>"value1", "key2"=>"value2"}]

With the OTEL collector then receiving the key2 value:

2025-06-12T11:00:31.846Z	info	ResourceLog #0
Resource SchemaURL: 
ScopeLogs #0
ScopeLogs SchemaURL: 
InstrumentationScope  
LogRecord #0
ObservedTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2025-06-12 11:00:31.415125943 +0000 UTC
SeverityText: 
SeverityNumber: Unspecified(0)
Body: Str(value2)
Trace ID: 
Span ID: 
Flags: 0

The documentation shows how to configure some of the other OTEL fields appropriately: https://docs.fluentbit.io/manual/pipeline/outputs/opentelemetry

In the basic example, we aren’t populating other useful information like SeverityText and everything else. These can be set up from the data using the various configuration options available in the documentation.

Note the configuration options let you distinguish between data in the actual log message body and data found in the metadata:

xxx_metadata_key: Looks for the key in the record metadata and not in the log message body.
xxx_message_key: Looks for the key in the log message body/record content.

Fluent Bit with gRPC

Fluent Bit also supports using gRPC (and HTTP/2), but these need to be explicitly enabled via the grpc or http2 configuration options. Using our previous OTEL collector listening on 4317 for gRPC data, we can do the following:

service:
  log_level: warn
pipeline:
  inputs:
    - name: dummy
      tag: test
      processors:
        logs:
          - name: opentelemetry_envelope
  outputs:
    - name: opentelemetry
      match: "*"
      host: 127.0.0.1
      port: 4317
      grpc: on
      tls: off
      metrics_uri: /v1/metrics
      logs_uri: /v1/logs
      traces_uri: /v1/traces

Now, it should send data over gRPC to the OTEL collector which reports similar output as before. I increased the log level because the current version of Fluent Bit was very “chatty” about success reporting for gRPC.

Metrics and traces

As previously discussed, Fluent Bit can handle metric and trace style data now. It can scrape metrics from prometheus endpoints, handle the prometheus write protocol, or handle OTLP metric data directly.

For a simple demonstration, we can use the fluentbit_metrics input which provides metrics about Fluent Bit itself: https://docs.fluentbit.io/manual/pipeline/inputs/fluentbit-metrics

service:
  log_level: info
pipeline:
  inputs:
    - name: fluentbit_metrics
      tag: metrics
  outputs:
    - name: opentelemetry
      match: "*"
      host: 127.0.0.1
      port: 4318
      tls: off
      metrics_uri: /v1/metrics
      logs_uri: /v1/logs
      traces_uri: /v1/traces
      log_response_payload: true
    - name: stdout
      match: "*"
    - name: prometheus_exporter
      match: metrics
      host: 0.0.0.0
      port: 2021

We provide a stdout output which will report the data in the log and an endpoint that you can scrape for Prometheus format data at port 2021 via the prometheus_exporter. The metrics are also sent to the OTEL collector we are running, which should report output like so:

StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2025-06-12 12:51:52.418887897 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> name: Str(stdout.1)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2025-06-12 12:51:48.339675316 +0000 UTC
Value: 0.000000
Metric #30
Descriptor:
     -> Name: fluentbit_output_chunk_available_capacity_percent
     -> Description: Available chunk capacity (percent)
     -> Unit: 
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> name: Str(opentelemetry.0)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2025-06-12 12:51:52.418913562 +0000 UTC
Value: 100.000000
NumberDataPoints #1
Data point attributes:
     -> name: Str(stdout.1)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2025-06-12 12:51:52.416704619 +0000 UTC
Value: 100.000000

Traces require something that generates OpenTelemetry format traces – the only supported trace input for the moment is from OTEL. The opentelemetry input plugin (not output) shows how to configure this, including even converting traces to log style data via the raw_traces option (e.g. to send to an endpoint that only supports log data like S3, etc. rather than OTLP trace data): https://docs.fluentbit.io/manual/pipeline/inputs/opentelemetry

There is also a useful logs_to_metrics filter, which can be used to convert log messages into metrics. Quite a common pattern found in a lot of existing applications is to log various buffer sizes, etc., which can be exposed better as metrics: https://docs.fluentbit.io/manual/pipeline/filters/log_to_metrics

Using Fluent Forward with OTEL collector

The OTEL collector can talk directly to Fluent Bit with the Fluent Forward protocol via a receiver (if sending from Fluent Bit) or an exporter (if sending to Fluent Bit). This may be a better option in some cases and is easy to configure.

The Fluent Forward protocol is also implemented by Fluentd and essentially is a msgpack based implementation that includes the tag of the data. It’s an optimal way to transfer data between Fluentd/Fluent Bit instances, since it uses the internal data structure for it all.

Sending from OTEL collector to Fluent Bit

Fluent Bit needs to receive data using a forward input plugin.

pipeline:
    inputs:
        - name: forward
          listen: 0.0.0.0
          port: 24224
    outputs:
        - name: stdout
          match: '*'

We configure the OTEL collector to have a Fluent Forward exporter to send this data.

exporters:
  fluentforward:
    endpoint:
      tcp_addr: 127.0.0.1:24224
    tag: otelcollector

Remember that the Fluent Forward protocol includes the tag, so it is not part of the input plugin.

Sending to OTEL collector from Fluent Bit

We configure the OTEL collector to have a Fluent Forward receiver to get this data.

receivers:
  fluentforward:
    endpoint: 0.0.0.0:24224

Fluent Bit needs to send data using a forward output plugin.

pipeline:
    inputs:
        - name: dummy
          tag: test
    outputs:
        - name: forward
          match: '*'
          port: 24224

Kubernetes observability

I will show a “normal” (most wildly deployed) example of using Fluent Bit to collect container logs from a K8S cluster. We can extend this to send to any output supported by Fluent Bit as well to include getting metrics and traces.

For the examples below, I am using Kubernetes-in-docker (KIND). This is a vanilla K8S distribution, so it should be applicable to all others. We also use the helm tooling, since this is the only officially supported approach: https://docs.fluentbit.io/manual/installation/kubernetes

Mounting container logs

To read the logs we have to mount them from the host so typically we deploy Fluent Bit as a daemonset with a hostPath mount to the local log files. One important thing to watch out for is dangling symlinks being mounted: make sure you mount the links and their targets if required so they can be resolved. Using the official helm chart will automatically create a daemonset with these files all mounted for you: https://github.com/fluent/helm-charts/blob/main/charts/fluent-bit

The logs being ingested should follow the K8S standard and container runtime format, Fluent Bit provides two default parsers that let you handle this file format automatically and deal with the various edge cases when lines are split by the kubelet. The documentation shows the recommended usage for containers and in general I always say to follow this – do not define your own custom parsers for the logs unless you know what you’re doing.

pipeline:
  inputs:
    - name: tail
      path: /var/log/containers/*.log
      tag: kube.*
      multiline.parser: docker, cri

In the above example, we are assuming there is a mounted set of container logs at /var/log/containers which is the default location used by the helm chart and most distributions. Then we attempt to parse with the built-in multiline parsers for the docker and cri container runtime log formats.

Previously, Fluent Bit also provided Multiline or Docker_Mode configuration options but these are deprecated now and only included for legacy usage – do not mix them with the new multiline.parser options. Instead, just use the new options.

Application-specific parsing

The parsers used above are mutually exclusive. The first one that matches will be used. They won’t be applied in order, so you cannot first do CRI format parsing then another application specific. If you want to first parse the default kubelet format and then attempt some application specific parsing, you should add a processor or filter:

parsers:
  - name: my-custom-parser
    format: json

pipeline:
  inputs:
    - name: tail
      path: /var/log/containers/*.log
      tag: kube.*
      multiline.parser: docker, cri
      processors:
        logs:
          - name: parser
            parser: my-custom-parser
            key_name: log

After we have finished processing the data in the input file, we pass it to a custom parser that operates on the log key. You can use any of the other filters/processors in the same way and apply multiple as needed.

If a parser does not apply, then the data is left alone and unchanged – there is no data loss from an invalid parser. Potentially, you can chain a series of different parsers and only those that apply will affect the data. For example, with two wildly different log formats, try one parser then the other and it will apply whichever matches first.

Preventing duplicate or missing data

One other thing you may want to consider is the fact that your pods may be evicted or have delays in scheduling for various reasons, so you should make sure that when a new pod starts it continues from wherever the last one left off. Otherwise, you may miss data since the pod started or send duplicate data that another pod has already sent. This can also be true when running an agent outside of K8S, e.g. the Fluent Bit service starts up later than something you want to track logs from.

Fluent Bit provides a way to support this by persisting the offset in the input file it last read up to with a simple sqlite database. Optionally, you can provide it with the db parameter. The db file tracks which files have been read and how much of the file so that when Fluent Bit is restarted/rescheduled/etc. it will continue from where it left off.

pipeline:
  inputs:
    - name: tail
      path: /var/log/containers/*.log
      tag: kube.*
      multiline.parser: docker, cri
      db: /var/log/fluent-bit.db

The simple example above shows how to use the same volume as the container logs to write the database file. If you want to use read only mounts for the log volume, you can use a separate volume with write access and set the path to it for the db option. This example would also work for an agent deployed directly on the node.

The sqlite database tracks files by inode value so it handles log file rotation automatically: when the file is rotated the old inode is read until completion then we carry on with the new inode for the next file. The database file can also be looked at via any sqlite tooling you may want to access it with.

The database file is not intended to be shared across processes or nodes – the inode values will not be unique for example on different nodes. Make sure that it is linked to only one Fluent Bit process at a time. You need a writable location for this database that automatically matches to the right pod each time it is started. A simple way is to use a hostPath mount so the same config is shared across all pods, but the actual filesystem is specific to each node then.

If your pod is running and persisting its file information to the database, then it is evicted and a new pod starts. The database must be linked to the new pod automatically. A hostPath can be a simple way to do this when running as a daemonset. It will always be for that specific node. Similarly, only one pod should be writing to a specific database file at a time. For other deployment options (e.g. maybe you are running as a deployment instead of a daemonset), you can figure out an alternative like using named directories or files in a persistent volume shared across all pods.

Kubernetes meta-data

Fluent Bit provides a simple Kubernetes filter you can use to automatically query the K8S API to get pod meta-data (labels and annotations) to inject into the records you are sending to your outputs. This filter will also allow you to do some additional custom parsing and other behaviour (e.g. you can ignore logs by label) on the log records it receives.

pipeline:
  inputs:
    - name: tail
      path: /var/log/containers/*.log
      tag: kube.*
      multiline.parser: docker, cri
      processors:
        logs:
          - name: kubernetes
            kube_tag_prefix: kube.var.log.containers.

This relies on the K8S standard for kubelet log filenames, which includes enough information to extract and then query the API server with. The log filename will include the namespace, pod, and container names. From this, we can make a query to the K8S API to get the rest of the metadata for that specific container in that specific pod.

To ensure that the K8S filter in Fluent Bit has this information, it must be provided the log filename in the tag. The tail input filter will do this if you provide a wildcard in the tag name, i.e. tag: kube.* will be automatically expanded to the full filename for the tag (with special characters replaced) so something like kube.var.log.containers.namespace_pod_container. The K8S filter has two configuration parameters relevant here: https://docs.fluentbit.io/manual/pipeline/filters/kubernetes#workflow-of-tail-and-kubernetes-filter

Kube_tag_prefix: defaults to kube.var.log.containers. and is stripped off the tag to just give you the filename. This must be correct , otherwise you will get nonsense information which will fail when queried. If you change the default tag to something other than kube.* or files are mounted to a different path, you must make sure this is correct.
Regex_Parser: this is the parser used to extract the information from the filename after it is stripped, i.e. it gets the namespace and other information. You likely do not need to change this.

Make sure you have correctly configured RBAC to allow your Fluent Bit pods to query this information from the K8S API.

If you’re seeing missing information from the kubernetes filter, the first step is to set log_level debug. This will provide the HTTP requests you’re making to the K8S API (check the pod information is correct – it’s usually down to mismatches in the tag) and the HTTP responses (which can show you invalid RBAC configuration).

Helm chart deployment

We’ll use the official helm chart to deploy Fluent Bit with the following configuration:

      service:
        # Required for health checks in the chart
        http_server: on
      pipeline:
        inputs:
          - name: tail
            tag: kube.*
            path: /var/log/containers/*.log
            multiline.parser: docker, cri
            processors:
              logs:
                - name: kubernetes
                  kube_tag_prefix: kube.var.log.containers.
                  merge_log: on
        outputs:
           - name: stdout
             match: "*"

This is a very simple standalone configuration that assumes a daemonset with a hostPath mount of /var/log, i.e. the helm chart defaults. We use the previously discussed kubernetes filter to retrieve additional information about each container log from the K8S API.

The merge_log parameter is a powerful tool to look at the log data and extract JSON key-value pairs or apply custom parsers you can specify by annotations on the pods: https://docs.fluentbit.io/manual/pipeline/filters/kubernetes#kubernetes-pod-annotations

We specify the http_server, as the helm chart defaults to enabling K8S health checks on the pods which hit that endpoint. If it isn’t present, the pods will never be marked healthy. You can disable them and then you don’t need http_server set, which will remove running an HTTP endpoint on port 2020: https://github.com/fluent/helm-charts/blob/54f30bd0c98d7ef7b7100c14d6cbd52236cb34e4/charts/fluent-bit/values.yaml#L201-L209

You can include files into an overall configuration. The include files are good ways to reuse common configuration or isolate specific configuration into separate files (e.g. for separate teams to control or to make it simpler for large configurations with well-named includes). Each configuration file is read independently and the data loaded into memory which means that you can also include “classic” configuration files into a top-level YAML configuration file or even mix-and-match:

includes:
  - yaml-include-1.yaml
  - classic-include-2.conf

This can be useful if you want to move things piecemeal a bit at a time.

One thing to note in YAML is to always quote wildcards, as they can be treated as special characters: a good tip is to quote things if you start seeing configuration format errors, just in case this is the problem.

YAML format with Helm chart

Currently the helm chart is defaulting to the old format configuration, but we can use YAML configuration with a simple values file:

config:
  extraFiles:
    fluent-bit.yaml: |


args:
  - --workdir=/fluent-bit/etc
  - --config=/fluent-bit/etc/conf/fluent-bit.yaml

We override the default configuration file to the YAML configuration we have added to the configmap used by the helm chart. This is a slight workaround in that it leaves all the legacy configuration alone and adds a new YAML one to use.

An example is provided here: https://github.com/FluentDo/fluent-bit-examples/tree/main/helm-yaml-config

We can run up a cluster with KIND then deploy the helm chart like so:

kind create cluster
helm repo add fluent https://fluent.github.io/helm-charts --force-update
helm repo update
helm upgrade --install fluent-bit fluent/fluent-bit --values ./values.yaml

Remember that with helm you can use helm template to generate you the actual YAML output (similar to what a lot of GitOps/IaC tools like Argo, etc. use to manage helm deployments) and verify it or use it directly.

Looking at the logs from the Fluent Bit pods should show you container logs with K8S metadata added: kubectl logs -l “app.kubernetes.io/name=fluent-bit,app.kubernetes.io/instance=fluent-bit”

[0] kube.var.log.containers.kindnet-vdwzr_kube-system_kindnet-cni-6c3fd58a5ca253428cbc7de0c54cb107bfac4c5b8977f29107afab415d376a4c.log: [[1749731282.036662627, {}], {"time"=>"2025-06-12T12:28:02.036662627Z", "stream"=>"stderr", "_p"=>"F", "log"=>"I0612 12:28:02.036155       1 main.go:297] Handling node with IPs: map[172.18.0.2:{}]", "kubernetes"=>{"pod_name"=>"kindnet-vdwzr", "namespace_name"=>"kube-system", "pod_id"=>"4837efec-2287-4880-8e05-ed51cc678783", "labels"=>{"app"=>"kindnet", "controller-revision-hash"=>"6cd6f98bf8", "k8s-app"=>"kindnet", "pod-template-generation"=>"1", "tier"=>"node"}, "host"=>"kind-control-plane", "pod_ip"=>"172.18.0.2", "container_name"=>"kindnet-cni", "docker_id"=>"6c3fd58a5ca253428cbc7de0c54cb107bfac4c5b8977f29107afab415d376a4c", "container_hash"=>"sha256:409467f978b4a30fe717012736557d637f66371452c3b279c02b943b367a141c", "container_image"=>"docker.io/kindest/kindnetd:v20250512-df8de77b"}}]
[1] kube.var.log.containers.kindnet-vdwzr_kube-system_kindnet-cni-6c3fd58a5ca253428cbc7de0c54cb107bfac4c5b8977f29107afab415d376a4c.log: [[1749731282.036770275, {}], {"time"=>"2025-06-12T12:28:02.036770275Z", "stream"=>"stderr", "_p"=>"F", "log"=>"I0612 12:28:02.036253       1 main.go:301] handling current node", "kubernetes"=>{"pod_name"=>"kindnet-vdwzr", "namespace_name"=>"kube-system", "pod_id"=>"4837efec-2287-4880-8e05-ed51cc678783", "labels"=>{"app"=>"kindnet", "controller-revision-hash"=>"6cd6f98bf8", "k8s-app"=>"kindnet", "pod-template-generation"=>"1", "tier"=>"node"}, "host"=>"kind-control-plane", "pod_ip"=>"172.18.0.2", "container_name"=>"kindnet-cni", "docker_id"=>"6c3fd58a5ca253428cbc7de0c54cb107bfac4c5b8977f29107afab415d376a4c", "container_hash"=>"sha256:409467f978b4a30fe717012736557d637f66371452c3b279c02b943b367a141c", "container_image"=>"docker.io/kindest/kindnetd:v20250512-df8de77b"}}]

You can see from these example logs that K8S metadata is nested under a kubernetes key.

Fluent Bit debugging tips

Even with the best tooling in the world, occasionally things go wrong and you’ll have to figure out why. My tips for debugging boil down to the usual software engineering idioms:

Simplify your stack.
Reproduce your issue minimally.
Increase your log level.
Treat warnings as errors.

Simplify

Always attempt to simplify any issues – do not debug things with a massive observability stack with components all along the way, mangling or affecting the final data you’re looking at.

Do not attempt to debug by looking at the output in another tool. Instead, use the stdout output or filter to see what the raw data looks like to the Fluent Bit deployment. This will help identify if the problem is with Fluent Bit or the component after it.

Once the data looks right on the Fluent Bit side, the issue may be resolved. But if not, then we know it’s either a problem with sending the data to the next component or something that component is doing.

Quite often, it’s easy to make assumptions about what you think your data looks like which may be incorrect. A good example of this is parsing kubelet logs. People may accidentally think their data looks like the output of kubectl logs … when actually you need to parse the raw file on disk. Typically, in this case people will want to use the default cri or docker multiline parsers to handle the kubelet format first in the tail input then have a separate parser filter (or processor attached) to parse the logs after reconstructing them from the kubelet format (which is what kubectl logs does first).

Local reproducer

Simplifying the problem also helps you set up a simple local reproducer with no (or minimal) external dependencies. Too often, we get issues raised with “random” failures seen using multiple inputs, filters, and outputs. If you can provide a simple reproducer, others can easily help and use it as a regression test (either by yourself when accepting new versions and/or by the Fluent project) if it turns out to be an issue.

A local reproducer also lets you iterate quickly to test possible changes/tweaks/updates. I like to do this with a simple container-based stack using the dummy output (or even tail if mounting sample data into the container). For example, you can easily have a local fluent-bit.yaml and test it like so:

vi fluent-bit.yaml
…
docker run --rm -it -v $PWD/fluent-bit.yaml:/fluent-bit/etc/fluent-bit.yaml:ro fluent/fluent-bit -c /fluent-bit/etc/fluent-bit.yaml

This will mount the YAML file in and provide it as the configuration file to use (with the -c parameter). You can even just mount the whole local directory if you’re passing in things like parser configuration, test files, etc.

More debugging

Once you’ve simplified and have a reproducible version, then you can start adding more logging and investigating the remaining issues.

Increase your log_level to debug to check for problems and help debug things. This is especially useful when using HTTP APIs where it will show you the request and response. For misconfigured K8S filters, you may see an RBAC failure or incorrect queries showing why you’re not getting any K8S metadata that’s being added.

Always treat warnings as errors, at least when debugging things. A good example is warnings about unknown parsers that are just ignored, while your data is not parsed as expected. Quite often, that’s down to using relative paths with an incorrect working directory before – switch to absolute paths to try that.

Missing Kubernetes metadata is usually down to mismatched tag configuration or invalid RBAC configuration. Using log_level debug will give you the HTTP requests being made and the responses from the K8S API server which usually helps figure out the problem.

The tail input functions like tail -f Linux command by default so will only ingest new log lines added to the file (with a newline) after Fluent Bit is started. You can use read_from_head: true (not recommended) to read data already in the file or the previously mentioned state file via the db: xxx parameter. This will read a new file completely and maintain where it is up to from then on.

Use the –help command to get information about every specific plugin and all its configuration options in the specific version you’re using:

docker run --rm -it fluent/fluent-bit -o opentelemetry --help

Unfortunately, documentation can sometimes be out of date or waiting on a pull request, so please send any updates to https://github.com/fluent/fluent-bit-docs.

Wrapping up

In summary, you can use Fluent Bit with OTEL either directly or interfacing with the OTEL collector. The benefits are that you’ll have a very mature OSS solution and low resource usage of Fluent Bit combined with a lot of options for non-OTEL usage. You may already have Fluent Bit deployed either explicitly or via your cloud provider, so leveraging it rather than deploying and managing another component can also make sense too.

The Kubernetes observability section above aims to explain how Fluent Bit works to extract K8S metadata from the log files and query the API server for more information. If things go wrong, this should provide pointers as to why or what is required.

The final section gives you some tools to help you investigate any issues you may have with incorrect configuration, missing data or the like in your telemetry pipelines following the golden rules of:

Simplify
Reproduce
Debug

The post Adventures in Observability with Fluent Bit, OTEL, and Kubernetes appeared first on DevOps Conference & Camps.

Streamlining Kubernetes Deployments with Kustomize: A CI/CD Perspective

DevOpsCon editorial — Tue, 15 Jul 2025 12:50:42 +0000

1. Understanding Kustomize

Kustomize is a configuration management tool specifically designed for Kubernetes. Unlike traditional templating solutions, Kustomize works by layering patches on Kubernetes resource files, thus preserving the original structure and ensuring that configurations remain declarative and clean. It helps with managing variants of Kubernetes resources without the need for templates. This is perhaps its greatest strength and weakness at the same time. Some features that Helm offers cannot be provided by Kustomize due to this limitation. For example, the usage of control structures like loops or conditional blocks. Nevertheless, Kustomize keeps customizing simple by using fully valid yaml structures.

1.1 Key Features of Kustomize

Certain characteristics and constraints include:

No Templating Language: Kustomize uses plain YAML, eliminating the need for any additional templating language.
Overlay System: Overlays allow you to define variations of your base configurations without duplicating files.
Native Integration: Kustomize is natively integrated into kubectl, the Kubernetes command-line tool, making it seamless to use.
The kubectl CLI includes an integrated version, though it is typically not up-to-date. Therefore, the standalone binary must be used to benefit from the latest features.
It manages variants of resources with overlaying and merging yaml files in a structured way.
It provides convenient built-in features to generate common resources like ConﬁgMaps and Secrets.
It has built-in transformers to modify resources.
It can be extended via a plug-in mechanism.
It is possible to dynamically change resources, but this is restricted to particular ﬁelds of a resource.
It only manages the yaml files and does not actively manage the resources in the cluster.

STAY TUNED

Learn more about DevOpsCon

[mc4wp-simple-turnstile]

1.2 Core Components

Base: A set of common conﬁgurations that can be reused across multiple environments. It can be reused in all environments and patches can be added for each of these environments.
Overlay: A set of environment-speciﬁc conﬁgurations that modify or extend the base conﬁgurations.
kustomization.yaml: The main conﬁguration ﬁle that deﬁnes how resources should be customized.

A base manifest and an overlay manifest create customized ﬁles. Each environment needs a kustomization.yaml ﬁle.

The kustomization.yaml ﬁle is the primary ﬁle used by the Kustomize tool. When the Kustomize build command is executed, Kustomize searches for this ﬁle. It contains a list of all Kubernetes manifests (YAML ﬁles) to be managed by Kustomize, along with any custom conﬁgurations for the manifests that will be generated.

Here is an example of kustomize.yaml ﬁle

Patches: They enable modiﬁcations to particular resources or workloads in a Kubernetes cluster without impacting other ﬁelds. Three Parameters are important to achieve a patch:
- Operation Type: This can be add, remove, or replace. – Target: Represents the Kubernetes resource intended for modiﬁcation.
  - Value: Speciﬁes the name of the value to be added or changed. This ﬁeld remains blank when performing a removal operation.

There are two methods for patches: JSON 6902, which is managed by target and patches details, and the Strategic Merge Patching method, where the conﬁguration provided follows the standard Kubernetes YAML format. The required changes can be speciﬁed in the relevant sections.

To clarify, we will provide some examples. Consider the following Kubernetes deployment:

A patch here may consist of change replicas to one for the deployment and change the container name.

Using JSON6902 we get:

Using the strategic merge patching will result in:

Kubernetes Training (German only)

Entdecke die Kubernetes Trainings für Einsteiger und Fortgeschrittene mit DevOps-Profi Erkan Yanar

Mehr erfahren

Transformers: Transformers refer to a concept that modiﬁes or enhances the values speciﬁed in a conﬁguration. By utilizing Transformers, we can adjust our basic Kubernetes YAML conﬁgurations according to the desired speciﬁcations. The following are some transformers and their applications:
- commonLabel: Adds a label to all Kubernetes manifests.
- namePreﬁx: Adds a common preﬁx to all manifest ﬁles.
- nameSuﬀix: Adds a common suﬀix to all manifest ﬁles.
- Namespace: Adds a common namespace to all manifests.
- commonAnnotations: Adds a common annotations to all manifests.

2. Setting Up Kustomize

2.1 Installation

To get started with Kustomize, you need to install it. Fortunately, Kustomize is bundled with kubectl, so if you have kubectl installed, you’re ready to go. However, you can also install Kustomize as a standalone tool if needed.

To install Kustomize, the following commands can be used:

# Install
curl -s "https://raw.githubusercontent.com/kubernetes sigs/kustomize/master/hack/install_kustomize.sh" | bash

# permission
sudo install -o root -g root -m 0755 kustomize /usr/local/bin/kustomize

2.2 Deployment of an application using Kustomize

To understand how Kustomize simpliﬁes Kubernetes deployments, let’s walk through a basic example.

Imagine we want to deploy a simple NGINX application. Instead of manually editing YAML ﬁles for each environment (like development or production), we can structure our resources cleanly with Kustomize.

Step 1: Create the Base Conﬁguration

First, create a directory structure: mkdir -p kustomize-example/base

Inside base/, create the following ﬁles:

Step 2: Create an Overlay for Development

Now let’s create a custom conﬁguration for a development environment: mkdir -p kustomize-example/overlays/dev

Inside overlays/dev/, create:

We add a dev- preﬁx to resource names.
We label all resources with environment: development.
We patch the deployment to run 2 replicas instead of 1.

Step 3: Build the Final Manifests

Now, use Kustomize to build the manifests:

cd kustomize-example/overlays/dev kustomize build.

The output would show:

A Deployment named dev-nginx with 2 replicas.
A Service named dev-nginx.
Labels added for the development environment.

3. CI/CD with Kustomize

Thus far, we have examined the utilization of Kustomize to generate Kubernetes manifest ﬁles that are prepared for deployment. Kustomize proves particularly eﬀective when integrated into a CI/CD pipeline. By leveraging Kustomize, you can ensure that your Kubernetes conﬁgurations remain consistent and easily manageable across various environments. In this section, we will present a brief example illustrating how Kustomize can be incorporated into a CI/CD pipeline using a build tool such as Jenkins.

Jenkins is a widely-used open-source automation server that facilitates the automation of the CI/CD process. Here is a procedure for integrating Kustomize with Jenkins:

Checkout the code (including Kustomize bases and overlays).
Build the Docker image (optional, if you’re deploying a new app version).
Update the Kustomize overlay (for example, to set a new image tag).
Build the Kubernetes manifests with Kustomize.
Apply the manifests using kubectl to the Kubernetes cluster.

Here is an example of Jenkinsﬁle:

pipeline {
agent any
environment {
REGISTRY = "your-docker-registry.io"
IMAGE_NAME = "your-app"
K8S_OVERLAY_DIR = "k8s/overlays/staging"
KUSTOMIZE_VERSION = "5.0.1" // adjust version if needed
}
stages {
stage('Checkout') {
steps {
checkout scm
}
}
stage('Build Docker Image') {
steps {
script {
 
sh "docker build -t ${REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER} ."
sh "docker push ${REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER}"
}
}
}
stage('Install Kustomize') {
steps {
sh '''
curl -s "https://raw.githubusercontent.com/kubernetes-
sigs/kustomize/master/hack/install_kustomize.sh" | bash
mv kustomize /usr/local/bin/
'
''
}
}
stage('Update Kustomize Image') {
steps {
dir("${K8S_OVERLAY_DIR}") {
sh "kustomize edit set image
${IMAGE_NAME}=${REGISTRY}/${IMAGE_NAME}:${BUILD_NUMBER}"
}
}
}
stage('Deploy to Kubernetes') {
steps {
dir("${K8S_OVERLAY_DIR}") {
sh '''
kubectl apply -k .
'
''
}
}
}
}

post {
always {
echo 'Cleaning up...'
}
}
}

4. Best Practices

Outlined below are several best practices to consider when working with Kustomize:

Keep Your Base Conﬁguration Simple: The base conﬁguration should be as simple as possible. It should contain the common conﬁgurations that apply to all environments.
Use Overlays for Environment-Speciﬁc Conﬁgurations: Overlays should contain only the conﬁgurations that are speciﬁc to a particular environment. This keeps your conﬁgurations clean and manageable.
Version Control Your Conﬁgurations: Store your conﬁgurations in a version control system like Git. This allows you to track changes and collaborate with your team eﬀectively.
Automate Your CI/CD Pipeline: Automate your CI/CD pipeline to ensure that your deployments are consistent and repeatable. Use tools like Jenkins, GitLab CI, or GitHub Actions to automate your pipeline.

Conclusion

Kustomize is a powerful tool that can simplify your Kubernetes conﬁgurations and make them more manageable. By integrating Kustomize into your CI/CD pipeline, you can ensure that your deployments are consistent across all environments. With the examples and best practices provided in this article, you should be well-equipped to streamline your Kubernetes deployments using Kustomize.

References

Kustomize FAQ

What is Kustomize and why use it with Kubernetes?

Kustomize is a configuration management tool for Kubernetes that lets you customize YAML manifests without a templating language. It layers patches over base resources so configurations stay declarative and easy to maintain across environments.

How is Kustomize different from Helm?

Unlike Helm, Kustomize uses plain YAML and avoids control structures like loops and conditionals. This simplifies customizing resources, but it also means some of Helm’s template-driven features aren’t available.

What are “bases” and “overlays” in Kustomize?

Base: the common, reusable set of manifests shared across environments.
Overlay: environment-specific changes (e.g., dev, staging, prod) that modify or extend the base via patches and transformers.

What does `kustomization.yaml` do?

It’s the entrypoint that lists resources, patches, generators, and transformers. When you run kustomize build (or kubectl apply -k), Kustomize discovers and assembles everything defined in kustomization.yaml.

How do patches work (JSON 6902 vs. Strategic Merge)?

JSON 6902: precise, operation-based edits (add, remove, replace) targeting exact paths.
Strategic Merge: supply partial YAML that’s merged into the target resource based on Kubernetes’ schema-aware rules.

What are transformers and when should I use them?

Transformers programmatically adjust fields across resources (e.g., namePrefix, nameSuffix, namespace, commonLabels, commonAnnotations) so you can keep bases clean and apply broad changes in overlays.

Can Kustomize generate ConfigMaps and Secrets?

Yes. Use configMapGenerator and secretGenerator in kustomization.yaml to create them from literals or files, so you don’t have to hand-write these manifests.

Is Kustomize built into `kubectl`?

Yes, kubectl includes an integrated Kustomize, but it’s often behind the standalone release. Use the standalone binary to get the latest features, and use kubectl apply -k

to apply.

How should I structure directories for multiple environments?

Keep a minimal base/ (shared manifests) and create overlays// for each environment. Each overlay has its own kustomization.yaml that references the base and applies environment-specific patches and transformers.

How do I set different replica counts or image tags per environment?

Place a patch in the overlay (e.g., Strategic Merge to change spec.replicas) and/or run kustomize edit set image in CI to pin an environment-specific image tag before building/applying the manifests.

How do I integrate Kustomize into a CI/CD pipeline?

Typical steps: checkout code (bases/overlays), build & push the container image, update overlay image tags (e.g., kustomize edit set image), run kustomize build, and deploy with kubectl apply -k. This ensures consistent, repeatable deployments.

What are recommended best practices?

Keep the base simple and generic.
Use overlays only for environment-specific changes.
Version your configuration in Git.
Automate your pipeline with Jenkins, GitLab CI, or GitHub Actions.

Are there limitations I should know about?

Kustomize manages YAML—not the live cluster—and it purposefully avoids a template language. That reduces complexity but means advanced templating logic (loops/conditionals) isn’t part of the model.

What’s a quick command recap?

# Build manifests from an overlay
kustomize build overlays/dev

# Apply directly with kubectl
kubectl apply -k overlays/dev

# Update an image in CI
kustomize edit set image app=my.registry/app:BUILD_NUMBER

The post Streamlining Kubernetes Deployments with Kustomize: A CI/CD Perspective appeared first on DevOps Conference & Camps.

Reducing Developer Cognitive Load with Platform Engineering

DevOpsCon editorial — Tue, 27 May 2025 06:52:19 +0000

Over the past few years, we have seen a significant shift in the way software is developed and deployed. We have moved beyond the ‘Just hand it to Ops and they will install it’ mentality to a more collaborative approach where developers are responsible for the entire lifecycle of their applications. This coupled with the adoption of cloud computing and their associated platforms has introduced a new set of challenges for development teams. To answer this we looked towards DevOps, a set of practices that looks to combine development and operations through a culture of collaboration and shared responsibility. But in reality, this has only shifted the burden from operations to developers, who are now responsible for managing the complexity of cloud platforms, while still delivering high-quality software.

The Developer Cognitive Load Problem

Cognitive load is the way that we describe the amount of mental effort and information required to complete a task. In terms of software development, this can be the amount of information a developer needs to keep in their head to understand and work on a particular piece of code.

With the shift in DevOps to move the responsibility of managing infrastructure towards the developers, we are seeing an increase the tasks and technologies that developers need to understand and manage.

The extra load comes in from needing to understand not just the code, but how the code builds, where the code runs, how it is monitored, how it is scaled, how it is secured, and how it is deployed. This can lead to developers spending more time managing the infrastructure than actually writing code, which can lead to burnout and a decrease in productivity.

So, what do we do about this and how can we help shift back the balance towards developers doing what they do best, writing code?

STAY TUNED

Learn more about DevOpsCon

Platform Engineering: A Strategic Response

Before we dive in too deeply, let’s first define what we mean by Platform Engineering.

Platform Engineering is the practice of empowering organisations to uplift their engineering teams by providing a platform for automating software delivery within their environment.

The key pillars of a platform are:

Standardisation of tooling – Integrated Development Environments (IDEs), CI/CD pipelines, IaC tools, etc.
Implementation of standardised IaC patterns – By providing standardised IaC patterns, the platform can ensure that the infrastructure is consistent and repeatable.
Repeatable automation of deployments – By providing a standardised deployment pipeline, the platform can ensure that deployments are repeatable and reliable.
Monitoring and Observability – Making monitoring and observability an integral part of the platform and open to not just the platform team but the development teams as well.
Integration with Security tooling and frameworks – Secure by default should be the aim, integrating security tooling and frameworks into the platform can help ensure that the applications are secure.

Platform engineering seeks to support all these pillars to enable the development teams to focus on what they do best while keeping the operational environments manageable, secure and scalable.

So, now that we have defined the goals of Platform Engineering, how do we go about actually delivering these goals?

Building Developer-Centric Platforms

As a starting point, we need to understand the needs of the developers and the business they are providing value to. The goal of Platform Engineering is to focus on the developers as the primary customer and the platform as the means to deliver value to the business.

So, this means we need to ensure that the platform is scoped to the technologies and tools that the developers are using, and that the platform is designed to be easy to use and understand.

But this can get complicated quickly, with a large array of systems and tools to try and integrate into a single platform. As an example, let’s have a look at what might be required for a fairly simple web development team:

- Source Control
- Continuous Integration and Deployment (CI/CD)
- Infrastructure as Code (IaC)
- Monitoring and Observability
- Security Scanning and Compliance
- Developer tooling
- Cloud Platform
- Feature Management

Kubernetes Training (German only)

Entdecke die Kubernetes Trainings für Einsteiger und Fortgeschrittene mit DevOps-Profi Erkan Yanar

Mehr erfahren

Minimal Internal Development Platform

So, let’s start out with an example minimal internal development platform.

Source Control: GitHub
CI/CD: GitHub Actions
IaC: Bicep
Monitoring and Observability: Azure Monitor

With these tools, you can start to build your platform.

Starting with source control and CI/CD, you can create template repositories that developers can use to get started quickly. Include bare bones CI/CD pipelines that give guidelines to developers but still give them the flexibility to customise.

For IaC again, start with a simple template, maybe only for one team to start with, work with them to understand how they use it and what can be stored in a central repository vs needing to be customized per team.

Monitoring and Observability can be a bit harder, but focus on the simple things first, like making sure the apps are up, monitoring the load and response times, and then getting both the platform team and development teams to work together to consume the data and make decisions.

Developer Experience on the Platform

So, now that we have a basic platform in place, how does that work towards improving the Developer Experience? At its core, the platform should focus on reducing the time to first deployment.

This means providing developers with template repositories that include pre-configured CI/CD pipelines, infrastructure definitions, and monitoring setup. Instead of spending days setting up their development environment, developers can be deploying their first application within hours.

From there, we focus on reducing cognitive load through self-service capabilities. This means developers can provision resources and deploy applications without needing to understand the complexities of Kubernetes configurations or cloud infrastructure specifics.

For example, a developer should be able to deploy a new web application by simply pushing their code to a repository, with the platform handling the creation of the necessary cloud resources, security configurations, and deployment pipelines.

The platform should also provide rapid feedback loops to developers. This includes not just basic application metrics like response times and error rates, but also deployment success rates, security scan results, and cost implications of their infrastructure choices.

Making this information readily available through dashboards or integrated into their development tools helps developers make informed decisions quickly.

To measure the platform’s impact on developer productivity, we need to look at concrete metrics:

Performance metrics (deployment frequency, success rates)
Team efficiency metrics (time spent on tasks)
User satisfaction metrics (surveys, support tickets)

We should regularly review these metrics with development teams to identify pain points and areas for improvement.

Are developers still struggling with certain aspects of the platform? Are there common requests that could be automated? This continuous feedback loop is crucial for platform evolution.

Remember, not one size will fit all teams or organizations. Start small with a core set of features, work collaboratively with development teams to understand their needs, and build out the platform incrementally based on real usage patterns and feedback.

So What Does This Mean For DevOps Teams?

Throughout this article we have been talking about Platform Engineering and Developer Experience, but what does this mean for the DevOps teams?

Essentially, DevOps was never meant to be a team, but a set of practices and principles that should be adopted by all teams.

The engineering roles that have been brought together under DevOps such as Site Reliability Engineers (SRE) and Cloud Engineers will still be needed, but their focus will be on managing and maintaining the infrastructure that the platform is built on.

As I see it, the scope of the teams are as follows:

Cloud Engineers – Responsible for the underlying cloud infrastructure and services that the platform is built on, this includes the networking and security that serves not just the platform but the entire business.

SREs – Responsible for the reliability and availability of the platform, this includes monitoring and alerting, incident response, and capacity planning. They are the first line responders, able to quickly have a grasp of the situations and empowered to reach out to the required teams to resolve the issue.

Platform Engineers – Responsible for the platform itself, this includes the CI/CD pipelines, IaC templates, monitoring and observability, and security tooling. They are the ones that work with the development teams to understand their needs and build out the platform to meet those needs.

While they all might still be brought together under the name of a DevOps team, I think it is important that we recognize the different roles and responsibilities that these engineering disciplines have.

Conclusion and Future Outlook

While we have been leveraging and using DevOps for a while I think it is now time we look to its next evolution.

Platform Engineering lets us focus on providing efficiency and value to the development teams, which in turn provides value to the business.

We should also be looking at platforms as ways to help us measure the developer experience of our teams, by providing metrics and feedback loops we can start to understand where the platform is working and where it is not.

Overall, just like with DevOps there will be lots of ways to look at Platform Engineering, but the key is to keep the developers at the centre of the conversation and ensure that the platform is providing value to them.

So, I encourage you all to start thinking about platforms, even from just the simplest templates to how you could build out an internal development platform, but remember, start small and iterate often.

Platform Engineering & Developer Cognitive Load FAQ

What is the developer cognitive load problem?

The developer cognitive load refers to the mental effort and information developers must juggle—code, builds, infrastructure, monitoring, scaling, security, and deployment. DevOps practices often shifted infrastructure complexity onto developers, which can lead to burnout and reduced productivity.

What is Platform Engineering?

Platform Engineering is the practice of creating and maintaining a self-service platform that automates software delivery. It includes standardized tooling, Infrastructure as Code patterns, CI/CD pipelines, observability, and security integrations.

How does Platform Engineering reduce cognitive load?

A platform streamlines workflows by providing template repositories, pre-configured CI/CD pipelines, infrastructure and monitoring templates, and self-service deployment tools. This reduces the need for developers to understand every detail of the infrastructure stack.

What are the key pillars of a developer-centric platform?

Standardized tooling (IDEs, CI/CD, IaC)
Repeatable automation of deployments
Built-in monitoring and observability
Secure-by-default integrations

How should organizations start building an internal development platform?

Start with a minimal viable stack (e.g., GitHub, GitHub Actions, IaC templates, and monitoring). Provide template repositories to help developers deploy their first applications quickly. Expand incrementally based on developer feedback.

How does Platform Engineering improve developer experience?

It shortens time-to-first-deployment, automates infrastructure setup, enables resource provisioning with a push of a button, and provides rapid feedback loops via integrated dashboards. This frees developers to focus more on coding rather than operations.

How can teams measure the success of Platform Engineering?

Deployment frequency and success rates
Time spent on tasks before vs. after platform adoption
Developer satisfaction metrics (e.g., surveys, support tickets)

How does Platform Engineering impact DevOps and related roles?

Platform Engineering doesn’t replace DevOps—it complements it. Cloud Engineers continue managing infrastructure, SREs focus on reliability, and Platform Engineers build and evolve the platform while collaborating closely with developers.

What is the main takeaway from this approach?

The next evolution after DevOps is treating platforms as products. Start small, iterate often, keep developers at the center, and incorporate feedback continuously to ensure the platform delivers real value.

The post Reducing Developer Cognitive Load with Platform Engineering appeared first on DevOps Conference & Camps.

Shipping Daily: From Sprints to Continuous Releases

skansal — Wed, 23 Apr 2025 07:59:44 +0000

The Daily Rhythm: Planning, Execution, and Monitoring

1. Planning

For teams delivering code daily, the rhythm of planning, execution and monitoring does not follow the two-week sprint cycle but happens continuously. Here is what this daily rhythm looks like:

Frequent Prioritization: Daily release teams prioritize their work each day, selecting high impact tasks that can be completed and shipped within a single day.
Dynamic Backlogs: Instead of working with a static sprint backlog which is derived from the mammoth product backlog, these teams operate with highly flexible backlogs – adding things to it every day. They are ready to pivot quickly in response to customer feedback or issues, urgent needs or new business opportunities.
Smaller Targeted Tasks: Work items are broken into small, manageable pieces – each designed to be completed within hours. User stories and tasks are refined to be achievable in less than a day, keeping workloads manageable and ensuring that work completed aligns with daily release goals.

2. Execution

Unlike Scrum teams that often release at the end of a sprint, daily release teams execute work with a focus on immediate delivery.

Incremental Work: Instead of waiting until the end of a sprint, developers push small, frequent changes every day. Every code change is designed to be testable and deployable at the end of the day.
Automated Testing: It is critical to daily releases. CI pipelines are designed to run tests on each code change, ensuring stability and reliability and readiness of production.
Seamless Deployment: CD pipelines are in place, so that the code – once tested – is deployed automatically to production. With daily releases, teams cannot afford to spend hours on deployment activity every single day – so it is imperative to automate it.

STAY TUNED

Learn more about DevOpsCon

3. Monitoring

Automated Monitoring: Monitoring tools track deployment success, system performance, and error rates in real time. Tools like Datadog, New Relic, or Prometheus help track application performance, error rates and system health. These tools are crucial for catching issues early and preventing them from impacting users.
Daily Retrospective Feedback Loops: Instead of waiting until the end of a sprint, the team reviews their daily progress and identifies immediate improvements – leading to quick adjustments.

Normal Scrum vs Daily Release: Key Differences

While both Scrum and daily release teams follow agile principles, there are notable differences in the processes, timelines and focus areas. Here is a closer look at how the two approaches differ:

Aspect	Traditional Scrum	Daily Releases
Planning Frequency	Sprint planning at the beginning of each 2-week sprint cycle.	Planning & prioritization happens daily
Work Cadence	Features delivered at the end of each sprint	Small, incremental features or fixes delivered every day
Testing	Manual testing followed by automated tests	Fully automated testing integrated with CI/CD pipelines
Deployment	Deployed at the end of the sprint or later	Automated deployments happen multiple times a day
Feedback Loops	Feedback gathered after every sprint	Continuous feedback integrated with daily review
Responsiveness to Change	Responds to change every 2 weeks in planning	Responds to change daily

Practices to Support Daily Releases

Here are a few practices that help support the goal of daily releases:

Real Time Code Reviews: Code Reviews are conducted in real-time. Teams use tools like GitHub, GitLab, or BitBucket to review code in small increments, making it easier to spot errors and push fixes promptly.
Cross-Functional Collaboration: Developers, testers and DevOps professionals work together throughout the day, minimizing dependencies and addressing blockers as they arise.
Minimally Viable Features: Work is broken down into the smallest possible increments. Rather than delivering a complete feature, teams focus on delivering MVPs that are functional and add value, with enhancements to follow in future releases.
Emphasis on Automation: It goes without saying that daily releases are nearly impossible to achieve without automation. Automated testing – unit tests, integration tests and end-to-end tests – ensure that the new code does not break existing functionality. Tools like Selenium, Cypress and Jest are useful in this regard.
CI/CD Pipelines: A robust continuous integration and delivery pipeline is essential to ensure that each code change is tested and ready for deployment within hours – sometimes minutes! Jenkins, GitLab CI/CD, CircleCI, or Travis CI enable automated code integration, testing & deployment.
Real Time Communication & Collaboration: Teams cannot resort to one catch-up a day like in traditional scrum. They have to be constantly in sync and keep work moving ahead without glitches. Teams often rely on chat tools like Slack, or Microsoft Teams, where real-time discussions and quick problem-solving can occur. Continuous collaboration helps tackle blockers in the moment and keeps everyone aligned.
Daily Stand-Ups with Action Items: Stand-ups are action-oriented, focusing on quick issue resolution rather than long discussions.
Continuous Improvement Mindset: While daily release teams may not hold formal sprint retrospectives, they keep a close eye on process improvements. Adjustments to processes, tools or work structures are made whenever necessary.

Key Metrics to Track

Several metrics are needed to ensure that the daily release process remains effective and high-quality. Here are a few examples of such metrics:

Deployment Frequency: Tracks how often changes are deployed in production.
Lead Time for Changes: Measures the time it takes from code commit to release.
Change Failure Rate: Tracks the percentage of changes that lead to incidents or rollback.
Mean Time to Restore (MTTR): Measures how quickly the team can recover from an issue.

When are Daily Releases Appropriate?

Having explained the processes for daily releases, it is imperative to also discuss that daily releases may not be suitable for every team or product.

Here are some scenarios where a daily release model can be highly effective and beneficial:

Fast-Paced Startups:

For startups or small product teams operating in dynamic environments, daily releases allow rapid iterations based on customer feedback, enabling swift pivots without waiting for lengthy development cycles.

Customer Facing Cloud Applications:

In products where user experience is crucial, like e-commerce or social platforms, daily releases can help deliver new features, bug fixes, and improvements quickly, ensuring a competitive edge.

Products Requiring High Competitiveness:

For applications that need frequent adjustments like marketing analytics, SaaS tools – daily releases allow for agile responses to emerging needs, bugs and requests.

Benefits of Daily Releases

Adopting daily releases comes with significant benefits that can transform software delivery and enhance customer experience. Here are a few key advantages:

Increased Customer Satisfaction: By releasing updates & fixes daily, teams can respond to user feedback immediately. This responsiveness creates a customer-centric environment, showing users that their needs are heard and met quickly.
Reduced Risk in Deployments: Smaller, more frequent deployments reduce the risk of introducing major errors. Each change is smaller, more isolated, and therefore easier to monitor, fix or roll back in case of issues.
Faster Time to Market: Daily releases allow new features, improvements and fixes to reach users almost immediately. This rapid time-to-market ensures teams stay competitive by keeping up with or even outpacing industry trends.
Enhanced Team Productivity: Daily releases encourage a culture of constant delivery and iterative progress. This model promotes focus and encourages efficiency as teams align around the goal of shipping daily.
Agility in Adapting to Changes: Teams that release daily become more adept at responding to shifts in user expectations, industry standards or internal priorities, fostering a truly agile mindset.
Increased Transparency & Continuous Feedback: With daily releases, teams receive real-time feedback enabling quick adjustments. This continuous feedback loop supports agile processes, helping teams remain in sync with user needs.

Conclusion

Daily releases offer agile teams a powerful way to deliver continuous value to customers. By shifting their approach to planning, execution and monitoring, those teams can maintain high quality and speed. Though it requires a high degree of automation, collaboration and flexibility, transitioning to daily releases can bring impressive gains in responsiveness and customer satisfaction.

For organizations ready to embrace this approach, adapting their Scrum practices to support daily releases could be the next step in their agile journey.

Daily Release — Planning, Execution & Monitoring FAQ

What defines a daily release rhythm?

Instead of two-week sprint cycles, daily release teams continuously plan, execute, and monitor their work. The cycle repeats every day with rapid feedback and iteration.

How do planning practices differ for daily release teams?

Frequent prioritization: Teams pick high-impact tasks that can be completed and released within a single day.
Dynamic backlogs: Items are added, reprioritized, and adjusted daily based on feedback or new requirements.
Smaller, targeted tasks: Work is broken into granular tasks small enough to finish within hours—and suitable for a daily cadence.

What does execution look like under a daily release model?

Incremental work: Developers push small, deployable changes every day—no waiting for a sprint-end release.
Automated testing: CI pipelines automatically run tests on each change to maintain stability and production readiness.
Seamless deployment: CD pipelines handle automatic deployment, minimizing manual effort and accelerating delivery.

How are releases monitored on a daily schedule?

Automated monitoring: Tools like Datadog, New Relic, or Prometheus provide real-time visibility into deployment success, performance, and error rates.
Daily retrospective feedback: Teams review progress daily, quickly identifying improvements and adapting the process as needed.

How does daily release differ from traditional Scrum?

Aspect	Traditional Scrum	Daily Releases
Planning Frequency	Sprint planning every 2 weeks	Daily prioritization and planning
Work Cadence	Features delivered at sprint end	Small, incremental improvements every day
Testing	Manual followed by automated tests	Fully integrated automated testing via CI
Deployment	At sprint end or in batches	Automated, daily deployments via CD
Feedback Loop	After sprint completion	Continuous, daily feedback integration

What practices support daily release success?

Real-time code reviews using GitHub, GitLab, or Bitbucket
Cross-functional collaboration (Dev, QA, DevOps working closely throughout the day)
Delivering minimally viable features for quick value delivery
Heavy reliance on automation—testing, CI/CD pipelines, deployments
Continuous communication via Slack, Teams, or similar tools
Short, action-oriented daily stand-ups and process improvement sessions

Which metrics should teams track to measure effectiveness?

Deployment frequency (how often code reaches production)
Lead time for changes (from commit to release)
Change failure rate (how often deployments cause incidents or rollbacks)
Mean time to restore (MTTR) following failures

When are daily releases most appropriate?

Fast-paced startups needing rapid iteration and responsiveness
Customer-facing cloud applications—e.g., e-commerce or social platforms—requiring frequent updates
Competitive SaaS or analytics products needing agility to react to user feedback or market shifts

What are the benefits of adopting daily releases?

Increased customer satisfaction via immediate updates and responsiveness
Reduced deployment risk thanks to smaller, isolated changes
Faster time to market for features and fixes
Enhanced team productivity through consistent delivery rhythm
Greater agility and adaptability to change
Improved transparency and feedback loops for continuous improvement

The post Shipping Daily: From Sprints to Continuous Releases appeared first on DevOps Conference & Camps.

Quality Assurance: Preparing for a Seamless Future

DevOpsCon editorial — Wed, 05 Mar 2025 12:07:25 +0000

As technology accelerates, software systems are becoming more complex, interconnected, and integral to daily life. With this evolution, the role of Quality Assurance (QA) has shifted from gatekeeping to a collaborative, strategic and deeply integrated discipline. Quality Assurance teams need to be more agile, tech-savvy, and forward thinking to ensure that software meets current expectations and anticipates future challenges.

STAY TUNED

Learn more about DevOpsCon

1. Adopt Risk-Based Testing

Risk-based testing is all about focusing on what matters most is modern software systems. Not all features or functionalities are equally critical. Some areas, such as payment gateways or user authentication systems, have a higher risk of failure or greater impact on users compared to other areas like image search.

Risk-based testing is a well-known concept, but teams need to adopt it now more than ever. By identifying and prioritizing high risk areas, Quality Assurance teams can allocate their time and resources more efficiently, ensuring that critical features receive the attention they deserve without wasting effort on low-risk elements.

Practical Tips for Quality Assurance team:

Conduct a risk analysis at the start of each sprint by collaborating with developers and product owners.
Maintain a risk matrix to identify features or modules prone to defects.
Look at historical data to find risk and defect-prone components in the system.
Prioritize testing efforts on crucial workflows, recently changed code, and areas with high user impact.
Refer to tools like TestRail for managing risk-based test plans and Xray for tracking risk alongside test cases.

2. Leverage AI-Powered Testing in Quality Assurance

AI-powered testing uses artificial intelligence and machine learning to revolutionize how QA teams operate. By automating routine tasks, predicting areas of risk, and analyzing vast amounts of test data, AI can help support teams to work smarter, not harder.

For example – AI can automatically identify flaky tests, suggest optimized test paths, and help simulate user behavior more realistically.

Practical Tips for Quality Assurance team:

Use AI tools to identify redundant or low-value test cases and focus on high priority scenarios.
Implement AI-based visual testing to validate UI across multiple devices and resolutions.
Analyze logs using AI tools to detect patterns and predict potential failures.
Implement tools like Applitools for visual testing, Sahi-Pro for AI-driven test automation, and Testim for dynamic and scalable test creation.

3. Integrate Quality Assurance into the Development Lifecycle

In a fast-paced agile environment, QA is no longer a standalone phase at the end of our dev cycle. Integrating QA into every step of the development lifecycle ensures that quality is baked into the product from the start. This proactive approach reduces the cost and time associated with fixing defects later in the cycle and fosters a culture of shared responsibility.

Practical Tips for Quality Assurance team:

Explore ways to shift-left testing earlier in the lifecycle in various ways.
Implement Test Driven Development (TDD) or Behavior Driven Development (BDD) to ensure tests are written alongside code (and not after).
Automate unit, API, and integration tests within your CI/CD pipelines.
Encourage developers and testers to collaborate during code reviews, test case reviews, and identify potential issues early.
Explore tools like Jenkins for CI/CD integration, Cucumber for BDD, and SonarQube for static code analysis.

4. Foster Collaboration Across Teams – With Testers at the Forefront

Quality Assurance is a collective responsibility and the entire team needs to understand this principle. Developers, product owners, and even business stakeholders must collaborate to deliver high quality software. This collaboration will ensure that everyone has a shared understanding of quality goals.

Testers can no longer afford to work in silos or restrict themselves to a single specialization or way of working. They need to have a multifaceted skill set – be it technical skills like functional, automation testing and performance tests; organization skills like management using Jira, Asana etc.; or soft skills like clear communication, good writing, and are adept at explaining concepts.

With these skills, testers become an inherent and irreplaceable part of the teams and help design quality into the product.

Practical Tips for Quality Assurance team:

Testers must master more than one skill and multiple tools for testing.
Implement Jira for collaboration, backlog, and testing management.
Use tools like Confluence and Notion for sharing test strategies and documentation.
Host cross-functional bug-bash sessions to identify issues as a team.
Use dashboards to maintain visibility of test plans, results, and metrics.

5. Prioritize Continuous Learning and Upskilling

The QA landscape is continuously evolving with new tools, technologies, and methodologies emerging every year. To remain relevant and effective, QA teams must commit to lifelong learning. This isn’t just about keeping up – it is about staying ahead, whether that means mastering new automation frameworks and tools, understanding advanced cloud environments or exploring emerging fields like AI-driven testing.

Practical Tips for Quality Assurance team:

Invest in training your QA teams in modern tools as per your needs and context.
Encourage certifications in areas like cloud-testing, AI-based testing, and/or performance and security testing.
Set up internal knowledge-sharing sessions for QA guilds to discuss challenges and solutions.
Take advantage of industry events such as online webinars, in-person meetups, and conferences to send your QA teams to for networking and exposure.
Utilize resources like Udemy or coursera for online courses and the Ministry Of Testing for QA-specific training and events.

Conclusion

Preparing your Quality Assurance teams for the future isn’t just about adopting new tools – it is about a mindset shift towards proactive quality management. By focusing on risk, leveraging AI, integrating QA into development, fostering collaboration and upskilling regularly, QA teams can stay ahead of the curve and deliver robust, reliable software in an ever-evolving landscape.

The future of Quality Assurance is here – let’s embrace it together!

Quality Assurance — Preparing for a Seamless Future FAQ

What is the evolving role of QA in modern software delivery?

As systems become more complex and interconnected, QA has shifted from gatekeeping to a collaborative, strategic discipline. QA now plays a proactive, integrated role throughout the development lifecycle.

What is risk-based testing?

Risk-based testing involves focusing QA efforts on the most critical system areas—such as authentication or payment flows—rather than spreading effort equally. It’s about prioritizing testing where failures would have the biggest impact. Teams should maintain a risk matrix, use sprint-level risk analysis, and leverage historical defect data to guide testing priorities. Tools like TestRail or Xray can help manage and track risk-based test plans. :contentReference[oaicite:1]{index=1}

How can AI-powered testing enhance QA?

AI-driven tools can automate routine tasks, predict risk areas, identify flaky or redundant tests, simulate realistic user behaviors, and provide visual UI validation across devices. Tools like Applitools (visual testing), Sahi-Pro (AI-driven automation), and Testim (dynamic test creation) illustrate this capability. :contentReference[oaicite:2]{index=2}

Why integrate QA into the development lifecycle (shift-left)?

Embedding QA earlier—via TDD/BDD, integrating tests into CI/CD, and collaborating during code reviews—improves quality and reduces the cost and effort of late-stage fixes. Tools like Jenkins, Cucumber, and SonarQube can support these practices. :contentReference[oaicite:3]{index=3}

How does cross-functional collaboration empower QA?

QA must work closely—with developers, product owners, and stakeholders—to share quality goals. Testers should be versatile, mastering both technical (automation, performance testing) and soft skills. Tools like Jira, Confluence, Notion help in collaboration, documentation, and visibility. Cross-functional bug bashes and dashboards enhance team awareness. :contentReference[oaicite:4]{index=4}

Why is continuous learning important for QA teams?

Technology evolves rapidly, and QA must keep pace. Teams should invest in training for modern tools (especially cloud- and AI-centered testing), encourage certifications, conduct internal knowledge-sharing, and attend industry events or courses (e.g., Udemy, Coursera, Ministry of Testing). :contentReference[oaicite:5]{index=5}

What is the core recommendation from the article?

Future-ready QA requires a mindset shift toward proactive, integrated quality. Focus on risk-based testing, AI-powered tools, integration into development, strong collaboration, and continuous upskilling to stay effective. :contentReference[oaicite:6]{index=6}

The post Quality Assurance: Preparing for a Seamless Future appeared first on DevOps Conference & Camps.