Cevo

dbt Core vs dbt Platform: Modern Data Transformation with dbt (Part 1)

Kiran Pinjarla — Thu, 12 Feb 2026 03:31:23 +0000

TL;DR:

dbt Core and the dbt Platform use the same transformation engine, but differ significantly in how they are deployed, governed, and operated, making the right choice dependent on organisational maturity rather than technical capability.

Introduction

As cloud data platforms mature, many organisations are adopting architectures where data is loaded first and transformed directly within the warehouse, rather than relying on external ETL engines. dbt fits naturally into this shift by providing a structured, SQL‑based framework for managing analytics transformations.

When teams adopt dbt, one of the earliest decisions is whether to use dbt Core, the open‑source command‑line framework, or the dbt Platform, a managed service that adds operational and collaboration capabilities. While both use the same transformation engine, they differ in how they are deployed, governed, and operated.

This blog compares dbt Core and the dbt Platform in the context of modern cloud data transformation, covering where dbt fits in the data stack, how each option is typically run, and the types of teams and environments where each approach is most effective.

dbt in the context of Modern Analytics Platform

As cloud data platforms mature, organisations are increasingly adopting ELT architectures where data is loaded first and transformed directly within the warehouse rather than through external ETL engines. This approach leverages the scalability, performance, and security of modern warehouses such as Snowflake, BigQuery, and Databricks, and has driven broad adoption of warehouse‑native transformation patterns (see dbt’s official case studies and customer list).

Within this architecture, dbt operates in the transformation layer. While the warehouse executes the SQL, dbt provides the structure required to manage transformation logic at scale, ensuring that raw data can be reliably shaped into analytics‑ready datasets.

dbt operating within the transformation layer of a modern data platform – Image Credits: dbt

Specifically, dbt enables teams to:

Define explicit dependencies between models using a directed acyclic graph (DAG)

Execute transformations natively inside the warehouse, leveraging its compute and security

Apply automated tests (e.g. not‑null, uniqueness, relationships) to validate data quality

Generate documentation and lineage directly from transformation metadata

Adopt software engineering practices such as modular code, version control, code review, and CI/CD for analytics workloads

dbt works alongside, rather than replacing, other tools in the data platform ecosystem:

dbt in the end‑to‑end analytics workflow ImageCredits: dbt

Ingestion tools load raw data into the warehouse

The warehouse stores and processes raw and transformed data

dbt performs in‑platform transformation, modelling, and testing

Downstream BI, ML, and AI tools consume curated, analytics‑ready models

In this way, dbt acts as the connective layer between upstream ingestion and downstream consumption, helping teams operate from a consistent, well‑governed analytics foundation.

Running dbt with dbt Core

dbt Core is the open-source, command-line distribution of dbt. It is typically used as a developer-driven framework, where teams install dbt locally, write models in their preferred editor, and execute dbt commands via the CLI.

In organisations using dbt Core, dbt is usually integrated into an existing engineering ecosystem rather than operated as a standalone service. Execution is commonly triggered through external schedulers or orchestration tools, and CI workflows are implemented using the organisation’s standard automation platforms, and—in secure or tightly governed environments dbt Core is sometimes run as an on‑demand, containerised workload.

Visit this On‑Demand dbt Execution blog for more detailed discussion.

dbt Core provides the transformation engine itself — compiling SQL models, managing dependencies, and executing transformations inside the data warehouse. Everything surrounding that engine is implemented and maintained by the team, including:

Job scheduling and orchestration

Secrets and credentials management

Environment separation and deployment conventions

CI workflows and validation checks

Documentation hosting and operational monitoring

As a result, dbt Core is often embedded into broader platform or data engineering workflows, giving teams full control over how dbt is executed, deployed, and observed.

What Is the dbt Platform?

High-level view of the dbt Platform and its integration points Image Credits: dbt

The dbt Platform is a hosted environment designed to manage dbt projects at scale. It provides a central place to develop, test, schedule, document, and observe dbt runs, with built‑in support for CI/CD, environment management, and team collaboration.

Importantly, the dbt Platform does not change how data is transformed—that remains the responsibility of dbt Core. Instead, it changes how dbt is adopted, operated, and governed as it becomes a shared, production‑critical capability across teams.

Key Capabilities of the dbt Platform

The dbt Platform adds managed capabilities around execution, collaboration, and observability, including:

Capability	Simple Explanation
Browser‑based development environments	A web‑based IDE (Studio IDE) that allows teams to develop, test, and run dbt models without installing dbt locally.
Managed job scheduling and execution	Built‑in scheduling and execution of dbt runs, supporting time‑based schedules and API or webhook triggers.
Integrated CI for analytics engineering	Native CI that automatically runs dbt builds and tests on pull requests using temporary schemas.
First‑class environment management	Explicit support for separate development, staging, and production environments.
Hosted documentation and lineage	Automatic generation and hosting of dbt documentation and lineage with access control.
Run history, monitoring, and alerts	Visibility into past runs, execution timings, and configurable success or failure notifications.
Native Git integrations	Built‑in integration with GitHub, GitLab, and Azure DevOps to support standard branching and pull‑request workflows.

These capabilities significantly reduce the operational effort required to run dbt reliably in production, particularly in organisations where analytics engineering spans multiple teams or includes non‑engineering contributors.

Choosing Between dbt Core and the dbt Platform

Teams typically move from dbt Core to the dbt Platform for operational reasons rather than functional gaps. The underlying transformation logic remains the same; the difference lies in how dbt is deployed, governed, and operated as it becomes a shared, production‑critical capability across teams.

At a high level, dbt Core offers maximum flexibility and control but requires teams to build and maintain supporting infrastructure. The dbt Platform reduces this operational overhead by providing managed capabilities for execution, environments, CI/CD, documentation, and observability.

Operational Fit: dbt Core vs dbt Platform

Consideration	dbt Core	dbt Platform
Team size & structure	Small, engineering‑led teams	Multiple teams contributing to models
Operational maturity	Strong in‑house DevOps and platform capability	Limited appetite to build and maintain infrastructure
CI/CD & execution	Custom CI, orchestration, and monitoring	Built‑in CI, managed scheduling, run history
Environment management	Scripted or convention‑based	First‑class dev, staging, and prod
Governance & access	Team‑managed	Centralised governance and access control
Documentation & lineage	Generated and hosted internally	Automatically hosted and broadly accessible
Control vs convenience	Full control prioritised	Consistency and ease of operation prioritised

In practice, many teams start with dbt Core and adopt the dbt Platform as onboarding friction increases, CI/CD becomes harder to manage, or operational risk grows. The choice should be treated as an architectural decision, guided by team structure, governance needs, and long‑term maintainability rather than by transformation features alone.

Final Thoughts: Making dbt an Architectural Capability

dbt has become a cornerstone of modern data transformation because it brings engineering discipline to analytics workflows — modular code, testing, documentation, versioning, and reproducibility. Whether you choose dbt Core or the dbt Platform depends on how you want to operate that transformation layer.

In our consulting engagements, we see the dbt Platform become most valuable when transformation is a shared responsibility across teams and when operational reliability is a requirement rather than an aspiration. Conversely, dbt Core remains a strong option when teams already maintain mature platform capabilities or want maximum control.

Our recommendation is to treat this as an architectural decision, not a tooling decision. Evaluate it based on team structure, governance needs, operational maturity, and long-term maintainability.

What’s Coming Next

In my next blog – Modern Data Transformation with dbt – Part 2 : How to Structure Your dbt Project for Scale – I’ll explore practical patterns for modelling, naming conventions, folder structures, and development workflows that help teams collaborate effectively and avoid common pitfalls as their dbt projects grow.

Future posts in this series will also cover:

Tests, documentation, and lineage in depth,

CI/CD workflows tailored for analytics engineering,

Orchestration patterns with Airflow, Dagster, and other tools,

Common anti-patterns and how to avoid them.

Explore more of our blogs, where we share practical perspectives on analytics engineering, data platforms, cloud architecture, and secure execution models for modern enterprises.

The post dbt Core vs dbt Platform: Modern Data Transformation with dbt (Part 1) appeared first on Cevo.

On-Demand dbt Execution: Rethinking Analytics Engineering in Secure Cloud Environments

Domenico Campagnolo — Mon, 09 Feb 2026 05:25:32 +0000

TL;DR

In secure enterprise cloud environments, traditional dbt deployment models often introduce unnecessary cost, security risk, or operational friction. This blog describes an on-demand, containerised dbt execution model, orchestrated via AWS MWAA and backed by ephemeral ECS tasks. By treating dbt as a workload rather than a service, analytics transformations scale efficiently without always-on infrastructure, improving cost control, security posture, data quality observability, and CI/CD integration, while supporting medallion architectures in modern enterprise data lakes.

Introduction

Over the last few years, data platforms have shifted away from monolithic enterprise data warehouses toward modular, cloud-native architectures built on object storage, distributed compute, and declarative tooling. As these platforms mature, data transformation is increasingly recognised not as a purely operational concern, but as a form of engineering that benefits from the same discipline applied to software development.

Within this landscape, dbt (data build tool) has become a foundational component of modern analytics engineering. While often described as a SQL transformation tool, its broader impact lies in how it reframes data modelling, testing, and documentation as versioned, testable, and observable engineering artefacts.

In this blog, I explore an on-demand dbt execution model for secure cloud environments, one designed specifically for enterprise contexts where always-on services, long-lived credentials, and externally hosted SaaS platforms are either infeasible or undesirable.

dbt as an Analytics Engineering Framework

dbt provides a thin but powerful abstraction over data warehouse and data lake execution engines such as Spark, Trino, Athena, Snowflake, and BigQuery. Rather than introducing its own runtime, dbt focuses on enabling:

Declarative data modelling using SQL

Explicit dependency management between models

Automated lineage generation

Embedded data quality testing

Self-documenting data assets

This design aligns naturally with how analytics teams already work, while introducing engineering discipline without imposing significant operational overhead.

Crucially, dbt allows transformation logic to move closer to analytics and domain teams, while still enabling platform teams to enforce governance, security, and deployment standards.

The Problem Space: Secure Data Lake Migrations

The architecture described here emerged from the construction of a migration data lake built on a medallion architecture:

Bronze: raw ingested data

Silver: standardised, validated, and conformed entities

Gold: analytics-ready models aligned with business semantics

In practice, this migration was constrained by a set of non-negotiable enterprise requirements. These constraints fundamentally shaped how transformation tooling could be deployed.

Key constraints included:

Network isolation
All workloads run inside a sealed VPN with no public internet exposure.

Strict security and compliance controls
Persistent compute resources and unmanaged access paths were strongly discouraged.

Multiple transformation domains
Independent dbt projects were required to support different subject areas and release cycles.

Operational efficiency
Idle infrastructure and always-on transformation services were considered wasteful.

Under these conditions, traditional dbt deployment patterns, such as dbt Cloud or permanently running compute hosting dbt, were either infeasible or misaligned with the platform’s security and cost objectives.

Rethinking dbt Execution: Why dbt Works Better as an On-Demand Workload

A key design decision was to treat dbt not as a long-running service, but as an ephemeral workload.

Instead of asking:

“Where should dbt live?”

The question became:

“When should dbt exist?”

This shift reframes dbt execution as something that is instantiated only when required, executes a well-defined scope of transformations, and is then fully torn down. dbt environments become transient by design, created for execution, not permanence.

This conceptual change underpins the architecture that follows.

Architecture Overview

At a high level, the architecture operates as follows:

AWS MWAA (Managed Airflow) orchestrates transformation workflows.

Each workflow launches an ECS task (Fargate).

The task runs a custom dbt container image, stored in Amazon ECR.

The dbt container loads a dbt project from S3.

The container executes dbt run, dbt test, or documentation generation.

Execution outputs are persisted to the data lake.

The container is terminated immediately after execution.

The dbt container image is built around:

dbt-core

A cloud-native execution adapter (for example, dbt-spark backed by AWS Glue)

Organisation-specific dependencies and configuration standards

Execution roles are tightly scoped and ephemeral, credentials are injected using cloud-native secret management mechanisms, and logs are streamed to centralised logging services for auditability and observability. Each run starts from a known, immutable image and exits cleanly, ensuring reproducibility and a minimal attack surface.

Why On-Demand dbt Execution Improves Cost, Security, and Scale

1. Cost Efficiency

Provisioning compute only for the duration of a dbt run ensures that infrastructure costs scale linearly with actual usage. There is no idle transformation environment consuming resources between schedules or deployments.

2. Security and Compliance

Ephemeral execution significantly reduces risk:

No long-lived credentials

No persistent access paths

No configuration drift over time

Each execution operates within tightly controlled IAM boundaries, inside private network segments, and terminates immediately after completion.

3. Operational Simplicity

This model eliminates several operational burdens associated with long-running hosts:

Patch management

Dependency drift between environments

Manual intervention during failure recovery

Failures are isolated to individual executions rather than shared infrastructure.

4. Scalability and Isolation

Multiple dbt projects can execute in parallel without resource contention. Isolation at the container and execution level simplifies troubleshooting, capacity planning, and platform governance.

Data Quality as a First-Class Output

dbt’s testing framework is frequently under-leveraged, with test outcomes treated as ephemeral execution artefacts rather than durable analytical assets. In this architecture, data quality is elevated to a first-class output of the platform.

Rather than relying on transient logs, all dbt test failures are materialised as Iceberg tables in Amazon S3, using an append-only design. Each test emits structured, queryable records enriched with execution metadata, including test name, model and column context, invocation identifiers, execution timestamps, and failure cardinality. This design ensures that every data quality signal is preserved immutably and can be analysed longitudinally.

By standardising test schemas across auditable test macros (e.g. not-null, accepted values, column expressions, and uniqueness constraints), results from heterogeneous models can be consolidated into unified audit views without post-hoc transformations. These audit datasets are then exposed via the AWS Glue Catalog and queried directly from Amazon Athena, enabling seamless downstream consumption.

Historical trends in data quality are visualised using analytics tooling such as AWS QuickSight, allowing teams to identify systemic issues, unstable models, and regressions introduced by schema or logic changes. Crucially, this shifts data quality from a reactive, failure-driven concern to an observable and measurable platform capability, supporting quantitative assessment of reliability and continuous improvement over time.

CI/CD Integration and Continuous Validation

The same containerised dbt environment used in production is also executed within CI/CD pipelines.

A typical workflow includes:

A pull request triggering a CI pipeline

Execution of the dbt container with dbt test

Persistence of results to a non-production schema

Automated surfacing of regressions prior to merge

This approach ensures environmental parity between CI and production, faster feedback loops for engineers, and a reduced risk of introducing data regressions.
dbt functions not only as a transformation engine, but also as a continuous validation framework.

Implications for Platform and Executive Stakeholders

For platform teams, this execution model provides:

Clear governance boundaries

Predictable operational behaviour

A clean separation of concerns

For executives and data leaders, it delivers:

Improved trust in analytical outputs

Transparent, measurable data quality signals

Lower total cost of ownership

Faster delivery of analytics capabilities

Most importantly, it demonstrates that modern data platforms can be both agile and controlled, without compromising one for the other.

Closing Thoughts

dbt is often discussed in terms of features and tooling, but its deeper impact lies in how it encourages teams to think differently about data transformation.

By running dbt as an on-demand, containerised workload, analytics engineering can align more closely with modern cloud execution models, meet stringent enterprise security requirements, and scale transformation efforts without scaling operational burden.

This approach does not replace existing dbt deployment patterns. Instead, it extends dbt into environments where traditional models fall short.

As data platforms continue to evolve, architectures like this suggest that the future of analytics engineering is shaped not only by better tools, but by better execution models.

Interested in how on-demand dbt execution and modern analytics engineering patterns can improve security, scalability, and cost efficiency in enterprise cloud environments?

Explore more of our blogs, where we share practical perspectives on analytics engineering, data platforms, cloud architecture, and secure execution models for modern enterprises.

The post On-Demand dbt Execution: Rethinking Analytics Engineering in Secure Cloud Environments appeared first on Cevo.

Cevo’s adoption of the AWS EBA Framework for our clients

Damian Coyne — Fri, 23 Jan 2026 05:13:06 +0000

Digital transformation has never been short on strategy decks, maturity models, or future-state diagrams, with countless customer stories of the endless struggle to move from strategy to real world execution.

What has been harder to find is safe, structured ways for organisations to actually experience modern cloud, data and AI platforms in practice — before committing to large-scale change.

That’s exactly why Experience-Based Accelerators (EBAs) were created by Amazon Web Services — and why we’re excited that partners like Cevo can now lead and deliver them directly for our clients (AWS Experience-Based Acceleration – Amazon Web Services)

At Cevo, we’ve adopted the AWS EBA framework as a core part of how we help clients move from intent to impact across Migration, Modernisation and AI. Since adopting and launching Cevo’s Partner-led EBAs, the customer demand has been overwhelming with Independent Software Vendors, Public Sector transport management and energy clients all signing up for the experience.

What Is an AWS Experience-Based Accelerator (EBA)?

An EBA is a time-boxed, hands-on engagement that gives teams the opportunity to:

Work directly in AWS environments

Use real workloads, real data, and real tooling

Learn by doing, not observing

Build confidence before committing to scale

Rather than abstract conversations about “what’s possible,” EBAs are designed to create practical learning moments that shorten decision cycles and de-risk transformation.

Why Experience-Based Accelerators Reduce Risk for Clients

Across government and enterprise, we consistently see the same challenges:

Migration programs stall due to uncertainty and risk

Modernisation decisions are delayed by legacy complexity and a fear factor that it’s ‘all too hard’ and that a ‘SaaS’ replacement will solve the complexity

AI ambition outpaces organisational readiness, with too much time (and lost opportunity) of defining a wave of ‘use cases’

EBAs address these challenges by replacing theory with tangible experience.

Clients don’t just hear about cloud, data or AI — they use it, test it, and see outcomes firsthand.

How Cevo Delivers Partner-Led AWS EBAs

Historically, AWS delivered EBAs exclusively themselves. Today, Cevo is proud to be delivering these accelerators as a trusted AWS partner, deeply tailored to our clients’ environments, constraints and ambitions.

While AWS defines the EBA structure, Cevo’s role is to operationalise it. We tailor each accelerator beyond standard patterns, designing sessions around customer-specific architectures, workloads and organisational realities. This ensures the experience reflects what production delivery will actually look like, not an idealised version of it.

Local Cevo engineers and consultants deliver EBAs hands-on with customers every day. That means navigating existing technical debt, security requirements, operating models and timelines, and designing solutions that can move forward after the accelerator ends.

Our approach is grounded in three principles:

Real Workloads, Not Demos

We design EBAs around your systems, your data and your challenges — not generic samples.

Engineering-Led, Outcome-Focused

Every accelerator is facilitated by Cevo engineers and consultants who live and breathe delivery — ensuring outcomes are practical, scalable and production-ready.

Confidence Before Commitment

The goal isn’t just learning — it’s decision enablement. Clients leave with clarity, evidence and momentum.

Experience-Based Accelerators in Practice

Migration: From Uncertainty to Momentum

Our migration EBAs help teams:

Assess real workloads

Execute hands-on migrations

Understand security, cost and operational impacts early

Teams gain confidence before entering larger migration programs such as AWS MAP

Modernisation: Seeing Legacy Become Cloud-Native

For modernisation, EBAs allow teams to:

Refactor or re-platform a real application

Explore modern architectures, DevOps and observability

Experience performance, resilience and scalability gains

The result is faster buy-in from both technical and business stakeholders.

AI: Turning Possibility into Practice

AI is where EBAs are having the biggest impact.

Through AI-focused accelerators, clients:

Build and test real AI use cases

Work with modern data platforms and foundation models

Understand governance, security and responsible AI considerations

Move from “AI strategy” to production-deployed proof of value

Organisations early in their AI journey gain powerful insights from this safe, guided experimentation.

Why Experience-Led Transformation Works

EBAs change the transformation dynamic:

Decisions are based on evidence, not assumptions

Teams gain hands-on skills, not just awareness

Executives see working outcomes, not hypothetical benefits

In short, experience creates confidence — and confidence accelerates change.

Continuous Evolution, Not One-Off Transformation

At Cevo, we believe transformation isn’t a single project — it’s a continuous evolution.

EBAs fit perfectly into this philosophy:

Start small

Learn fast

Prove value

Scale with confidence

Whether you’re exploring cloud migration, modernising legacy platforms, or unlocking AI opportunities, AWS Experience-Based Accelerator’s provide a practical, low-risk way to get started — and a powerful catalyst for what comes next.

See how an EBA can help your team test, learn, and make decisions with confidence — without committing to a full-scale transformation. Book your discovery today.

The post Cevo’s adoption of the AWS EBA Framework for our clients appeared first on Cevo.

Enhance Kubernetes Cluster Performance and Optimise Cost with Karpenter – Part 3

Gokulnath Venugopal — Wed, 21 Jan 2026 05:46:14 +0000

TL;DR:
Karpenter simplifies Kubernetes cluster management by automatically provisioning the right nodes at the right time, helping optimise costs and performance. Unlike EKS Auto Mode, self-managed Karpenter offers full observability—logs, metrics, and dashboards—to monitor scaling decisions, detect bottlenecks, and fine-tune cluster behaviour. Setting up robust monitoring with tools like Prometheus, Grafana, or commercial platforms ensures reliable workloads and prevents unexpected outages.

Introduction

One of the famous quotes from Dr Werner Vogels says

Everything fails, all the time.

The introduction of Karpenter, simplifies Kubernetes infrastructure with the right nodes at the right time. However, proactive monitoring of Karpenter helps teams optimise costs, identify scheduling bottlenecks, and ensure workload reliability.

In the previous Part 1 and Part 2 blogs of “Enhance Kubernetes cluster performance and optimise costs with Karpenter”, we discussed:

What is Karpenter and its components?

How it works and advantages over Cluster Auto Scaler.

How to deploy Karpenter in a Kubernetes cluster?

What is EKS Auto Mode?

EKS with Karpenter vs EKS Auto Mode, and how to choose one over the other.

In this blog, let’s deep dive into the Observability side of things.

How to monitor Karpenter?

Though Karpenter is a mature and stable open-source project widely adopted for production workloads, it is critical to “scrape metrics from the Karpenter controller” into a monitoring solution (e.g., Prometheus & Grafana) to gain visibility into provisioning decisions, scaling performance, and potential inefficiencies.

Unlike EKS Auto Mode, where AWS manages scaling logic and hides these details, customer managed Karpenter offers transparency and allows for observability, to gather fine-grained insight into its scaling behaviour. Capturing metrics and set up monitoring is crucial for continuously finetuning.

Karpenter Architecture

Cluster Operator: Someone who installs and configures Karpenter in a Kubernetes cluster, and sets up Karpenter’s configuration and permissions.
Cluster Developer: Developer or Consumer who can create pods, typically through Deployments, Stateful Sets, DaemonSets, or other pod-controller types.
Karpenter Controller: The Karpenter application pod that operates inside a cluster.

When Karpenter is deployed in an EKS Cluster using Helm or EKS add-ons, it deploys various Kubernetes resources for it to work. Karpenter Controller Pods is one of the resources.

Note: Karpenter Controller Pods should not run in the infrastructure managed by Karpenter to avoid circular dependency.

Karpenter Controller Pods make decision around node provisioning and removing nodes from cluster, moving application pods between nodes on the cluster based on NodePool, NodeClass and Disruption Budgets. Karpenter controllers’ pods generates both logs and metrics that can be used for monitoring.

Karpenter Controller Logs or Karpenter logs

Karpenter Metrics

Karpenter Logs

Karpenter logs are generated by Karpenter Pods, which can be inspected directly or can be forwarded for forensic analysis. Logs can be forwarded to;

Amazon CloudWatch Logs

Custom options: – Forward using Fluent bit, Loki, Fluentd, etc.

Forward to Datadog, Splunk, etc.

Below is an example of Karpenter controller logs you would typically see in an EKS cluster, covering provisioning, scheduling, and consolidation events. These logs are useful for understanding what Karpenter is doing at any given point in time. Unless they are forwarded and stored externally, these logs are not retained permanently by the EKS cluster and will be lost when the Karpenter pods restart.

Example Karpenter logs:

				
					{"level":"INFO","ts":"2026-01-19T01:12:03.421Z","logger":"controller","msg":"discovered nodepool","commit":"a1b2c3d","nodepool":"default"} 

{"level":"INFO","ts":"2026-01-19T01:12:05.118Z","logger":"provisioner","msg":"found unschedulable pods","pods":3} 

{"level":"INFO","ts":"2026-01-19T01:12:05.456Z","logger":"provisioner","msg":"computing instance types","nodepool":"default","requirements":{"cpu":"2","memory":"4Gi"}} 

{"level":"INFO","ts":"2026-01-19T01:12:06.031Z","logger":"provisioner","msg":"launching node","instance-type":"m6i.large","zone":"ap-southeast-2a","capacity-type":"on-demand"} 

{"level":"INFO","ts":"2026-01-19T01:12:36.882Z","logger":"controller","msg":"node registered","node":"ip-10-0-23-114.ap-southeast-2.compute.internal"} 

{"level":"INFO","ts":"2026-01-19T01:12:38.214Z","logger":"controller","msg":"initialized node","node":"ip-10-0-23-114.ap-southeast-2.compute.internal"} 

{"level":"INFO","ts":"2026-01-19T01:18:12.902Z","logger":"consolidation","msg":"candidate for consolidation","node":"ip-10-0-23-114.ap-southeast-2.compute.internal","reason":"underutilized"} 

{"level":"INFO","ts":"2026-01-19T01:18:15.447Z","logger":"consolidation","msg":"terminating node","node":"ip-10-0-23-114.ap-southeast-2.compute.internal"} 

{"level":"INFO","ts":"2026-01-19T01:18:42.019Z","logger":"controller","msg":"node deleted","node":"ip-10-0-23-114.ap-southeast-2.compute.internal"} 

{"level":"WARN","ts":"2026-01-19T01:22:10.633Z","logger":"provisioner","msg":"unable to schedule pod","pod":"payments-api-7d9f6","reason":"no instance type satisfies requirements"}

Karpenter Metrics

Karpenter also makes several metrics available in Prometheus format to allow monitoring cluster provisioning status. These metrics are available by default at karpenter.kube-system.svc.cluster.local:8080/metrics configurable via the METRICS_PORT environment variable. Karpenter generates metrics under various categories. These include;

Nodeclaims Metrics

Nodes Metrics

Pods Metrics

Termination Metrics

Voluntary Disruption Metrics

Scheduler Metrics

Nodepools Metrics

EC2NodeClass Metrics

Interruption Metrics

Cluster Metrics

Cloudprovider Metrics

Controller Runtime Metrics

Workqueue Metrics

As aforementioned, Karpenter metrics are available in Prometheus format. These metrics can be scraped using one of the following options.

Prometheus [most preferred]

- Self-Managed

- Amazon Managed Prometheus

Grafana Agent, Telegraf, Fluent Bit/Fluentd

OpenTelemetry

Datadog, NewRelic, Splunk and other managed commercial offerings.

Karpenter Metrics Visualisation

Managed commercial platforms such as Datadog and New Relic provide out-of-the-box community dashboards for monitoring Karpenter, along with support for flexible dashboards and highly customisable alerts.

For customers who do not use these commercial offerings, the scraped Karpenter metrics can be forwarded to Grafana. Open-source Grafana provides an equivalent visualisation and alerting solution like a commercial managed offering. Below are the few out of the box free dashboards available in Grafana for Karpenter monitoring. These can be added to your Grafana dashboards by downloading them directly from Grafana.

https://grafana.com/grafana/dashboards/20398-karpenter/

https://grafana.com/grafana/dashboards/16237-cluster-capacity/

https://grafana.com/grafana/dashboards/16236-pod-statistic/

https://grafana.com/grafana/dashboards/18862-karpenter/

https://grafana.com/grafana/dashboards/22173-kubernetes-autoscaling-karpenter-performance/

https://grafana.com/grafana/dashboards/22172-kubernetes-autoscaling-karpenter-activity/

https://grafana.com/grafana/dashboards/22171-kubernetes-autoscaling-karpenter-overview/

https://grafana.com/grafana/dashboards/23471-karpenter-cluster-cost-estimate/

Dashboard Example 1: Kubernetes/Autoscaling/Karpenter/Activity – Providing insights about Auto Scaling and reason for Node Termination.

Dashboard Example 2: Kubernetes/Autoscaling/Karpenter/Overview – Provides insights about instance CPU & Memory usage & instance capacity type.

Conclusion

In EKS Auto Mode, the Karpenter controller is fully managed by AWS, which limits the level of observability customers can build around it. In contrast, with a self-managed Karpenter deployment, it is essential to capture logs and metrics and to establish monitoring in order to continuously fine-tune its behaviour.

“Trust, but verify.” While Karpenter is a powerful and generally stable solution, it should not be treated as a black box. Everything fails, all the time. A robust observability and monitoring setup enables forensic analysis during production incidents and helps teams learn and improve over time to prevent recurring issues. The last thing any team wants is a production outage caused by Karpenter consolidating or terminating nodes during business hours.

If you have an EKS cluster and are looking to improve Kubernetes performance while optimising costs, reach out to Cevo.

The post Enhance Kubernetes Cluster Performance and Optimise Cost with Karpenter – Part 3 appeared first on Cevo.

Enhance Kubernetes Cluster Performance and Optimise Cost with Karpenter – Part 2

Gokulnath Venugopal — Wed, 21 Jan 2026 01:28:41 +0000

TL;DR

This blog explains how migrating from Cluster Autoscaler to Karpenter on Amazon EKS improves scaling speed, resource utilisation, and cost efficiency. It also introduces EKS Auto Mode, which builds on Karpenter to reduce operational overhead through automated scaling, patching, and upgrades—making EKS clusters more efficient and easier to manage.

Introduction

In the previous blog “Enhance Kubernetes cluster performance and optimise costs with Karpenter”, we discussed: What is Karpenter?, its components, advantages, how it works, and how to deploy it in a AWS EKS cluster.

In this blog, let’s dive deeper into the following items:

How to migrate from Cluster AutoScaler to Karpenter

What is EKS Auto Mode (Advanced Karpenter Mode)?

How to migrate from Karpenter to EKS Auto Mode and reap full benefits of cloud-managed services

To explain these points, let’s use a hypothetical example. Imagine two groups, Dogs and Cats are at war with each other. Each group has a leader who must securely and efficiently distribute war plans to their generals stationed across the globe. They decide to use a containerized application deployed in EKS for communication.

They share the same Amazon EKS cluster named “PetsCluster” for cost efficiency. Within this cluster, each of them has their own Kubernetes namespaces created to host their applications & for data isolation:

fordogs – to deploy apps for dogs

forcats – to deploy apps for cats

Even though both groups are focused on defeating each other, they still value AWS’s principle of frugality. Instead of duplicating infrastructure, they share a common namespace/infrastructure named platform to host logging, monitoring, and other shared services. This allows logical separation of workloads (forcats and fordogs) without compromising isolation between the two groups.

Full codebase: https://github.com/cevoaustralia/aws-eks-karpenter-and-automode

Cluster Auto-Scaler and its limitations

Currently, “PetsCluster” is using Cluster AutoScaler, which relies on AWS Auto Scaling Groups (ASGs). While it handles basic scaling, it has key limitations:

One size must fit all – A Node Group can only use one instance type and size (e.g., t3.large). Workloads requiring other instance types cannot benefit without creating or modifying a separate Node Group.

Overprovisioning – Cluster Autoscaler scales at the node group level, so GPU workloads force scaling of GPU node groups, causing non-GPU pods to land on GPU instances. Additionally, even small pods trigger provisioning of the full, often large, node group instance type, resulting in overprovisioning and higher infrastructure costs

Underutilisation – If a tiny app runs on a t3.8xlarge, the server stays active until the app stops, leading to wasted capacity. There’s no optimization or consolidation.

Operational overhead – New instance types require manual configuration changes, preventing users from quickly leveraging AWS innovations.

Delayed provisioning – Cluster AutoScaler provisions via ASGs, which can be slow, leading to poor user experience under volatile load.

EKS Cluster Config: (PetsCluster)

				
					module "eks" { 

  source  = "terraform-aws-modules/eks/aws" 

  version = "~> 20.0" 

 

  cluster_name                             = var.cluster_name #PetsCluster 

  cluster_version                          = "1.32" 

  cluster_endpoint_public_access           = true 

  enable_cluster_creator_admin_permissions = true 

 

  vpc_id     = module.vpc.vpc_id 

  subnet_ids = module.vpc.private_subnets 

 

  enable_irsa = true 

 

}

EKS Managed Node Group Configs: (Platform, Forcats and ForDogs)

				
					eks_managed_node_groups = { 

  platform = { 

    min_size       = 2 

    max_size       = 2 

    desired_size   = 2 

    instance_types = ["t3.medium"] 

 

    labels = { 

      "nodegroup/type" = "platform" 

    } 

 

    taints = { 

      dedicated = { 

        key    = "dedicated" 

        value  = "platform" 

        effect = "NO_SCHEDULE" 

      } 

    } 

  } 

 

  forcats = { 

    min_size       = 1 

    max_size       = 3 

    desired_size   = 1 

    instance_types = ["t3.small"] 

 

    labels = { 

      "nodegroup/type" = "forcats" 

    } 

 

    taints = { 

      dedicated = { 

        key    = "dedicated" 

        value  = "forcats" 

        effect = "NO_SCHEDULE" 

      } 

    } 

  } 

 

  fordogs = { 

    min_size       = 1 

    max_size       = 3 

    desired_size   = 1 

    instance_types = ["t3.small"] 

 

    labels = { 

      "nodegroup/type" = "fordogs" 

    } 

 

    taints = { 

      dedicated = { 

        key    = "dedicated" 

        value  = "fordogs" 

        effect = "NO_SCHEDULE" 

      } 

    } 

  } 

}

EKS Cluster Auto Scaling Config:

				
					resource "helm_release" "cluster_autoscaler" { 

  name       = "cluster-autoscaler" 

  repository = "https://kubernetes.github.io/autoscaler" 

  chart      = "cluster-autoscaler" 

  namespace  = "kube-system" 

  version    = "9.29.0" 

  timeout    = 600 

 

  set { 

    name  = "autoDiscovery.clusterName" 

    value = var.cluster_name 

  } 

 

  set { 

    name  = "awsRegion" 

    value = var.region 

  } 

 

  set { 

    name  = "rbac.serviceAccount.create" 

    value = "true" 

  } 

 

  set { 

    name  = "rbac.serviceAccount.name" 

    value = "cluster-autoscaler" 

  } 

 

  set { 

    name  = "rbac.serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn" 

    value = module.cluster_autoscaler_irsa.iam_role_arn 

  } 

 

  set { 

    name  = "image.tag" 

    value = "v1.30.0" 

  } 

 

  set { 

    name  = "nodeSelector.nodegroup\\/type" 

    value = "platform" 

  } 

 

  set { 

    name  = "tolerations[0].key" 

    value = "dedicated" 

  } 

 

  set { 

    name  = "tolerations[0].value" 

    value = "platform" 

  } 

 

  set { 

    name  = "tolerations[0].operator" 

    value = "Equal" 

  } 

 

  set { 

    name  = "tolerations[0].effect" 

    value = "NoSchedule" 

  } 

 

  depends_on = [module.eks] 

}

Scenario 1: Generals from Cats & Dogs demand frugality & agility

Modern apps come in different shapes and sizes. Auto scaling should provision the right instance types dynamically, rather than enforcing “one size fits all.”

Also, if 9–5, Mon–Fri is considered business hours, that’s only 23.8% of the week. The remaining 76.2% are non-business hours when non-production apps don’t need full capacity. By configuring appropriate disruption budgets, scaling down policies or shutting non-production workloads during off-business hours using AWS Instance Scheduler, customers can save up to ~75% in infrastructure costs .

Karpenter enables this by consolidating nodes—it moves workloads from underutilized / empty servers to other nodes in the cluster to maintain server utilization at ~80%. It then terminates idle servers saving cost.

Note: A dedicated platform Node Group (1 node) must host the Karpenter controller. Deploying Karpenter on Karpenter-managed nodes creates a chicken-and-egg problem.

Reason being, Karpenter controller responsible for dynamically provisioning and deprovisioning worker nodes based on pending pods and scheduling requirements. Because of this role, it cannot depend on itself to exist.

If the Karpenter controller pod were scheduled onto Karpenter-managed nodes, you would create a circular dependency (chicken-and-egg problem):

Let’s deploy Karpenter to “PetsCluster.”

Karpenter Setup Config:

				
					resource "helm_release" "karpenter" { 

  name             = "karpenter" 

  namespace        = "karpenter" 

  create_namespace = true 

 

  repository = "oci://public.ecr.aws/karpenter" 

  chart      = "karpenter" 

  version    = "1.6.1" 

  timeout    = 600 

 

  # Core settings 

  set { 

    name  = "settings.clusterName" 

    value = var.cluster_name 

  } 

 

  set { 

    name  = "settings.clusterEndpoint" 

    value = module.eks.cluster_endpoint 

  } 

 

  set { 

    name  = "serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn" 

    value = module.karpenter_irsa.iam_role_arn 

  } 

 

  # Controller configuration 

  set { 

    name  = "controller.resources.requests.cpu" 

    value = "1" 

  } 

 

  set { 

    name  = "controller.resources.requests.memory" 

    value = "1Gi" 

  } 

 

  set { 

    name  = "controller.resources.limits.cpu" 

    value = "1" 

  } 

 

  set { 

    name  = "controller.resources.limits.memory" 

    value = "1Gi" 

  } 

 

  # Node selector for platform nodes 

  set { 

    name  = "nodeSelector.nodegroup\\/type" 

    value = "platform" 

  } 

 

  # Tolerations for platform nodes 

  set { 

    name  = "tolerations[0].key" 

    value = "dedicated" 

  } 

 

  set { 

    name  = "tolerations[0].value" 

    value = "platform" 

  } 

 

  set { 

    name  = "tolerations[0].operator" 

    value = "Equal" 

  } 

 

  set { 

    name  = "tolerations[0].effect" 

    value = "NoSchedule" 

  } 

 

  depends_on = [module.eks, module.karpenter_irsa] 

}

Karpenter NodePool Config:

				
					--- 

apiVersion: karpenter.sh/v1 

kind: NodePool 

metadata: 

  name: platform 

spec: 

  template: 

    metadata: 

      labels: 

        "nodegroup/type": "platform" 

    spec: 

      nodeClassRef: 

        group: karpenter.k8s.aws 

        kind: EC2NodeClass 

        name: platform 

      requirements: 

        - key: "karpenter.k8s.aws/instance-category" 

          operator: In 

          values: ["t", "m"] 

        - key: "karpenter.k8s.aws/instance-cpu" 

          operator: In 

          values: ["2", "4", "8"] 

      taints: 

        - key: "dedicated" 

          value: "platform" 

          effect: "NoSchedule" 

  limits: 

    cpu: 100 

  disruption: 

    consolidationPolicy: WhenEmptyOrUnderutilized 

    consolidateAfter: 30s 

 

--- 

apiVersion: karpenter.sh/v1 

kind: NodePool 

metadata: 

  name: forcats 

spec: 

  template: 

    metadata: 

      labels: 

        "nodegroup/type": "forcats" 

    spec: 

      nodeClassRef: 

        group: karpenter.k8s.aws 

        kind: EC2NodeClass 

        name: forcats 

      requirements: 

        - key: "karpenter.k8s.aws/instance-category" 

          operator: In 

          values: ["c", "m", "r"] 

        - key: "karpenter.k8s.aws/instance-cpu" 

          operator: In 

          values: ["4", "8", "16"] 

      taints: 

        - key: "dedicated" 

          value: "forcats" 

          effect: "NoSchedule" 

  limits: 

    cpu: 500 

  disruption: 

    consolidationPolicy: WhenEmptyOrUnderutilized 

    consolidateAfter: 30s 

 

--- 

apiVersion: karpenter.sh/v1 

kind: NodePool 

metadata: 

  name: fordogs 

spec: 

  template: 

    metadata: 

      labels: 

        "nodegroup/type": "fordogs" 

    spec: 

      nodeClassRef: 

        group: karpenter.k8s.aws 

        kind: EC2NodeClass 

        name: fordogs 

      requirements: 

        - key: "karpenter.k8s.aws/instance-category" 

          operator: In 

          values: ["c", "m", "r"] 

        - key: "karpenter.k8s.aws/instance-cpu" 

          operator: In 

          values: ["4", "8", "16"] 

      taints: 

        - key: "dedicated" 

          value: "fordogs" 

          effect: "NoSchedule" 

  limits: 

    cpu: 500 

  disruption: 

    consolidationPolicy: WhenEmptyOrUnderutilized 

    consolidateAfter: 30s

Karpenter NodeClass Config:

				
					--- 

apiVersion: karpenter.k8s.aws/v1 

kind: EC2NodeClass 

metadata: 

  name: platform 

spec: 

  amiFamily: AL2 

  amiSelectorTerms: 

    - alias: al2@latest 

  role: ${instance_profile_name} 

  subnetSelectorTerms: 

    - tags: 

        "kubernetes.io/role/internal-elb": "1" 

  securityGroupSelectorTerms: 

    - name: "${cluster_name}-node-*" 

  tags: 

    "karpenter.sh/discovery": "${cluster_name}" 

    "nodegroup/type": "platform" 

    "Name": "${cluster_name}-platform-karpenter" 

  userData: | 

    #!/bin/bash 

    /etc/eks/bootstrap.sh ${cluster_name} 

 

--- 

apiVersion: karpenter.k8s.aws/v1 

kind: EC2NodeClass 

metadata: 

  name: forcats 

spec: 

  amiFamily: AL2 

  amiSelectorTerms: 

    - alias: al2@latest 

  role: ${instance_profile_name} 

  subnetSelectorTerms: 

    - tags: 

        "kubernetes.io/role/internal-elb": "1" 

  securityGroupSelectorTerms: 

    - name: "${cluster_name}-node-*" 

  tags: 

    "karpenter.sh/discovery": "${cluster_name}" 

    "nodegroup/type": "forcats" 

    "Name": "${cluster_name}-forcats-karpenter" 

  userData: | 

    #!/bin/bash 

    /etc/eks/bootstrap.sh ${cluster_name} 

 

--- 

apiVersion: karpenter.k8s.aws/v1 

kind: EC2NodeClass 

metadata: 

  name: fordogs 

spec: 

  amiFamily: AL2 

  amiSelectorTerms: 

    - alias: al2@latest 

  role: ${instance_profile_name} 

  subnetSelectorTerms: 

    - tags: 

        "kubernetes.io/role/internal-elb": "1" 

  securityGroupSelectorTerms: 

    - name: "${cluster_name}-node-*" 

  tags: 

    "karpenter.sh/discovery": "${cluster_name}" 

    "nodegroup/type": "fordogs" 

    "Name": "${cluster_name}-fordogs-karpenter" 

  userData: | 

    #!/bin/bash 

    /etc/eks/bootstrap.sh ${cluster_name}

Scenario 2: Generals demand “All hands on deck” using EKS Auto Mode

Karpenter improves frugality and agility, but operational overhead remains—security patching, upgrades, and maintenance are still needed for better security posture. This is where EKS Auto Mode comes in. It uses Karpenter behind the scenes, but with additional automation:

Streamlined cluster management – Production-ready EKS clusters with minimal overhead, like Elastic Beanstalk.

Application availability – Dynamically adds/removes nodes per workload demand.

Efficiency – Consolidates workloads, removes idle nodes, and reduces cost.

Security – Nodes are recycled every 21 days (can be reduced), aligning with security best practices.

Automated upgrades – Keeps clusters/nodes up to date while respecting PDBs and NDBs.

Managed components – Includes DNS, Pod networking, GPU plug-ins, health checks, and EBS CSI out-of-the-box.

Customizable NodePools – Still allows defining custom storage, compute, or networking requirements.

EKS Auto Mode is enabled by setting `compute_config.enabled = true`.

EKS Cluster Config with Auto Mode:

				
					module "eks" { 

  source  = "terraform-aws-modules/eks/aws" 

  version = "~> 20.0" 

 

  cluster_name                             = var.cluster_name 

  cluster_version                          = "1.32" 

  cluster_endpoint_public_access           = true 

  enable_cluster_creator_admin_permissions = true 

 

  vpc_id     = module.vpc.vpc_id 

  subnet_ids = module.vpc.private_subnets 

 

  enable_irsa = true 

 

  compute_config = { 

    enabled = true 

  } 

}

EKS with Karpenter v/s EKS Auto Mode

Both Managed EKS with Karpenter & Amazon EKS Auto Mode offer a powerful solution for managing Kubernetes clusters.

Choose Managed EKS with Karpenter option if you need:

Control over Data Plane

Custom AMI

Install specific Agents or software requiring DaemonSet

Custom Networking

Granular control over patching & upgrades

Choose EKS Auto Mode if:

Reduce operational overhead on upgrade and patching

Don’t require granular control over the AMI, Custom Networking, upgrade and patching

Conclusion

In conclusion, now both Dogs and Cats generals have EKS clusters with auto-mode enabled that automatically scales, patches, and provides enhanced security to workloads without manual intervention.

Migrating from Cluster AutoScaler to Karpenter was the tectonic shift that optimized cluster efficiency. Karpenter, originally built by AWS, is now an open-source project maintained by the community.

As the proverb goes: “Trust, but verify.” While Karpenter is powerful, it shouldn’t be treated as a black box. The last thing anyone wants is a production outage because Karpenter decided to consolidate or terminate nodes during business hours.

So, in the next blog, we’ll explore observability for Karpenter forwarding Karpenter controller logs to Grafana and building dashboards to monitor its actions.

The post Enhance Kubernetes Cluster Performance and Optimise Cost with Karpenter – Part 2 appeared first on Cevo.

Kafka ZooKeeper to KRaft: The Next Chapter in Apache Kafka (MSK and Kafka 3.9 Support)

Mehul Merchant — Fri, 19 Dec 2025 02:07:42 +0000

TL;DR

Kafka consensus ensures coordination, leader election, and metadata consistency. ZooKeeper handled this but added complexity and scaling limits. KRaft embeds Raft directly in Kafka, offering simpler operations, faster failover, and higher scalability. For MSK, migrate by creating a new KRaft cluster, replicating data, updating clients, testing, cutting over, and decommissioning the old cluster.

Why Kafka Needs Consensus (and Why It’s Hard)

For years, ZooKeeper has been a necessary but operationally heavy dependency for Kafka. With Kafka 3.9, that finally changes, especially for Amazon MSK users.

At the heart of any distributed system lies the problem of agreement: multiple independent components must coordinate and agree on the system’s state even in the face of failures, network partitions, or latency.

This is critical for tasks such as:

Determining leader nodes for managing shared responsibilities (diagram below).

Ensuring that all nodes agree on cluster membership.

Maintaining consistent metadata about topics, partitions, and configuration.

Leader Node Selection

Without a reliable consensus algorithm, systems can suffer from the pains of

Split-brain scenarios

Inconsistent state

Data loss

Protocols such as Raft and Paxos were introduced to guarantee safety, liveness, and fault tolerance in distributed environments, ensuring a single source of truth despite failures and concurrent updates.

In Apache Kafka’s, metadata such as topic configurations, partition placements, and ACLs must be shared and agreed upon by all brokers. Achieving this efficiently and reliably is what a consensus layer enables.

What ZooKeeper Did for Kafka (and Where It Fell Short)

From Kafka’s early versions up through 3.9, Apache ZooKeeper served as the authoritative consensus and coordination layer.

ZooKeeper’s responsibilities included:

Maintaining cluster membership (which brokers are alive).

Performing controller election to select a leader for cluster coordination.

Storing critical metadata about topics, partitions, and replicas.

Detecting broker failures via coordinate ephemeral nodes. Amazon Web Services, Inc.+1

ZooKeeper uses its own consensus algorithm (Zab) to replicate state across an ensemble. For many years, Kafka depended on this external ensemble to bootstrap and coordinate distributed state, which worked exceptionally well, but introduced operational complexity and scaling limitations as Kafka clusters grew.

Key operational challenges of ZooKeeper in Kafka included:

Separate system to manage: Operators needed to configure, monitor, and maintain a ZooKeeper cluster in addition to Kafka.

Metadata bottleneck: ZooKeeper’s ability to serve metadata was a limiting factor for cluster scaling as partition counts grew.

Failover complexity: Controller election and metadata propagation relied on cross-system interactions, potentially slowing down recovery. Amazon Web Services, Inc.

What Is KRaft and How It Replaces ZooKeeper

KRaft stands for Kafka Raft, an internal consensus protocol that replaces ZooKeeper within Kafka’s architecture.

Key Features of KRaft

Integrated Consensus Within Kafka

KRaft embeds the consensus mechanism directly within Kafka brokers. A group of controller nodes forms a Raft quorum, responsible for metadata storage and replication. Metadata is stored as a special Kafka topic (e.g., __cluster_metadata), and changes are replicated using Raft instead of external ZooKeeper ensembles. AWS Documentation+1

Simplified Architecture

Kafka no longer requires a separate ZooKeeper ensemble. This reduces operational complexity and lowers the maintenance burden associated with a separate distributed coordination service.

Improved Scalability

With Raft, Kafka clusters can scale beyond the partition and broker limits imposed by ZooKeeper-based metadata bottlenecks. In Amazon MSK, for example, KRaft mode enables up to 60 brokers per cluster, compared to 30 brokers with ZooKeeper mode by default. Amazon Web Services, Inc.

Faster Failover and Recovery

Raft’s built-in leader election and metadata replication provide quicker controller failover and metadata propagation, improving availability during broker restarts and topology changes.

Unified Metadata Handling

Metadata is treated as just another Kafka log, leveraging Kafka’s replication, partitioning, and log management semantics. This improves consistency and throughput for metadata operations.

ZooKeeper vs. KRaft in Apache Kafka

The differences between ZooKeeper and KRaft become clear when compared side by side.

Aspect	ZooKeeper Mode	KRaft Mode
Architecture	External ZooKeeper cluster manages metadata and leadership.	Integrated Raft quorum inside Kafka brokers manages metadata.
Consensus Protocol	ZooKeeper’s Zab protocol.	Raft consensus tailored for Kafka.
Operational Complexity	Requires managing ZooKeeper nodes separately.	No separate ZooKeeper → simpler operations.
Metadata Storage	Stored in ZooKeeper znodes.	Stored as Kafka log (__cluster_metadata).
Scaling (Broker Count)	Limited (e.g., 30 brokers default on MSK).	Higher scalability (e.g., 60 brokers on MSK). Amazon Web Services, Inc.
Performance	Requires cross-system coordination → more latency.	Metadata local to Kafka brokers → lower latency and faster failure recovery.
Lifecycle	Dependent on external system health.	Internal and unified Kafka lifecycle.

Why KRaft Is the Better Approach

In summary, Kafka’s move to KRaft offers:

Lower operational overhead: Single system to manage, replacing two.

Improved metadata performance and failover times.

Higher scalability limits—important for enterprises with intense streaming workloads.

Simplified client connections: Modern Kafka clients now use bootstrap.servers exclusively, with the older ZooKeeper connection string deprecated. Amazon Web Services, Inc.

This evolution simplifies both development and operations while positioning Kafka for future innovations that depend on a robust internal consensus.

Practical Migration Plan for Amazon MSK

Because ZooKeeper mode and KRaft mode represent fundamentally different metadata architectures, you cannot convert an existing MSK ZooKeeper cluster in place. Instead, follow a practical migration strategy:

1. Prepare a New MSK KRaft Cluster

Create a new MSK cluster with Kafka version 3.9 (or later), specifying KRaft mode at provisioning. AWS Documentation

Configure your brokers and controller count based on expected throughput and partition count.

2. Synchronise Data

Use MirrorMaker 2 (or similar replication tools) to mirror topics and consumer group state from the old ZooKeeper cluster to the new KRaft cluster.

Validate topic configurations, security settings, and ACLs in the target cluster.

3. Update Clients

Ensure producer and consumer applications are updated to use bootstrap.servers for connection.

Confirm that admin tools are using Kafka Admin APIs (no dependencies on ZooKeeper). Amazon Web Services, Inc.

4. Test and Validate

Perform extensive integration testing.

Validate lag, throughput, and end-to-end behaviour under load.

5. Switch Production Traffic

Once validation is complete, switch clients to point to the new cluster.
Monitor key health and performance metrics during the cut-over window.

6. Decommission Old Cluster

After successful migration and validation, safely decommission the ZooKeeper-based MSK cluster.

Note: Because you are provisioning a new cluster, expect some planning around capacity, testing, and potential cut-over coordination. This phased approach minimises risk and maximises confidence in the migration.

Closing Thoughts

Apache Kafka’s shift from ZooKeeper to KRaft represents a pivotal moment in the evolution of distributed streaming platforms. With simplified architecture, better scalability, and faster metadata operations, Kafka becomes easier to operate and more capable of supporting tomorrow’s real-time workloads.

For AWS MSK users, Kafka 3.9 with KRaft means you can now harness these benefits in a fully managed environment, provided you plan and execute a thoughtful migration strategy.

Should you choose to adopt KRaft today? Yes, especially for new clusters.

Cevo can help assess your Kafka readiness for KRaft and design a low-risk MSK migration plan. Get in touch with our team to find out how.

The post Kafka ZooKeeper to KRaft: The Next Chapter in Apache Kafka (MSK and Kafka 3.9 Support) appeared first on Cevo.

AWS re:Invent 2025 Recap

Greg Luxford — Mon, 08 Dec 2025 23:31:53 +0000

TL;DR:
This year’s AWS re:Invent was the most mature step forward in Agentic AI we’ve seen yet. AWS moved beyond concepts and delivered tangible, operational agent tooling, including Bedrock AgentCore, AWS Security Agent, and the new DevOps Agent. Security Hub finally went GA, multimodal capabilities took centre stage, and the community experience delivered as always.

Before attending re:Invent this year, it was obvious that it was going to be a different flavour of themes surrounding AI. In 2023, the resounding them was Generative AI. New tooling was announced with a lot of focus around Bedrock, PartyRock and SageMaker enhancements. The trouble was that there was not a lot of structure around the use of the tooling and many were left puzzled by what the direction and strategy was for AWS. Last year, we saw some large announcements and a deep dive into tooling with a renewed focus on Security. It felt much more balanced and nuanced than the previous year. The resounding difference this year was the theme around Agentic AI, only this time we had real and tangible solutions revolving around the theme.

This was the year that AWS really operationalised AI revolving around the use and release of agents as well as how to orchestrate the agents and this represented a big shift from previous years.

A common theme of service releases has traditionally been built around the solving of a very specific problem. In the past we have needed to then cobble all these different services together into a cohesive solution. Bedrock has always faced this problem when creating and orchestrating agents to carry out a given task. Getting hands-on with Bedrock AgentCore as part of the security workshop delivered this year at reinvent was really cool.

Bedrock AgentCore

The point of AgentCore is to create all the integration points into a single cohesive solution that brings together all the core functions of the agent.

Some key capabilities included:

Gateways

Gateways replace the need to create separate API Gateways as an entry point. Prior to AgentCore, a RESTful API was often used to create an entry point for the integration to the agent. This is demonstrated in my own version of an agent-based solution for ingesting, analysing and resolving AWS Security Hub findings using Bedrock Agents instead of AgentCore:

https://github.com/greg-luxford/security-hub-ai-workshop

Gateways use open source protocols such as the Model Context Protocol (MCP) to call tools through OpenAPI schemas, REST API schemas or as pre-configured integrations without requiring you to write additional code or manage infrastructure. You can also configure identity settings for your gateways to securely manage access to downstream resources.

Memory

Memory allows for continuance and conversational context – allowing for true conversational interaction with agents. Memory enables agents to retain knowledge and learn continuously by leveraging built-in and/or custom strategies for automatically extracting and storing key types of memory from every interaction. This allows agents to be context-aware across sessions.

Short-term memory maintains context within the current session (something you may have previously experienced with Q CLI (sorry, ahem – Kiro CLI…). Once the session is closed, however, then any context is lost. Long-term memory is used to maintain context even between sessions. This is a really powerful feature of AgentCore!

Agent Runtime

Agent Runtime offers a secure, serverless and purpose-built way to deploy and scale AI agents and tools using any agent framework and any model. Runtime unlocks fast cold starts, industry-leading long-running execution, true session isolation, built-in identity and support for multi-modal payloads. Simply host your existing agent or tool code in Runtime to get started.

Speaking of this – multi-modal support was a standout feature of many services this year. Multimodal AI is a type of artificial intelligence that can process and integrate information from multiple data types, or “modalities,” such as text, images, audio, and video. This powerful supported feature allow you to correlate and build solutions for multiple capabilities in the one place – pretty awesome.

Identity

The identity capability is really cool as it allows to embed trusted identity into the agent without the need to cobble together other services like Cognito. Identity enables agents to securely manage access to resources by integrating with leading identity providers. You can add Inbound Identity to resources to enable caller authentication and authorisation, and you can enable Outbound Identity to provide downstream resource access.

Policy

Providing guardrails to your agent is really important. Being able to define these as a policy is a very smart feature that drastically improves the security posture of your agents.

Policy offers deterministic control to ensure agents operate within defined boundaries and business rules without slowing them down. Easily author fine-grained rules using natural language or Cedar (AWS’s open-source policy language). It integrates with Gateway, controlling who can perform what actions under what conditions.

Built-In Tools

There are built-in tools including a Browser for simulation and a Code Interpreter.

Browser

Augment your agent to securely interact with web applications, fill forms, navigate websites, and extract information in a fully managed environment. View live browser session or session replay. Monitor key metrics and traces for your browser session in AgentCore Observability. This enables you to test and validate your use cases without needing to publish publicly.

Code Interpreter

Enable your AI agent to write and execute code securely in a sandbox environment to solve complex end-to-end tasks. It supports Python, JavaScript, and TypeScript to run data analysis, calculations and code validations securely. You can also use it to monitor traces and metrics as well.

Combined, it is easy to see what the Bedrock team were going for – a one stop service that integrated several features and capabilities into the one tool. Further, we had some amazing announcements centralised around Security and DevOps capabilities.

Security Agent

It is clear to see that AWS is taking security very seriously and in a big way by releasing tools to significantly reduce the effort to secure code and containers. There is focus on the end-to-end development lifecycle from inception to release.

AWS Security Agent is a frontier agent that proactively secures your applications throughout the development lifecycle across all your environments. It conducts automated security reviews customized to your requirements, with security teams centrally defining standards that are automatically validated during reviews. The agent performs on-demand penetration testing customized to your application, discovering and reporting verified security risks. This approach scales security expertise across your applications to match development velocity while providing comprehensive security coverage. By integrating security from design to deployment, it helps prevents vulnerabilities early and at scale.

DevOps Agent

In public preview is the AWS DevOps Agent, a frontier agent that helps you respond to incidents, identify root causes, and prevent future issues through systematic analysis of past incidents and operational patterns. DevOps Agent is equivalent to an always-on, autonomous on-call engineer. When issues arise, it automatically correlates data across your operational toolchain, from metrics and logs to recent code deployments in GitHub or GitLab. It identifies probable root causes and recommends targeted mitigations, helping reduce mean time to resolution.

There is a great blog already released for it and the capabilities are impressive on first look. It seems things are getting serious now about leveraging AI to help reduce operational effort in more ways than one.

https://aws.amazon.com/blogs/aws/aws-devops-agent-helps-you-accelerate-incident-response-and-improve-system-reliability-preview/

This added capability does not replace humans – they need to be part of the loop and it does require specific knowledge of findings and context to what the finding is. It does however empower teams to reduce the impact of outages and incidents and is great at defining the root cause analysis post incident.

Security Hub is Now Generally Available

I’ve been using the new Security Hub capabilities under the free preview for quite a while now and as features that were released to AWS Community Builders and AWS Heroes under NDA with AWS were released – it has been a true pleasure watching the capabilities grow on what seemed a week-by-week improvement leading up to reinvent. The push to go GA as an announcement at reinvent clearly put a lot of the development teams under significant pressure, however it has been awesome to see this go live. The legacy version of Security Hub is now relegated to Security Hub CSPM (Cloud Security Posture Management). While the CSPM version remains useful for high-level findings, the new Security Hub really allows you to peel behind the curtains and trace findings end-to-end and their associated resources.

Human Connections

The human connections made at reinvent cannot be underestimated. This year I met many old and new faces. The highlight for me was connecting with the worldwide community of AWS Ambassadors, AWS Community Builders, AWS Heroes and AWS User Group Leaders. Collaborating, sharing ideas and knowledge, demonstrating tooling and shared experiences was amazing. Getting to meet and discuss all things AWS and products, releases and a few things us partners need help with product teams, AWS business leaders and product heads was also a highlight. Helping other partners with growing competencies, service delivery programs and foundational technical reviews were also just icing on the cake. Human connections are what reinvent is about as much as it is a learning experience.

Another big announcement was from Werner Vogels, presenting his last keynote for reinvent after 14 years. Getting to see this in person and hear the insights and story of development from “back to the beginning” and the Renaissance Developer was a truly memorable highlight.

Finally, running the PEX315 – Intelligent Security Operations with AWS: AI-Powered Incident Response workshop with my fellow workshop team from AWS Security and ProServe teams was an amazing experience. A packed ballroom and an excited crowd said it all – people love getting hands-on and ideating on what comes next. There was a lot of enthusiasm from participants and several commented on the great user experience of the workshop. Our workshop team have agreed to continue collaboration on the workshop and make continual improvements to it for future running.

Conclusion

re:Invent 2025 showcased AWS’s significant shift toward operationalising Agentic AI, with Bedrock AgentCore, Security and DevOps Agents, and the new Security Hub GA leading the way. The conference reinforced the importance of hands-on experience, human connections, and continuous innovation. For builders, security practitioners, and partners, the announcements provide clear pathways to build more intelligent, secure, and integrated AI solutions for the future.

The post AWS re:Invent 2025 Recap appeared first on Cevo.

AWS re:Invent 2025 Key Announcements

Nicolas Foulon — Thu, 04 Dec 2025 22:31:00 +0000

AWS re:Invent 2025 delivered one of the most transformative waves of innovation we’ve seen in years. These weren’t just incremental updates, but foundational leaps that reshape how we build, automate, and scale on the AWS cloud.

From agent-driven development and next-generation AI infrastructure to major breakthroughs in data, security and serverless, AWS has clearly shifted into a new era of builders’ tooling. Whether you’re leading platforms, shipping products, running ML workloads or modernising operations, this year’s announcements redefine what’s possible and what will soon become the new normal.

Below is a curated breakdown of the biggest releases you need to know and why they matter.

AI platforms, Agents & the Developer Experience

AWS leaned hard into an agent-first developer experience this year. Amazon Bedrock AgentCore was introduced as a fully managed runtime for production AI agents, alongside new “frontier agents” such as Kiro (virtual developer), plus AWS Security Agent and AWS DevOps Agent designed to work through common SDLC and operational tasks. To support custom agent development, AWS expanded Strands Agents with TypeScript, CDK integration and edge deployment, making it easier than ever to build reliable, domain-specific agents.

This is great because AgentCore and Frontier Agents finally take over all the repetitive coding and ops tasks we normally burn hours on, so we can ship features faster and give customers improvements in days instead of weeks.

Models & Customisation: Nova, Bedrock & SageMaker AI

AWS made model choice and customisation a lot more practical this year. The new Nova 2 model family and Nova Forge, introduce “open training” style customisation using pre-trained checkpoints and curated datasets. Nova Act is specialised for UI and browser automation.

On top of that, Bedrock Reinforcement Fine-Tuning (RFT) enables alignment-style tuning for domain accuracy, while SageMaker AI now supports serverless customisation, cutting experimentation cycles from months to days.

The takeaway is simple: it’s now easier to tailor AI to real business context without a heavy ML ops lift, so customers can roll out smarter, more relevant experiences faster.

Trainium3 UltraServers: Next-Generation AWS AI Silicon

AWS unveiled Trainium3 UltraServers, a new 3nm AI accelerator platform enabling frontier-scale model training and high-throughput inference. With massive performance and efficiency improvements over Trainium2, these systems are positioned to power the next generation of cost-efficient LLM workloads both inside AWS and for early adopters like Anthropic.

This will:

Lower cost to train and run big models, making advanced AI use cases more commercially viable
Higher performance per watt, which helps at scale when workloads run continuously
More headroom for ambitious features, unlocking AI capabilities that were previously hard to justify on budget

New NVIDIA GB300 UltraServers on EC2

AWS introduced P6e-GB300 UltraServers, based on NVIDIA’s GB300 NVL72 architecture, enabling ultra-large model training and high-bandwidth reasoning at scale. These instances bring frontier GPU performance into the EC2 ecosystem with deep integration into Nitro, EKS, and AWS’s ML stack.

These servers are amazing as they will:

Unlock extreme GPU scale on demand, without specialised hardware investment
Accelerate training and high-throughput inference, especially for large reasoning workloads
Simplify production deployment, thanks to tighter integration with EKS, networking and the broader AWS ML toolchain

AWS AI Factories for Sovereign & Regulated Workloads

AWS AI Factories allow organisations to deploy managed, hyperscale AI infrastructure inside their own data centers. These combine Trainium, NVIDIA GPUs, AWS networking, storage, databases, Bedrock, and SageMaker into a fully managed “private AWS region,” ideal for sovereign workloads and regulated industries.

AI Factories are a game-changer for our regulated customers as they will:

Enable modern AI adoption while keeping data on-prem, supporting sovereignty and stricter regulatory needs
Bring AWS-managed AI infrastructure behind the firewall, reducing operational burden for customers
Accelerate time-to-value for regulated use cases, without compromising control, governance, or residency requirements

Amazon S3 Updates: Vectors, 50 TB Objects and Data Lake Enhancements

Amazon S3 received major AI-era upgrades: S3 Vectors (high-scale vector search built into S3), 50 TB object support for massive datasets, 10× faster S3 Batch Operations, and enhancements to S3 Tables for Apache Iceberg like Intelligent-Tiering and cross-account replication.

We love this, because S3 is basically becoming an AI-native database, which will:

This will:

Make vector search more native and scalable, reducing the need to stitch together extra services
Support much larger datasets more simply, especially for AI and analytics workloads
Speed up large-scale data operations, improving throughput for batch processing and lakehouse management

Lambda Upgrades: Durable Functions and Managed Instances

AWS Lambda now supports durable functions, enabling long-running workflows that persist for up to a year. Lambda Managed Instances give developers EC2-level control (hardware, networking, pricing) with the ergonomics of Lambda which is ideal for mission-critical pipelines and predictable workloads.

This will:

Enable long-running, stateful workflows in serverless, without redesigning architectures
Give better cost and performance control, while keeping the simplicity of Lambda
Speed up delivery for event-driven and AI workflows, especially where orchestration and reliability matter most

Networking & Multicloud: AWS Interconnect with Google Cloud

AWS Interconnect – multicloud, built in partnership with Google Cloud, delivers private, high-bandwidth links between clouds using an open spec and coordinated operations, pushing multicloud from DIY networking to a fully managed experience.

This will:

Finally make multicloud sane, eliminating the historical networking pain between AWS and GCP
Allow teams to combine services from both clouds without complex workarounds or fragile architectures
Enable customers to adopt true best-of-both-worlds solutions that maximise performance, resilience and choice

Security Enhancements: GuardDuty ETD, Security Hub Analytics and IAM Policy Autopilot

GuardDuty now provides Extended Threat Detection for EC2 and ECS, correlating host- and container-level signals into actionable, MITRE-mapped findings. Security Hub adds real-time risk analytics across AWS security services. IAM Policy Autopilot, an open-source MCP server, generates least-privilege IAM policies directly from your code.

This will:

Reduce noisy alerts and cut through security signal overload
Automate least-privilege IAM, replacing manual policy tuning with intelligent defaults
Strengthen overall security posture while significantly reducing operational burden
Help teams avoid late-night fire drills by surfacing only the issues that actually matter

Customer Experience: Amazon Connect Agentic AI

Amazon Connect now includes agentic AI for both self-service and human-assisted workflows. With Nova-powered speech models, Connect can automate tasks, generate documentation, analyse sentiment, and assist human agents with context-aware recommendations.

This will:

Deliver quicker, more natural support for customers through agentic AI
Provide human agents with real-time guidance and next-best actions
Reduce wait times and boost service quality immediately

AI-Native Observability: CloudWatch and Related Services

CloudWatch gained AI-native observability for agents and LLMs, automatic application topology discovery, AI-generated incident reports, and expanded cross-account/Region telemetry via Logs Centralisation and Database Insights.

This will:

Automate much of the detective work normally done by engineers
Spot issues earlier, identify root causes and analyse incidents faster
Improve system reliability and reduce time to recovery for customers

Data Collaboration & Core Services: Clean Rooms, EKS, Route 53 Global Resolver, Partner Central

AWS Clean Rooms introduced Clean Rooms ML synthetic data, a new capability that lets multiple parties collaborate on machine learning without sharing raw data. It generates noisy, de-identified synthetic datasets that preserve the statistical patterns of joint datasets (e.g., retailers + brands), enabling compliant ML training while reducing re-identification risk. Beyond data collaboration, AWS added new EKS capabilities for managed workload orchestration and cloud resource governance, launched Route 53 Global Resolver to unify DNS resolution across hybrid and multicloud environments, and integrated AWS Partner Central directly into the AWS Console for easier partner and marketplace management.

This will:

Allow teams to collaborate on shared datasets without exposing sensitive information
Enable new joint modelling and analytics opportunities with partners
Unlock data-driven projects that were previously impossible due to privacy constraints

Updated AWS Support Experience

AWS reorganised support into Business Support+, Enterprise Support, and Unified Operations, with AI-powered assistance, faster response times, deeper security integration, and alignment with the new AI-driven operational tooling.

This will:

Provide faster SLAs across incidents and operational requests
Use AI-powered troubleshooting to accelerate diagnostics and resolution
Improve uptime and reduce delays, delivering a smoother customer experience

Conclusion

AWS re:Invent 2025 marks a turning point for builders on AWS, with breakthroughs across AI, compute, serverless, data, security, and operations that directly improve how teams design, deploy, and scale applications:

AI becomes first-class with AgentCore, frontier agents, and expanded Bedrock/Nova models, giving developers production-ready AI building blocks instead of DIY glue and model customisation gets easier thanks to Nova Forge, RFT, and SageMaker AI serverless customisation, making domain-specific AI accessible without ML ops complexity.

Lambda steps up with durable functions and Managed Instances, enabling long-running AI workflows and cost-optimised compute without losing serverless ergonomics.

Multicloud networking with AWS Interconnect makes cross-cloud architectures reliable, high-bandwidth, and finally simple.

Security shifts left and gets smarter with GuardDuty ETD, Security Hub analytics, and IAM Policy Autopilot, thus reducing noise and tightening least-privilege by default.
CloudWatch’s AI-first observability and improved support tiers give teams faster insights, better incident response, and more intelligent operations.

Across the board, these releases remove friction, automate complexity, and give builders more powerful primitives. Developing on AWS becomes faster, safer, more scalable, and far more AI-native than ever before, setting the stage for how modern applications will be built in the years ahead.

The AWS re:Invent 2025 announcements open the door to a new era of AI-native, high-performance cloud engineering. If you’re ready to explore how agentic AI, Trainium3, S3 Vectors, multicloud networking or the latest serverless and security capabilities can accelerate your delivery, Cevo can help.

Whether you’re scaling platforms, modernising workloads, or building out new AI-powered capabilities, our consultants partner with your team to design solutions that are robust, secure and future-ready.

Move faster with confidence. Reach out to Cevo to start the conversation.

The post AWS re:Invent 2025 Key Announcements appeared first on Cevo.

Continuous Evolution: Building Adaptability in the Age Of AI

Sean Hooper — Thu, 13 Nov 2025 23:08:39 +0000

TL;DR
Continuous evolution and adaptability are essential for organisations to thrive in the age of AI. Businesses that embrace change by optimising people, processes, and technology, can leverage AI to enhance unique capabilities, drive efficiency, and stay competitive. Those that stagnate risk falling behind as technology and customer expectations evolve rapidly.

Why Continuous Evolution and Adaptability Matter in the Age of AI

Just like treading water to stay afloat, companies can’t survive for very long if this is how they operate. Treading water is effectively standing still, or stagnating. It means no vision, no investment, minimal growth and spending as little as possible. A company in this situation does not show its customers why they should do business with it. However, there are some that think this is a low-risk strategy (albeit low returns) when in fact it is high risk, and any returns are short-lived.

Without an active growth strategy, you will go backwards due to erosion of market perception, market share, reducing brand value and technical debt. In addition to this you’ll lose the muscle-memory to evolve and grow.

McKinsey’s “The State of AI: Global Survey 2025” shows that high-performing organisations are those that invest in transformative innovation and redesigning of workflows using technology such as AI to achieve adaptability. The results are greater efficiency and reduced technical debt, increased competitive advantage and sustained growth.

Here at Cevo, “Continuous EVOlution” embodies our most fundamental, core principals. It’s the idea that to remain relevant, deliver value to customers, and make the most of emerging technology, organisations must constantly adapt.

Adaptability is something we see in nature. Species survive because they can refine how they use their resources and respond to environmental challenges. The same principle applies to businesses. Those that can evolve in response to rapid change come out on top. Those that don’t risk falling behind and they might not be able to recover when they realise what’s happened around them.

The Evolution of Technology Adoption

Over the past few decades, we’ve seen this play out in technology. Prior to the cloud era, many organisations managed risk by minimising change. Systems were designed to be stable and predictable but that also meant they were slow to adapt. Cloud brought elasticity and standardisation, making extensive automation more possible. Today, the pace of change is accelerating. AI is fast-tracking new ways of working, including analysis, design, implementation and operations, making technology environments more dynamic than ever.

To thrive, organisations must build adaptability into every layer of their business: systems, processes, and people. That’s a tall order, but it’s essential. If your organisation or even just your team isn’t evolving, it’s a signal that it’s time to catch up.

AI Adoption: A Case Study in Continuous Evolution

AI exemplifies this challenge, and opportunity. To make the most of AI, organisations need to get familiar with AI becoming a first-class citizen within their organisation, and embrace evolution at each stage of maturity:

Human augmentation – Start by using AI to assist employees in their daily work. AI Assistants are appearing everywhere from productivity applications to CRM/Sales & Service and emerging in ERP. Development of policy and guidelines will help to build the required safety around the adoption of these tools.

Task completion – Move toward AI performing entire tasks, with humans in the loop for oversight. This is the next step in maturity in the adoption of AI. It’s very powerful for determining a return on investment, as its now participates in the outputs of your organisation.

Orchestration – Eventually, AI can coordinate and manage multiple tasks, freeing humans to focus on higher-value work. For most, this is too far to reach for directly, but it will come soon enough. Progressing through each of these stages will prepare your people and organisation.

Each of these stages of maturity in adoption delivers significantly different return on investment. The biggest gains come when AI is applied to tasks that are unique to your organisation, those differentiators that set you apart from competitors. This is reflected in Gartner‘s “Innovation Guide for Generative AI Consulting and Innovation Services”. It notes that successful companies focus on pragmatic, enterprise-scale solutions rather than hype and adopt standardised evaluation criteria to assure market viability and innovation resilience. Automating generic tasks is a good start, but real advantage comes from using AI to enhance your unique capabilities.

Learning from the Evolution of Middleware and Process Automation

Understanding the evolution of technology adoption helps contextualise AI’s potential. In the 2000s, middleware and process automation filled a critical gap: connecting disparate systems and enabling organisations to execute their unique processes. Those solutions were revolutionary at the time, combining system-to-system integration, human interaction, and supporting long-running processes with enhanced visibility and control. Many of the interactions with email and SMS, were hard-coded. As work evolved, these new complex middle layers needed to change rapidly. This meant processes already running had to be adapted to the new way, or processes were version controlled.

Eventually they became brittle and complex, and most organisations have not wanted to change these very much since, even though these systems typically represent an organisation’s most high-value processes. These systems codify the processes that are often unique and can’t be simply purchased in an application.

Why People and Processes are Critical for Successful AI Adoption

AI represents a new inflection point. It can read, understand, and act on information in ways traditional systems could not. It can fill gaps that have existed for decades, interpreting old processes and data into something new, efficient, and scalable. More importantly, AI allows organisations to leverage what makes them unique and amplify it. Ultimately, AI makes systems more adaptable and enables them to evolve in real-time.

We’ve seen steps in this evolution of technology and services for decades. Cloud was the most significant step forward through the provision of elastic capacity and capability. Now AI allows us to adapt to changes in ways of working, data processing, integration and how we want to change the way we connect with our customers and business partners.

Technology alone isn’t enough. To maximise the benefit, we can extract from AI, we can’t treat it like a shrink-wrapped capability. We’ve explored the need for people to change their mindset and understanding. This includes thinking through how processes will change but also getting comfortable with even more change happening faster. Organisations must cultivate a culture where experimentation is encouraged, risk is managed, and continuous learning is embedded into everyday work.

At Cevo, this principle of continuous evolution is so intrinsically linked to how we see organisations gaining advantage from technology. It encompasses our own internal perspective and ways of working, affects how we deliver results and in how we help our customers navigate AI adoption. Adaptability starts with a mindset.

The Opportunity Ahead: Continuous Evolution and Adaptability in AI

We’re at an exciting moment. Technology has reached a point where it can significantly augment human capabilities, drive efficiency, and access differentiation in ways that were unimaginable only a few years ago. To seize this opportunity, organisations must embrace evolution, not just for the technology itself, but for the people and processes that make it work and harmonise the results.

The future belongs to those who actively prepare to adapt, experiment, and evolve continuously. In a world defined by rapid technological change, adaptability isn’t optional, it’s essential.

Ready to turn AI ambition into practical evolution? Reach out to our team of expert consultants, where we’ll help you move from ideas to impact.

The post Continuous Evolution: Building Adaptability in the Age Of AI appeared first on Cevo.

Transforming Data Engineering with DevOps on the Databricks Platform

Juzel Pantaleon — Tue, 11 Nov 2025 05:30:17 +0000

TL;DR
Data engineering is evolving from manual ETL scripting to delivering robust, production-grade data products. By adopting DevOps principles on the Databricks Lakehouse Platform—modular code, rigorous automated testing, CI/CD pipelines with Git Folders, and Databricks Asset Bundles, teams can automate deployment, ensure reliability, and accelerate delivery. This approach transforms data work into scalable, maintainable, and production-ready data solutions.

Introduction

The role of the Data Engineer has fundamentally shifted. It is no longer enough to write functional ETL (Extract, Transform, Load) scripts; modern data delivery requires building robust, scalable, and secure data products. This transformation mandates the adoption of DevOps principles, a powerful convergence of culture, automation, and tooling applied directly to the Databricks Lakehouse Platform, a modern approach compared to other data architectures.

By integrating software engineering best practises, rigorous testing, and automated Continuous Integration and Continuous Delivery (CI/CD), organisations can transition from manual, error-prone data workflows to automated, production-grade pipelines, but first, it’s important to ensure a solid data strategy is in place. In this blog I explore the essential components needed to achieve true DataOps maturity and master DevOps for Data Engineering on Databricks.

1. The Foundation: Software Engineering Best Practices (SWE)

The foundation of any successful DevOps strategy is high-quality code. Before automation can be effective, the assets being automated must be reliable and maintainable. This requires every Data Engineer to adopt the mindset of a Software Engineer.

Modular Design and Code Quality

The first step away from “notebook scripting” is embracing modularity. Rather than writing sprawling, single-purpose notebooks, your logic should be decomposed:

Reusable Components: Core data transformation logic (e.g., standardising addresses or calculating metrics) must be factored out into reusable Python or Scala modules. These modules should be stored outside the notebook environment, versioned separately, and imported when needed. This practise ensures your logic is DRY (Don’t Repeat Yourself).

Style and Readability: Adherence to standards, formats, and consistent naming conventions is essential. Automation tools should run during the development cycle to ensure code is consistently formatted and easy for any team member to read and debug.

Documentation: Documentation moves beyond simple comments. Every function and class should include detailed docstrings explaining its purpose, parameters, and return values. This intrinsic documentation significantly lowers the barrier to entry for new developers and simplifies maintenance.

Rigorous Automated Testing

The only way to guarantee a pipeline works reliably is through automated testing. Testing provides the confidence required to implement Continuous Delivery safely.

Unit Testing: This is the most granular level of testing. Unit tests should verify individual functions and modules without needing a live Spark session or access to Databricks. This ensures the core business logic remains sound, irrespective of the underlying data platform.

Integration Testing: This is where the Databricks environment becomes critical. Integration tests check how different parts of the pipeline interact, for instance, loading data from an external source, applying Spark transformations, and writing the result to a Delta Lake table. These tests ensure the end-to-end data flow works as expected within the target environment.

Data Quality Check: Beyond code functionality, testing in data engineering requires validating the data itself. Features within Delta Live Tables (DLT) like Expectations are a powerful form of testing, allowing engineers to define data quality rules that can warn, quarantine, or fail a pipeline if constraints are violated.

Best Practise Tip: Always write your unit tests before or while writing the actual transformation code (Test-Driven Development). If you find a bug in production, write a test that reproduces it before fixing the bug.

2. DevOps for Data Engineering on Databricks: Building CI/CD Pipelines

Implementing DevOps for Data Engineering on Databricks provides the framework for how we deliver this high-quality code continuously. In the data world, this is often called DataOps, and applying these principles ensures pipelines evolve safely and predictably as data changes over time.

Continuous Integration (CI)

CI is the foundation of modern delivery. It is a development practise where developers merge code changes frequently (at least once a day) into a shared central repository to enforce code quality.

The CI process, typically executed by an automation server (e.g., GitHub, GitLab), performs several critical, automated checks:

Build Validation: Checking out the code and ensuring all library dependencies resolve correctly.

Linting and Formatting: Enforcing code quality standards and style.

Unit Test Execution: Running all unit tests on the code modules.

Security Scanning: Checking for dependencies with known vulnerabilities.

For Databricks, CI ensures that every committed change has passed all software checks before it can even be considered a viable candidate for deployment.

Continuous Delivery (CD)

CD extends CI by automatically preparing and deploying the validated code to staging or testing environments. The core principle of CD is deployable artefacts: the jobs, libraries, and notebooks that passed CI are packaged and ready to be deployed to any downstream environment at a moment’s notice.

CD transforms data engineering delivery from a slow, manual ticketing process into an automated, low-risk release cadence.

See Databricks references on CI/CD

3. Continuous Integration in Databricks: Leveraging Git Folders

The biggest obstacle for CI in Databricks used to be synchronising the external Git repository with the internal notebook environment. Databricks Git Folders (or Repos) solved this problem, making CI a seamless reality.

Git Folders as the Workspace Bridge

Git Folders connect a specific folder in a Databricks workspace directly to a remote Git repository (e.g., Azure DevOps, GitHub). This integration achieves several crucial objectives:

Enforced Source of Truth: The remote Git repo, not the Databricks workspace, becomes the authoritative source of your code. If the workspace is accidentally deleted, your code persists safely in Git.

Branch-Based Development: Developers can check out a specific feature branch within the Git Folder, develop and test their changes, and then use standard Git operations (commit, push, pull) to manage their work, all without leaving the Databricks UI.

CI Triggering: When a developer pushes a change or opens a Pull Request (PR), the external Git service automatically triggers the CI pipeline. This pipeline fetches the new code, runs tests, and validates quality. Only after the PR is merged into the main branch does the code become a validated artefact, ready for CD.

See Databricks reference on Git Folders

4. The Path to CD: Databricks Deployment Methods

Once the code is validated in the CI phase, the final step is Continuous Deployment. This involves promoting the verified assets (jobs, DLT pipelines, models) from the version control system into the target Databricks workspace (Staging or Production).

Traditional Automation Interfaces

For many years, automation relied on the programmatic interfaces provided by Databricks:

REST API and CLI: The Databricks REST API is the underlying engine for all automation. The CLI provides a convenient shell wrapper to execute API calls (e.g., uploading notebooks, creating clusters, configuring jobs). These tools are versatile but often require complex scripting to manage environment-specific variables and asset dependencies.

Databricks SDKs: Available in languages like Python, these SDKs offer a cleaner, object-oriented way to interact with the Databricks control plane. They allow engineers to manage resources directly within their automation code.

The Modern Paradigm: Databricks Asset Bundles (DABs)

Databricks Asset Bundles (DABs) represent the current best practise for deploying complex solutions. They introduce Infrastructure-as-Code (IaC) principles to the Databricks environment.

A DAB is essentially a declarative YAML configuration file that defines all the resources required for an application or pipeline.

DAB Component	Description
Artifacts	Specifies source notebooks and libraries to be deployed.
Resources	Defines the operational assets: Databricks Jobs, Delta Live Tables (DLT) Pipelines, and Clusters.
Environments	Allows easy switching between configuration parameters for different deployment targets (e.g., Staging vs. Production), handling variations in cluster size, secrets, and data paths.

The Key Benefits of Databricks Asset Bundles (DABs):

Atomic Deployment: DABs ensure that related assets—a job, its cluster, its pipeline, and its configuration—are deployed and managed as a single unit, eliminating configuration drift.

Simplicity of CD: A single CLI command (Databricks bundle deploy) can handle the entire deployment process to a specified environment, drastically simplifying external CD pipeline scripts.

Cross-Domain Support: DABs are versatile, used for deploying traditional Data Engineering pipelines, MLOps stacks, and any other solution composed of Databricks assets.

See Databricks reference on DAB

Conclusion: Embracing Engineering Excellence

Adopting DevOps for Data Engineering on Databricks is a journey that moves the organisation closer to true business agility. It’s a necessary shift from viewing data work as individual tasks to seeing it as a cohesive, software-driven product line and mastering DevOps for Data Engineering on Databricks ensures that transformation is sustainable, scalable, and future-ready.

By mandating modular code, enforcing automated testing, building CI/CD pipelines with Git Folders, and utilising the declarative power of Databricks Asset Bundles, Data Engineers can achieve unparalleled levels of pipeline reliability, consistency, and speed. This solidifies their role as critical enablers of the modern data-driven solution.

Key Takeaways

Key Takeaway	Why It Matters
Software Engineering Best Practises	The code must be modular, tested, and documented before deployment can be reliable.
DevOps & CI/CD	The engine that automates quality checks and deployment, making releases faster and safer.
Databricks Git Folders	The bridge that connects code repository (Git) directly to the Databricks workspace, enabling Continuous Integration (CI).
Databricks Asset Bundles (DABs)	The modern, declarative way to package and deploy everything—jobs, clusters, and pipelines—as a single, consistent unit across environments.
The Result	A move from manual scripting to fully automated, production-grade data product delivery.

What Next?

Ready to start your DevOps journey on Databricks?

Start Small: Choose one existing notebook and refactor its core logic into a reusable Python module. Write a unit test for that module.

Connect Git: Set up your first Databricks Git Folder, linking your workspace to a remote repository.

Explore DABs: Experiment with creating a simple Databricks Asset Bundle to deploy a basic job.

Take that first step today and transform your data engineering workflow from reactive scripting to proactive engineering excellence.

The post Transforming Data Engineering with DevOps on the Databricks Platform appeared first on Cevo.

Cevo

dbt Core vs dbt Platform: Modern Data Transformation with dbt (Part 1)

Table of Contents

Introduction

dbt in the context of Modern Analytics Platform

Running dbt with dbt Core

What Is the dbt Platform?

Key Capabilities of the dbt Platform

Choosing Between dbt Core and the dbt Platform

Final Thoughts: Making dbt an Architectural Capability

What’s Coming Next

On-Demand dbt Execution: Rethinking Analytics Engineering in Secure Cloud Environments

Table of Contents

Introduction

dbt as an Analytics Engineering Framework

The Problem Space: Secure Data Lake Migrations

Rethinking dbt Execution: Why dbt Works Better as an On-Demand Workload

Architecture Overview

Why On-Demand dbt Execution Improves Cost, Security, and Scale

1. Cost Efficiency

2. Security and Compliance

3. Operational Simplicity

4. Scalability and Isolation

Data Quality as a First-Class Output

CI/CD Integration and Continuous Validation

Implications for Platform and Executive Stakeholders

Closing Thoughts

Cevo’s adoption of the AWS EBA Framework for our clients

What Is an AWS Experience-Based Accelerator (EBA)?

Why Experience-Based Accelerators Reduce Risk for Clients

How Cevo Delivers Partner-Led AWS EBAs

Real Workloads, Not Demos

Engineering-Led, Outcome-Focused

Confidence Before Commitment

Experience-Based Accelerators in Practice

Migration: From Uncertainty to Momentum

Modernisation: Seeing Legacy Become Cloud-Native

AI: Turning Possibility into Practice

Why Experience-Led Transformation Works

Continuous Evolution, Not One-Off Transformation

Enhance Kubernetes Cluster Performance and Optimise Cost with Karpenter – Part 3

Table of Contents

Introduction

How to monitor Karpenter?

Karpenter Architecture

Karpenter Logs

Example Karpenter logs:

Karpenter Metrics

Karpenter Metrics Visualisation

Dashboard Example 1: Kubernetes/Autoscaling/Karpenter/Activity – Providing insights about Auto Scaling and reason for Node Termination.

Dashboard Example 2: Kubernetes/Autoscaling/Karpenter/Overview – Provides insights about instance CPU & Memory usage & instance capacity type.

Conclusion

Enhance Kubernetes Cluster Performance and Optimise Cost with Karpenter – Part 2

Table of Contents

Introduction

Cluster Auto-Scaler and its limitations

EKS Cluster Config: (PetsCluster)

EKS Managed Node Group Configs: (Platform, Forcats and ForDogs)

EKS Cluster Auto Scaling Config:

Scenario 1: Generals from Cats & Dogs demand frugality & agility

Karpenter Setup Config:

Karpenter NodePool Config:

Karpenter NodeClass Config:

Scenario 2: Generals demand “All hands on deck” using EKS Auto Mode

EKS Cluster Config with Auto Mode:

EKS with Karpenter v/s EKS Auto Mode

Conclusion

Kafka ZooKeeper to KRaft: The Next Chapter in Apache Kafka (MSK and Kafka 3.9 Support)

Table of Contents

Why Kafka Needs Consensus (and Why It’s Hard)

What ZooKeeper Did for Kafka (and Where It Fell Short)

What Is KRaft and How It Replaces ZooKeeper

Key Features of KRaft

Integrated Consensus Within Kafka

Simplified Architecture

Improved Scalability

Faster Failover and Recovery

Unified Metadata Handling

ZooKeeper vs. KRaft in Apache Kafka

Why KRaft Is the Better Approach