Codefresh

Pipeline Performance Profiling: Making CI/CD Performance, Cost, and Bottlenecks Visible

Gil Chaouat — Mon, 26 Jan 2026 15:56:34 +0000

Modern CI/CD pipelines are no longer just about whether builds succeed, they’re about how fast, how efficiently, and at what cost they run.

One theme has come up consistently in customer conversations:

“My builds are slow, expensive, and I don’t know where to start fixing that.”

Pipeline Performance Profiling is designed to close this gap by making pipeline behavior observable, measurable, and explainable. Instead of treating build time as a single opaque number, it breaks pipeline execution down into clear phases, steps, and resource signals. Built on OpenTelemetry and Prometheus-compatible, it exposes these insights using open, industry-standard metrics, giving teams the flexibility to analyze pipeline cost and performance with the same monitoring tools they already use. This allows teams to understand where time and resources are spent, why bottlenecks occur, and how to make informed trade-offs between speed, cost, and reliability.

Why We Built Pipeline Performance Profiling

Our customers asked us clear, practical questions:

How do I choose the right machine sizes for my builds without wasting money?
Where exactly are my pipeline bottlenecks?
Why do some steps feel slow even when nothing has changed?
How much time are we losing pulling images instead of executing code?
Can I analyze Codefresh builds with the same monitoring tools I already use?

Until now, answering these questions required guesswork, manual timing, or support escalations with limited data.

Pipeline Performance Profiling changes that by instrumenting the pipeline runtime itself and exposing step-level time and resource metrics in a way that’s easy to analyze, trend, and correlate.

What Is Pipeline Performance Profiling?

Pipeline Performance Profiling is the first phase of Codefresh’s broader Pipeline Observability initiative.

It provides:

Step-level execution timing (active vs. idle)
Initialization vs. execution breakdown
CPU and memory metrics
Cache usage visibility
Prometheus-compatible metrics, visualized through Grafana dashboards
OpenTelemetry-native instrumentation, so your data fits into existing observability pipelines

This foundation supports developers, DevOps engineers, and platform teams who need evidence-based answers, not assumptions.

Performance: Find Time Sinks and Bottlenecks

One of the hardest performance problems to solve is unexplained slowdown. Builds still succeed, configurations haven’t changed, yet pipelines take longer to complete, leaving teams guessing where the time is going.

“Our builds feel slower, but nothing obvious changed.”

With Pipeline Performance Profiling, teams no longer have to rely on intuition or one-off comparisons. By breaking pipeline execution into measurable phases and steps, the dashboards make it clear where time is actually spent and how that changes over time.

With this visibility, teams can answer questions such as:

Build and Step Duration Trends

Understanding whether performance is improving or regressing requires looking beyond individual builds.

How does build duration change over time?
Are pipelines getting faster, or are there gradual regressions?
Which steps consistently dominate execution time across builds?

These trends help teams spot slowdowns early and focus optimization efforts where they will have the biggest impact.

Initialization Time: Understanding Build Preparation Delays

Many pipeline slowdowns happen before any build step runs. The initialization phase includes several setup stages that can often be optimized to speed things up. Pipeline Performance Profiling makes these stages visible through build preparation duration metrics (such as P95 trends), helping teams quickly see when and why startup time increases.

Common contributors include:

Request account clusters: Attaching many clusters can slow build startup as each one is contacted. Limiting pipelines to only the required clusters can significantly reduce initialization time.
Validate Docker daemon: Slow validation often indicates delays in provisioning Docker-in-Docker pods, which depend on cluster configuration and capacity.
Start composition services: Pipelines with many Docker Compose services may start more slowly due to image pulls and container startup.

By identifying which part of initialization is responsible for delays, teams can make targeted configuration changes and reduce overall pipeline startup time.

Cost: Optimize Resource Usage Without Guesswork

Slow pipelines are frustrating, but expensive pipelines are even worse. Too often, CI/CD resource decisions are made conservatively, teams over-provision CPU and memory to avoid failures, without clear visibility into whether those resources are truly needed. Pipeline Performance Profiling addresses this by exposing how resources are actually consumed over time, enabling teams to use historical data instead of assumptions to make deliberate, data-driven decisions about sizing and efficiency.

Pipeline Performance Profiling helps teams answer questions such as:

Resource Utilization per Pipeline and Step

What is the average and peak CPU usage per pipeline and per step?
How much memory is actually consumed during builds?
Are there steps that briefly spike resource usage while the rest of the build remains under-utilized?

Right-Sizing and Cost Optimization

Which pipelines are consistently under-utilizing allocated resources?
Where are CPU or memory requests clearly higher than necessary?
Can pod sizes or machine types be safely reduced without impacting build stability?

With this visibility, right-sizing becomes a controlled, low-risk process rather than a guessing game. Teams can adjust resource allocations confidently, validate changes over time, and reduce CI infrastructure costs without sacrificing performance or developer productivity.

Cache Utilization: Stop Paying for Repeated Work

Repeated image pulls and dependency downloads are a quiet but persistent drain on pipeline performance. When caching is ineffective or inconsistently used, teams end up paying the cost on every build, longer startup times, wasted network bandwidth, and slower feedback loops for developers.

Pipeline Performance Profiling makes cache behavior visible, so teams can understand whether caching is actually helping and where it falls short. Instead of guessing, they can see the real impact cache usage has on build duration and step execution.

With this visibility, teams can answer questions such as:

How often are cache volumes reused across builds?
How does overall build time differ when a cache volume is reused versus not reused?
How does step duration change when cache hits occur?

This makes it easier to identify pipelines that rarely benefit from caching, steps that could be restructured to improve reuse, and opportunities to reconfigure cache volume usage to improve hit rates, all while reducing build startup time. Even modest improvements in cache utilization can translate into noticeable gains in speed and efficiency at scale.

Bring Your Own Observability Stack

A core design goal for Pipeline Performance Profiling was avoiding lock-in to proprietary tooling such as Datadog. The metrics it exposes are fully compatible with OpenTelemetry and Prometheus and visualized through Grafana, making pipeline performance a first-class part of your existing observability ecosystem rather than a separate, siloed view.

This approach allows teams to:

Analyze Codefresh pipeline metrics alongside application and infrastructure data
Correlate slow pipeline steps with cluster-level CPU or memory pressure
Feed build performance data into existing dashboards, alerts, and cost-analysis workflows

For hybrid and on-prem environments in particular, this brings first-class pipeline observability using tools teams already trust, without requiring new platforms or specialized integrations.

Grafana Dashboards

Metrics are only useful if teams can easily explore, understand, and act on them. To make Pipeline Performance Profiling immediately practical, Codefresh provides ready-to-use Grafana dashboards that turn raw pipeline metrics into clear, actionable insights, helping teams move quickly from “something feels slow” to “here’s where the problem is.”

To help teams get immediate value, Codefresh provides two ready-to-use Grafana dashboards.

1. Pipeline Overview

The Pipeline Overview dashboard is designed to help teams understand how their pipelines behave over time, rather than focusing on a single build in isolation. It provides a high-level view of performance trends, making it easier to spot gradual slowdowns, sudden regressions, or improvements introduced by recent changes.

Using this dashboard, teams can answer questions such as:

How is build duration changing over time?
What is causing delays during build initialization?
Are resources allocated effectively for this pipeline?

2. Build Details

While the Pipeline Overview focuses on trends, the Build Details dashboard zooms in on individual builds to support faster troubleshooting and optimization. It is designed for moments when something looks off and teams need to understand exactly what happened during a specific execution.

With this level of detail, teams can answer questions such as:

Which step consumes the most time or resources?
Can the step structure be redesigned for faster execution?
What caused delays during build bootstrap or startup?

Together, these dashboards provide both the big picture and the fine-grained detail needed to move from detection to diagnosis. Teams can track long-term performance trends, investigate anomalies when they occur, and make informed changes with confidence.

We also encourage teams to extend these dashboards with additional graphs and views tailored to their specific workflows and business goals. As you customize and optimize your dashboards, we’d love to hear about your experience, your feedback helps us continue improving Pipeline Performance Profiling for you and for other customers as well.

What’s Next

Pipeline Performance Profiling is an important first step, and we’re actively iterating on it based on real-world usage and customer feedback. We’re already working closely with early customers, using real builds and environments to validate metrics, dashboards, and workflows — and the feedback so far has been invaluable.

Over the coming months, we’ll continue evolving this capability with a focus on deeper insights and improved usability. Key areas of investment include:

Expanded support for SaaS customers: enabling environments where Codefresh hosts both the control plane and runtime to benefit from the same metrics and to analyze them with existing observability tools.
Richer Grafana dashboards: making it easier to spot regressions, identify anomalous builds, and understand performance patterns.
Foundations for performance management: laying the groundwork for future capabilities such as actionable insights, smarter analysis, and performance regression detection.

As always, this work is driven by real customer needs. We encourage you to start exploring Pipeline Performance Profiling with your own pipelines, extend the dashboards to match your workflows, and let the data guide your optimization efforts. Your usage and feedback help shape what comes next and allow us to continue improving the experience for all Codefresh customers.

The post Pipeline Performance Profiling: Making CI/CD Performance, Cost, and Bottlenecks Visible appeared first on Codefresh.

Anatomy of a Pull Request Generator

Shawn Sesna — Thu, 02 Oct 2025 14:48:57 +0000

Argo CD has built a number of Generators to support various scenarios that developers need when using Argo CD and Kubernetes. In this post, I’ll be discussing the Pull Request Generator. A Pull Request Generator is an Argo CD Application Set deployment type that is configured to “watch” a Git repository for Pull Requests (PRs). Whenever a new PR is submitted that matches the specified filter, Argo CD applies the manifests from the referenced repository and path. This allows you to test the PR changes in an ephemeral environment. In addition, the Pull Request Generator cleans up after itself when the PR is closed, removing the resources that it deployed.

The manifest for a Pull Request Generator can look quite daunting as there is some additional configuration required to make it work. In this post, I’ll break down the manifest so it is a little more consumable.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: octopub-pullrequestgenerator
  namespace: argocd
spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - pullRequest:
      requeueAfterSeconds: 120
      bitbucketServer:
        project: pul
        repo: pullrequestgenerator
        # URL of the Bitbucket Server. Required.
        api: https://bitbucket.octopusdemos.app
        # Credentials for Basic authentication (App Password). Either basicAuth or bearerToken
        # authentication is required to access private repositories
        # Credentials for Bearer Token (App Token) authentication. Either basicAuth or bearerToken
        bearerToken:
          # Reference to a Secret containing the bearer token.
          tokenRef:
            secretName: bitbucket-token
            key: token
        # authentication is required to access private repositories
      # Labels are not supported by Bitbucket Server, so filtering by label is not possible.
      # Filter PRs using the source branch name. (optional)
      filters:
      - branchMatch: ".*-argocd"
  template:
    metadata:
      name: 'octopub-{{.branch}}-{{.number}}'
    spec:
      source:
        repoURL: 'https://bitbucket.octopusdemos.app/scm/pul/pullrequestgenerator.git'
        targetRevision: '{{.head_sha}}'
        path: manifests
      project: "default"
      destination:
        server: https://kubernetes.default.svc
        namespace: 'octopub-{{.branch}}-{{.number}}'
      syncPolicy:
        automated: 
          prune: true
          selfHeal: true
        syncOptions:
        - Validate=false
        - CreateNamespace=true

Kind

To support multiple PRs existing at the same time, the Pull Request Generator must use an ApplicationSet. An ApplicationSet acts as an “application factory” to automatically generate applications from a single manifest file.

kind: ApplicationSet

Spec: Generators

Within the manifest specification (spec), you can define one or more generators. This post focuses on the Pull Request Generator specifically, which is denoted by pullRequest.

spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - pullRequest:

By default, Argo CD will check for pull requests every 30 minutes. The manifest provides a method to override this value called requeueAfterSeconds. In my example, I’ve configured Argo CD to check for PRs every two minutes (120 seconds)

Note: Exercise caution when configuring the requeueAfterSeconds as it could lead to API rate limitation for cloud-based source control managers.

  - pullRequest:
      requeueAfterSeconds: 120

Repo Server

Git providers implement the Git API in the same way for most commands such as pulling, pushing, fetching, etc., except for Pull Request. This requires that the Pull Request Generator specify which Git repository it is using so that it makes the appropriate API calls. In my example, I configured the Pull Request Generator to work with my Bitbucket Server instance (see here for a list of Git repository providers and the specifics for configuring them).

  - pullRequest:
      requeueAfterSeconds: 120
      bitbucketServer

For the Bitbucket Server configuration, you’ll need to define the following:

Project
Repo
Api
Authentication

Project

The Argo CD example in their documentation is misleading when it comes to this value. Their example makes it look like this is the Bitbucket Project name; however, it is the Project key value it is looking for. (Bitbucket Server capitalizes the Project key, however, manifests require these values to be lower-case).

  - pullRequest:
      requeueAfterSeconds: 120
      bitbucketServer:
        project: pul

Repo

Projects within Bitbucker Server may have multiple repositories configured. This is the name of the repository you would like Argo CD to monitor.

  - pullRequest:
      requeueAfterSeconds: 120
      bitbucketServer:
        project: pul
        repo: pullrequestgenerator

API

This value is simply the URL to the Bitbucket Server instance. In my case, it is https://bitbucket.octopusdemos.app

  - pullRequest:
      requeueAfterSeconds: 120
      bitbucketServer:
        project: pul
        repo: pullrequestgenerator
        # URL of the Bitbucket Server. Required.
        api: https://bitbucket.octopusdemos.app

Authentication

Argo CD needs to be able to authenticate to the Bitbucket Server so it can monitor the requested repositories. The Bitbucket Server implementation offers two authentication mechanisms:

BasicAuth
BearerToken

My example uses the BearerToken method. This value is a Personal Access Token (PAT) for Bitbucket Server

Argo CD uses Kubernetes resources for this authentication, so the PAT is stored as a Secret within your cluster. This can be created using something similar to this:

apiVersion: v1
kind: Secret
metadata:
  name: bitbucket-token
  labels:
    argocd.argoproj.io/secret-type: repository
  namespace: argocd
type: Opaque
stringData:
  token:

This secret is then referred to in the bearerToken section

  - pullRequest:
      requeueAfterSeconds: 120
      bitbucketServer:
        project: pul
        repo: pullrequestgenerator
        # URL of the Bitbucket Server. Required.
        api: https://bitbucket.octopusdemos.app
        # Credentials for Bearer Token (App Token) authentication. Either basicAuth or bearerToken
        bearerToken:
          # Reference to a Secret containing the bearer token.
          tokenRef:
            secretName: bitbucket-token
            key: token

Filters

Filters are how you tell the Pull Request Generator what to match on. In my example, I’m telling Argo CD to create resources only when a PR is created in branches that end in “-argocd”. The match is done with a regular expression so the period is required, despite not being in the branch name.

- pullRequest:
      requeueAfterSeconds: 120
      bitbucketServer:
        project: pul
        repo: pullrequestgenerator
        # URL of the Bitbucket Server. Required.
        api: https://bitbucket.octopusdemos.app
        # Credentials for Bearer Token (App Token) authentication. Either basicAuth or bearerToken
        bearerToken:
          # Reference to a Secret containing the bearer token.
          tokenRef:
            secretName: bitbucket-token
            key: token
        # authentication is required to access private repositories
      # Labels are not supported by Bitbucket Server, so filtering by label is not possible.
      # Filter PRs using the source branch name. (optional)
      filters:
      - branchMatch: ".*-argocd"

Template

This section defines the template to use when creating the Kubernetes resources. For the Pull Request Generator, we can utilize some variables such as the branch name ( {{.branch}} ) and the numerical value of the PR ( {{.number}} )

  template:
    metadata:
      name: 'octopub-{{.branch}}-{{.number}}'

Spec

The Spec section of a Template follows the same pattern as the standard ApplicationSet specification. The biggest difference is going to be in the targetRevision and namespace sections. You are able make use of the variables previously mentioned to create unique namespaces so that each PR has its own, ephemeral environment. The targetRevision needs to match the PR, this is one of the few times where you can deviate from the recommended GitOps principal of using something other than HEAD.

spec:
      source:
        repoURL: 'https://bitbucket.octopusdemos.app/scm/pul/pullrequestgenerator.git'
        targetRevision: '{{.head_sha}}'
        path: manifests
      project: "default"
      destination:
        server: https://kubernetes.default.svc
        namespace: 'octopub-{{.branch}}-{{.number}}'
      syncPolicy:
        automated: 
          prune: true
          selfHeal: true
        syncOptions:
        - Validate=false
        - CreateNamespace=true

Seeing it in action

If everything is configured correctly, whenever a new PR with the specified filter is created, you will see a new application created within Argo CD!

Conclusion

The Pull Request Generator is a powerful tool that can help reduce bugs by providing a mechanism of testing PRs before they are merged. In this post, I broke down the Pull Request Generator to help understand what it does and how to construct one.

The post Anatomy of a Pull Request Generator appeared first on Codefresh.

Top 30 Argo CD Anti-Patterns to Avoid When Adopting Gitops

Kostis Kapelonis — Tue, 19 Aug 2025 16:36:21 +0000

The time has finally come! After the massive success of our Docker and Kubernetes guides, we are now ready to see several anti-patterns for Argo CD. Anti-patterns are questionable practices that people adopt because they seem like a good idea at first glance, but in the long run, they make processes more complicated than necessary.

Several times, we have spoken with enthusiastic teams that recognize the benefits of Gitops and want to adopt Argo CD as quickly as possible. The initial adoption phase seems to go very smoothly and more and more teams are getting onboarded to Argo CD. However, after a certain point things start slowing down and developers start complaining about the new process.

Like several open source projects, Argo CD has several capabilities that can be abused if you don’t have the full picture in your mind. The end result almost always makes life really difficult for developers.

So keep your developers happy and don’t fall into the same traps. Here is the full list of the antipatterns we will see:

Number		Area
1	Not understanding the declarative setup of Argo CD	Adopting Gitops
2	Creating Argo CD applications in a dynamic way	Adopting Gitops
3	Using Argo CD parameter overrides	Adopting Gitops
4	Adopting Argo CD without understanding Helm	Prerequisite knowledge
5	Adopting Argo CD without understanding Kustomize	Prerequisite knowledge
6	Assuming that developers need to know about Argo CD	Developer Experience
7	Grouping applications at the wrong abstraction level	Application Organization
8	Abusing the multi-source feature of Argo CD	Application Organization
9	Not splitting the different Git repositories	Application Organization
10	Disabling auto-sync and self-heal	Adopting Gitops
11	Abusing the target Revision field	Adopting Gitops
12	Misunderstanding immutability for container/git tags and Helm charts	Adopting Gitops
13	Giving too much power (or no power at all) to developers	Developer Experience
14	Referencing dynamic information from Argo CD/ Kubernetes manifests	Application Organization
15	Writing applications instead of Application Sets	Application Organization
16	Using Helm to package Applications CRDs	Application Organization
17	Hardcoding Helm data inside Argo CD applications	Developer Experience
18	Hardcoding Kustomize data inside Argo CD applications	Developer Experience
19	Attempting to version Applications and Application Sets	Application Organization
20	Not understanding what changes are applied to a cluster	Developer Experience
21	Using ad-hoc clusters instead of cluster labels	Cluster management
22	Attempting to use a single application set for everything	Cluster management
23	Using Pre-sync hooks for db migrations	Developer Experience
24	Mixing Infrastructure apps with developer workloads	Developer Experience
25	Misusing Argo CD finalizers	Cluster management
26	Not understanding resource tracking	Cluster management
27	Creating “active-active” installations of Argo CD	Cluster management
28	Recreating Argo Rollouts with Argo CD and duct tape	Adopting Gitops
29	Recreating Argo Workflows with Argo CD, sync-waves and duct tape	Adopting Gitops
30	Abusing Argo CD as a full SDLC platform	Adopting Gitops

The order of anti-patterns follows the timeline of an organization that starts with minimal Argo CD knowledge and slowly migrates several applications to the GitOps paradigm.

Anti-pattern 1 – Not understanding the declarative setup of Argo CD

Following the GitOps principles means that Argo CD can take your Kubernetes manifests (or Helm charts or Kustomize overlays) from Git and sync them on your Kubernetes cluster. This process is well understood by teams and most people are familiar with saving Kubernetes manifests in Git.

It is important to note however that even this link between a cluster and a Git repository is a Kubernetes resource itself.

Argo CD introduces its own Custom Resource Definitions (CRDs) for several of its central concepts such as applications and projects and also reuses existing Kubernetes CRDs for clusters, secrets, etcs.

These files should also be stored in Git. That is the whole point of following GitOps. It doesn’t make sense to use Git only for some files and not store the Applications themselves similarly.

We see several teams that use the Argo CD UI or CLI to create applications that are not stored anywhere and then have difficulties understanding what is deployed where or how to recreate their Argo CD configuration from scratch.

It should be noted that the “create new app” button in the Argo CD UI is only for experiments and quick tests. You should not use it at all in a production environment as the created application is not saved anywhere.

Everything that Argo CD needs should be stored in Git. Recreating an Argo CD instance should be a simple process with minimal steps:

Create a new cluster with Terraform/Pulumi/Crossplane etc.
Install Argo CD itself using Terraform/Autopilot/Codefresh etc.
Point Argo CD to your ApplicationSets or root app-of-apps file
Finished.

Recreating your Argo CD instance is a repeatable process that can be performed in less than 5 minutes (explained in anti-pattern 27).

If you want a comprehensive guide on how to organize your Kubernetes manifests in Git see our Application Set guide.

Anti-pattern 2 – Creating Argo CD applications in a dynamic way

A related anti-pattern occurs when organizations have an existing process for creating applications and when they adopt Argo CD they simply call the Argo CD CLI or API to pass the information they already have.

Essentially there is an existing database or application configuration somewhere else and a custom CLI or other imperative tool does the following:

Extracts application configuration from the existing database
Creates an Argo CD application or Kubernetes manifest on the fly
Applies this file to an Argo CD instance without storing anything in Git.

You fall into this trap if the “official” way of creating Argo CD applications in your organization is something like this:

my-app-cli new-app-name | argocd app create -f -

Or several times, envsubst is used like this

envsubst < my-app-template.yaml | kubectl apply -f -n argocd

The end result is always the same. You have custom Argo CD applications that are not saved anywhere in Git. You lose all the main benefits of GitOps:

You don’t have a declarative file for what is deployed right now
You don’t have a history of what was deployed in the past
Recreating the Argo CD instance is not a single-step process any more (see anti-pattern 27)

To overcome this anti-pattern you need to make sure that you use Argo CD the way GitOps works.

You either need to convert your existing database settings and make them read/write to Git or discard them completely and start using Git as the single source of truth for everything.

Then, all day 2 operations should be handled by Git.

The same is true for updating existing applications. If you want to update the configuration of an application, the process should always be the same:

You (or an external system) change a file in Git
Argo CD notices the change in Git
Argo CD syncs the changes to the cluster.

If you use the Argo CD API or the Kubernetes API to manually patch resources then you are not following GItOps.

Updating applications in production with any of the following commands goes against the GitOps principles.

kubectl set image deployment/my-deployment my-container=nginx:1.27.0
kubectl patch deployment  [.....patch here…]

Use the Kubernetes API only for experiments and local tests, never for production upgrades (see also anti-pattern 10 – disabling auto-sync).

Anti-pattern 3 – Using Argo CD parameter overrides

Yet another way of updating an Argo CD application in a manner we do NOT recommend is the following:

argocd CLI app set guestbook -p image=example/guestbook:v2.3

The guest book application was just updated to version v2.3. Argo CD syncs the version and everything looks good. But where was this action saved? Nowhere.
This command is using the parameters feature of Argo CD that allows you to override any Argo CD application with your own custom properties. Even the official documentation has a huge warning about not using this feature as it goes against the GitOps principles.

Even if you save the parameter information in the Application manifest, you have now completely destroyed local testing for developers (see anti-patterns 6 and 17).

Anti-pattern 4 – Adopting Argo CD without understanding Helm

Helm is the package manager for Kubernetes. In its original form it offers several essential features in a single platform:

A package format for Kubernetes manifests
A repository specification for storing Helm packages in artifact managers
A templating system
A comprehensive CLI
A lifecycle definition (upgrade, install, rollback, test )

Argo CD renders all Helm charts using the template command and completely discards most other lifecycle features. Even though in theory this is a good thing as Argo CD can replace the default Helm lifecycle (e.g. Argo CD comes with its own rollback command), Argo CD assumes that you already know how Helm templates work.

If you are adopting Argo CD and want to use Helm, then make sure that you know how Helm works on its own and how all your applications can be deployed in all your different environments WITHOUT Argo CD. Trying to learn Argo CD and Helm together at the same time is a recipe for failure.

At the very least you should know how to create Helm hierarchies of values

common-values.yaml
+-----all-prod-envs.yaml
   +----specific-prod-cluster.yaml

And how Helm works when loading the hierarchy with the correct overrides

helm install ./my-chart/ --generate-name -f common.yaml -f more.yaml -f some-more.yam

If you use Helm umbrella charts, understand how to override child values from the top chart.

In particular, pay attention to the following:

restaurant:
 menu: vegetarian

This can be a simple value setting for a Helm chart that sets the restaurant.menu value to “vegetarian”. It can also be an umbrella chart which has a dependency subchart called “restaurant” which itself has a property called “menu”. Understand why those two approaches are different and the advantages and disadvantages of each.
We have already written a complete guide on how to use Argo CD with Helm value hierarchies.

Anti-pattern 5 – Adopting Argo CD without understanding Kustomize

This is the same anti-pattern as the previous one but for Kustomize users. Kustomize is a powerful tool and includes several features such as

Overlays for different environments
Reusable component configurations
Common globals (labels, annotations, namespace prefixes)
Configuration generators
Transformers/Replacements
Remote resources

Argo CD can reuse all Kustomize features, provided that you have structured your Kustomize files correctly first.

Again make sure that your Kustomize files work on their own BEFORE bringing Argo CD into the picture. Well structured Kustomize applications are self-contained. Any developer should be able to use the kustomize (or kubectl) command to extract the configuration for an existing environment without the need for Argo CD.
We presented a full example for Kustomize configurations in our Argo CD promotion guide.

Anti-pattern 6 – Assuming that developers need to know about Argo CD

This is the corollary to the previous two antipatterns. Several teams mix Argo CD configuration data with Kubernetes configuration making the life of developers extremely difficult.

This is a big problem for developers as one of their most important tasks is to run an application locally both during development and when trying to pinpoint difficult bugs in isolation. Creating Argo CD configurations and coupling them with Kubernetes manifests prevents them from understanding how an application runs independently.

We will see more specific anti-patterns later that also contribute to this problem but it is best to know about this trap in advance when you start working with Argo CD.

Remember also that developers don’t care about Kubernetes manifests. They only care about source code features. It is one thing to ask them to learn the basics (i.e. Helm values) and a completely different thing to require Argo CD knowledge just to be able to recreate the configuration of an existing environment.

When you design your Argo CD repository, you should always consider a developer persona who is an expert on Kubernetes but knows nothing about Argo CD. Can they recreate any configuration of any application on their laptop without using Argo CD? If the answer is no, it means that you are the victim of this anti-pattern.

Find out where you have hardcoded Argo CD configurations with Kubernetes configurations and remove the tight coupling. For specific advice see anti-patterns 17 for Helm and 18 for Kustomize.

We have seen how to split Kubernetes configuration from Argo CD manifests in our Application Set guide.

Anti-pattern 7 – Grouping applications at the wrong abstraction level

As we explained in the first anti-pattern (store everything in Git), an Argo CD application is just a link between a Git repository and a Kubernetes cluster. It is not a deployment artifact or a packaging format.

We have seen teams that abuse an Argo CD application as a generic grouping mechanism, using it for microservices or even completely unrelated applications.

You need to spend some time understanding what your applications do and how tightly coupled they are. If you have a set of “micro-services” that are always deployed together and upgraded together, you might want to use an umbrella chart for them.

Argo CD applications should generally model something that requires individual deployments and updates. If you have several Argo CD applications that you always want to be managed together (but still want to deploy and update individually), then a better choice might be the app-of-apps pattern.

If several applications need to be deployed to different or similar configurations, then Application Sets are the proper recommendation.

So any time you want to group several applications, ask the following questions:

Are those applications always deployed and upgraded as a single unit?
Are those applications related in a business or technical manner?
Do you want to use different configurations for different clusters for these applications?
Is this combination of applications always the same? Do you sometimes wish to deploy a subset of them or a superset?
Are these applications managed by a single team or multiple teams?

If you are unsure where to start, looking at Application Sets is always the best choice. Check also anti-pattern 19 (attempting to version application CRDs).

Anti-pattern 8 – Abusing the multi-source feature of Argo CD

This is a close relative of the previous anti-pattern. The Multi-source feature of Argo CD was one of the most requested features in the project’s history. The feature was created to solve a single scenario:

You wish to use an external Helm chart that is not hosted by your organization
You want to use your own Helm values and still store them in your own Gitrepository
You need a way to instruct Argo CD to combine the external Helm chart with your own values.

The feature is finally implemented in Argo CD and you can finally do the following:

apiVersion: argoproj.io/v1alpha1
kind: Application
spec:
  sources:
  - repoURL: 'https://prometheus-community.github.io/helm-charts'
    chart: prometheus
    targetRevision: 15.7.1
    helm:
      valueFiles:
      - $values/charts/prometheus/values.yaml
  - repoURL: 'https://git.example.com/org/value-files.git'
    targetRevision: dev
    ref: values

Unfortunately, Argo CD does not limit the number of items you can place in the “sources” array. Several people have misunderstood this, abusing the feature to group multiple (often unrelated) applications.

Don’t fall into this trap. The multi-source feature was never designed to work this way and several standard Argo CD capabilities will either be broken or not work at all if you use multi-sources as a generic application grouping mechanism.
The correct way to group applications is with Application Sets. If you want to use multisource applications with Helm hierarchies, we have also written an extensive guide.

Anti-pattern 9 – Not splitting the different Git repositories

If you look at any Kubernetes application from a high level it is comprised of 3 distinct types of files

The source code
Kubernetes manifests (deployment, service, ingress etc)
Argo CD application manifests (or application sets)

We have already explained that as a best practice the source code should be separate than the manifests. If you are in a big organization it might also make sense to split Kubernetes manifests from Argo CD manifests.

If you keep all manifests in a single Git repository you will have issues with both CI (Continuous Integration) and CD (Continuous Deployment) phases. Your CI system will try to auto-build application code when a manifest changes, and Argo CD will try to sync applications when a developer changes the source code.

There are several workarounds for these scenarios, but why try to solve a problem that shouldn’t exist in the first place?

We have also seen several variations of the same pattern that make the deployment process even more complicated. A classic scenario to avoid is the following

The source code of the application is in Git repository A
The Helm manifest is in Git repository B
Only the Helm values for the different environments are also stored in Git repository A

The assumption here is that Helm values are close to the source code that developers need to change. In reality, it never makes sense to have access to Helm values without also having access to the Helm chart that uses them. Either assume that developers know about Helm and show them everything, or assume that they don’t care about Kubernetes at all and offer them a different abstraction that hides Argo CD completely for them.

Yet another problem is mixing Kubernetes manifests and Argo CD manifests in the same repository but instead of using different folders, you hardcode Kubernetes information into Argo CD information. This is described in detail in anti-patterns 17 and 18.

Anti-pattern 10 – Disabling auto-sync and self-heal

Several times, migrating to Argo CD is a big undertaking, especially for organizations that have invested a significant amount of effort in traditional pipelines. After all, it can be argued that Argo CD simply replicates what is already possible with an existing Continuous Integration system:

A change happens in Git in a manifest
A separate process picks up the Git event
The process uses kubectl (or a custom script) to apply the changes in the cluster.

This could not be further from the truth as Argo CD also works the other way around. It monitors changes in the cluster and compares them against what is in Git. Then you can make a choice and either review those changes or discard them altogether.

This means that Argo CD solves the configuration drift problem once and for all, something that is not possible with traditional CI solutions.

But this advantage only exists if you let Argo CD do its job. Some organizations disable the auto-sync/self-heal behavior in Argo CD in an effort to “lock-down” or fully control production systems.

This is a bad choice because production systems are exactly the kind of systems where you want to avoid configuration drift. Manual changes that happen in production (during hotfixes or other debugging sessions) are one of the biggest factors affecting failed deployments.

We recommend that you have auto-sync/self-heal for all your systems both production and non-production.

Locking down a system should not be done in Argo CD itself, but enforced on the cluster and Git level. The most obvious solution is to reject any direct commits in a Git repository that controls your production system and only allow developers to create Pull Requests which must pass several manual and automated checks before landing in the mainline branch that Argo CD monitors.

If you disable auto-sync/self-heal you are missing the number one advantage of moving to Argo CD from traditional pipelines (eliminating configuration drift).

Anti-pattern 11 – Abusing the targetRevision field

Promoting applications when adopting Argo CD is one of the biggest challenges for developer workloads. People see the targetRevision field in the Argo CD application and assume it is a promotion mechanism.
The first issue is when teams use semantic version ranges to force Argo CD to update an application automatically to a newer version. The second issue is if they continuously update the targetRevision field to different branch names or attempt to implement “preview environments” by pointing an Argo CD application to a temporary developer/feature branch.

We have written a complete guide about the issues of abusing targetRevision.

In general, we recommend you always use HEAD in the targetRevision field which is also the default value.

Anti-pattern 12 – Misunderstanding immutability for container/git tags and Helm charts

This is not an anti-pattern with Argo CD per se, but it is closely related to the targetRevision choices, as explained in the previous anti-pattern.

We have seen several cases of people adopting Argo CD without first understanding the foundations (Helm, container registries, git tags). Several times, people use a specific git tag or Helm version in Argo CD without realizing that:

Git tags seem to be immutable, but can be deleted and recreated with the same name
Helm chart versions are mutable. This is how Helm was designed
Container tags are mutable by default.

Let’s take these points one by one.

Container tags are mutable. You can create a container called my-app:v1.2 and then change something and push another container with the same tag. So just because you see the same container tag doesn’t mean that it is actually the same application. Some binary repository implementations don’t allow you to do this, but this is not always the default setting.

Helm chart versions are also mutable. You can change the contents of a Chart version and use the exact same version. Again, this is how Helm charts are created.

In fact Helm offers an additional property – the appVersion which can store the “application” version. So you can have a Helm chart with 3 “version” fields:

The container image (mutable by default)
The appVersion field (mutable)
The Chart version (mutable)

So unless you control how developers work with code and manifests and also configure your Helm chart repository correctly, you don’t really know if a Helm chart version contains the same thing as another chart with the same version.

Git tags can also be overwritten. You can see this very easily with any GitHub repo with default settings.

git tag -a v1.2 -m "first tag"
git push --tags
echo "A change" >> README.md
git tag -d v1.2
git push origin :v1.2
git tag -a v1.2 -m "first tag"
git push --tags

Here we just pushed the same tag (v1.2) twice but with different contents. So if you were using this tag in the targetRevision field of Argo CD, you now have the “same” application without actually having the same contents.

The end result is that using tags and Helm chart versions in Argo CD doesn’t really restrict your developers unless you actively set up the rest of the ecosystem (Git repositories, Helm repos and binary artifact managers) to also work with immutable data.

Never assume you have a “locked-down” Argo CD system when the rest of the ecosystem allows developers and operators to create stuff with the same container/Helm/Git version.

Anti-pattern 13 – Giving too much power (or no power at all) to developers

When adopting Argo CD you need to make a decision about how much power and exposure you want to give to developers. On one end of the spectrum we see installations where developers have full access to the Argo CD UI and can sync/deploy their applications at will. On the other end we see locked down installations where developers are given very little power or Argo CD is completely hidden from them.

It is important to understand that despite these two extremes, there are several choices in the middle. First, you can configure the Argo CD access (and UI) to show only applications specific to each team. You can even do advanced scenarios where you show applications from other teams but in a read-only mode.

In our Argo CD RBAC guide, we have explained how the RBAC for Argo CD works and how you can show content specific only to a developer team.

There is no right or wrong answer here, but you need to balance the flexibility versus the security that you want to offer to developers.

On a related note we have already explained that developers don’t really care about Argo CD manifests and they shouldn’t also be forced to install Argo CD for local testing (see anti-pattern 6) .

So a recommended workflow would be the following:

Developers can test and deploy their applications locally without using Argo CD at all
When they have created a feature branch this should be converted to a preview/temporary environment using the Pull Request generator without any human intervention
Once the feature is ready, it will be deployed to production simply by merging the Pull request
A system designed for promotion should propagate the changes to the next environment (check anti-pattern 30).

Ideally developers should only come in contact with Argo CD in the last phase and only in the case of a failure. In the happy path scenario where everything works as planned, developers shouldn’t have to debug anything with Argo CD.

Anti-pattern 14 – Referencing dynamic information from Argo CD/ Kubernetes manifests

This is a more specialized anti-pattern related to number 2 (creating applications in a dynamic way). The second GitOps principle explains that the desired state of your system should be immutable, versionable and auditable.

This is only true if you store EVERYTHING that the application needs in Git (or your chosen storage method). In the case of Kubernetes/Argo CD manifests, this means that all values used should be static and known in advance.

It is ok if you want to post-process your manifests in some way as long as this happens in a repeatable manner OR you also save the result itself in Git.

The problem starts when your configuration is not known in advance but requires real-time access to something else.

The best example to illustrate this is the Helm lookup method which mutates the Helm chart to a different value without knowing in advance what this value is. This is problematic with Argo CD because having access to just the application manifests is not enough to run the application anymore.

Like anti-patterns 17 and 20 this also makes lives difficult for developers as they cannot run the application locally anymore (anti-pattern 6).

Note that the only exception to this rule is secrets. Even though you can store encrypted secrets in Git, it is also ok to reference them from an external source.
But make sure to understand the difference between referencing secrets from manifests versus injecting secrets into manifests.

Anti-pattern 15 – Writing applications instead of Application Sets

The Application CRD is the main entity in Argo CD that links a cluster and a GitHub repository. If you have a small number of applications or work in a homelab environment it is OK to write these files by hand.

But for any production installation of Argo CD (used in a company) you generally wouldn’t need to write Application files by hand. In fact you should not even deal with Application files at all.

The recommendation is to use directly Application Sets.

In a big organization you will rarely have to deal with a single application on its own. Almost always you want to work with a group of applications. Some examples are:

A set of applications that go to the same cluster
A set of applications that are managed by the same team
A set of applications that share a configuration
A set of applications that should be deployed/updated as a unit.

Application Sets implement this grouping and also automate the tedious YAML needed. For example if you have 4 applications and 10 clusters you don’t really want to create 40 YAML files by hand.

You create a single cluster generator that iterates over your clusters
You create 4 folders for the 4 applications
The Application set runs and creates automatically all 40 combinations for you.

Some people resist using Application Sets because they think it is yet another abstraction that hides their real Application manifests.

This used to be true, but in the latest Argo CD releases the CLI allows you to render an application set and see exactly what applications will be created. You can run this command either manually or in a CI pipeline (when a pull request is created) so that you have instant visibility on what will change.

In general though, as we will see later all application set files should be created only once.They are not a deployment format (anti-pattern 19) and you shouldn’t have to edit ApplicationSets for simple operations.
We have written a dedicated guide on how to use Application Sets with Argo CD.

Anti-pattern 16 – Using Helm to package Applications instead of Application Sets

This anti-pattern is the so called “Helm sandwich”. This is the case where:

A set of Kubernetes manifests is packaged as a Helm chart
The Helm chart is referenced from an Argo CD application Manifest
The Argo CD application manifest itself is packaged in another Helm chart

Essentially, there are two instances of Helm templating on two different levels.

Teams that adopt this anti-pattern are familiar with Helm and assume that if it works great for plain Kubernetes manifests, it would also work great for Argo CD application manifests.

So why is this approach an anti-pattern? Using Helm for applications is not an issue on its own, but it is the start for several other anti-patterns:

Because Helm charts have a version people try to version Applications -> Anti-pattern 19
Chart version numbers often result in abusing target Revision for promotions -> Anti-pattern 11
If you use Helm everywhere it is super easy to hard-code Helm values in Application manifests -> Anti-pattern 17
People miss all the Argo CD specific features of application sets -> Anti-pattern 15
Distributing applications to different clusters happens with snowflake servers instead of the cluster generator -> Anti-pattern 21

The biggest problem however is that it ruins again completely the developer experience. You now have 2 levels of Helm values and this is a recipe for disaster:

Even speaking for just operators/admins, packaging applications in a Helm chart creates an extra level of indirection that not only is not needed but makes your life more difficult as Argo CD was never created for this “Helm sandwich”.

Our recommendation is obviously to use Application Sets for configuring Applications, as already explained in the previous anti-pattern.

Application Sets are the preferred way to generate Applications and they support the exact same templating functions as Helm charts. Basically if you can template it with Helm, you should be able to template it with Application Sets.

Using the “Helm sandwich” pattern makes the process more complex for everybody involved in the software lifecycle.
For more information, see our Application Set guide.

Anti-pattern 17 – Hardcoding Helm data inside Argo CD applications

If you follow the advice we outlined in Antipattern 4 you should have a clear separation between your Helm charts and your Argo CD manifests.

The Helm charts can be used independently (even by developers) and contain all application settings. The Argo CD application manifest simply defines where this application runs, and only operators need to change this file.

Those two kinds of files also have a different lifecycle

Helm values are expected to change all the time (even by developers)
Argo CD application manifests are created once and never changed again.

With Argo CD it is possible to hardcode Helm information inside an Application CRD. But just because you can do this, doesn’t mean it is a good idea to use this feature.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-helm-override
  namespace: argocd
spec:
  project: default


  source:
    repoURL: https://github.com/example-org/example-repo.git  
    targetRevision: HEAD  
    path: my-chart  


    helm:
      # DONT DO THIS
      parameters:
      - name: "my-example-setting-1"
        value: my-value1
      - name: "my-example-setting-2"
        value: "my-value2"
        forceString: true # ensures that value is treated as a string


      # DONT DO THIS
      values: |
        ingress:
          enabled: true
          path: /
          hosts:
            - mydomain.example.com


      # DONT DO THIS
      valuesObject:
        image:
          repository: docker.io/example/my-app
          tag: 0.1
          pullPolicy: IfNotPresent
  destination:
    server: https://kubernetes.default.svc
    namespace: my-app

The big problem with this file is that you are mixing 2 different kinds of information with 2 different lifecycles. The application CRD is something that is interesting mostly to operators/administrators while Helm information is interesting to developers and is also expected to change a lot.

By mixing this information you make manifests harder to understand for everybody.

Now you have Helm information in two places (helm values and the helm property inside the Argo CD application manifest). It is very hard to understand what settings exist where and how to audit deployment history. Argo CD application manifests now must also change all the time especially if they define container images.

This manifest mixing is also the root of several other anti-patterns such as

Using Applications as a unit of work (anti-pattern 19)
Abusing the targetRevision field for promotions (anti-pattern 11)
Not understanding how Helm hierarchy works (anti-pattern 4)
Assuming that developers need Argo CD (anti-pattern 6)

The last point is especially important for developers. If you follow this practice, you have completely destroyed local testing for developers as they cannot run the application on its own anymore.

Even though we speak only about developers and operators here, several other scenarios will make this approach difficult to use

Security teams will have a hard time understanding the settings for each application
Your CI system cannot upgrade images just on Kubernetes manifests anymore. It also needs to look at Argo CD manifests and check if image definitions exist there as well
It couples your Applications to specific Argo CD features

The correct solution is of course to store all Helm information in Helm values

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-helm-override
  namespace: argocd
spec:
  project: default


  source:
    repoURL: https://github.com/example-org/example-repo.git  
    targetRevision: HEAD  
    path: my-chart  


    helm:
      ## DO THIS (values in Git on their own)
      valueFiles:
      - values-production.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: my-app

Now there is a clear separation of concern between developers and operators.

People who want to know applications’ settings can look at Helm values, while people who want to know where applications are deployed can look at Argo CD manifests. It is possible to run a Helm application without Argo CD.

For more information about the different types of manifests and how to split them see our Application Set guide.

This anti-pattern is even worse if coupled with the previous anti-pattern (the Helm sandwich).

Now you have 3 places where configuration settings can be stored:

The Helm values of the chart that gets deployed
The helm property inside the Application Manifest that references the Helm chart
The Helm template that renders the Application manifest before getting passed to Argo CD

The result is a nightmare for anybody who wants to understand how an application gets deployed.

Anti-pattern 18 – Hardcoding Kustomize data inside Argo CD applications

This is the same anti-pattern as the previous one but for Kustomize. Again for convenience Argo CD allows you to hardcode Kustomize information inside an Application YAML:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-kustomize-override
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/example-org/example-repo.git  
    targetRevision: HEAD  
    path: my-app  
   
    # DONT DO THIS
    kustomize:
      namePrefix: prod-
      images:
      - docker.io/example/my-app:0.2
      namespace: custom-namespace


  destination:
    server: https://kubernetes.default.svc
    namespace: my-app

It is the same problem as before where you are mixing different concerns in a single file. Kustomize information should only exist in Kustomize overlays:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-proper-kustomize-app
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/example-org/example-repo.git  
    targetRevision: HEAD  
    ## DO THIS. Save all values in the Kustomize Overlay itself
    path: my-app/overlays/prod  


  destination:
    server: https://kubernetes.default.svc
    namespace: my-app

As explained before this helps developers during local testing. A developer can simply run “kustomize build my-app/overlays/prod” and get the full configuration of how my-app runs in production. No knowledge of Argo CD is required and no local installation of Argo CD is needed.

Developers can define how an application runs (its settings), while operators can decide where (which cluster) the application is deployed.

At the same time, several supporting functions are very easy:

Git history of the overlays is the same as the deployment history
There is only a single source of truth for configuration (the overlays)
Developers don’t need to know how to use Argo CD at all
It is very easy to diff settings between environments.

A detailed example with Kustomize overlays is available in our GitOps promotion guide.

Anti-pattern 19 – Attempting to version and promote Applications/Application Sets

An Argo CD application is just a link between a cluster and a Git repository. There is NO version field in the Application CRD. An application manifest is neither a packaging format nor a deployment artifact. The same is also true for Application Sets. You never deploy application sets. You just use them to auto-generate application manifests. Application Sets have no version on their own.

The lack of a version field is not a big problem because the expectation for both Applications and Application Sets is that you create them once and then never update them again. So, a version field is unnecessary as all Argo CD Applications are considered static.

However, we see several teams that try to use Applications as the unit of work (see anti-pattern 7) or continuously update those files in the targetRevision field (see anti-pattern 11).

At this point teams try to create their own versioning on top of the Argo CD manifests and of course they fail because Argo CD was never designed this way.

The same is true for promotions. You cannot really “promote” an Argo CD application from one cluster to the next. It doesn’t work this way. You can only promote values that are referenced from one application to the next (essentially copying them).

The way promotions work in Argo CD is the following:

There is an Argo CD application manifest in QA that points to a Helm chart or Kustomize overlay
There is an Argo CD application manifest in Staging that points to different Helm values or Kustomize overlays
When it is time to promote you copy the Helm values or Kustomize overlay from the QA files to the Staging files.
The Argo CD application manifests are not affected in any way. They are exactly the same as they were before.

See also the related anti-patterns of hardcoding Helm (anti-pattern 17) or Kustomize data (anti-pattern 18) inside applications.

If you find yourself constantly updating Argo CD application manifests (or Application Sets) you have fallen into this trap. Your Argo CD manifests should be created only once and never touched again. No process is simpler than not needing a process at all (to update Application manifests).

On a related note, we have created GitOps Cloud to solve this problem with promotions and allow you to promote applications from one cluster to another.

Anti-pattern 20 – Not understanding what changes are applied to a cluster

One of the main benefits of using Argo CD is that all your Git tools work out of the box and you can reuse your code review process for your Kubernetes manifests.

The most basic capability of storing anything in GitHub is creating a Pull Request before merging any changes. This allows humans to review what will change and also run any automated tools to verify and validate the changes.

Unfortunately this review process will not really work on files such as Helm charts or Application Sets or Kustomize Overlays. Let’s assume you need to review the following change:

You need to manually run Helm in your head to understand what is happening here. Humans never want to do that. Obviously one approach would be to pre-render all your manifests so that reviews only happen in the final content. There are however several alternatives that you can consider such as having your CI system render on the fly the manifests and show you what will actually happen.

Here is the exact change as before but this time on the final chart.

We have written a full guide on how to preview and diff your Kubernetes manifests with several other approaches.
Specifically for Argo CD you should also check https://github.com/dag-andersen/argocd-diff-preview and also understand that you can use the Argo CD CLI to render your Application Sets to Application CRDs.

Anti-pattern 21 – Using ad-hoc clusters instead of cluster labels

If you have a large number of clusters that you wish to manage with Argo CD, your first question is always whether to use a single Argo CD instance or multiple ones.

Once you answer this question the next step is to understand how you can distribute different applications to different clusters. The answer for this question is Application Sets.

Unfortunately we see a lot of teams that don’t understand how the cluster generator works and instead try to create ad-hoc cluster configurations (the pet vs cattle philosophy).

Examples to avoid often appear like this:

## DO NOT DO THIS 
- merge:
      mergeKeys:
        - app
      generators:
        - list:
            elements:
              - app: external-dns
                appPath: infra/helm-charts/external-dns
                namespace: dns
              - app: argocd
                appPath: infra/helm-charts/argocd
                namespace: argocd
              - app: external-secrets
                appPath: infra/helm-charts/external-secrets
                namespace: external-secrets
              - app: kyverno
                appPath: infra/helm-charts/kyverno
                namespace: kyverno
        - list:
            elements:
              - app: external-dns
                enabled: "true"
              - app: argocd
                enabled: "true"
              - app: external-secrets
                enabled: "false"
              - app: kyverno
                enabled: "true"
    selector:
      matchLabels:
        enabled: "true"

Or this

## DO NOT DO THIS 
Version: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-staging-cluster
  namespace: argocd
spec:
  goTemplate: true
  generators:
    - matrix:
        generators:
        - git:
            repoURL: ''
            revision: HEAD
            files:
              - path: customConfig/base.yaml
              - path: customConfig/{{ .Values.domainId }}/*.yaml
        - list:
            elements:
              - appName: 'auth'
              - appName: 'search'
              - appName: ‘billing’
              - appName: ‘payments’

These kind of configurations create snowflake servers that need constant maintenance. If you have fallen into this trap try to understand how much time you will need to spend in the following scenarios:

Creating a brand new server
Migrating an application from one server to another
Applying a global setting to all your servers
Making a different configuration change for a subset of your servers.

This kind of cluster management is even more problematic for developers, as simply understanding what applications are deployed where is not easy (already explained in anti-pattern 6).

Our recommendation is to create a production-ready setup using cluster labels. We have described all the details in a dedicated guide for the cluster generator and Argo CD application sets.

Anti-pattern 22 – Attempting to use a single application set for everything

A related anti-pattern is when teams discover application sets and, for some unknown reason, try to cram all their applications into a single application set. This often results in a complex mix of different generators that is hard to understand and hard to debug.

The recommendation is to have many different application sets in your Argo CD setup. Ideally, you should have different application sets per “type” of application. This type can be anything that makes sense to you. For example:

An application set for all staging apps
An application set for all AWS clusters
An application set for all infra apps
An application set for the billing team
An application set for the payments team

You can slice and dice your applications in different dimensions. But in the end, you will have many application sets, and any time you make a change, you should instantly know where it should happen.

A new requirement is that all AWS clusters need to get sealed secrets -> Change the AWS application set.
A new requirement is that the billing team add a new microservice to their setup -> Change the “billing” application set.
All new clusters must upgrade Prometheus -> Change the “common” application set.

There is no technical limitation on the number of application sets you can have on a single Argo CD installation. So, spend some time understanding your application sets accordingly.

As always, our Argo CD application guide is the best starting point.

Anti-pattern 23 – Using Pre-sync hooks for db migrations

Argo CD phases/waves allow you to define the order of Resources synced within a single Argo CD application. The pre-sync phase, in particular, can be used to check deployment requirements or perform other checks that need to happen before the main sync phase.

We often see organizations attempting to use pre-sync hooks for database migrations. The assumption here is that the database schema must be updated just before the new application version. Unfortunately, most organizations use legacy DB migration tools, and almost always, they package a DB migration CLI tool in the pre-sync phase, which is not the proper approach for Kubernetes applications.

Using pre-sync hooks for database migrations has several issues. The most important one is that the DB CLI tool is just a black box for Argo CD. The CLI runs and never reports back to Argo CD what really happened with the database. This can leave the DB in an inconsistent state where the main sync phase will always fail.

The second problem is that Argo CD is based on continuous reconciliation, where an application might be synced for several reasons and in different time frames. Unfortunately the traditional DB CLI tools are rarely created with this scenario in mind. Most times they assume they run inside a typical CI pipeline only once.

At this point, Argo CD users are looking for several hacks to force the pre-sync hook to run ONLY in the initial application deployment and not in any subsequent sync events, as this either slows down the deployment or breaks the database completely.

The correct approach is to use a Database migration operator built specifically for Kubernetes. We have written a full guide using the AtlasGo DB operator.

Anti-pattern 24 – Mixing Infrastructure apps with developer workloads

We have seen in anti-pattern 7 several ways to group applications with GitOps (applicationsets, apps-of-apps, Helm umbrella charts).

These grouping methods should always be used for the same types of applications. They should group either infra applications (core-dns, nginx, prometheus) OR the applications that developers create.

You should never mix applications of different types as you force developers to deal with infrastructure errors.

When you create a new cluster and hand it over to developers, it should already have everything they need. The respective application sets should also have installed any infrastructure applications.

Mixing both infrastructure and developer applications might be easier for you (to control directly the deployment order) but always results in a bad user experience for developers.

If developers have access to the Argo CD UI and see a deployment error, they should know immediately that it is something they can fix themselves.

Anti-pattern 25 – Misusing Argo CD finalizers

Argo CD finalizers allow you to define what happens when an Argo CD application (or application set) is removed. You must understand how finalizers work and the impact of adding/removing a finalizer from a resource.

Several teams have accidentally deleted one or more Argo CD applications because they never understood how finalizers work. Other times several resources are “stuck” and never recreated because of a misconfiguration with finalizers.

Finalizers are also very useful when you want to migrate applications from one Argo CD instance to another.
We have written a comprehensive guide about Argo CD finalizers and how to use them.

Anti-pattern 26 – Not understanding resource tracking

This anti-pattern is related to the previous one. First of all it is important to understand how Argo CD tracks and “adopts” Kubernetes resources. You can have Kubernetes resources that are not managed by Argo CD, or Argo CD resources that “owned” by other Kubernetes controllers.

It is vital to know that the relationship between a Kubernetes resource and the Argo CD application that owns it is not always 1-1. You can have

Argo CD applications that no longer contain any Kubernetes resources
Kubernetes resources that are no longer owned by an Argo CD application

The second scenario is achieved with finalizers. This pattern is very useful when you want to migrate applications from one Argo CD instance to another without downtime. The full process is the following:

Argo CD instance A owns all Kubernetes resources
You remove all finalizers for all applications (and application Sets)
You delete all Argo CD applications
The Kubernetes resources are still running just fine. There is no downtime
You apply the same Argo CD applications to Argo CD instance B
Argo CD instance B will adopt the same Kubernetes resources as before (with no downtime)

You can try this exact scenario of moving Argo CD applications between instances without downtime in our Gitops Certification (level 3) course.

The exact process can be used to upgrade an existing Argo CD instance to a new version in the safest way possible.

Anti-pattern 27 – Creating “active-active” installations of Argo CD

This is the corollary to the previous two anti-patterns. We see several several teams that try to create “active-active” installations of Argo CD with the following requirement:

The main Argo CD instance is controlling all applications and deployments
There is a secondary Argo CD instance that is also pointed to the same cluster
If the main Argo CD instance “fails” the secondary instance “jumps-in”
When the main Argo CD instance is restored it “adopts” again all applications.

These teams are disappointed to learn that Argo CD doesn’t support this and even the centralized mode is for controlling other Kubernetes clusters and not other Argo CD instances.

This requirement doesn’t really make sense for Argo CD and teams that look for this “active-active” configuration haven’t really understood resource tracking.

First of all it is important to understand that Argo CD only deploys applications. It doesn’t really control them. If Argo CD fails, new deployments will stop but existing applications will continue to work just fine. And even if those fail for some reason, their pods will be rescheduled/restarted by the Kubernetes cluster (even if Argo CD is no longer operational).

The disaster recovery scenario for Argo CD is straightforward if your team has everything in Git (see antipattern 1). You can launch a second Argo CD cluster and point it to the same application manifests. Argo CD will then adopt the existing Kubernetes resources.

This is even possible if the central Argo CD instance has issues but still runs, as you can use finalizers (as explained in the previous section) to migrate applications to the second Argo CD instance without downtime.

Keeping a second Argo CD instance in “active-active” mode only wastes resources.

Anti-pattern 28 – Recreating Argo Rollouts with Argo CD and duct tape

Bad deployments always happen regardless of whether you are using Argo CD or not. That is a fact for any software team. So how do failed deployments work if you have adopted GitOps?

The simplest way to fix a failed deployment is to roll “forward”. Make a new release or fix the Kubernetes manifests and once you commit, Argo CD will deploy the new changes and hopefully bring back the application to a good state.

Argo CD also includes a “rollback” command which simply points the application back to a previous Git hash. This sounds great in theory but it has 2 major problems:

It works only if auto-sync is disabled (anti-pattern 10)
It breaks GitOps as your cluster doesn’t represent what is in Git anymore

At this point, teams start creating custom solutions to overcome these limitations. The most common approaches we see are:

Trying to detect a failed deployment, and then disable auto-sync on the fly while rolling back
Using notifications with external metric providers who will automatically try to revert a commit on their own, so Argo CD will sync as usual to the previous version

These custom solutions are always clunky and create more problems than they solve. You know that your team is falling into this trap if you hear people always asking the question, “How can I disable auto-sync temporarily?”

The recommended solution is to use Argo Rollouts.

Argo Rollouts is a progressive Delivery controller designed for this exact scenario—automated rollbacks when things go wrong. It also comes with its own resource (Analysis) that allows you to look at your metrics during a deployment and roll back without any human intervention.

Argo Rollouts will handle production deployments while non-production environments can still use plain Argo CD.

Anti-pattern 29 – Recreating Argo Workflows with Argo CD, sync-waves and duct tape

The sync wave feature of Argo CD allows you to execute tasks before and after the main sync phase. These tasks should ideally be idempotent and quick to finish. Some examples are:

Sending a notification to another system
Performing a quick smoke test
Verifying that a dependency exists

We see teams that misuse the sync waves in Argo CD with long-running tasks that are part of a bigger process with strict requirements such as

Automatic retries
If/else control flows
Dependency graphs and fan-in/fan-out configuration
Artifact storage are retrieval

Sync waves were NEVER designed for this kind of requirement. If you try to do this, you will soon resort to custom scripts that nobody wants to maintain. Adopting Argo CD for deployments and then trying to incorporate custom scripts in the sync process is a huge step backwards.

If you have this kind of process you should use Argo Workflows which handle exactly these kinds of requirements.

Argo Workflows are Kubernetes native workflows that offer you all these features out of the box.

Therefore the whole sync process must be:

Run an Argo Workflow before the sync process
Perform the main Sync phase
Run another Argo Workflow after the sync process.

Argo Workflows will then handle all the heavy lifting using declarative Kubernetes resources instead of custom scripts.

Anti-pattern 30 – Abusing Argo CD as a full SDLC platform

Despite all the features and the developer-friendly UI, Argo CD is very simple at its core. It is a powerful sync engine that continuously watches what you have in Git and applies the change to your cluster. All the features are centered around this use case.

However, deploying an application is only part of the software development life cycle (SDLC). Several other requirements must be met before (the CI process) and after (observability) the main deployment.

We have seen several teams trying to expand Argo CD’s scope and make it something it never was. Most importantly, Argo CD has no visibility in your Continuous Integration (CI) process. Argo CD doesn’t know:

What are the new features in the container that gets deployed
Who made the container build
If the application has passed your unit and integration tests
If your new container has passed your security scans
Who approved the Pull request of the source code change

In fact, Argo CD doesn’t even know that it deploys a new container that includes commits from a source code repo. All it knows is diffing and applying Kubernetes manifests without any insights about the business features behind.

Attempting to integrate this information into Argo CD either through custom plugins or custom YAML segments is always a clunky process. We understand the need for a unified interface. Developer teams love the Argo CD UI and think they can use it as a central dashboard for everything.

That is not the role of Argo CD. You can create a developer portal that uses Argo CD behind the scenes, but Argo CD is not a developer portal itself.

If you want to use a central platform for all your Argo CD instances that also combines deployment information with the CI world, check out Codefresh GitOps cloud.

Conclusion

We hope this comprehensive guide is useful and has provided several good and bad practices when adopting GitOps. Argo CD is a great tool, but it offers several knobs and switches that can be used with undesirable results.

Some features can be abused in several ways, simply because no good documentation exists about the history of the feature, what the intended use is, and what to avoid.

Using this guide you can start your Argo CD journey in the best way possible as you now have the knowledge of what to avoid before investing a significant amount of effort into your Application manifests.

Here is a summary of all the anti-patterns and our recommendation:

Not understanding the declarative setup of Argo CD -> Store Application CRDs in Git.
Creating dynamic Argo CD applications -> Use Git as the single source of truth for application configuration.
Using Argo CD parameters -> Avoid using the parameters feature as it goes against GitOps principles.
Adopting Argo CD without understanding Helm -> Understand how Helm works independently before adopting Argo CD.
Adopting Argo CD without understanding Kustomize -> Ensure your Kustomize files work on their own before integrating with Argo CD.
Assuming that developers need to know about Argo CD -> Design your Argo CD applications so developers can recreate configurations without Argo CD knowledge.
Grouping applications at the wrong abstraction level -> Use Application Sets or app-of-apps pattern for proper application grouping.
Abusing the multi-source feature of Argo CD -> Use multi-source as a last resort and only for edge case scenarios.
Not splitting the different Git repositories -> Separate source code, Kubernetes manifests, and Argo CD application manifests into different Git repositories.
Disabling auto-sync and self-heal -> Keep auto-sync/self-heal enabled for all systems, including production.
Abusing the targetRevision field -> Always use HEAD in the targetRevision field.
Misunderstanding immutability for container/git tags and Helm charts -> Actively set up the ecosystem (Git, Helm repos, artifact managers) to work with immutable data.
Giving too much power (or no power at all) to developers -> Balance flexibility and security with Argo CD RBAC, and enable local testing without Argo CD.
Referencing dynamic information from Argo CD/ Kubernetes manifests -> Store all values used in manifests statically in Git.
Writing applications instead of Application Sets -> Use Application Sets to automate the creation of Application files.
Using Helm to package Applications instead of Application Sets -> Learn how Application Sets work and their features.
Hardcoding Helm data inside Argo CD applications -> Store all Helm information in Helm values, separate from Argo CD manifests.
Hardcoding Kustomize data inside Argo CD applications -> Store Kustomize information only in Kustomize overlays separate from Argo CD manifests
Attempting to version and promote Applications/Application Sets -> Promote values or overlays, not Application manifests themselves.
Not understanding what changes are applied to a cluster -> Use tools or CI systems to preview and diff rendered Kubernetes manifests.
Using ad-hoc clusters instead of cluster labels -> Use cluster labels and Application Sets to distribute applications to different clusters.
Attempting to use a single application set for everything -> Have many different Application Sets, each with a different purpose/scope.
Using Pre-sync hooks for db migrations -> Use a Database migration operator explicitly built for Kubernetes.
Mixing Infrastructure apps with developer workloads -> Separate infrastructure applications from developer workloads.
Misusing Argo CD finalizers -> Understand how finalizers work and use them correctly for application deletion and migration.
Not understanding resource tracking -> Understand how Argo CD tracks and adopts Kubernetes resources.
Creating “active-active” installations of Argo CD -> Avoid active-active setups, rely on Git and resource tracking for disaster recovery.
Recreating Argo Rollouts with Argo CD and duct tape -> Use Argo Rollouts for progressive delivery and automated rollbacks.
Recreating Argo Workflows with Argo CD, sync-waves and duct tape -> Use Argo Workflows for long-running tasks and complex process orchestration.
Abusing Argo CD as a full SDLC platform -> Use a different system as a developer portal or promotion orchestrator.

Let us know in the comment section if we have missed any other questionable practices!

Happy deployments!

The post Top 30 Argo CD Anti-Patterns to Avoid When Adopting Gitops appeared first on Codefresh.

Abusing the Target Revision Field for Argo CD Promotions

Kostis Kapelonis — Fri, 01 Aug 2025 13:57:26 +0000

In our big guide on how to use ApplicationSets for Argo CD applications, we explained the best practice of having a 3-level structure for all manifests with a clear distinction between Argo CD Application files and Kubernetes resource files.

In that article, we also outlined several anti-patterns that we have seen in the wild, meaning questionable practices that might seem ok at first glance but are problematic in the long run both for developers and for Argo CD operators.
In this guide we want to expand “Antipattern 2 – Working at the wrong abstraction level” and focus on the targetRevision field of the Argo CD application manifest.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  ## DONT DO THIS
  name: my-ever-changing-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/example-org/example-repo.git  
    targetRevision: dev
    ## earlier it was "targetRevision: staging" and before that it was "targetRevision: 1.0.0",
    ## and even earlier it was "targetRevision: 1.0.0-rc"
    path: my-staging-app
    ## Previously it was "path: my-qa-app"
  destination:
    server: https://kubernetes.default.svc
    namespace: my-app

Using the targetRevision field as a poor-man’s promotion mechanism is a big trap that impacts both usability and auditability for your Argo CD applications.

How the targetRevision field works

The targetRevision field is part of the Application specification, which is the central Argo CD construct for deploying your GitOps applications.

An Argo CD application describes a link between a Git repository and a Kubernetes cluster. At its most basic form you point Argo CD to the HEAD of a Git repository that contains all your Kubernetes manifests. For convenience, the targetRevision field allows you to define other values apart from HEAD and even select specific Helm versions if you point Argo CD to a Helm chart instead of a Git repository.

You can see all the possible options for tracking strategies at the official documentation page.

Our recommendation is to have “targetRevision: HEAD” in all your application sets and very sparingly use “targetRevision: v1.2.3” for Helm charts that define infrastructure applications that remain mostly static (unlike applications created by developers).

Here is a list of all the possible options and our recommendation:

Application Target	Value	Recommended
Folder in Git	targetRevision: HEAD	Yes
Helm chart stored in Git	targetRevision: HEAD	Yes
Branch/environment name	targetRevision: dev	No
Semantic version (Git tag)	targetRevision: 3.4.*	No
Semantic version (Helm chart)	targetRevision: 3.4.*	No
Git hash of a commit	targetRevision: 8aefce	No
Git tag	targetRevision: v2.4	Only in special cases
Helm Chart in Helm repository	targetRevision: v2.4	Only for Infra charts

As always, our recommendation is for production usage of Argo CD in large organizations. On a small scale (e.g. homelabs) or with a small team you can obviously get away with any approach you choose.

The main problem we see with several teams is abusing the target revision field as a promotion mechanism for developer applications. We will explain the shortcomings of this approach and the advantages of our recommendation.

Recommendation: Setting TargetRevision to Git HEAD

Before exploring all the alternative options let’s set the baseline and see why our recommendation of using the default HEAD value is the proper one. This scenario is where environments are based on Helm values and Kustomize overlays (or Git folders).

We will compare the following aspects of each approach:

Direct changes to code and simplifying day-to-day operations
Helping developers understand how and when an application is deployed
Enable easy auditing (one of the main benefits of GitOps)
Addressing break-glass scenarios and urgent production hotfixes

Setting the value to HEAD and instructing Argo CD to look at the latest version of the manifests/charts/overlays contained in that folder is the most direct and most straightforward approach for developers. It makes any deployment a single step. A developer can update the Kubernetes files and immediately see the change in any affected environment.

The same is true for hotfixes or rollbacks. Developers can change the status of an environment with the familiar git revert and git reset commands.

In fact this is the only setup where developers don’t need to know what Argo CD does at all. We have already explained that developers don’t really care about Argo CD manifests, and keeping them happy by allowing deployments without tampering with Argo CD manifests is the optimal process for them.

Auditing is also as simple as possible. Since Argo CD continuously tracks what is committed in the Git repository of the Kubernetes resources, the deployment history is the SAME as Git history.

The advantages of using Git history as deployment history cannot be overstated. In a large organization and with large numbers of environments, when something breaks the first questions everybody asks are always the same:

What did we change in this environment?
When did the change happen and who did it?
What was the previous version that worked correctly?

Answering these questions quickly is a considerable advantage, especially during an active incident when timing is critical.

In summary, using HEAD for targetRevision is the solution that is fully GitOps compliant (as far as auditing is concerned), easy for developers to use, and flexible enough to cover any possible edge case scenarios and urgent hotfixes.

Let’s compare it with the additional tracking options that Argo CD offers.

Avoid using environment names for targetRevision

The first approach that we can dismiss right away is using branch names for environments (dev, qa, staging, prod etc).

This is the practice where teams either point specific Argo CD applications to long-running Git branches or they use the targetRevision field to “mimic” promotion in the following way:

An application is first pointed to the “dev” branch of the manifests using targetRevision
Then the Application manifest is updated to point to the “qa” branch of manifests
Then, finally, the targetRevision field is set to “staging” or whichever branch is the one before production.
The cycle starts again

First, if you use the targetRevision field for long-running static branches, you have bigger problems than environment promotions. We have written a complete guide with all the details about the problems with the branch-per-environment approach.

The biggest problem with this approach, however, is that it completely misses all the benefits of auditing that come from GitOps.

If you constantly update the targetRevision field to different branch names it is tough (if not impossible) to reason about the history of your deployments.

In our baseline scenario of using HEAD, if you want to find what was running in a specific application last Thursday, you can go to your Git history and see which commit was active on the respective Git repository. This is a single-step process that anyone can complete in less than a minute.

If you use branch names in the targetRevision field, looking at history now becomes a multi-step process:

First, you need to go to the Git repository that holds the application manifest and find the git commit that was done last Thursday
Read the application file and see which branch was inserted in the targetRevision (let’s say it was “dev”).
Then you need to go to the Git repository of the Kubernetes manifests and the dev branch, and see what was committed last Thursday

This process is more complex and prone to errors, as you manually need to correlate different git repos and commits for different functionalities (application manifests vs. Kubernetes manifests). Humans should not have to do this under pressure at 3 a.m. (when an incident often happens).

Auditing deployments becomes even more chaotic if you use Application Sets. First, you would have to checkout the Git repos (with the correct revisions) and use the argocd cli to recreate how your application set looked last Thursday before you actually reach the appropriate Kubernetes resources that existed in the cluster.

What completely breaks down this process is all the “temporary fixes” that developers will make if you allow them. A widespread pattern we see is the temporary change of the targetRevision field to another branch “just for testing purposes.”

For example, a team that deploys typically its QA environment from the “qa” branch will often point the ArgoCD application to the “staging” branch to debug an issue that occurred in staging using the resources from the QA branch.

Or, several times, teams create ad-hoc preview environments by pointing different environments to feature branches from individual developers. This is even worse as developer branches can be removed at any time.

In summary, using branch names for the targetRevision field is a very complex process that presents many issues regarding auditing and deployment history. If you work in an organization that has specific financial and legal requirements, you will spend a lot of time trying to keep the auditors happy and manually reconstructing what and where was deployed in the past.

Avoid using semantic ranges in targetRevision

The targetRevision field can also work with version ranges for both Git tags and Helm versions. This sounds great in theory. You enter a value like 3.4.x and then Argo CD will automatically deploy 3.4.0, 3.4.1, 3.4.2 and so on.

You think you have solved the promotion problem with Argo CD but in reality you just introduced two major issues that are especially important to developers.

The first problem is that you have now missed all of GitOps’s auditing capabilities. If you use this pattern, you don’t really have a deployment history in the Git repository with the Argo CD manifests.

If somebody asks what you had deployed in a past moment, you cannot see this information in the Git repository anymore. You need to correlate information between versions from your artifact manager that holds Helm charts or locate the dates where each Git tag was created. This is a cumbersome process and again it is not something that you want to do during an incident.

If your organization has strict legal requirements, using version ranges in targetRevisions will complicate auditing on a magnitude of order.

But the biggest problem is that you can no longer roll back to a previous version. As Argo CD honors semantic version rules, it will only deploy newer versions of an application. This becomes a big problem when critical issues are found in production.

You have 3.4.x as a value in targetRevision
Version 3.4.2 is deployed right now in production
You create a new Git tag with version 3.4.3
Argo CD deploys it and it has a critical issue
You cannot really go back to 3.4.2 anymore in an automated way

You would have to manually edit the application file and change the targetRevision to 3.4.2 yourself. Then, remember to switch it back to 3.4.x when the issue with 3.4.3 is fixed and 3.4.4 is released.

With this requirement you just forced 2 extra commits that humans have to do during incidents (where you usually want to avoid manual steps).

This process also leaves room for human error. People might forget to switch the targetRevision field back to semantic versioning and wonder why new releases are not getting deployed anymore.

In summary, while version ranges look like an easy way to gain “free promotions”, the problems they create are more impactful than their benefits. The same issues apply if you use semantic versioning for Helm charts.

Avoid using Git hashes in targetRevision

This is the case when an organization wants the “safest” process possible and forces all environments to point to a specific Git hash.

This is a truly locked system, as Argo CD will not deploy anything new anymore. We see this pattern often in financial companies and other companies that want to restrict developers to the greatest extent.

First of all, let’s clarify the perceived “safety” of this approach. As the Argo CD documentation clearly states, even if you set up an Argo CD application with a specific Git Hash, parameter overrides will still take effect. We don’t recommend using parameter overrides, but this means that a person can still affect this application configuration (either by mistake or on purpose).

So don’t assume that using a specific hash in the targetRevision field is a bulletproof method for “securing” your Argo CD applications.

On the other hand, you have ruined the developer experience for all your teams. Nothing gets deployed unless somebody also changes the targetRevision. Developer self-service is not possible at all. Every time developers create a new release, another human or system must update the targetRevision field.

The experience is even worse during incidents. We already know that developers don’t care about Git hashes, so understanding what is deployed where becomes a lengthy and error-prone process. Rolling back is also super difficult as the only way to do it is by switching the targetRevision.

Git hashes are immutable. This means that it is impossible for an external system (e.g. CI) to make any changes to your manifests or understand what needs to be deployed next. Instead, the external system (or human) must always work in 2 steps:

First, somebody needs to commit the new version of the manifests with application updates (i.e. bumping the container image)
Then you also need to obtain the new Git Hash to put into targetRevision and do a separate commit

It is impossible to do both tasks in a single step as the Git hash from the first action is needed in the second one.

If you assign this responsibility to humans, you just introduced manual steps into your deployment process. If you use an external system, you just added complexity for the sake of complexity.

In summary, using Git hashes in the targetRevision field goes against all DevOps principles and significantly slows down deployments. Using this approach in non-production environments is always a sign that the organization doesn’t really trust its own developer teams.

Use (if needed) Git tags in targetRevision

The last choice for the targetRevision field is to use numbered Git Tags.

This is an acceptable practice, but we recommend it only for locking down production environments. It is better than using plain hashes, as with Git versions, developers can understand where each version is deployed. However, it still suffers from all the issues mentioned in the previous section. Developers cannot deploy or rollback on their own, and extra effort is required during incidents. Like Git hashes, Git tags are considered immutable, meaning you need special care to update them when a new release occurs.

Using Git versions as the target revision in production environments is a good practice if you want to fully control what goes into production. But don’t use this technique for non-production environments. Even if your organization is under legal restrictions, there is no point in making the lives of developers (especially for their QA/staging/dev environments) difficult.

Therefore, our recommendation is:

Use a specific Git Tag in the targetRevision field ONLY in production environments
Use the simple HEAD tracking method in every other environment.

This gives you the best of both worlds. Developers can deploy fast in non-production environments and can change versions at will. But when it comes to production, they need a human (or external system) to actually update the targetRevision field to a new version.

Note, however, that Git tags can be deleted and recreated with different contents. So, unless you set proper permissions in your Git repository, using a specific Git tag does not guarantee that your environment is locked down. If the same tag gets associated with a different Git hash, Argo CD will happily redeploy the application.

Use specific chart versions for Infrastructure charts

The targetRevision field can also accept specific versions for Helm charts. This is a good pattern to follow but only for infrastructure Helm charts (coreDNS, sealed-secrets, prometheus etc.). Basically you should pin your Helm charts only if all the following apply:

The chart is stored in a Helm repository (and not in Git)
The chart represents off-the-shelf software and not something your developers create
You never “promote” these charts from one environment to another. They just represent infrastructure applications.

If you use Helm charts for your own applications (the ones your developers create) then follow the advice of the first section of this guide. Put them in Git and use HEAD in the targetRevision field. This will give you all the benefits of easy updating, history as auditing, and following the GitOps principles.

It is worth mentioning that Helm versions are mutable by default. Unless your artifact manager has specific support for this, a developer can push a different Helm chart with the same version and override the contents of the previous one. So even if you set targetRevision for a Helm chart to version 2.3.4 it doesn’t mean that it has the same contents of chart 2.3.4 as it was last week unless you configure Helm chart versions as immutable. Some developers prefer to bump only appVersion without also bumping the chart version if they have never changed anything in the chart itself.

Summary

We have now seen all the choices for the targetRevision field and examined the advantages and disadvantages.

The HEAD tracking method is the simplest, most direct, and flexible for developers. It makes incident response as painless as possible, as rolling back can be performed by anyone with simple Git commands. It also allows developers to self-serve their needs. It makes auditing straightforward. We recommend using HEAD as the value in the targetRevision field whenever possible.

For production environments, we understand if you are using a specific Git tag/version. However, employ this approach sparingly and only for systems where you want to “restrict” developers. Make sure you understand all the limitations of this approach. Tags are mutable by default, and you have introduced two extra steps in all your deployment processes.

We strongly recommend against using branch/environment names in the targetRevision field. It completely breaks GitOps history and makes auditing a nightmare.

We recommend against using version ranges. Again, you lose all the benefits of GitOps auditing.

We also recommend against using Git hashes. Especially in non-production environments, it slows down your developers.

Happy deployments!

The post Abusing the Target Revision Field for Argo CD Promotions appeared first on Codefresh.

GoodRx Releases Lifecycle Solution for Ephemeral Developer Environments with Built-in Support for Codefresh Pipelines

Kostis Kapelonis — Tue, 08 Jul 2025 11:48:28 +0000

GoodRx, a digital healthcare platform, has released the Lifecycle project as open-source code. Lifecycle is a complete solution for temporary/ephemeral environments. The project’s build process includes built-in support for Codefresh pipelines.

Creating preview environments from a Pull Request with Lifecycle

Lifecycle was conceived as an internal project back in 2019, and today it is released to the world as a fully open-source project available at https://github.com/GoodRxOSS/lifecycle

The project covers two very popular scenarios for medium-sized developer teams.

Creating a complete temporary environment with the contents of a pull request
Creating some services with the contents of a pull request while still using several dependencies from a shared/staging environment

Lifecycle comes with its own abstraction for service definitions. You can see the full syntax in the documentation page.

This file (lifecycle.yaml) allows developers to define several microservices that take part in the application and their dependencies.

When a developer creates a Pull Request, Lifecycle understands all the dependencies and their changes and launches preview/ephemeral environments either for all services or only those selected by the developer.

Even though Lifecycle includes a simple Graphical User Interface, developers can use the Pull Request itself to see what is happening.

When a GitHub project is augmented with Lifecycle, a smart comment is added on each Pull request that shows the state of the preview environment. From the same comment, developers can enable or disable specific microservices and even redeploy the entire environment by clicking on checkboxes.

Once the Pull request is merged (or closed) the temporary environment shuts down on its own. You can see a full demo of the developer experience in a Youtube recording.

How Lifecycle works

Lifecycle itself is a self-hosted application available as a Helm chart. Currently, it supports Google Cloud and Amazon Web Services. You need to install Lifecycle in a Kubernetes cluster. If you are using Terraform/OpenTofu you can easily bootstrap everything required with an example Git repository.

Once you install the Lifecycle GitHub app your developers are ready!

There are many ways to define how environments are created. You can choose to auto-create an environment for each Pull request or require a specific label before a deployment happens.

Developers follow their usual workflow.

First they create a feature locally
Once ready, they commit and push to a Pull Request
They can view their feature in isolation in an environment specific to a Pull request/branch
They can choose to accept or discard a Pull request at any point in time.

Lifecycle essentially supercharges your Pull requests because, in addition to the usual checks (unit tests, code coverage, security scans), they now show a URL with the application running for live verification. Apart from developers, other teams (testers or database administrators) will find this functionality very useful for quick manual tests or other checks that require deploying the end result.

Why Lifecycle is different

Preview environments are a well-accepted practice in the software industry and several approaches exist for solving this problem. We have actually offered our own advice both for plain Helm applications as well as GitOps workflows with Argo CD.

What distinguishes Lifecycle from the competition is the “fallback” mechanism it offers. Sometimes creating a full replica of the whole application is either too costly or too complex. Especially for teams that have adopted microservices, the usual scenario is that changes exist only in a subset of the services while the rest are still on the latest version.

Lifecycle allows you to define a fallback/static environment that will take effect for services that the developer does not select. This means that the developer now has the full power to select only a subset of services to participate in the Pull request, while still using the latest stable version for everything else.

This static/shared environment is also handled from a specific Pull request again by Lifecycle. This means that developers can choose to examine new features in individual branches and when they feel confident they can move them to the shared static environment with a simple merge.

We have seen several tools that excel in one scenario (launching everything) or the other (keeping a shared testing environment), but Lifecycle is a unique tool that focuses equally on both use cases.

If you want to try Lifecycle get started at the official documentation.

The post GoodRx Releases Lifecycle Solution for Ephemeral Developer Environments with Built-in Support for Codefresh Pipelines appeared first on Codefresh.

How we replaced the default K8s scheduler to optimize our Continuous Integration builds

Vadim Gusev — Mon, 07 Jul 2025 11:57:20 +0000

The default Kubernetes scheduler works great when your cluster is destined for long running applications. At Codefresh we use our Kubernetes clusters for running Continuous Integration pipelines which means our workloads are ephemeral (they are discarded when a pipeline has finished).

This allowed us to look at the Kubernetes scheduler from a different perspective and forced us to think about how Kubernetes can work for short-running workloads. After trying to fine-tune the default scheduler for running CI pipelines, we decided that it was best to write our own scheduler designed specifically for our needs.

In this post, we will describe why the default scheduler is not a good choice for ephemeral workloads and how we replaced it with a custom scheduler that meets our needs.

Codefresh pipelines – build your code on Kubernetes

Codefresh pipelines are powerful tools with a simple yet powerful syntax and many capabilities that can optimize your container or application builds. Beyond pipeline syntax, there are less discussed operational aspects of running Codefresh pipelines in hybrid or on-prem scenarios.

Every day we run lots of builds both for ourselves and SAAS customers. Build start time is one of the most immediately noticeable aspects of user experience. Nothing spoils the first impression about our platform quite like builds that are stuck in the initialization stage for a prolonged period of time. There is a psychological threshold when “it’s normal” turns into “it’s kinda slow” and then into “is it broken?”. Since we want to provide a pleasant experience to our customers we want to avoid slowness as much as possible.

There are naive ways to brute force this issue, but we want to stay cost-efficient in our solutions, so we had to be a little bit more inventive than throwing resources at the problem. A big part of our solution is realizing that the default behavior of the Kubernetes scheduler is sensible for regular web applications, but is completely suboptimal for CI/СD workloads.

At Codefresh, we implement measures to address both latency and cost considerations. Some of those are useful if you run builds on your infrastructure. In this article, we will cover those topics by explaining the problem, designing the solution, and implementing it all.

We focused on two main areas

The time it takes to start a build
Reducing the cost of the infrastructure that runs our builds

Let’s see these in order.

Build start latency

As a starting point we want to minimize the time it takes to start a pipeline run in order to offer a better user experience.

Behind the scenes Codefresh builds are mapped to Kubernetes Pods, so naturally the question of scheduling comes up. Both in terms of “how to start builds as fast as possible?” and “how to minimize the costs associated with running builds?”.

The build cannot start until the pipeline Pod(s) are up and running. Typically, there is spare capacity on the Nodes dedicated to running the build, but it’s not always the case: if we keep creating new builds faster than they are completed, we are bound to reach cluster capacity limits.

Sooner or later builds would be delayed by build Pods being stuck in the Pending status. It might take several minutes for the autoscaler to react and provision new Node(s) for a surge of new builds. Then new Nodes must pull Codefresh images and start all required containers inside the Pod. During all that time the user sees their build in the “Init” phase, probably getting increasingly frustrated.

The solution – avoid waiting for empty nodes completely

To combat this bad user experience, we’ve implemented a concept of ballast: Pods that mimic build Pods in terms of scheduling preferences but with greatly reduced priority. Pod priority affects the K8s scheduler’s behavior. Usually, if there is not enough capacity in a cluster to run a Pod, it will enter a Pending state, and then it’s up to autoscaler or human operators to provision additional capacity for it.

In this illustration a Pod waits for a second node to be created to finally land on it.

But that is not a complete picture. In fact, before transitioning a Pod into the Pending state, the scheduler compares the new Pod’s priority to existing ones and if there are any with lower priority then the scheduler will evict those to make space for new Pod.

Typically this mechanism is used to ensure critical components like various DaemonSets (marked red in the illustration above) are always up and running. There are 2 built-in PriorityClasses: system-cluster-critical and system-node-critical to facilitate that. They have a very large positive priority value that should be enough to kick out any regular pod.

It turns out that we can employ this basic scheduling mechanism to our advantage.

We can use the same mechanism in reverse by creating a PriorityClass with a very large negative value to ensure that Pods with this class will be “total pushovers” and concede their spot to any other Pod if needed.

We call those Pods a “ballast”. These are placeholder pods with the sole purpose of getting discarded by the scheduler when a real pod appears.

We can create ballast Pods (marked blue in the illustration above) in a Node pool dedicated to running builds. When the node eventually fills up, the next build Pod will evict a ballast Pod and land in its place. The evicted ballast Pod will become Pending and trigger new Node creation.

From the cluster’s view the picture is the same: A Pod is created, the Pod enters the Pending state, and a new Node is provisioned.

But from the end user perspective there is a key difference. The “real” build pod starts working immediately and the placeholder/ballast pod is the one that has to wait and enters the Pending state.

Users see their pipeline start right away!

Implementation

In the release 7.6.0 of our cf-runtime Helm chart we’ve added a ballast section that allows users to enable ballast for both dind and engine Pods. Under the hood those are Deployments that copy nodeSelector, affinity, tolerations and schedulerName of respective build Pods to perfectly mimic their scheduling behavior.

All you need to do is enable them, and set the amount of replicas and the resources of an individual replica:

ballast:
 dind:
   enabled: true
   replicaCount: 3
   resources:
     requests:
       cpu: 3500m
       memory: 7800Mi
     limits:
       cpu: 3500m
       memory: 7800Mi
 engine:
   enabled: true
   replicaCount: 3
   resources:
     requests:
       cpu: 100m
       memory: 128Mi
     limits:
       cpu: 100m
       memory: 128Mi

In this example we create a “ballast” setting that can handle a build spike of up to 3 builds. Ballast resources are set to the same values as runtime.dind.resources (targeting xlarge EC2 instance size) and runtime.engine.resources respectively.

There are some considerations and rules of thumb for picking those values:

Pick ballast Pod size equal to the default build Pod size
Pick replica count not greater than the typical build spike size

If the ballast Pod is equal to the default build Pod size, then most of the time there will be one eviction per one new build, which makes it easier to reason about the number of replicas.

At the same time the ballast count should not be higher than the expected spikes we want to accommodate: if we create at most 10 builds at a time, but have 30 ballast pods then those 20 remaining replicas will never be of use to us and just idly burn through our infra/cloud budget.

The maximum value from this setup would be achieved by setting the replica count to the most common spike size: all spikes up to the common size will be fully accommodated while bigger spikes will still benefit from the setup.

Another factor to consider is cluster autoscaler reactivity. If it can provision a node in 10 minutes, then in general you want to have a bit bigger ballast compared to a more reactive autoscaler that provisions a new node in under a minute.

Ultimately your ballast setup is a tradeoff between convenience and cost: we host idle Pods for our builds to start faster. If ballast causes cost concerns, one might want to scale the ballast setup dynamically. It’s possible to set the number of replicas to zero in chart values and add an external HPA, for example use KEDA with Cron scaler to effectively remove ballast outside of office hours, where we don’t care about build start times (since they are most likely created by some form of automation and not humans).

Cost-efficient scheduling

After we optimized the start time of our builds, we wanted to look at cost efficiency.

Running pipelines requires considerable computing resources, so being as efficient as possible when scheduling build Pods is a major concern for operators.

There is a significant difference between running something like a web application expressed in Kubernetes terms as Deployment and a CI/CD pipeline, which is more like a Job object.

Deployments are scalable and can afford disruption, which actually happens during every release that updates image tags in a rolling fashion. That allows operators to run those workloads on Spot instances or aggressively scale down underutilized nodes, pushing replicas to other nodes.

On the other hand, CI/CD pipelines do not work in this manner. They are not stateless and disrupting them while they are running is a scenario we want to avoid.

With Job-like workloads that implement pipelines we need to patiently wait for their completion, and only then can we scale down the node that hosted this Job.

This trait creates a harmful dynamic that we’ve observed in our clusters that run customer builds. Below you can see a snapshot of a node pool from one of our clusters (made with eks-node-viewer):

Those clusters tend to run half-empty, costing us extra money.

This problem is especially exaggerated right after a spike in the amount of submitted builds. During a build spike, the cluster autoscaler will create new nodes to accommodate workloads.

In the illustration below, you can see a cluster right after the build spike, where the majority of those builds have finished and the cluster is half-empty. After those builds finish, there will always be a small stream of builds that trickle in and are evenly spread across the nodes.

As you can see in the illustration below, overall resource utilization is very small, but no single node can be scaled down.

Solution – fine tune the scheduler for job-like workloads

The Root cause of this problem is two-fold:

Job-like workloads are non-disruptible
The default Kubernetes scheduler strives for even load spread

We cannot do anything about the first aspect, but the second one is entirely in our control.

The default scheduler’s behavior is reasonable for most applications but suboptimal for CI/CD pipelines. To prevent builds from taking nodes hostage, we want to pack them tightly, filling nodes one by one instead of spreading builds evenly across the nodes. Any solution that fills small nodes first gets bonus points.

We want our scheduling algorithm to look like this:

This way, we give big and mostly empty nodes a chance to complete a few remaining jobs and retire from the cluster. In this particular example, nodes 2 and 3 will most likely be scaled down very soon.

Implementation

To change the scheduler’s behavior we need to implement a custom scheduler. Kubernetes allows for multiple schedulers to exist in a cluster and Pods can specify the desired scheduler via the “schedulerName” field which defaults to boringly named “default-scheduler”.

WARNING: It’s important to make sure that all Pods on a given node pool are managed by a single scheduler to avoid conflicts and evict/schedule loops.

For the purpose of running Codefresh builds the recommendation is to have a dedicated tainted node pool and add matching tolerations to Codefresh pods.

We need to create an umbrella chart over the scheduler-plugins Helm chart, since it doesn’t provide the required level of flexibility. You can find our Helm chart for our scheduler plugin at https://github.com/codefresh-contrib/dind-scheduler/tree/main/dind-scheduler

Here is the important part of values.yaml

schedulerConfig:
   score:
     enabled:
       # pick smallest node fitting pod
     - name: NodeResourcesAllocatable
       weight: 100
       # in case of multiple nodes of the same size,
       # resolve a tie in favor of most allocated one
     - name: NodeResourcesFit
       weight: 1
     disabled:
     - name: "*"


 pluginConfig:
 - name: NodeResourcesAllocatable
   args:
     mode: Least
     resources:
     - name: cpu
       weight: 1
 - name: NodeResourcesFit
   args:
     scoringStrategy:
       type: MostAllocated

If we focus on the core of scheduler logic in plain English it sounds like:

Pick the smallest eligible node possible
If there are multiple nodes of the same size, use the fullest node in terms of allocated CPUs

This way we minimize the time big half-empty nodes are held hostage by trickling builds.

The only thing left is to set the scheduler name in runtime values:

runtime:
 dind:
   schedulerName: dind-scheduler

Now you can describe any DinD pod to validate that the correct scheduler is used:

Events:                                                
  Type    Reason                  Age    From          
  ----    ------                  ----   ----          
  Normal  Scheduled               39s    dind-scheduler

After rolling out this custom scheduler this is how the same node pool looks after we’ve changed the scheduler’s behavior: nodes became as tightly packed as possible.

This means that now we pay for nodes we actually use to their fullest capacity keeping our cloud costs down.

Conclusion

In this article we’ve learned how to start Codefresh builds fast and run them cheaply. We have used nothing but built-in Kubernetes concepts revolving around the scheduler:

We adjusted Pod Priority to improve build start times
We wrote our own Scheduler Plugin to improve resource utilization

Feel free to use those resources to further tailor this solution to your needs. If you are a Codefresh customer you should also realize why your builds are much faster now!

The post How we replaced the default K8s scheduler to optimize our Continuous Integration builds appeared first on Codefresh.

Configuring Slack notifications with Argo Workflow – a learning experience

Shawn Sesna — Tue, 24 Jun 2025 18:06:52 +0000

The acquisition of Codefresh gave me an exciting opportunity to learn new tech. Initially, I thought Argo was just Argo CD. I didn’t realize that Argo consists of 4 distinct projects:

A key feature of the Codefresh product is Promotion Flows, which makes heavy use of Argo Workflows. Promotion Flows add the ability to assign Pre and or Post Actions to the process called Promotion Workflows, which are Argo Workflows with some annotations added. To better understand Promotion Flow capabilities, I decided to create a workflow so I can see how it works and put it into action. In this post, I go over the project I undertook.

The problem to solve

To get this project underway, I needed a problem to solve. I read that Kubernetes worked with external secret providers, but I hadn’t ever used it myself. Having a local instance of HashiCorp Vault running, I tried including a Slack notification in a Codefresh Promotion Flow, which pulled the secrets for Slack from HashiCorp Vault in a just-in-time (JIT) fashion. It’s worth noting that while I created this to work as a Promotion Workflow, this also works as a standard Argo Workflow.

Prep work

To prepare, I first configured:

A Slack channel to work with
Vault to work with Kubernetes authentication/authorization

Slack channel

To post a message to Slack, I created an App. I used the Slack API bot token guide to create an App quickly.

Installing the App generated a Bot User OAuth Token. This retrieves secrets from Vault.

Vault

For the Vault instance, I needed 2 things for this project:

Some secrets to retrieve
To allow a Kubernetes Service Account to access my Vault instance

Vault Secrets

You can create secrets in Vault via the UI, CLI, or through an API call. My Vault instance runs in a container without any data persistence, so I created a quick PowerShell script to automate it.

# Get variable values
$slackChannel = 
$slackToken = 
$vaultToken = 
$header = @{ "X-Vault-Token" = $vaultToken } 

# Create Hashtable
$jsonPayload = @{
	data = @{  
		SLACK_CHANNEL = $slackChannel
        SLACK_TOKEN = $slackToken
    }
}

$jsonPayload | ConvertTo-Json -Depth 10

Invoke-RestMethod -Method Post -Uri "http://:8200/v1/secret/data/slack" -Body ($jsonPayload | ConvertTo-Json -Depth 10) -Headers $header

Kubernetes Service Account authentication and authorization

There are a few files I had to create to configure the integration between my Kubernetes cluster and my HashiCorp vault instance:

vault-auth-service-account.yaml
vault-auth-secret.yaml
configmap-json.yaml
vault-policy.hcl (this one is generated in the script)

vault-auth-service-account.yaml

This file will create a new Kubernetes Service Account, then assign permissions to it.

The commented out portion is for using vanilla Argo Workflows, my example is configured to be a Codefresh Promotion Flow so I’ve used the default Service Account of cf-default-promotion-workflows-sa which has the necessary permissions to execute a Promotion Flow in the Codefresh product. If using vanilla Argo Workflows, the vault-auth Service Account can be used.

# Uncomment if using vanilla Argo Workflows
#apiVersion: v1
#kind: ServiceAccount
#metadata:
#  name: vault-auth
#  namespace: argo
#---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: role-tokenreview-binding
  namespace: codefresh-gitops-runtime # Change namespace to either your GitOps Runtime namespace or argo if you're using vanilla Argo Workflows
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:auth-delegator
subjects:
- kind: ServiceAccount
  name: cf-default-promotion-workflows-sa # Change to vault-auth is using vanilla Argo Workflows
  namespace: codefresh-gitops-runtime  # Change namespace to either your GitOps Runtime namespace or argo if you're using vanilla Argo Workflows

vault-auth-secret.yaml

The Service Account being used will need a token created to perform the authentication operations. This token is what is used with Vault so it can authenticate and retrieve secrets.

apiVersion: v1
kind: Secret
metadata:
  name: vault-auth-secret
  namespace: codefresh-gitops-runtime  # Change namespace to either your GitOps Runtime namespace or argo if you're using vanilla Argo Workflows
  annotations:
    #kubernetes.io/service-account.name: vault-auth # Uncomment this line and comment out the next if using vanilla Argo Workflows
    kubernetes.io/service-account.name: cf-default-promotion-workflows-sa
type: kubernetes.io/service-account-token

configmap-json.yaml

This file creates a configMap resource that the Vault container uses to connect your Vault instance. The template section defines what you want the container to do. In this case, I’m writing JSON to a file with the contents of the secrets we retrieved.

apiVersion: v1
data:
  vault-agent-config.hcl: |
    # Comment this out if running as sidecar instead of initContainer
    exit_after_auth = true

    pid_file = "/home/vault/pidfile"

    auto_auth {
        method "kubernetes" {
            mount_path = "auth/kubernetes"
            config = {
                role = "argo"
            }
        }

        sink "file" {
            config = {
                path = "/home/vault/.vault-token"
            }
        }
    }

    template {
    destination = "/etc/secrets/slack.json"
    contents = <



vault-policy.hcl



This file grants the Service Account permissions in Vault so it can read secrets.  This example grants the Service Account read and list permissions to any secret. In a real-world situation, I’d limit what the Service Account has access to. This is just a simple example to get started.



path "secret/data/*" {
  capabilities = ["read", "list"]
}



Helper script



Since I was doing this over and over while testing, I wrote a PowerShell script to automate the process.



  # Reference: https://developer.hashicorp.com/vault/tutorials/kubernetes/kubernetes-external-vault
  #            https://developer.hashicorp.com/vault/tutorials/kubernetes/agent-kubernetes

# Declare working variables
$vaultUrl = "http://:8200"
$vaultToken = ""
$namespaceName = ""
$serviceAccountName = ""

# Set environment variables
$env:VAULT_ADDR = $vaultUrl
$env:VAULT_TOKEN = $vaultToken

# Create the Kubernetes service account and secret
kubectl apply -f vault-auth-service-account.yaml
kubectl apply -f vault-auth-secret.yaml

# Get the secret
$secret = (kubectl get secrets -n $namespaceName --output json | ConvertFrom-Json)
$secret = ($secret.Items | Where-Object {$_.metadata.name -eq "vault-auth-secret"})

# Get JWT token
$jwtToken = (kubectl get secret $secret.metadata.name --output 'go-template= {{ .data.token }}' -n $namespaceName)
$jwtToken = [System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String($jwtToken))

# Get CA certificate 
$saCaCRT = (kubectl config view --raw --minify --flatten --output 'jsonpath={.clusters[].cluster.certificate-authority-data}' -n $namespaceName)
$saCaCRT = [System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String($saCaCRT))

# Get cluster hostname
$k8sHost = (kubectl config view --raw --minify --flatten --output 'jsonpath={.clusters[].cluster.server}')

# Create read-only policy for kubernetees
$vaultPolicy = @"
path "secret/data/*" {
  capabilities = ["read", "list"]
}
"@

Set-Content -Path .\vault-policy.hcl -Value $vaultPolicy

.\vault policy write k8s-ro $PWD/vault-policy.hcl

# Enable Kubernetes authentication in Vault
.\vault auth enable kubernetes

# Configure the Kubernetes authentication
.\vault write auth/kubernetes/config `
token_reviewer_jwt="$jwtToken" `
kubernetes_host="$k8sHost" `
kubernetes_ca_cert="$saCaCRT" `
issuer="https://kubernetes.default.svc.cluster.local"

# Create a role for the Kubernetes authentication
.\vault write auth/kubernetes/role/argo `
bound_service_account_names=$serviceAccountName `
bound_service_account_namespaces=$namespaceName `
token_policies=k8s-ro `
ttl=24h

# Create config map for agent
kubectl apply -f configmap-json.yaml




This tutorial from HashiCorp contains the same commands used in this post, but in bash format.



HashiCorp’s vault agent tutorial contains the same commands used in this post, but in Bash format.



Both the bash and PowerShell scripts will require the use of the HashiCorp Vault CLI. Please be sure to grab this binary if you intend to follow along with this post.



Workflow template



If you haven’t worked with Argo Workflows before, this template may look intimidating. I’ll break it down by section to make it more digestible.



# DO NOT REMOVE the following attributes:
# annotations.codefresh.io/workflow-origin (identifies type of Workflow Template as Promotion Workflow)
# annotations.version (identifies version of Promotion Workflow used)
# annotations.description (identifies intended use of the Promotion Workflow)
apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: slack-notification
  annotations:
    codefresh.io/workflow-origin: promotion
    version: 0.0.1
    description: promotion workflow template
spec:
  arguments:
    parameters:
        - name: APP_NAME
        - name: RUNTIME
  serviceAccountName: cf-default-promotion-workflows-sa
  entrypoint: vault-auth
  volumes:
    - configMap:
        items:
          - key: vault-agent-config.hcl
            path: vault-agent-config.hcl
        name: vault-agent-config
      name: config
    - emptyDir: {}
      name: shared-data
  templates:
    - name: vault-auth
      steps:
        - - name: get-slack-data
            template: call-vault
        - - name: post-slack-message
            template: post-message
            arguments:
              parameters:
                - name: SLACK_CHANNEL
                  value: >-
                    {{=jsonpath(steps['get-slack-data'].outputs.parameters['slack-data'],
                    '$.SLACK_CHANNEL')}}
                - name: SLACK_TOKEN 
                  value: "{{=jsonpath(steps['get-slack-data'].outputs.parameters['slack-data'], '$.SLACK_TOKEN')}}"
                - name: SLACK_MESSAGE
                  value: "{{workflow.parameters.APP_NAME}} promotion has started on runtime {{workflow.parameters.RUNTIME}}"

    - name: call-vault
      container:
        command:
          - vault
        args:
          - agent
          - '-config=/etc/vault/vault-agent-config.hcl'
          - '-log-level=debug'
        env:
          - name: VAULT_ADDR
            value: http://:8200
        image: hashicorp/vault
        name: vault-agent
        volumeMounts:
          - mountPath: /etc/vault
            name: config
          - mountPath: /etc/secrets
            name: shared-data
      outputs:
        parameters:
          - name: slack-data
            valueFrom:
              path: /etc/secrets/slack.json

    - name: post-message	# we also have an existing plugin at https://github.com/codefresh-io/argo-hub/blob/main/workflows/slack/versions/0.0.2/docs/post-to-channel.md
      inputs:
        parameters:
          - name: SLACK_CHANNEL
          - name: SLACK_TOKEN
          - name: SLACK_MESSAGE
      script:
        image: curlimages/curl
        command:
          - sh
        source: |
          curl -vvv -X POST -H "Authorization: Bearer {{inputs.parameters.SLACK_TOKEN}}" \
          -H "Content-type: application/json" \
          --url https://slack.com/api/chat.postMessage \
          --data "{ 'token': '{{inputs.parameters.SLACK_TOKEN}}', 'channel': '{{inputs.parameters.SLACK_CHANNEL}}', 'text' : 'Workflow beginning:star:', 'attachments': [{'color': '#ADD8E6','blocks': [ { 'type': 'section', 'fields': [{ 'type': 'mrkdwn', 'text': '{{inputs.parameters.SLACK_MESSAGE}}'}] } ] }]  }" 



Kind



The Kind for the Workflow manifest is a WorkflowTemplate using the api of argoproj.io/v1. You need the annotations applied to this template to designate this as a Promotion Workflow in Codefresh. This allows it to display on the correct dashboards.  If you’re using vanilla Argo Workflows, you can remove the annotations section.



apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: slack-notification # Name of the workflow
  annotations: # Codefresh annotations
    codefresh.io/workflow-origin: promotion
    version: 0.0.1
    description: promotion workflow template



Spec



Spec is where we define  parameters, volumes, and the entrypoint.  There are additional components broken down in the comments.



spec:
  arguments:
    parameters: # Define parameters for the workflow
      - name: APP_NAME # APP_NAME and RUNTIME are automatically set when used within a promotion
      - name: RUNTIME
  serviceAccountName: cf-default-promotion-workflows-sa # The service account used during workflow execution
  entrypoint: vault-auth # Name of the first template to call
  volumes: # Defines workflow wide volume usable by all templates and steps
    - configMap: # This config map is specific to the Vault work we'll be doing, it references the config map we created in the HashiCorp Vault configuration steps
        items: # This section specifies that we're going to write the config map contents to a file which the Vault container will use as a configuration
          - key: vault-agent-config.hcl
            path: vault-agent-config.hcl 
        name: vault-agent-config
      name: config # Name of the volume
    - emptyDir: {} # Creates an empty directory
      name: shared-data # Name of the volume



Templates



The templates section defines the different templates used during the execution of the workflow. This example contains 3 templates:




vault-auth:  This is the name we specified in the entrypoint of the spec section and the beginning of execution. This template calls the other two.



call-vault: This is the template that will perform the call from the cluster to Vault to retrieve the secrets and produce output parameters.



post-message: This template posts the message to Slack and will receive the secret as an input parameter.




The – – syntax indicates the step executes sequentially; a single – means it executes in parallel with the previous step.



templates:
    - name: vault-auth
      steps:
        - - name: get-slack-data
            template: call-vault
        - - name: post-slack-message
            template: post-message
            arguments:
              parameters:
                - name: SLACK_CHANNEL
                  value: >-
                    {{=jsonpath(steps['get-slack-data'].outputs.parameters['slack-data'],
                    '$.SLACK_CHANNEL')}}
                - name: SLACK_TOKEN 
                  value: "{{=jsonpath(steps['get-slack-data'].outputs.parameters['slack-data'], '$.SLACK_TOKEN')}}"
                - name: SLACK_MESSAGE
                  value: "Test message"


    - name: call-vault
      container:
        command:
          - vault
        args:
          - agent
          - '-config=/etc/vault/vault-agent-config.hcl'
          - '-log-level=debug'
        env:
          - name: VAULT_ADDR
            value: http://:8200
        image: hashicorp/vault
        name: vault-agent
        volumeMounts:
          - mountPath: /etc/vault
            name: config
          - mountPath: /etc/secrets
            name: shared-data
      outputs:
        parameters:
          - name: slack-data
            valueFrom:
              path: /etc/secrets/slack.json


    - name: post-message
      inputs:
        parameters:
          - name: SLACK_CHANNEL
          - name: SLACK_TOKEN
          - name: SLACK_MESSAGE
      script:
        image: curlimages/curl
        command:
          - sh
        source: |
          curl -vvv -X POST -H "Authorization: Bearer {{inputs.parameters.SLACK_TOKEN}}" \
          -H "Content-type: application/json" \
          --url https://slack.com/api/chat.postMessage \
          --data "{ 'token': '{{inputs.parameters.SLACK_TOKEN}}', 'channel': '{{inputs.parameters.SLACK_CHANNEL}}', 'text' : 'Workflow beginning:star:', 'attachments': [{'color': '#ADD8E6','blocks': [ { 'type': 'section', 'fields': [{ 'type': 'mrkdwn', 'text': '{{inputs.parameters.SLACK_MESSAGE}}'}] } ] }]  }" 



The result



After going through some trial and error learning how everything functions, I successfully posted a message to my designated channel!







Conclusion



I needed something to help me understand how Argo Workflows and Codefresh Promotions Workflows worked. Setting up my own project with a specific purpose demystified not only how they worked, but also how to construct a Promotion Workflow myself. I hope this post helps you in the same way it helped me.



Happy deployments!




The post Configuring Slack notifications with Argo Workflow – a learning experience appeared first on Codefresh.

Laser Focused Kubernetes Deployments Using Argo Rollouts and Header Based Routing

Kostis Kapelonis — Mon, 23 Jun 2025 13:39:25 +0000

A Kubernetes cluster with default configuration has access to only two deployment strategies:

Recreate (causes downtime)
Rolling Update (avoids downtime but you cannot preview or validate the next version in advance)

To get access to more advanced deployment strategies such as blue/green and canaries you need to use a dedicated Progressive Delivery controller such as Argo Rollouts.

We have previously covered several basic and advanced scenarios for Argo Rollouts in our blog. Today we answer another common question which is how you can select which of all live users will have access to the canary deployment.

As a reminder, with a canary deployment, you gradually shift traffic to your live users for the new version’s pods. The canary is finished when 100% of live users see the new pods or when something goes wrong and you revert all of them to the previous/stable version.

In the example above, we start the canary by shifting 20% of network requests to the v2 container, then 50%, and finally 100%. The key point here is that unless you do something special, the network traffic requests that go to the new version are random. Some users might even see both application versions if you are not careful.

In the real world you almost always want specific groups of people to be part of the canary process. Some examples are:

“Only our internal users should see this new version”
“Only French users must be part of the canary”
“Asia should stay in the old version, the US will see the canary only”
“Only users who have checked the ‘preview checkbox’ must see the canary”
“The payment gateway should stay in the old version. The intranet should see the new version”

So can we still use Argo Rollouts to cover these use cases? The answer is yes, and in this guide we explain two approaches, one basic and one advanced and also compare the advantages and disadvantages.

The methods we are going to use to decide which users see the canary are

Static Routing with extra URLs (limited but simple to implement)
Header based Routing (more powerful but also more complex to implement)

If you want to try the examples on your own, all resources are available at https://github.com/kostis-codefresh/rollouts-header-routing-example

Understanding the blast radius of your deployments

The central promise of Argo Rollouts is automatic rollbacks. You deploy a new version and then within 1-2 hours (ideally 15 minutes) either the new version is promoted as stable or it is automatically reverted.

This sounds great in theory, but in practice, you need to understand who will be affected if a deployment fails. Let’s say you are doing canary deployments and need 1 hour to get good metrics to decide about the new version’s health. If the metrics fail, some users will have issues for 1 hour. Is this acceptable? Could you control which users face the disruption and who never participate in the canary?

If you read the official Argo Rollouts documentation, the assumption is that the Rollout controller only focuses on a single application.

In a big organization most services have dependencies. This is especially true for companies that have adopted microservices. So instead of looking at a single service independently, you need to understand how the application works inside the whole cluster.

A more realistic example would be the following:

Here we have an e-shop application with different kinds of users

External partners can interact with the inventory of the application
Internal/Intranet users provide customer support and handle the store management
The general public accesses the public website to order/buy items.

If we choose Progressive Delivery for the “auth” service shown in the middle, we see that even though it is a single service, it is a runtime requirement for 3 other services (portal, admin, store). So even if we apply a canary approach, a failed deployment will affect all users of our application.

Therefore if you need 2 hours for a canary deployment , and you have a failure then ALL your users will be affected for 2 hours. Wouldn’t it be nice if you could control which user groups are affected and which are not?

Isolating specific users instead of random network requests

Making a decision about which users see the canary process and which do not is only one aspect of the deployment process. The other aspect is verifying whether a user is part of the canary. Then, all network requests should always be directed to the preview/canary version of the application.

A widespread misconception about Argo Rollouts is that integrating with a traffic provider allows you to send specific users to the canary version. Unfortunately, this is not true in the default configuration.

Even if you use a traffic provider, the percentage of requests that go to the canary application is completely RANDOM. If you set up Argo Rollouts with a canary step of 30%, Argo Rollouts will only guarantee that 30% of all network requests will go to the canary process. But there is no guarantee that these are from the same users.

This leads to a very common problem for several organizations: Multiple requests from the same user result in different application versions (both old/stable and preview/canary).

In the example of 30%, Argo Rollouts will indeed send 30% of the total network requests to the canary version, but if you look at the network requests for a single user, you might have the case where the first request is not part of the 30%, the next one is, the next is not and so on. This limitation can be catastrophic for applications with a Graphical Interface, as the user might see different components on screen with each subsequent network request (if the canary version also affects the application’s UI).

In the real world, companies don’t want a random percentage of requests to go to the canary version. You want to apply the percentage to individual users.

The expectation is that if you set up a canary of 30%, you expect 30% of users to see the canary and 70% of them are still on the old/stable version. If you log the network requests of a single user however, you want all of them to go to EITHER the canary version OR the stable version and never both.

So can Argo Rollout support this use case of user segmentation instead of network request segmentation?

Example application – Visualize your canary

Our example Rollout can be found at https://github.com/kostis-codefresh/rollouts-header-routing-example/

This repository includes

The source code of the example application
A setup for the Traefik proxy
Example Rollout definitions
Examples of HTTP routes from the Gateway API

The highlight of the example application is that you can see visually which requests hit one version or the other.

In the screenshot above, a canary is in progress between the application’s v1 and v2. The dashboard performs multiple requests (one for each box shown), allowing you to examine your canary networking in a very simple way.

Approach 1 – Static URL routing

Let’s start with the first use case—reducing the blast radius from a failed deployment. The solution is to create a different URL for each group of users who participate in the deployment process.

We have 3 URLs:

The canary/default URL where requests are routed to the canary. Users of this URL will follow the canary traffic as it increases
A URL that ALWAYS sends requests to the canary/preview version regardless of the defined percentage
A URL that ALWAYS sends requests to the stable/old version regardless of the defined percentage.

Instead of having a single URL for the canary, we can give each user group a different URL according to their risk acceptance.

In the previous example of the e-shop application we can easily accommodate the following imaginary requirements:

We want our external partner never to see the canary at all. They will be shown the stable version until the last possible moment
We want our public users to be part of the canary process as normal
We want our own employees to “see” the new version right away so that they can detect problems as early as possible

In this example we use different URL paths, but we could do the same thing with hostnames (e.g. canary.auth.com, preview.auth.com, stable.auth.com).

Now when a canary process is started

Users who follow the /stable endpoint will always see the old/stable application version
Users who follow the /preview endpoint go to the new version straight away
Users who follow the /canary endpoint participate in the canary as usual.

Here is a timeline for each user group. Blue indicates that network requests go to the old/stable version, and green indicates that they go to the new canary version.

The end result for all groups is precisely the same. They see the new version of the application. The big difference is in failed deployments. If a deployment fails and the canary reverts, users that follow /stable (external partners in our example above) have no impact at all.

Instead of affecting everybody, we have completely isolated our external patterns and also have a different risk acceptance between the general public and our own internal users.

Implementing this approach with Argo Rollouts is straightforward. Instead of using just one network endpoint (for the canary), you create an additional one pointing at the stable service and one more for the preview service.

If you run our example, you can now access 3 URLS (/canary, /stable, /preview). If you start a canary process only the /canary will gradually move to the new version.

Users who visit /preview will see the new version right away:

Users who visit /stable will always view the stable version regardless of the state of the canary process:

This approach needs no source code changes and can be implemented quickly. It has however three significant limitations:

It is static in the sense that you need to decide in advance which user groups will visit which URL
You need to notify all dependent services about the new URLs if they don’t want to follow the default behaviour
It still works at the level of network requests instead of actual users

Approach 2 – Dynamic URL routing

The main limitation of static routing is that you need to identify which user group will use which service in advance. Once you make this selection, you cannot change it after the canary has started.

We still haven’t addressed the problem of users versus requests. In the case of the canary endpoint a random number of requests can see the canary instead of real users.

Using HTTP headers instead of simple endpoints can improve network isolation. Argo Rollouts can detect optional HTTP headers and make decisions accordingly.

In the example above Argo Rollouts will send to the canary all requests that have an HTTP header “X-Canary:true”.

Now we have the capability to have canaries for users instead of just networks. We can modify our application source code to enable this header on the fly.

All requests for users with this header present will be redirected to the canary user. This user group will always see the canary, so this approach works great even for Graphic applications.

HTTP headers are fully dynamic. You can change them on the fly. There are several networking products such as load balancers, api gateways, service meshes that allow you to inject headers or modify headers in a network request.
We can activate this pattern by creating a standard HTTP route and then instructing the canary to create a second one on the fly only if a specific header exists.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: smart-rollouts-demo
spec:
  replicas: 5
  strategy:
    canary:
      canaryService: smart-canary-service
      stableService: smart-stable-service
      trafficRouting:
        managedRoutes:
        - name: always-preview
        plugins:
          argoproj-labs/gatewayAPI:
            httpRoutes:
              - name: my-smart-route
                useHeaderRoutes: true
            namespace: default
      steps:
        - setHeaderRoute:
            name: always-preview
            match:
              - headerName: X-Canary
                headerValue:
                  exact: "yes"  
        - setWeight: 25                        
        - pause: {}
        - setWeight: 100

If you launch the application and this HTTP header is not present, you will see a standard canary with both versions.

If you activate the header then all requests of this user go to the canary!

This setting is now per user. If you open another browser (to simulate a different user), you will see the standard canary behavior again.

In this simple example, the application itself controls the HTTP header. In a real application, a networking component might do this for you (for example, adding this header only to French users).

Conclusion

In this guide we have seen two additional approaches to decide which users can look at the new version during a canary instead of a random percentage of requests (default behavior for Argo Rollouts)

With these approaches

We have complete control over the impact of a failed deployment. We can choose user groups that will always be redirected to the stable version even when a canary is in progress
Splitting user groups according to their risk acceptance can be decided in advance or updated on the fly
Canary behavior is now per user instead of per network request

In both cases, you might need to make source code changes to use a different endpoint or enable/disable a specific HTTP header.

The post Laser Focused Kubernetes Deployments Using Argo Rollouts and Header Based Routing appeared first on Codefresh.

Distribute Your Argo CD Applications to Different Kubernetes Clusters Using Application Sets

Kostis Kapelonis — Tue, 17 Jun 2025 10:44:02 +0000

In the previous article in this series, we explained how Argo CD application Sets work and how to use them for organizing your applications in different environments or groups. We received a lot of positive feedback from our readers, and many teams now use the associated Git repository as a starting point for their own Argo CD setup.

Even though we covered Application Sets, and more specifically the Git generator, we never explained how to assign different applications to different clusters. This is a common question from teams managing multiple clusters with different application settings per environment.

In this article, we complete the Application Set puzzle and analyze:

How to decide which application goes to which cluster
How to have different application settings per environment
How to split your clusters into different groups with cluster labels
How to combine the Argo CD Git Generator with the Cluster generator
How you can simplify your day-to-day operations using cluster labels.

For more details, we’ve again included an example Git repository.

Managing multiple Kubernetes clusters with Argo CD

Argo CD ApplicationSets let you automate your Application manifests in Argo CD. If you adopt ApplicationSets, you no longer need to deal with individual Argo CD applications’ YAML. You can simply point Argo CD to your clusters and folders, and all the possible combinations get created on the fly for you.

We’ve already seen that you can use ApplicationSets to deploy multiple applications on a single cluster.

We’ve also seen the other dimension—how to deploy the same application to different clusters:

In this guide, we cover the most complex scenario where we have multiple applications and multiple clusters.

To achieve this scenario, we need to use the Cluster Generator of Argo CD. This means you need to connect all your clusters to a single Argo CD instance. This is the hub-and-spoke setup of Argo CD. See our Argo CD architecture guide for different configurations and the advantages and disadvantages of each one.

Using a combination of the cluster and the Git generator, we can create a 2-dimensional matrix of all the pairs (cluster-app) and have Argo CD deploy everything with a single file.

This approach is a great starting point, but in a real organization, we need 2 more capabilities:

The ability to enable/disable some applications for some clusters
The ability to have different configurations (for example, Helm values) according to the cluster the application belongs to.

The final result is not a full 2-D matrix because some applications won’t exist in all environments. We want to achieve this:

In the example above, Sealed Secrets is NOT present in Cluster C. And the Cert manager is not present in Cluster A. In addition, the “Billing Application” needs to have a different configuration for each cluster.

So, can we achieve these requirements with Application Sets?

Anti-pattern – Creating Snowflake servers with ad-hoc combinations

When faced with the problem of distributing different applications to different clusters, many teams jump straight into very complex solutions that combine multiple Application Set generators. Unfortunately, most hard code custom combinations in the application set files.
A classic example of this approach is trying to individually enable/deactivate a specific application for a particular cluster. We advise AGAINST using such Application Set structures.

 ## DO NOT DO THIS 
- merge:
      mergeKeys:
        - app
      generators:
        - list:
            elements:
              - app: external-dns
                appPath: infra/helm-charts/external-dns
                namespace: dns
              - app: argocd
                appPath: infra/helm-charts/argocd
                namespace: argocd
              - app: external-secrets
                appPath: infra/helm-charts/external-secrets
                namespace: external-secrets
              - app: kyverno
                appPath: infra/helm-charts/kyverno
                namespace: kyverno
        - list:
            elements:
              - app: external-dns
                enabled: "true"
              - app: argocd
                enabled: "true"
              - app: external-secrets
                enabled: "false"
              - app: kyverno
                enabled: "true"
    selector:
      matchLabels:
        enabled: "true"

This file creates snowflake/pet servers where you need to define exactly what they contain. The final result is brittle, requiring significant effort when any major change happens. There are several challenges with this setup:

It works directly on individual clusters (instead of cluster groups, as we’ll see later in the guide), so it never scales as your requirements change.
It forces you to hardcode application combinations inside Application Sets. This makes the generators your new unit of work instead of your Kubernetes manifests.
It makes all day-2 operations lengthy and cumbersome procedures.
It makes reasoning about your clusters super difficult. Understanding what’s deployed where is no longer trivial.

The final two points cannot be overstated. This approach might look ok at first glance, but the more clusters you have, the more complex it will become.

If somebody asks which clusters contain kyverno, you need to scan all individual files for the “enabled” property of the “kyverno” line.
Every time you add a new cluster to your setup, you need to copy/paste the list of components from another cluster and start enabling/deactivating each individual component. If you have many components and many clusters, this is an error-prone process that you should avoid at all costs.
If you add a new component, you need to go to all your existing files and add it to all the enabled/deactivated lists.
It only addresses the first requirement (enabling/deactivating applications for clusters) but not the second one (having different configurations per cluster for the same application).

There is a better way to distribute applications to Argo CD clusters. The approach we DO recommend is using cluster generator labels.

Working with cluster groups instead of individual clusters

In a large organization, you don’t really care about individual clusters. You care about cluster groups. Argo CD doesn’t model the concept of a cluster group on its own, but you can replicate it using cluster labels.

You need to spend some time thinking about the different types of clusters you have and then assign labels to them.

The labels can be anything that makes sense to your organization

Environment types (for example, QA/staging/prod)
Regions or countries
Department or teams
Cloud provider or other technical difference
Any other special configuration that distinguishes one or more clusters from the rest

After you have those labels, you can slice and dice your cluster across any dimension and start thinking about cluster groups instead of individual cluster labels.

Because, ultimately, 99% of use cases resolve around cluster groups rather than individual clusters.

“I want all my production clusters to have application X with Y settings.”
“I want all my AWS clusters to have X authentication enabled.”
“Team X will control this environment while team Y will control that environment.”
“All European clusters need this application.”
“Application X is installed on both US-East and US-West regions, but with different configurations.”
“Just for our QA environment, we need this load testing app deployed.”

We’ll see in detail all the advantages when using cluster labels, but one of the easiest ways to understand the flexibility of this approach is to examine what happens for a very common scenario—adding a brand new cluster.

In most cases, a new cluster is “similar” to another cluster. A human operator needs to “clone” an existing cluster, or at the very least, define the new properties of the new cluster in the configuration file.

If you use cluster labels (as we suggest), the whole process requires zero modifications to your application set files.

You create the cluster with your favorite infra tool (Terraform/Pulumi/Crossplane, etc)
You assign the labels on this cluster (for example, it’s a new QA cluster in US East)
Finished!

Argo CD automatically detects this new cluster when it collects all its clusters, and deploys everything that needs to be deployed in the cluster according to its labels. There’s no configuration file to edit to “enable/deactivate” your apps. The process cannot get any easier than this.

Notice that this setup also helps with communication between developers and operators/infrastructure people. Opening a ticket for a new cluster and having several discussions about the contents of the new cluster significantly slows down development time.

In most cases, developers want a cluster that either mimics an existing one or has similar configuration to another cluster group. This makes your job very easy, as you can map directly to cluster labels what developers need.

Creating a new cluster can be a hectic process because you need to validate that it matches the expected workloads and is “similar” to your other clusters. If you use cluster labels, then Argo CD takes care of everything in minutes instead of hours.

Organizing your Argo CD clusters with different labels

Let’s see how all our use cases can work together with a semi-realistic example. You can find all Argo CD manifests at https://github.com/kostis-codefresh/multi-app-multi-value-argocd if you want to follow along.

The repository contains:

A set of scripts to create 7 clusters running on the same machine using K3d
Different labels for organizing those clusters into different groups
Example Application Sets that distribute applications to those cluster groups
Helm charts and Kustomize overlays for placeholder applications

Here are the 7 clusters that we define with K3d. In a real organization, these clusters would be created with Terraform or another similar tool.

We’ve assigned several example labels on those clusters. Notice that even before talking about applications, the clusters themselves exist in 2 dimensions:

A promotion flow (QA -> staging -> production) on the horizontal axis
A region setting (US/EU/Asia) on the vertical axis

The “hub” cluster contains the Argo CD instance that manages all the other clusters. In our example, this cluster only has Argo CD and no end-user applications, so it doesn’t take part in our application sets (it has a label type=hub instead of type=workload).

You can verify or change the labels of each cluster by the Cluster Secret in the main Argo CD instance. Here’s an example of a QA cluster that shows the assigned labels as created by our example GitHub repository.

apiVersion: v1
data:
  [...snip..]
kind: Secret
metadata:
  annotations:
    managed-by: argocd.argoproj.io
  labels:
    argocd.argoproj.io/secret-type: cluster
    cloud: gcp
    department: billing
    env: qa
    region: eu
    type: workload
  name: cluster-k3d-qa-eu-serverlb-1347542961
  namespace: argocd

We’re now ready to look at some typical scenarios. It’s impossible to cover all possible use cases, so we’ll see some representative scenarios for each use case.

The major question that you need to ask yourself is whether you want to deploy an application across different environments with the exact same configuration OR you want a different configuration per environment. The latter is obviously more complex and requires a good understanding of your Kustomize Overlays and Helm value hierarchies, but it’s closer to how a real organization works:

Here are the scenarios we’ll see:

Scenario	Type	Configuration
1 – “workload clusters”	Plain Manifests	Same across all environments
2 – “GCP only”	Plain Manifests	Same across all environments
3 – “Europe only”	Plain Manifests	Same across all environments
4 – “Production/Asia”	Plain Manifests	Same across all environments
5 – “QA US and EU”	Kustomize	Same across all environments
6 – “Production EU/US”	Kustomize	Different per environment
7 – “QA US and EU”	Helm	Same across all environments
8 – “Europe Only”	Helm	Different per environment
9 – “Production EU/US/Asia”	Helm	Different per environment

Notice that in our example repository, our applications are grouped in folders by type: manifests, Kustomize, or Helm apps.

In a real organization, you might have different sub-folders for each type, but it’s simpler if you only have to manage one type of application (for example, Kustomize for your own developers and Helm charts for external applications).

Scenario 1 – Run some applications on all workload clusters

Let’s see a very simple use case. We want to deploy all the following common applications to all our clusters only, excluding the Argo CD “hub” cluster. We can take advantage of the “workload” label and point Argo CD to a folder that has all our common applications.

spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - matrix:
      generators:
        - git:
            repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
            revision: HEAD
            directories:
            - path: simple-apps/*
        - clusters:    
            selector:
              matchLabels:
                type: "workload"

You can see the full Application Set at 01-common-apps.yml. This file instructs Argo CD to:

Gather all connected clusters that have the “type=workload” label
Gather all the Kubernetes manifests found under “simple-apps”
Create all the combinations between those clusters and those apps
Deploy the resulting Argo CD applications.

If you’re not familiar with generators, please read our Application Set Guide. If you deploy this file, you’ll see the following:

We got 18 applications (6 clusters multiplied by 3 apps) in a single step. Isn’t this cool?

Scenario 2- Choose only GCP clusters and exclude those in AWS

In the next example, we want to install all the applications under `simple-apps` folder only in our Google Cloud clusters, but those applications should not exist in our Amazon clusters. Again, we have created the appropriate labels in advance. In our imaginary organization, all non-production servers run in GCP.

The admin server also runs in AWS, and this is why it won’t get picked up by our application set. You can find the full manifest at 02-gcp-only.yml.

spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - matrix:
      generators:
        - git:
            repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
            revision: HEAD
            directories:
            - path: simple-apps/*
        - clusters:    
            selector:
              matchLabels:
                type: "workload"    
                cloud: "gcp"

This Application Set is similar to the previous one, but now we’re matching 2 labels—one for Google Cloud and one for all our “workload” clusters.

If you apply it, you get several applications, but only to non-prod environments.

Argo CD created a list of applications for only the QA and Staging cluster groups, as they contain clusters that run on Google Cloud.

Scenario 3 – Choose only European Clusters

The big power of labels will become clear when you get requirements that need to work with clusters in an unusual or non-linear way. Let’s imagine a scenario where you need to do something specific to all European clusters because of GDRP regulations.

At this point, most teams realize that the primary way of organizing their clusters was by type (qa/staging/prod), and they modelled the region as a secondary parameter. This creates several challenges and makes people ask the same question, “Does product X support deployments in regions?”.

But when using the cluster generator, all labels are first-level constructs, allowing you to make any selection possible. We can focus on European clusters by just defining our region the same way as any other scenario.

Today, only the QA and Production environments have a European server. But tomorrow, you might add one in the Staging environment WITHOUT any modifications in your application Set.

We select all European servers by region with file 03-eu-only.yml

spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - matrix:
      generators:
        - git:
            repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
            revision: HEAD
            directories:
            - path: simple-apps/*
        - clusters:    
            selector:
              matchLabels:
                type: "workload"    
                region: "eu"

Deploying this application set will instruct Argo CD to place all the applications under simple-apps folder only in the European servers:

If you add a new European Region in the Staging environment, then in the next Argo CD sync, that cluster also gets all applications defined for Europe, with zero effort from the administrator.

Scenario 4 – Choose a specific cluster among a cluster group

If it wasn’t clear from the previous examples, the label selector for clusters works in an “AND” manner by default. So the more labels you add in the selector, the more specific the application set becomes.

This means that even if you really want to select a single cluster among a group, you can just define all the labels that correctly identify it.

We want to select the Asian Environment for Production (which is a specific Kubernetes cluster).

The application set that selects this cluster is at 04-specific-cluster.yml.

spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - matrix:
      generators:
        - git:
            repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
            revision: HEAD
            directories:
            - path: simple-apps/*
        - clusters:    
            selector:
              matchLabels:
                type: "workload"  
                region: "asia"    
                env: "prod"

The labels we have defined in the application set map only to one cluster. Argo CD will look at this application set and find all clusters that have type=workload AND region=asia AND env=prod.

Applying the file you will see the following

As expected, Argo CD deployed all the applications under simple-apps folder only to the production cluster in Asia.

Scenario 5 – Different Kustomize overlays for the QA clusters

For simplicity, in all the previous examples, all our applications use the same configuration across all clusters. So even if our cluster generator selected multiple clusters, they all used the plain manifests we defined.

While this approach can work for some trivial applications, you almost certainly want to use a different configuration per cluster. This can take the form of DNS names, database credentials, security controls, rate limiting settings, etc.

For our next example, we’ll use Kustomize overlays. For each application, we have the base configuration plus extra settings in overlays or Kustomize components.

We have covered Kustomize overlays in detail in the promotion article and explained how they work with Argo CD in our Application Set guide, so make sure you read those first if you’re not familiar with overlays.

For the cluster selector, we’ll choose the QA environment this time (which corresponds to 2 clusters).

The application set that selects the QA clusters and deploys applications with the respective configuration is at 05-my-qa-appset.yml.

spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - matrix:
      generators:
        - git:
            repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
            revision: HEAD
            directories:
            - path: kustomize-apps/*/envs/qa  
        - clusters:    
            selector:
              matchLabels:
                type: "workload"    
                env: "qa"

The Matrix generator selects all clusters that match the QA/Workload labels and applies only the applications that have a QA overlay.

Apply the file, and you see all QA deployments:

The important point here is that for each application, only the QA overlay is selected.

You can see in the Git repository that the “Invoices” application comes with configurations for all environments, but we appropriately employ only the QA one in our application set.

Scenario 6 – Different Kustomize settings for US and EU in production

There are many more examples we can show with this setup. Be sure to read the documentation of the cluster generator. One important point is that you can use the output of this generator as input to another generator.

As a final example with Kustomize, let’s see a scenario where we want to deploy our applications to Production Europe and Production US, but not in Asia.

Remember that by default, server labels work in “AND” mode. So if we simply list “us” and “eu” as labels, Argo CD will try to find all clusters that have both labels at the same time. We don’t want this, as no cluster matches this description.

Also, unlike the previous example where we specifically asked for the “QA” overlay, now we want to choose the overlays that match whatever the cluster type/region is (either prod-us or prod-eu).

You can find the full application set at 06-my-prod-appset.yml.

spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - matrix:
      generators:
        - clusters:    
            selector:
              matchLabels:
                type: "workload"      
                env: "prod"
              matchExpressions:
              - key: region
                operator: In
                values:
                  - "eu"
                  - "us"        
        - git:
            repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
            revision: HEAD
            directories:
            - path: 'kustomize-apps/*/envs/{{.name}}'

The first thing to show here is the matchExpressions block. This lets you choose clusters in an “OR” manner. We want all clusters that are either EU or US AND in production.

The second point is using the output of the cluster generator as input to the Git generator. The “{{.name}}” variable will render to the name of the cluster matched, forcing the Git generator to load the respective Kustomize overlay for each environment.

Apply the file and you will see production deployment in EU and US but not in Asia:

And most importantly, you see that each server loads the configuration for its own region:

You should now understand how to select any combination of clusters and apply your exact choice of Kustomize overlays according to the “type” of each cluster.

Scenario 7 – A Helm hierarchy of values for the QA environment

Cluster labels can also work with your Helm charts and values.

As a starting example, let’s deploy our Helm charts to the two QA clusters using the same configuration for both.

You can find the full application set at 07-helm-qa-only.yml.

spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - matrix:
      generators:
        - git:
            repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
            revision: HEAD
            directories:
            - path: charts/*
        - clusters:    
            selector:
              matchLabels:
                type: "workload"    
                env: "qa"

The generator part of the file selects our charts and applies them to all clusters with the QA/workload label.

We have 2 example charts in the Git repository, so Argo CD created 4 applications for us (1 for each region).

Scenario 8 – Different Helm values for the European environments

Like the Kustomize example, we want to make our examples more advanced and have different value files per environment.

The same Git repository also contains a set of Helm values for each environment.

We have covered Helm value hierarchies and Argo CD applications in our Helm guide, so please read that guide first if you don’t know how to create your own value hierarchies.

Let’s deploy our Helm charts at all European servers:

This time, however, we want to specifically load the European values only instead of all values.

You can find the full application set at 08-helm-eu.yml

spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - matrix:
      generators:
        - clusters:    
            selector:
              matchLabels:
                type: "workload"      
                region: "eu"                        
        - git:
            repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
            revision: HEAD
            directories:
            - path: charts/*

The generator part is straightforward. It applies all charts to clusters with the EU/Workload labels. The smart selection of values happens in the “sources” section of the generated application:

sources:
  - repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
    path: '{{.path.path}}'
    targetRevision: HEAD
    helm:
      valueFiles:
      - '$my-values/values/{{index .path.segments 1}}/common-values.yaml'  
      - '$my-values/values/{{index .path.segments 1}}/app-version/{{index .metadata.labels "env"}}-values.yaml'                
      - '$my-values/values/{{index .path.segments 1}}/regions/eu-values.yaml'              
      - '$my-values/values/{{index .path.segments 1}}/envs/{{index .metadata.labels "env"}}-eu-values.yaml'                  
  - repoURL: 'https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git'
    targetRevision: HEAD
    ref: my-values

Here we apply the appropriate values according to:

The chart name (index .path.segments 1)
The environment label that exists on the cluster (index .metadata.labels “env”)

In this example, you see how you can query the cluster itself for its own metadata.

If you apply this file, you see both charts deployed in the European servers.

But most importantly, you see that each environment gets the correct values according to its type:

Notice that in both cases, we still have some common values that apply to both environments.

Scenario 9 – Different Helm values for all 3 Production regions

As a final example with Helm, let’s deploy to all production regions with the appropriate settings for each one.

We choose all 3 regions in our cluster generator.

You can find the full application set at 09-helm-prod.yml.

Like before, we select all 3 regions in an “OR” manner and apply our charts.

spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  generators:
  - matrix:
      generators:
        - clusters:    
            selector:
              matchLabels:
                type: "workload"      
                env: "prod"
              matchExpressions:
              - key: region
                operator: In
                values:
                  - "eu"
                  - "us"  
                  - "asia"                      
        - git:
            repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
            revision: HEAD
            directories:
            - path: charts/*

For each application, we query each cluster for its environments and region.

sources:
  - repoURL: https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git
    path: '{{.path.path}}'
    targetRevision: HEAD
    helm:
      valueFiles:
      - '$my-values/values/{{index .path.segments 1}}/common-values.yaml'  
      - '$my-values/values/{{index .path.segments 1}}/app-version/{{index .metadata.labels "env"}}-values.yaml'              
      - '$my-values/values/{{index .path.segments 1}}/env-type/{{index .metadata.labels "env"}}-values.yaml'  
      - '$my-values/values/{{index .path.segments 1}}/regions/{{index .metadata.labels "region"}}-values.yaml'              
      - '$my-values/values/{{index .path.segments 1}}/envs/{{index .metadata.labels "env"}}-{{index .metadata.labels "region"}}-values.yaml'                  
  - repoURL: 'https://github.com/kostis-codefresh/multi-app-multi-value-argocd.git'
    targetRevision: HEAD
    ref: my-values

All charts are now deployed in all regions:

You can also verify that each environment picks the correct settings from the value hierarchy:

You have now seen how to apply value hierarchies with Application Sets and cluster labels.

Day 2 operations

We now reach the most important point of this guide. We’ve seen how cluster labels let you define exactly what goes into which cluster. You might be wondering why this is the recommended solution and how it’s better than other approaches you’ve seen.

The answer is that with cluster labels you treat your application sets in a “create-and-forget” function. After the initial set up, you shouldn’t need to touch your application sets at all. This means that maintenance effort is zero, which is always the OPTIMAL way of evaluating any architecture decision.

Let’s see some semi-realistic scenarios of using our recommendation in a real organization.

Imagine you just organized all your application sets with cluster labels. All files are committed in Git, and all applications are successfully deployed. Everything runs smoothly.

Scenario A – Removing a server

On Monday, you need to decommission the US/prod server. You remove the “us” and “prod” tags from the server. In the next sync, the cluster generator from all related appsets doesn’t pick it up and nothing gets deployed there. You don’t really care how many application sets touched this cluster. They will all stop deploying there automatically.