Skip to content

fix(kubernetes): use full FQDN for kubeadm join apiServerEndpoint#2373

Closed
medampudi wants to merge 1 commit intocozystack:mainfrom
medampudi:fix/kubernetes-kubeadm-join-fqdn
Closed

fix(kubernetes): use full FQDN for kubeadm join apiServerEndpoint#2373
medampudi wants to merge 1 commit intocozystack:mainfrom
medampudi:fix/kubernetes-kubeadm-join-fqdn

Conversation

@medampudi
Copy link
Copy Markdown

@medampudi medampudi commented Apr 11, 2026

Summary

  • Use full FQDN (<name>.<namespace>.svc.<cluster-domain>:6443) instead of partial (<name>.<namespace>.svc:6443) for the kubeadm join apiServerEndpoint
  • Read cluster domain from _cluster.cluster-domain (same pattern as mongodb, nats, clickhouse charts)

Problem

KubeVirt worker VMs fail to join managed Kubernetes clusters with NodeStartupTimeout. The kubeadm join config uses a partial FQDN:

apiServerEndpoint: kubernetes-botz-infra.tenant-simbotix.svc:6443

This requires DNS search domain expansion (appending .cluster.local) to resolve. However, KubeVirt worker VMs get their DNS via DHCP and do not have the cluster's search domains in /etc/resolv.conf.

Even from a pod with correct search domains, the partial FQDN fails to resolve:

$ nslookup kubernetes-botz-infra.tenant-simbotix.svc
** server can't find kubernetes-botz-infra.tenant-simbotix.svc: NXDOMAIN

$ nslookup kubernetes-botz-infra.tenant-simbotix.svc.cluster.local
Address: 10.96.80.135   ✅

Fix

Append the cluster domain to produce a full FQDN:

apiServerEndpoint: kubernetes-botz-infra.tenant-simbotix.svc.cluster.local:6443

Test plan

  • Verified partial FQDN returns NXDOMAIN from within cluster
  • Verified full FQDN resolves correctly
  • Deploy managed K8s cluster with fix and verify workers join

Fixes #2372

Summary by CodeRabbit

  • Chores
    • Made Kubernetes cluster domain configurable in cluster setup, with a default value maintained for standard deployments. Updated API server endpoint discovery to incorporate the configurable cluster domain during node bootstrap.

The kubeadm join configuration uses a partial FQDN
(`<name>.<namespace>.svc:6443`) for the API server endpoint. This
requires DNS search domain expansion to resolve, but KubeVirt worker
VMs don't have the cluster's search domains in their /etc/resolv.conf.

This causes worker nodes to fail joining the managed Kubernetes cluster
with NodeStartupTimeout, as kubelet cannot resolve the API server
address.

Fix by appending the cluster domain (from `_cluster.cluster-domain`)
to produce a full FQDN like
`<name>.<namespace>.svc.cluster.local:6443` that resolves without
search domain expansion.

Follows the same pattern used by mongodb, nats, and clickhouse charts.

Fixes cozystack#2372
@dosubot dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 11, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 11, 2026

📝 Walkthrough

Walkthrough

Modified the Kubernetes Helm template to support custom cluster DNS domains by introducing a $clusterDomain variable and updating the kubeadm bootstrap API server endpoint to use the fully qualified domain name instead of the partial FQDN.

Changes

Cohort / File(s) Summary
Kubernetes Bootstrap Configuration
packages/apps/kubernetes/templates/cluster.yaml
Added Helm variable for cluster domain (defaults to cluster.local) and updated apiServerEndpoint to include the domain suffix in the FQDN, enabling worker nodes to properly resolve the API server regardless of cluster DNS configuration.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~4 minutes

Poem

🐰 A DNS tale so fine and true,
With cluster domains, old and new,
The workers hop to join with glee,
When FQDNs match perfectly!
No more lost names in digital space,
Every domain finds its place. 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: using full FQDN for kubeadm join apiServerEndpoint, which directly addresses the DNS resolution issue in the linked issue #2372.
Linked Issues check ✅ Passed The pull request fully addresses the linked issue #2372 by implementing the suggested fix to use full FQDN (including cluster domain) in the apiServerEndpoint, enabling worker node joins regardless of custom DNS domains.
Out of Scope Changes check ✅ Passed All changes are in-scope: adding a cluster domain variable and updating the apiServerEndpoint to use full FQDN directly addresses the DNS resolution requirement from issue #2372.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot Bot added the bug Something isn't working label Apr 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable $clusterDomain variable in the Kubernetes cluster template, which is now used to construct the apiServerEndpoint FQDN. A review comment suggests adding a safety check when accessing .Values._cluster to prevent potential template rendering errors if the value is null.

{{- $etcd := .Values._namespace.etcd }}
{{- $ingress := .Values._namespace.ingress }}
{{- $host := .Values._namespace.host }}
{{- $clusterDomain := (index .Values._cluster "cluster-domain") | default "cluster.local" }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing .Values._cluster directly with index can cause a template rendering error if _cluster is nil (e.g., during linting or if the secret is missing). Using default dict ensures the template renders safely. Additionally, since this pattern is used across multiple charts, consider adding a helper like cozy-lib.cluster-domain to the library in a future PR to centralize this configuration.

{{- $clusterDomain := (index (.Values._cluster | default dict) "cluster-domain") | default "cluster.local" }}

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/apps/kubernetes/templates/cluster.yaml (1)

4-4: Align fallback domain default with existing templates in the same directory.

Line 4 uses "cluster.local" as fallback, while helmreleases/monitoring-agents.yaml and helmreleases/vertical-pod-autoscaler.yaml (in the same directory) both use "cozy.local". The project convention in packages/core/platform/values.yaml also specifies cozy.local. If _cluster.cluster-domain is unset, this mismatch will produce inconsistent service FQDNs.

Suggested fix
-{{- $clusterDomain := (index .Values._cluster "cluster-domain") | default "cluster.local" }}
+{{- $clusterDomain := (index .Values._cluster "cluster-domain") | default "cozy.local" }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/apps/kubernetes/templates/cluster.yaml` at line 4, The fallback
cluster domain currently set in the template variable $clusterDomain uses
"cluster.local" which conflicts with other templates; update the default in the
expression that sets $clusterDomain (the line using (index .Values._cluster
"cluster-domain") | default "cluster.local") to use "cozy.local" so it matches
helmreleases/monitoring-agents.yaml, helmreleases/vertical-pod-autoscaler.yaml,
and the platform default in packages/core/platform/values.yaml.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/apps/kubernetes/templates/cluster.yaml`:
- Line 4: The fallback cluster domain currently set in the template variable
$clusterDomain uses "cluster.local" which conflicts with other templates; update
the default in the expression that sets $clusterDomain (the line using (index
.Values._cluster "cluster-domain") | default "cluster.local") to use
"cozy.local" so it matches helmreleases/monitoring-agents.yaml,
helmreleases/vertical-pod-autoscaler.yaml, and the platform default in
packages/core/platform/values.yaml.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e5a0840-976d-454c-b4e2-21cd9d0f47e3

📥 Commits

Reviewing files that changed from the base of the PR and between 78ca1f0 and 179c51c.

📒 Files selected for processing (1)
  • packages/apps/kubernetes/templates/cluster.yaml

@medampudi
Copy link
Copy Markdown
Author

Update: After further investigation, the primary reason workers weren't joining on our cluster was actually issue #2143conntrack missing in the container disk image. Switching from v1.31 to v1.32 (where kubeadm made the conntrack check optional) resolved the join failure.

However, the FQDN fix in this PR is still valid and important:

  1. The partial FQDN *.svc:6443 returns NXDOMAIN even from pods with search domains (verified with nslookup from inside the cluster)
  2. With v1.32, kubeadm eventually resolves it through retry/fallback, but it's fragile
  3. For clusters with non-standard cluster domains (e.g. cozy.local), the partial FQDN will never resolve from KubeVirt VMs

The fix follows the same pattern used by mongodb, nats, and clickhouse charts. Recommend merging to make the hosted K8s more robust.

@medampudi
Copy link
Copy Markdown
Author

Closing — the root cause on my cluster turned out to be issue #2143 (conntrack missing in the containerd image), not the apiServerEndpoint format. With #2143 addressed, plain-hostname endpoints work correctly for me, so this patch is no longer needed on my side.

If a maintainer thinks the FQDN form is still a worthwhile safety default they're welcome to reopen or cherry-pick; otherwise it's fine to leave closed.

Thanks for the eyes on the CodeRabbit walkthrough earlier.

@medampudi medampudi closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:XS This PR changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kubernetes app: kubeadm join fails when cluster DNS domain is not cluster.local

1 participant