fix(kubernetes): use full FQDN for kubeadm join apiServerEndpoint by medampudi · Pull Request #2373 · cozystack/cozystack

medampudi · 2026-04-11T16:45:18Z

Summary

Use full FQDN (<name>.<namespace>.svc.<cluster-domain>:6443) instead of partial (<name>.<namespace>.svc:6443) for the kubeadm join apiServerEndpoint
Read cluster domain from _cluster.cluster-domain (same pattern as mongodb, nats, clickhouse charts)

Problem

KubeVirt worker VMs fail to join managed Kubernetes clusters with NodeStartupTimeout. The kubeadm join config uses a partial FQDN:

apiServerEndpoint: kubernetes-botz-infra.tenant-simbotix.svc:6443

This requires DNS search domain expansion (appending .cluster.local) to resolve. However, KubeVirt worker VMs get their DNS via DHCP and do not have the cluster's search domains in /etc/resolv.conf.

Even from a pod with correct search domains, the partial FQDN fails to resolve:

$ nslookup kubernetes-botz-infra.tenant-simbotix.svc
** server can't find kubernetes-botz-infra.tenant-simbotix.svc: NXDOMAIN

$ nslookup kubernetes-botz-infra.tenant-simbotix.svc.cluster.local
Address: 10.96.80.135   ✅

Fix

Append the cluster domain to produce a full FQDN:

apiServerEndpoint: kubernetes-botz-infra.tenant-simbotix.svc.cluster.local:6443

Test plan

Verified partial FQDN returns NXDOMAIN from within cluster
Verified full FQDN resolves correctly
Deploy managed K8s cluster with fix and verify workers join

Fixes #2372

Summary by CodeRabbit

Chores
- Made Kubernetes cluster domain configurable in cluster setup, with a default value maintained for standard deployments. Updated API server endpoint discovery to incorporate the configurable cluster domain during node bootstrap.

The kubeadm join configuration uses a partial FQDN (`<name>.<namespace>.svc:6443`) for the API server endpoint. This requires DNS search domain expansion to resolve, but KubeVirt worker VMs don't have the cluster's search domains in their /etc/resolv.conf. This causes worker nodes to fail joining the managed Kubernetes cluster with NodeStartupTimeout, as kubelet cannot resolve the API server address. Fix by appending the cluster domain (from `_cluster.cluster-domain`) to produce a full FQDN like `<name>.<namespace>.svc.cluster.local:6443` that resolves without search domain expansion. Follows the same pattern used by mongodb, nats, and clickhouse charts. Fixes cozystack#2372

coderabbitai · 2026-04-11T16:45:41Z

📝 Walkthrough

Walkthrough

Modified the Kubernetes Helm template to support custom cluster DNS domains by introducing a $clusterDomain variable and updating the kubeadm bootstrap API server endpoint to use the fully qualified domain name instead of the partial FQDN.

Changes

Cohort / File(s)	Summary
Kubernetes Bootstrap Configuration `packages/apps/kubernetes/templates/cluster.yaml`	Added Helm variable for cluster domain (defaults to `cluster.local`) and updated `apiServerEndpoint` to include the domain suffix in the FQDN, enabling worker nodes to properly resolve the API server regardless of cluster DNS configuration.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~4 minutes

Poem

🐰 A DNS tale so fine and true,
With cluster domains, old and new,
The workers hop to join with glee,
When FQDNs match perfectly!
No more lost names in digital space,
Every domain finds its place. 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: using full FQDN for kubeadm join apiServerEndpoint, which directly addresses the DNS resolution issue in the linked issue `#2372`.
Linked Issues check	✅ Passed	The pull request fully addresses the linked issue `#2372` by implementing the suggested fix to use full FQDN (including cluster domain) in the apiServerEndpoint, enabling worker node joins regardless of custom DNS domains.
Out of Scope Changes check	✅ Passed	All changes are in-scope: adding a cluster domain variable and updating the apiServerEndpoint to use full FQDN directly addresses the DNS resolution requirement from issue `#2372`.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a configurable $clusterDomain variable in the Kubernetes cluster template, which is now used to construct the apiServerEndpoint FQDN. A review comment suggests adding a safety check when accessing .Values._cluster to prevent potential template rendering errors if the value is null.

gemini-code-assist · 2026-04-11T16:47:49Z

 {{- $etcd := .Values._namespace.etcd }}
 {{- $ingress := .Values._namespace.ingress }}
 {{- $host := .Values._namespace.host }}
+{{- $clusterDomain := (index .Values._cluster "cluster-domain") | default "cluster.local" }}


Accessing .Values._cluster directly with index can cause a template rendering error if _cluster is nil (e.g., during linting or if the secret is missing). Using default dict ensures the template renders safely. Additionally, since this pattern is used across multiple charts, consider adding a helper like cozy-lib.cluster-domain to the library in a future PR to centralize this configuration.

{{- $clusterDomain := (index (.Values._cluster | default dict) "cluster-domain") | default "cluster.local" }}

coderabbitai

🧹 Nitpick comments (1)

packages/apps/kubernetes/templates/cluster.yaml (1)
4-4: Align fallback domain default with existing templates in the same directory.

Line 4 uses "cluster.local" as fallback, while helmreleases/monitoring-agents.yaml and helmreleases/vertical-pod-autoscaler.yaml (in the same directory) both use "cozy.local". The project convention in packages/core/platform/values.yaml also specifies cozy.local. If _cluster.cluster-domain is unset, this mismatch will produce inconsistent service FQDNs.
Suggested fix
-{{- $clusterDomain := (index .Values._cluster "cluster-domain") | default "cluster.local" }}
+{{- $clusterDomain := (index .Values._cluster "cluster-domain") | default "cozy.local" }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/apps/kubernetes/templates/cluster.yaml` at line 4, The fallback
cluster domain currently set in the template variable $clusterDomain uses
"cluster.local" which conflicts with other templates; update the default in the
expression that sets $clusterDomain (the line using (index .Values._cluster
"cluster-domain") | default "cluster.local") to use "cozy.local" so it matches
helmreleases/monitoring-agents.yaml, helmreleases/vertical-pod-autoscaler.yaml,
and the platform default in packages/core/platform/values.yaml.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/apps/kubernetes/templates/cluster.yaml`:
- Line 4: The fallback cluster domain currently set in the template variable
$clusterDomain uses "cluster.local" which conflicts with other templates; update
the default in the expression that sets $clusterDomain (the line using (index
.Values._cluster "cluster-domain") | default "cluster.local") to use
"cozy.local" so it matches helmreleases/monitoring-agents.yaml,
helmreleases/vertical-pod-autoscaler.yaml, and the platform default in
packages/core/platform/values.yaml.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e5a0840-976d-454c-b4e2-21cd9d0f47e3

📥 Commits

Reviewing files that changed from the base of the PR and between 78ca1f0 and 179c51c.

📒 Files selected for processing (1)

packages/apps/kubernetes/templates/cluster.yaml

medampudi · 2026-04-11T17:31:54Z

Update: After further investigation, the primary reason workers weren't joining on our cluster was actually issue #2143 — conntrack missing in the container disk image. Switching from v1.31 to v1.32 (where kubeadm made the conntrack check optional) resolved the join failure.

However, the FQDN fix in this PR is still valid and important:

The partial FQDN *.svc:6443 returns NXDOMAIN even from pods with search domains (verified with nslookup from inside the cluster)
With v1.32, kubeadm eventually resolves it through retry/fallback, but it's fragile
For clusters with non-standard cluster domains (e.g. cozy.local), the partial FQDN will never resolve from KubeVirt VMs

The fix follows the same pattern used by mongodb, nats, and clickhouse charts. Recommend merging to make the hosted K8s more robust.

medampudi · 2026-04-22T20:05:55Z

Closing — the root cause on my cluster turned out to be issue #2143 (conntrack missing in the containerd image), not the apiServerEndpoint format. With #2143 addressed, plain-hostname endpoints work correctly for me, so this patch is no longer needed on my side.

If a maintainer thinks the FQDN form is still a worthwhile safety default they're welcome to reopen or cherry-pick; otherwise it's fine to leave closed.

Thanks for the eyes on the CodeRabbit walkthrough earlier.

medampudi requested review from IvanHunters, androndo, kvaps, lexfrei, lllamnyp and sircthulhu as code owners April 11, 2026 16:45

dosubot Bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Apr 11, 2026

dosubot Bot added the bug Something isn't working label Apr 11, 2026

gemini-code-assist Bot reviewed Apr 11, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 11, 2026

View reviewed changes

medampudi closed this Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kubernetes): use full FQDN for kubeadm join apiServerEndpoint#2373

fix(kubernetes): use full FQDN for kubeadm join apiServerEndpoint#2373
medampudi wants to merge 1 commit intocozystack:mainfrom
medampudi:fix/kubernetes-kubeadm-join-fqdn

medampudi commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

medampudi commented Apr 11, 2026

Uh oh!

medampudi commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

medampudi commented Apr 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

medampudi commented Apr 11, 2026

Uh oh!

medampudi commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

medampudi commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading