fix(kubernetes): use full FQDN for kubeadm join apiServerEndpoint#2373
fix(kubernetes): use full FQDN for kubeadm join apiServerEndpoint#2373medampudi wants to merge 1 commit intocozystack:mainfrom
Conversation
The kubeadm join configuration uses a partial FQDN (`<name>.<namespace>.svc:6443`) for the API server endpoint. This requires DNS search domain expansion to resolve, but KubeVirt worker VMs don't have the cluster's search domains in their /etc/resolv.conf. This causes worker nodes to fail joining the managed Kubernetes cluster with NodeStartupTimeout, as kubelet cannot resolve the API server address. Fix by appending the cluster domain (from `_cluster.cluster-domain`) to produce a full FQDN like `<name>.<namespace>.svc.cluster.local:6443` that resolves without search domain expansion. Follows the same pattern used by mongodb, nats, and clickhouse charts. Fixes cozystack#2372
📝 WalkthroughWalkthroughModified the Kubernetes Helm template to support custom cluster DNS domains by introducing a Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~4 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a configurable $clusterDomain variable in the Kubernetes cluster template, which is now used to construct the apiServerEndpoint FQDN. A review comment suggests adding a safety check when accessing .Values._cluster to prevent potential template rendering errors if the value is null.
| {{- $etcd := .Values._namespace.etcd }} | ||
| {{- $ingress := .Values._namespace.ingress }} | ||
| {{- $host := .Values._namespace.host }} | ||
| {{- $clusterDomain := (index .Values._cluster "cluster-domain") | default "cluster.local" }} |
There was a problem hiding this comment.
Accessing .Values._cluster directly with index can cause a template rendering error if _cluster is nil (e.g., during linting or if the secret is missing). Using default dict ensures the template renders safely. Additionally, since this pattern is used across multiple charts, consider adding a helper like cozy-lib.cluster-domain to the library in a future PR to centralize this configuration.
{{- $clusterDomain := (index (.Values._cluster | default dict) "cluster-domain") | default "cluster.local" }}There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/apps/kubernetes/templates/cluster.yaml (1)
4-4: Align fallback domain default with existing templates in the same directory.Line 4 uses
"cluster.local"as fallback, whilehelmreleases/monitoring-agents.yamlandhelmreleases/vertical-pod-autoscaler.yaml(in the same directory) both use"cozy.local". The project convention inpackages/core/platform/values.yamlalso specifiescozy.local. If_cluster.cluster-domainis unset, this mismatch will produce inconsistent service FQDNs.Suggested fix
-{{- $clusterDomain := (index .Values._cluster "cluster-domain") | default "cluster.local" }} +{{- $clusterDomain := (index .Values._cluster "cluster-domain") | default "cozy.local" }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/apps/kubernetes/templates/cluster.yaml` at line 4, The fallback cluster domain currently set in the template variable $clusterDomain uses "cluster.local" which conflicts with other templates; update the default in the expression that sets $clusterDomain (the line using (index .Values._cluster "cluster-domain") | default "cluster.local") to use "cozy.local" so it matches helmreleases/monitoring-agents.yaml, helmreleases/vertical-pod-autoscaler.yaml, and the platform default in packages/core/platform/values.yaml.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/apps/kubernetes/templates/cluster.yaml`:
- Line 4: The fallback cluster domain currently set in the template variable
$clusterDomain uses "cluster.local" which conflicts with other templates; update
the default in the expression that sets $clusterDomain (the line using (index
.Values._cluster "cluster-domain") | default "cluster.local") to use
"cozy.local" so it matches helmreleases/monitoring-agents.yaml,
helmreleases/vertical-pod-autoscaler.yaml, and the platform default in
packages/core/platform/values.yaml.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3e5a0840-976d-454c-b4e2-21cd9d0f47e3
📒 Files selected for processing (1)
packages/apps/kubernetes/templates/cluster.yaml
|
Update: After further investigation, the primary reason workers weren't joining on our cluster was actually issue #2143 — However, the FQDN fix in this PR is still valid and important:
The fix follows the same pattern used by mongodb, nats, and clickhouse charts. Recommend merging to make the hosted K8s more robust. |
|
Closing — the root cause on my cluster turned out to be issue #2143 (conntrack missing in the containerd image), not the If a maintainer thinks the FQDN form is still a worthwhile safety default they're welcome to reopen or cherry-pick; otherwise it's fine to leave closed. Thanks for the eyes on the CodeRabbit walkthrough earlier. |
Summary
<name>.<namespace>.svc.<cluster-domain>:6443) instead of partial (<name>.<namespace>.svc:6443) for the kubeadm joinapiServerEndpoint_cluster.cluster-domain(same pattern as mongodb, nats, clickhouse charts)Problem
KubeVirt worker VMs fail to join managed Kubernetes clusters with
NodeStartupTimeout. The kubeadm join config uses a partial FQDN:This requires DNS search domain expansion (appending
.cluster.local) to resolve. However, KubeVirt worker VMs get their DNS via DHCP and do not have the cluster's search domains in/etc/resolv.conf.Even from a pod with correct search domains, the partial FQDN fails to resolve:
Fix
Append the cluster domain to produce a full FQDN:
Test plan
Fixes #2372
Summary by CodeRabbit