-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Observed Behavior
When deploying multiple clusters of the same product (Airflow, NiFi, ...) into one namespace and then deleting one of them, it can happen that the roleBinding and serviceAccount objects that are shared by all these clusters accidentally get deleted as well.
Root Cause
The reason for this is shown in the following diagram
All cluster objects share the same roleBinding and serviceAccount objects. Which in principle works fine, as the content is the same with regards to every cluster.
Stackable operators track the resources they deploy via labels to ensure that they can delete "orphaned" objects that are no longer needed, this is done via the labels app.kubernetes.io/managed-by and app.kubernetes.io/instance as can be seen in the snippet below.
apiVersion: v1
kind: ServiceAccount
metadata:
creationTimestamp: "2024-11-22T13:23:34Z"
labels:
app.kubernetes.io/instance: simple-nifi2
app.kubernetes.io/managed-by: nifi.stackable.tech_nificluster
app.kubernetes.io/name: nifi
name: simple-nifi2-serviceaccountWhenever the object with the type matching mangedBy and the name matching instance is deleted, this serviceAccount will also be deleted as it is considered "orphaned" now.
In the scenario shown in the diagram above, the value for the instance label will constantly be set to either "simple-nifi" or "simple-nifi2" depending on which cluster was last changed. If that cluster is then deleted, the roleBinding and serviceAccount will also be cleaned up until the other cluster is next reconciled and the objects are recreated.
Workaround
The only currently know workaround is to trigger a manual reconciliation of the now broken cluster object by making some change, for example adding an annotation:
kubectl patch zookeepercluster/simple-zk --type=merge --patch=({metadata:{annotations:{touch: (date now)}}} | to json)
Fix
The best fix for this seems to be to move to individual roleBinding and serviceAccount objects for every cluster, instead of sharing them.
Todos
- Patch op-rs version to use one that fixes SUP-148. airflow-operator#545
- Update release branch airflow-operator#546
- Fix the changelog airflow-operator#548
- Patch op-rs version to use one that fixes SUP-148. druid-operator#657
- Update release branch druid-operator#658
- Fix the changelog druid-operator#659
- Patch op-rs version to use one that fixes SUP-148. hbase-operator#594
- Update release branch hbase-operator#595
- Fix the changelog hbase-operator#597
- Use a patched version of op-rs to hopefully fix SUP-148 hdfs-operator#616
- Update release branch hdfs-operator#617
- Fix the changelog hdfs-operator#618
- Use patched version of op-rs that hopefully addresses SUP-148. hive-operator#544
- Update release branch hive-operator#545
- Fix the changelog hive-operator#547
- Patch op-rs version to use one that fixes SUP-148. kafka-operator#793
- Update release branch kafka-operator#794
- Fix the changelog kafka-operator#795
- Use patched version of op-rs that hopefully addresses SUP-148. nifi-operator#717
- Update release branch nifi-operator#719
- Fix the changelog nifi-operator#720
- Patch op-rs version to use one that fixes SUP-148. opa-operator#656
- Update release branch opa-operator#657
- Fix the changelog opa-operator#658
- Patch op-rs version to use one that fixes SUP-148. spark-k8s-operator#498
- Update release branch spark-k8s-operator#500
- Use patched version of op-rs that addresses SUP-148. superset-operator#568
- Update release branch superset-operator#570
- Fix the changelog superset-operator#571
- Use patched version of op-rs that hopefully addresses SUP-148. trino-operator#672
- Update release branch trino-operator#674
- Fix the changelog trino-operator#675
- Patch op-rs version to use one that fixes SUP-148. zookeeper-operator#889
- Update release branch zookeeper-operator#890
- Fix the changelog zookeeper-operator#891
Integration tests Kind & Managed
- Airflow - Kind 1.29.8 ✅ - OKD 4.15 ✅ - @siegfriedweber
- Druid - Kind 1.29.8 ✅ - OKD 4.15 ✅ - @siegfriedweber
- HBase - kind 1.29.2 ✅ - OKD 4.14 ✅ - 2 tests failed, succeeded manually - @maltesander
- Hadoop HDFS - AKS 1.29.9 ✅ - OKD 4.15 ✅ - @soenkeliebau
- Hive - kind 1.29.2 ✅ - OKD 4.14 ✅ - @maltesander
- Kafka - Kind 1.29.8 ✅ - OKD 4.15 ✅ - @siegfriedweber
- NiFi - AKS 1.29.9 ✅ - OKD 4.15 ✅ - @soenkeliebau
- Spark
- Superset - Kind 1.29.8 ✅ - OKD 4.15 ✅ - @siegfriedweber
- Trino - AKS 1.29.9 ✅ - OKD 4.14 ✅ - @soenkeliebau
- ZooKeeper - kind 1.29.2 ✅ - OKD 4.14 ✅ - @maltesander
- OPA - kind 1.29.2 ✅ - OKD 4.14 ✅ - @maltesander
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
