Essentially, the change in flagd is not actually picked up right away by the corresponding services (which is why we restarted some of them IIRC).
From a debugging session with an AI that seems correct:
No, it only restarts flagd, not the ad service. The injection flow is:
Update flagd-config ConfigMap — set adFailure.defaultVariant to "on"
kubectl rollout restart deployment flagd — so flagd picks up the new config
That’s it — no restart of the ad deployment
The assumption is that the ad service’s OpenFeature SDK is connected to flagd via gRPC EventStream and will pick up the flag change in real-time. But in practice, there’s a race:
flagd restarts → the ad service’s EventStream connection drops
The OpenFeature SDK falls back to the code-level default (false = no failure) while reconnecting
Only after the SDK re-establishes the EventStream and syncs the flag state does adFailure=true take effect
That reconnection can take minutes, explaining the ~6–30 minute delay before errors appear
This seems to be the root cause.
Thanks @tianyi-tz for finding this and reaching out!
Essentially, the change in flagd is not actually picked up right away by the corresponding services (which is why we restarted some of them IIRC).
From a debugging session with an AI that seems correct:
This seems to be the root cause.
Thanks @tianyi-tz for finding this and reaching out!