Skip to content

flagd faults take time to propagate in astronomy shop #724

@HacksonClark

Description

@HacksonClark

Essentially, the change in flagd is not actually picked up right away by the corresponding services (which is why we restarted some of them IIRC).

From a debugging session with an AI that seems correct:

No, it only restarts flagd, not the ad service. The injection flow is:

Update flagd-config ConfigMap — set adFailure.defaultVariant to "on"
kubectl rollout restart deployment flagd — so flagd picks up the new config
That’s it — no restart of the ad deployment

The assumption is that the ad service’s OpenFeature SDK is connected to flagd via gRPC EventStream and will pick up the flag change in real-time. But in practice, there’s a race:

flagd restarts → the ad service’s EventStream connection drops
The OpenFeature SDK falls back to the code-level default (false = no failure) while reconnecting
Only after the SDK re-establishes the EventStream and syncs the flag state does adFailure=true take effect
That reconnection can take minutes, explaining the ~6–30 minute delay before errors appear

This seems to be the root cause.

Thanks @tianyi-tz for finding this and reaching out!

Metadata

Metadata

Assignees

Labels

benchmarkBenchmark platform-related issuesbugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions