Skip to content

bug: consumer groups with more than 1 consumer leads to GroupCoordinator split brain #121

@klaudworks

Description

@klaudworks

Status Quo Example

Setup: 3 brokers, 2 consumers (A and B) in the same group, topic with 4 partitions.

What should happen: A coordinator assigns partitions fairly -- consumer A gets partitions 0, 1 and consumer B gets partitions 2, 3. Each message is processed exactly once.

What actually happens in 2/3 cases:

  1. Consumer A connects and lands on broker 0. Broker 0 becomes the coordinator for this group, assigns all 4 partitions to consumer A, and saves this to etcd.
  2. Consumer B connects and lands on broker 1. Broker 1 loads the group state from etcd, sees consumer A exists, adds consumer B, and starts reassigning partitions. But consumer A doesn't know this is happening -- it's still talking to broker 0, which has no idea broker 1 is doing anything with this group.
    Consumer A never responds to broker 1's reassignment (it doesn't know about it), so after 30 seconds broker 1 gives up waiting and assigns all 4 partitions to consumer B alone.
    Now both consumers think they own all 4 partitions. Both process every message. Both save their progress. Neither knows the other exists.

Result: Every message gets processed twice, forever, with no error or warning anywhere.

Proposal

Fix this using the same lease based mechanism as proposed in #120. I.e. route all requests for a given group to a single broker.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions