Skip to content

fix: handle orphaned S3 segments gracefully in RestoreFromS3 and Read#114

Merged
novatechflow merged 5 commits intoKafScale:mainfrom
klaudworks:fix-restore-missing-index
Feb 27, 2026
Merged

fix: handle orphaned S3 segments gracefully in RestoreFromS3 and Read#114
novatechflow merged 5 commits intoKafScale:mainfrom
klaudworks:fix-restore-missing-index

Conversation

@klaudworks
Copy link
Collaborator

@klaudworks klaudworks commented Feb 26, 2026

Motivation

Addresses #112

When a partition is first accessed, RestoreFromS3 rebuilds the in-memory segment index from S3. It lists all .kfs segment files, then downloads each corresponding .index file. If any .index download fails, the entire partition becomes permanently unavailable until someone manually deletes the orphan from S3.
Orphaned .kfs files are a normal byproduct of partial flush failures. uploadFlush uploads .kfs and .index in parallel — when .kfs succeeds but .index fails, the .kfs persists in S3 with no matching index.

Changes

No error classification → typed ErrNotFound errorDownloadIndex returned a plain fmt.Errorf string. The restore loop treated "key doesn't exist" and "S3 is temporarily down" identically — both killed the partition. Both S3 client implementations now return a wrapped ErrNotFound when the index key doesn't exist, so callers can distinguish the two cases.

Single-failure partition kill → metadata-aware orphan skip — One .index upload failure (S3 503, network blip) made the partition permanently unavailable. Now, when an .index is confirmed missing (ErrNotFound) and the segment was never acknowledged to a producer, it's skipped with a warning log. We know a segment was never acknowledged because the metadata store (etcd) tracks the committed offset boundary, and it's only updated after a fully successful flush — if the .index upload failed, the flush failed, and the committed offset was never advanced. Transient errors always propagate regardless, because we can't tell if the .index is truly missing or if S3 is just temporarily unreachable.

Read() returned ErrOffsetOutOfRange for gaps → snap forward — If an intermediate segment was skipped during restore, consumers got an error instead of data from the next available segment. Now Read snaps forward to the next available segment, matching what Kafka does.

last derived from all listed entries, not valid segments — After skipping orphans, the last offset could reference a segment whose .index was never loaded. Now derived from valid segments only, with an empty-segments guard.

MemoryS3Client and awsS3Client now wrap ErrNotFound via %w when the
index key does not exist. The AWS client detects NoSuchKey/404 via
isNotFoundErr() helper. This lets callers distinguish 'key missing'
from transient S3 errors.
Adds an optional *slog.Logger field and a logger() helper that falls
back to slog.Default(). Wired in cmd/broker/main.go.
When DownloadIndex returns ErrNotFound and the segment's base offset
is >= l.nextOffset (the committed-offset boundary from etcd), skip
the segment with a warning log. This handles orphaned .kfs files left
by partial flush failures where .index upload failed.

Transient errors (non-ErrNotFound) always propagate — we cannot
distinguish an orphan from S3 being temporarily down.

Also: remove redundant entry struct (reuse segmentRange), derive last
offset from valid segments only, guard against len(segments) == 0.
When an orphaned segment is skipped during restore, consumers
requesting offsets in the gap get ErrOffsetOutOfRange. Instead,
snap forward to the next available segment — matching Kafka's
LocalLog.read() behavior.
All tests use MaxBytes: 1 with 70-byte batches, so ShouldFlush
triggers synchronously inside AppendBatch. The 2ms sleeps before
Flush() were no-ops. Verified with 50x race-detector runs.
Copy link
Collaborator

@novatechflow novatechflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks again @klaudworks!

@novatechflow novatechflow merged commit 16f6140 into KafScale:main Feb 27, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants