fix: handle orphaned S3 segments gracefully in RestoreFromS3 and Read#114
Merged
novatechflow merged 5 commits intoKafScale:mainfrom Feb 27, 2026
Merged
Conversation
MemoryS3Client and awsS3Client now wrap ErrNotFound via %w when the index key does not exist. The AWS client detects NoSuchKey/404 via isNotFoundErr() helper. This lets callers distinguish 'key missing' from transient S3 errors.
Adds an optional *slog.Logger field and a logger() helper that falls back to slog.Default(). Wired in cmd/broker/main.go.
When DownloadIndex returns ErrNotFound and the segment's base offset is >= l.nextOffset (the committed-offset boundary from etcd), skip the segment with a warning log. This handles orphaned .kfs files left by partial flush failures where .index upload failed. Transient errors (non-ErrNotFound) always propagate — we cannot distinguish an orphan from S3 being temporarily down. Also: remove redundant entry struct (reuse segmentRange), derive last offset from valid segments only, guard against len(segments) == 0.
When an orphaned segment is skipped during restore, consumers requesting offsets in the gap get ErrOffsetOutOfRange. Instead, snap forward to the next available segment — matching Kafka's LocalLog.read() behavior.
All tests use MaxBytes: 1 with 70-byte batches, so ShouldFlush triggers synchronously inside AppendBatch. The 2ms sleeps before Flush() were no-ops. Verified with 50x race-detector runs.
novatechflow
approved these changes
Feb 27, 2026
Collaborator
novatechflow
left a comment
There was a problem hiding this comment.
thanks again @klaudworks!
kamir
pushed a commit
to kamir/kafscale
that referenced
this pull request
Mar 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Addresses #112
When a partition is first accessed,
RestoreFromS3rebuilds the in-memory segment index from S3. It lists all.kfssegment files, then downloads each corresponding.indexfile. If any.indexdownload fails, the entire partition becomes permanently unavailable until someone manually deletes the orphan from S3.Orphaned
.kfsfiles are a normal byproduct of partial flush failures.uploadFlushuploads.kfsand.indexin parallel — when.kfssucceeds but.indexfails, the.kfspersists in S3 with no matching index.Changes
No error classification → typed
ErrNotFounderror —DownloadIndexreturned a plainfmt.Errorfstring. The restore loop treated "key doesn't exist" and "S3 is temporarily down" identically — both killed the partition. Both S3 client implementations now return a wrappedErrNotFoundwhen the index key doesn't exist, so callers can distinguish the two cases.Single-failure partition kill → metadata-aware orphan skip — One
.indexupload failure (S3 503, network blip) made the partition permanently unavailable. Now, when an.indexis confirmed missing (ErrNotFound) and the segment was never acknowledged to a producer, it's skipped with a warning log. We know a segment was never acknowledged because the metadata store (etcd) tracks the committed offset boundary, and it's only updated after a fully successful flush — if the.indexupload failed, the flush failed, and the committed offset was never advanced. Transient errors always propagate regardless, because we can't tell if the.indexis truly missing or if S3 is just temporarily unreachable.Read()returnedErrOffsetOutOfRangefor gaps → snap forward — If an intermediate segment was skipped during restore, consumers got an error instead of data from the next available segment. NowReadsnaps forward to the next available segment, matching what Kafka does.lastderived from all listed entries, not valid segments — After skipping orphans, the last offset could reference a segment whose.indexwas never loaded. Now derived from valid segments only, with an empty-segments guard.