Skip to content

tsdb: Find the last series ID on startup from the last series id file and WAL scan#18333

Merged
codesome merged 7 commits intoprometheus:mainfrom
RushabhMehta2005:read-state-file
Apr 1, 2026
Merged

tsdb: Find the last series ID on startup from the last series id file and WAL scan#18333
codesome merged 7 commits intoprometheus:mainfrom
RushabhMehta2005:read-state-file

Conversation

@RushabhMehta2005
Copy link
Copy Markdown
Contributor

@RushabhMehta2005 RushabhMehta2005 commented Mar 22, 2026

  • This PR builds directly on top of the state file created and maintained as in PR #18303
  • Now on startup, the Head.Init (...) method checks for this file,reads it, and if the shutdown is clean, trusts the last series ID value in it. If it is an unclean shutdown, we scan the WAL segments backwards, up until the last WAL segment as told by the file.
  • Added unit tests for this.

Which issue(s) does the PR fix:

NA

Does this PR introduce a user-facing change?

NONE

Comment thread tsdb/head_test.go
Signed-off-by: Rushabh Mehta <[email protected]>
@RushabhMehta2005
Copy link
Copy Markdown
Contributor Author

The failing CI linter test is due to a recent push by somebody else to a file I have not touched.

Signed-off-by: Rushabh Mehta <[email protected]>
Copy link
Copy Markdown
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Most of the comments are just styling nits.

Comment thread tsdb/head.go Outdated
state, err := h.readSeriesState()
if err != nil {
h.logger.Warn("Failed to read series state file, falling back to slow ID initialization", "err", err)
} else if state != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the else

Suggested change
} else if state != nil {
}
if state != nil {

Comment thread tsdb/head.go Outdated
if h.opts.EnableFastStartup {
state, err := h.readSeriesState()
if err != nil {
h.logger.Warn("Failed to read series state file, falling back to slow ID initialization", "err", err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it simple and only do fast startup if this file exists. Let's not keep an option to scan all files for series ID for the initial implementation of this feature.

Suggested change
h.logger.Warn("Failed to read series state file, falling back to slow ID initialization", "err", err)
h.logger.Warn("Failed to read series state file, skipping the fast startup", "err", err)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming once this feature is ready to go, we will change this up so that a new instance can infact make use of it.

Comment thread tsdb/head.go Outdated
h.lastSeriesID.Store(state.LastSeriesID)
h.logger.Info("Fast startup: clean shutdown detected, restored last series ID", "id", state.LastSeriesID)
} else {
h.logger.Info("Fast startup: unclean shutdown detected, performing bounded reverse scan", "from_segment", endAt, "to_segment", state.LastWALSegment)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
h.logger.Info("Fast startup: unclean shutdown detected, performing bounded reverse scan", "from_segment", endAt, "to_segment", state.LastWALSegment)
h.logger.Info("Fast startup: unclean shutdown detected, performing WAL scan", "from_segment", endAt, "to_segment", state.LastWALSegment)

Comment thread tsdb/head.go Outdated
} else {
h.logger.Info("Fast startup: unclean shutdown detected, performing bounded reverse scan", "from_segment", endAt, "to_segment", state.LastWALSegment)
if err := h.findLastSeriesID(state, endAt); err != nil {
h.logger.Warn("Bounded reverse scan failed, falling back to slow ID initialization", "err", err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
h.logger.Warn("Bounded reverse scan failed, falling back to slow ID initialization", "err", err)
h.logger.Error("Fast startup: WAL scan failed, skipping fast startup", "err", err)

Comment thread tsdb/head.go Outdated
if err := h.findLastSeriesID(state, endAt); err != nil {
h.logger.Warn("Bounded reverse scan failed, falling back to slow ID initialization", "err", err)
} else {
h.logger.Info("Fast startup: bounded reverse scan completed", "id", h.lastSeriesID.Load())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
h.logger.Info("Fast startup: bounded reverse scan completed", "id", h.lastSeriesID.Load())
h.logger.Info("Fast startup: WAL scan completed", "last_series_id", h.lastSeriesID.Load())

Comment thread tsdb/head_test.go Outdated
require.Equal(t, uint64(2), head.lastSeriesID.Load(), "Bounded scan should find the highest ID (2) from segment 1")
}

func TestHead_FastStartup_CleanShutdown(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WAL replay is still running so we can't really test if we are loading the right series id using the new mechanism. We can drop this test for now and come back to it later to test the mechanism end-to-end.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, this test is better for later.

Comment thread tsdb/head_test.go
Comment thread tsdb/head_test.go Outdated
// Get the current max segment number.
endSegment, _, err := head.wal.LastSegmentAndOffset()
require.NoError(t, err)
require.Equal(t, 1, endSegment)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the first file is 0 or 1. How about use wlog.Segments to get the first and last segment?

Comment thread tsdb/head_test.go Outdated
require.Equal(t, expectedState, *state, "read state should match written state")
}

func TestHead_FindLastSeriesIDBounded(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add another case here: before adding metric="B", add another sample for metric="A" in the second segment and then findLastSeriesID() on two files should give 1, and after adding metric=B the same call (scan on both files) should give 2 as the last ID.

Suggested change
func TestHead_FindLastSeriesIDBounded(t *testing.T) {
func TestHead_FindLastSeriesID(t *testing.T) {

Comment thread tsdb/head_test.go Outdated
require.Equal(t, 0, state.LastWALSegment, "LastWALSegment should remain 0")
}

func TestHead_ReadSeriesState(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func TestHead_ReadSeriesState(t *testing.T) {
func TestHead_ReadSeriesStateFile(t *testing.T) {

@RushabhMehta2005
Copy link
Copy Markdown
Contributor Author

@codesome I think I have fixed all the nits + the changes you requested, have another look. This is looking good so far, thanks for your detailed reviews.

Copy link
Copy Markdown
Member

@codesome codesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple of nits and one more test case and we are ready to merge.

Comment thread tsdb/head.go Outdated
h.lastSeriesID.Store(state.LastSeriesID)
h.logger.Info("Fast startup: clean shutdown detected, restored last series ID", "last_series_id", state.LastSeriesID)
} else {
h.logger.Info("Fast startup: unclean shutdown detected, performing WAL scan", "from_segment", endAt, "to_segment", state.LastWALSegment)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
h.logger.Info("Fast startup: unclean shutdown detected, performing WAL scan", "from_segment", endAt, "to_segment", state.LastWALSegment)
h.logger.Info("Fast startup: unclean shutdown detected, performing WAL scan", "from_segment", state.LastWALSegment, "to_segment", endAt)

Comment thread tsdb/head_test.go
Comment on lines +7988 to +7991
// Scanning both files should now return 2
id, err = head.findLastSeriesID(mockState, last)
require.NoError(t, err)
require.Equal(t, uint64(2), id, "Should find ID 2 after new series was created in segment 2")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add one more case duplicating these lines but with mockState.LastWALSegment = last. It should still return 2 because on unclean shutdown it should still scan the last file.

Comment thread tsdb/head_wal.go Outdated
var highestID chunks.HeadSeriesRef
var found bool

// Read the segment forwards
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Read the segment forwards
// Read the segment forwards.

@codesome codesome changed the title tsdb: Fast startup via bounded WAL reverse scan tsdb: Find the last series ID on startup from the last series id file and WAL scan Apr 1, 2026
@codesome codesome merged commit a2172f9 into prometheus:main Apr 1, 2026
37 checks passed
@RushabhMehta2005
Copy link
Copy Markdown
Contributor Author

Thanks for the merge! Onto the final PR of this idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants