Skip to content

tsdb: Refactor Head.Init(...) for fast startup#18442

Open
RushabhMehta2005 wants to merge 6 commits intoprometheus:mainfrom
RushabhMehta2005:refactor-head-fast-startup
Open

tsdb: Refactor Head.Init(...) for fast startup#18442
RushabhMehta2005 wants to merge 6 commits intoprometheus:mainfrom
RushabhMehta2005:refactor-head-fast-startup

Conversation

@RushabhMehta2005
Copy link
Copy Markdown
Contributor

@RushabhMehta2005 RushabhMehta2005 commented Apr 3, 2026

  • This PR builds on top of the 2 previous PRs: Adding the series_state.json file, Implementing the fast startup algorithm.
  • Head.Init(...) has to be refactored to make sure it does not block on Head.loadWAL(...). The latter is launched in its own goroutine, there is a check added in Head.Close(...) as well, which makes sure the WAL loading is finished before attempting a graceful shutdown, if it hasn't, we wait for it. This makes the WAL replay run in parallel with scraping.
  • Until the WAL replay is done, we have to block queries as we have incomplete data.

Which issue(s) does the PR fix:

NA

Release notes for end users (ALL commits must be considered).

Reviewers should verify clarity and quality.

NONE

Copy link
Copy Markdown
Member

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a plausible direction; take a look at the CI failures and see if there is anything you can clean up there.

Comment thread tsdb/head.go Outdated
Comment thread tsdb/head.go
Comment on lines +1879 to +1880
// Wait for background WAL replay to finish before shutdown.
h.walReplayWg.Wait()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a good idea? WAL replay can take ten minutes; if someone requested to shut down do we want to wait that long?

Could it be cancelable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am waiting here for it to complete as it is the simplest thing to do. What if we kill that goroutine when Close is called? Could it lead to wal file corruption or mess up some other state? I am not sure of the answer to these questions.

Comment thread tsdb/head_wal.go Outdated
// loadChunkSnapshot replays the chunk snapshot and restores the Head state from it. If there was any error returned,
// it is the responsibility of the caller to clear the contents of the Head.
func (h *Head) loadChunkSnapshot() (int, int, map[chunks.HeadSeriesRef]*memSeries, error) {
func (h *Head) loadChunkSnapshot(multiRef map[chunks.HeadSeriesRef]chunks.HeadSeriesRef, multiRefMtx *sync.Mutex) (int, int, map[chunks.HeadSeriesRef]*memSeries, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding the new parameters to the docstring comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I am not sure if this is the right approach to make loadChunkSnapshot access multiRef (and that too, safely), I think it would be better to update the docstrings once its set in stone.

Comment thread tsdb/head.go Outdated
Comment on lines +930 to +931
if h.opts.EnableFastStartup {
state, err := h.readSeriesStateFile()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style point: this if-statement is nearly the entire function; could it be two different functions with and without fast startup?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ganesh said once we are done with this we would want to remove this if entirely, making fast startup the default behaviour, hence it is kept as so for now.

@RushabhMehta2005
Copy link
Copy Markdown
Contributor Author

Thanks for the review Bryan, will fix these.

@RushabhMehta2005
Copy link
Copy Markdown
Contributor Author

It seems to be hard to make sense of these CI failures, still will look into it.

@RushabhMehta2005 RushabhMehta2005 marked this pull request as ready for review April 9, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants