remote_write: implement restart from segment-based savepoint by x1unix · Pull Request #18485 · prometheus/prometheus

x1unix · 2026-04-08T17:24:26Z

Which issue(s) does the PR fix:

This PR implements a mechanism for remote write to track and resume from last WAL segment.

PR is based on the proposal prometheus/proposals#72

Release notes for end users (ALL commits must be considered).

Reviewers should verify clarity and quality.

[ENHANCEMENT] Remote write: Add segment-based savepoint support behind the `remote-write-savepoint` feature flag. When enabled, remote write periodically persists the current WAL segment for each destination to a savepoint file, allowing replay from the last saved segment on restart instead of skipping undelivered samples.

Signed-off-by: x1unix <[email protected]>

kgeckhart

You're on the right track, there's a few things to consider with the implementation.

kgeckhart · 2026-04-08T19:16:11Z

+	if startSegment >= 0 {
+		t.watcher.SetStartSegment(startSegment)
+	}



Is it necessary to do this as a separate check vs passing the param and letting the watcher do it?

kgeckhart · 2026-04-08T19:35:39Z

@@ -513,6 +539,7 @@ func (w *Watcher) garbageCollectSeries(segmentNum int) error {
 // Also used with readCheckpoint - implements segmentReadFn.
 // TODO(bwplotka): Rename tail to !onlySeries; extremely confusing and easy to miss.


I think we should do this TODO. I was also burned by this when reading this code in the past. I renamed it during my hacking (grafana/prometheus@staleness_disabling_v3.4.2...kgeckhart:prometheus:kgeckhart/replay-hacking#diff-0ab108a20060eb76d034c4fd1a9c5112cf981141f870d0211b1e73faead7f888L335) for very similar reasons.

kgeckhart · 2026-04-08T19:38:43Z

 // Also used with readCheckpoint - implements segmentReadFn.
 // TODO(bwplotka): Rename tail to !onlySeries; extremely confusing and easy to miss.
 func (w *Watcher) readSegment(r *LiveReader, segmentNum int, tail bool) error {
+	replay := w.startSegment >= 0


Is this true? We would only be replaying if startSegment >=0 and currentSegment is < startSegment.

kgeckhart · 2026-04-08T19:40:38Z

+	var currentSegment int
+	if w.startSegment < 0 {
+		currentSegment, err = w.findSegmentForIndex(checkpointIndex)
+	} else {
+		// Respect checkpoint if it's ahead of the savepoint
+		// (segments before the checkpoint have been compacted away).
+		startIdx := max(w.startSegment, checkpointIndex)
+		currentSegment, err = w.findSegmentForIndex(startIdx)
+	}


IIUC this is going to move the current segment ahead based on the starting segment which we don't want to do. When we startup we need to read everything to load the queue manager caches. We need to control the cut over point between reading to load the cache and reading to send data.

kgeckhart · 2026-04-08T19:50:02Z

+func (rws *WriteStorage) persistSavepoint() {
+	rws.mtx.Lock()
+	sp := rws.collectSavepoint()
+	rws.mtx.Unlock()
+
+	if err := sp.Save(rws.dir); err != nil {
+		rws.logger.Error("Failed to persist remote write savepoint", "err", err)
+	}
+}
+
+// persistSavepointLocked persists the savepoint while the mutex is already held.
+func (rws *WriteStorage) persistSavepointLocked() {
+	sp := rws.collectSavepoint()
+	if err := sp.Save(rws.dir); err != nil {
+		rws.logger.Error("Failed to persist remote write savepoint", "err", err)
+	}
+}


Probably worth a comment about why these both exist. IIUC persistSavepointLocked is used for shutdown where the mutex is already held for the length of shutdown and persistSavepoint is more narrowly scoped lock wise.

Simplified the part a bit in 555ef94

kgeckhart · 2026-04-08T20:01:53Z

+
+// collectSavepoint updates the savepoint from current queue positions and returns a copy.
+// Must be called with rws.mtx held.
+func (rws *WriteStorage) collectSavepoint() Savepoint {


I'm not a fan of the side effect that we update rws and return the new Savepoint. I think we should decouple this because persistSavepointLocked doesn't need to update the rws value because it's being closed. persistSavepoint needs to do it but if the only place that it's done is there it can be done without holding the lock because it's only touched by persistSavepoint which won't be called in parallel.

Addressed together with a previous item.

Signed-off-by: x1unix <[email protected]>

…ient Signed-off-by: x1unix <[email protected]>

Signed-off-by: x1unix <[email protected]>

x1unix added 3 commits April 8, 2026 13:19

feat: add savepoint definition

c783494

Signed-off-by: x1unix <[email protected]>

feat: accept start segment

7de2c33

Signed-off-by: x1unix <[email protected]>

feat: load and persist savepoint

dab0df6

Signed-off-by: x1unix <[email protected]>

kgeckhart reviewed Apr 8, 2026

View reviewed changes

x1unix added 6 commits April 9, 2026 16:25

fix: unecessary guard

1330341

Signed-off-by: x1unix <[email protected]>

fix: restructure checkpoint save

555ef94

Signed-off-by: x1unix <[email protected]>

fix: always load all segments

8c41af5

Signed-off-by: x1unix <[email protected]>

feat: rename tail to onlySeries

3e7cd06

Signed-off-by: x1unix <[email protected]>

feat: hide savepoint behind feature-flag

7861ead

Signed-off-by: x1unix <[email protected]>

feat: add test helpers

230149a

Signed-off-by: x1unix <[email protected]>

x1unix force-pushed the x1unix/feat/rw-savepoint branch from e7f9a72 to eabd9de Compare April 13, 2026 14:08

feat: savepoint unit tests

fe0e5e1

Signed-off-by: x1unix <[email protected]>

x1unix force-pushed the x1unix/feat/rw-savepoint branch from eabd9de to fe0e5e1 Compare April 13, 2026 15:43

x1unix added 8 commits April 13, 2026 14:39

feat: savepoint e2e tests

883fb0e

Signed-off-by: x1unix <[email protected]>

feat: add watcher start segment test

55efe6b

Signed-off-by: x1unix <[email protected]>

feat: add optional parameter to ApplyConfig to be able to override cl…

4843417

…ient Signed-off-by: x1unix <[email protected]>

feat: add e2e tests

3596ec5

Signed-off-by: x1unix <[email protected]>

fix: golangci-lint

96abe6e

Signed-off-by: x1unix <[email protected]>

chore: restart CI

124055b

Signed-off-by: x1unix <[email protected]>

feat: add savepoint feature flag

f823b96

Signed-off-by: x1unix <[email protected]>

feat: add feature flag docs

645a55f

Signed-off-by: x1unix <[email protected]>

x1unix changed the title ~~feat(remove_write): implement restart from segment-based savepoint~~ feat(remote_write): implement restart from segment-based savepoint Apr 15, 2026

x1unix changed the title ~~feat(remote_write): implement restart from segment-based savepoint~~ remote_write: implement restart from segment-based savepoint Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remote_write: implement restart from segment-based savepoint#18485

remote_write: implement restart from segment-based savepoint#18485
x1unix wants to merge 18 commits intoprometheus:mainfrom
x1unix:x1unix/feat/rw-savepoint

x1unix commented Apr 8, 2026 •

edited

Loading

Uh oh!

kgeckhart left a comment

Uh oh!

kgeckhart Apr 8, 2026

Uh oh!

kgeckhart Apr 8, 2026

Uh oh!

kgeckhart Apr 8, 2026

Uh oh!

kgeckhart Apr 8, 2026

Uh oh!

kgeckhart Apr 8, 2026

Uh oh!

x1unix Apr 10, 2026

Uh oh!

kgeckhart Apr 8, 2026

Uh oh!

x1unix Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -513,6 +539,7 @@ func (w *Watcher) garbageCollectSeries(segmentNum int) error {
		// Also used with readCheckpoint - implements segmentReadFn.
		// TODO(bwplotka): Rename tail to !onlySeries; extremely confusing and easy to miss.

Conversation

x1unix commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue(s) does the PR fix:

Release notes for end users (ALL commits must be considered).

Uh oh!

kgeckhart left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

x1unix commented Apr 8, 2026 •

edited

Loading