Skip to content

fix: atomic session store writes to prevent context loss on Windows#18347

Merged
steipete merged 1 commit intoopenclaw:mainfrom
twcwinston:fix/atomic-session-store-v2
Feb 16, 2026
Merged

fix: atomic session store writes to prevent context loss on Windows#18347
steipete merged 1 commit intoopenclaw:mainfrom
twcwinston:fix/atomic-session-store-v2

Conversation

@twcwinston
Copy link
Contributor

@twcwinston twcwinston commented Feb 16, 2026

Summary

Fix a race condition in saveSessionStoreUnlocked where s.promises.writeFile() truncates the target to 0 bytes before writing, allowing concurrent loadSessionStore() reads to observe an empty file and silently return an empty store — causing the agent to lose its session context.

Changes

  • Atomic writes (Windows): saveSessionStoreUnlocked now writes to a temp file first, then renames it onto the target (same strategy used on other platforms). If rename fails due to file locking, retries 3 times with backoff, then falls back to \copyFile\ which overwrites in-place without truncating.
  • Read-side retry: \loadSessionStore\ on Windows retries up to 3 times with 50ms synchronous backoff when the file is empty or unparseable, giving the writer time to finish. SharedArrayBuffer is allocated once and reused across attempts.

Root Cause

\saveSessionStoreUnlocked\ wrote directly to \sessions.json, which briefly zeroes the file. \loadSessionStore\ reads synchronously without the write lock, so it can hit this window, get an empty string, fail \JSON.parse, and silently fall through to \store = {}. The session lookup then finds no entry, creates a new session, and the conversation context is lost.

Test Plan

  • Verified the fix eliminates spurious empty-file parse errors
  • Confirmed agent retains session context across sustained message load
  • Existing test suite passes

Supersedes #17870 — squashed with review feedback from @mudrii addressed.

Greptile Summary

Fixed a critical race condition in session store writes on Windows where writeFile truncates the target to 0 bytes before writing, allowing concurrent reads to observe an empty file and lose session context.

Key changes:

  • Implemented atomic temp-file + rename on Windows (matching Unix behavior)
  • Added retry logic with exponential backoff (5 attempts) for rename failures due to file locking
  • Added read-side retry (3 attempts, 50ms backoff) for empty or corrupt files on Windows
  • Used SharedArrayBuffer + Atomics.wait for synchronous backoff in the read path

The comment at src/config/sessions/store.ts:557-559 explicitly addresses the copyFile fallback concern from the previous thread, noting that both writeFile and copyFile use CREATE_ALWAYS on Windows and would reintroduce the race. The implementation now skips the save entirely after 5 failed rename attempts, relying on the write lock to ensure the next save succeeds.

Confidence Score: 4/5

  • Safe to merge with minor considerations around retry behavior under extreme contention
  • The fix correctly addresses the root cause (truncation race) by using atomic temp-file + rename. The retry mechanisms on both read and write sides provide resilience. However, the synchronous blocking in the read path (Atomics.wait) could impact latency under high contention, and the SharedArrayBuffer allocation per read (when retries enabled) adds overhead. The implementation explicitly avoids fallback to copyFile which would reintroduce the race, though this means failed saves after 5 attempts are silently skipped (with only a console warning).
  • No files require special attention beyond understanding the Windows-specific retry behavior

Last reviewed commit: 4ebfff3

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +570 to +575
if (!renamed) {
// All rename attempts failed (file persistently locked). Use
// copyFile + unlink as a last resort — copyFile overwrites the
// target's content in-place without truncating first on NTFS.
await fs.promises.copyFile(tmp, storePath);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyFile fallback reintroduces the truncation race

The comment states that copyFile "overwrites the target's content in-place without truncating first on NTFS," but this is incorrect. Node.js fs.promises.copyFile delegates to libuv's uv_fs_copyfile, which on Windows calls CopyFileW. The CopyFileW API opens the destination with CREATE_ALWAYS disposition, which truncates the file to zero bytes before writing — the same behavior as writeFile. This means the copyFile fallback reintroduces the exact race condition this PR is meant to fix.

Since this path only fires after 3 rename failures (file persistently locked by a reader), and the read side now retries with backoff, the practical impact is reduced. But consider either:

  1. Logging a warning when falling back to copyFile so the race is visible in diagnostics, or
  2. Replacing copyFile with a direct writeFile (since the behavior is identical) and adding a comment acknowledging the residual race window, or
  3. Using a different strategy for this last-resort path (e.g., waiting longer and retrying rename again).

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 570:575

Comment:
**`copyFile` fallback reintroduces the truncation race**

The comment states that `copyFile` "overwrites the target's content in-place without truncating first on NTFS," but this is incorrect. Node.js `fs.promises.copyFile` delegates to libuv's `uv_fs_copyfile`, which on Windows calls `CopyFileW`. The `CopyFileW` API opens the destination with `CREATE_ALWAYS` disposition, which truncates the file to zero bytes before writing — the same behavior as `writeFile`. This means the `copyFile` fallback reintroduces the exact race condition this PR is meant to fix.

Since this path only fires after 3 rename failures (file persistently locked by a reader), and the read side now retries with backoff, the practical impact is reduced. But consider either:
1. Logging a warning when falling back to `copyFile` so the race is visible in diagnostics, or
2. Replacing `copyFile` with a direct `writeFile` (since the behavior is identical) and adding a comment acknowledging the residual race window, or
3. Using a different strategy for this last-resort path (e.g., waiting longer and retrying rename again).

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

On Windows, fs.promises.writeFile truncates the target file to 0 bytes
before writing. Since loadSessionStore reads the file synchronously
without holding the write lock, a concurrent read can observe the empty
file, fail to parse it, and fall through to an empty store — causing the
agent to lose its session context.

Changes:
- saveSessionStoreUnlocked (Windows path): write to a temp file first,
  then rename it onto the target. If rename fails due to file locking,
  retry 3 times with backoff, then fall back to copyFile (which
  overwrites in-place without truncating to 0 bytes).
- loadSessionStore: on Windows, retry up to 3 times with 50ms
  synchronous backoff (via Atomics.wait) when the file is empty or
  unparseable, giving the writer time to finish. SharedArrayBuffer is
  allocated once and reused across retry attempts.
@twcwinston twcwinston force-pushed the fix/atomic-session-store-v2 branch from 2dee8fc to 4ebfff3 Compare February 16, 2026 18:00
@twcwinston
Copy link
Contributor Author

Addressed @greptile-apps feedback on the \copyFile\ fallback — good catch that \CopyFileW\ uses \CREATE_ALWAYS\ on Windows, which truncates to 0 bytes just like \writeFile.

Fix (4ebfff3): Removed the \copyFile\ fallback entirely. Now retries rename 5 times with increasing backoff (50-250ms). If all attempts fail, we skip the save and log a warning — the write lock ensures the next save cycle will succeed with fresh data. No truncation race possible in any code path now.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 16, 2026

The code doesn't match your description. The Windows path at line 500-512 still uses fs.promises.writeFile(storePath, json, "utf-8") directly — which truncates to 0 bytes and reintroduces the race condition this PR aims to fix.

You mentioned removing the copyFile fallback and adding 5 retries with temp-file + rename, but the actual implementation skips all of that for Windows and goes straight to the problematic writeFile().

What's needed:

Apply the same atomic write pattern that the Unix path uses (lines 514-528):

  // Windows: use atomic temp-file + rename strategy with retries
  if (process.platform === "win32") {
    const tmp = `${storePath}.${process.pid}.${crypto.randomUUID()}.tmp`;
    let lastErr: unknown;
    const maxRetries = 5;
    
    try {
      await fs.promises.writeFile(tmp, json, { mode: 0o600, encoding: "utf-8" });
      
      for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
          await fs.promises.rename(tmp, storePath);
          await fs.promises.chmod(storePath, 0o600);
          return; // Success
        } catch (err) {
          lastErr = err;
          if (attempt < maxRetries) {
            const backoffMs = 50 * attempt; // 50, 100, 150, 200, 250
            await new Promise(resolve => setTimeout(resolve, backoffMs));
          }
        }
      }
      
      // All retries failed - log warning and skip save
      log.warn("failed to rename temp file after retries; skipping save", {
        storePath,
        retries: maxRetries,
        error: lastErr,
      });
      return;
    } finally {
      await fs.promises.rm(tmp, { force: true });
    }
  }

This eliminates the truncation window entirely while handling Windows file locking through retries.

@twcwinston
Copy link
Contributor Author

@greptile-apps The code in the latest commit (4ebfff3) does implement the temp-file + rename with 5 retries pattern you're suggesting — there is no direct writeFile(storePath, ...) on the Windows path anymore. The fallback to copyFile was also removed. You may be reviewing a cached version of the diff.

The current Windows write path is:

  1. Write to temp file
  2. Retry rename up to 5 times with increasing backoff (50-250ms)
  3. If all renames fail, log a warning and skip the save (next cycle picks it up)
  4. Clean up temp file in finally block

No truncation race in any code path. Please re-review the latest diff.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +562 to +578
for (let i = 0; i < 5; i++) {
try {
await fs.promises.rename(tmp, storePath);
break;
} catch {
if (i < 4) {
await new Promise((r) => setTimeout(r, 50 * (i + 1)));
}
// Final attempt failed — skip this save. The write lock ensures
// the next save will retry with fresh data. Log for diagnostics.
if (i === 4) {
console.warn(
`[session-store] rename failed after 5 attempts: ${storePath}`,
);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider tracking consecutive rename failures

If all 5 rename attempts fail repeatedly (persistent file locking), session updates are dropped with only a console warning. While the write lock will retry later, consider tracking failed saves and alerting when consecutive failures exceed a threshold.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 562:578

Comment:
Consider tracking consecutive rename failures

If all 5 rename attempts fail repeatedly (persistent file locking), session updates are dropped with only a console warning. While the write lock will retry later, consider tracking failed saves and alerting when consecutive failures exceed a threshold.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +180 to +184
if (raw.length === 0 && attempt < maxReadAttempts - 1) {
// File is empty — likely caught mid-write; retry after a brief pause.
Atomics.wait(retryBuf!, 0, 0, 50);
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider checking empty file on final attempt too

On the final attempt, if raw.length === 0, the code falls through to JSON.parse("") which throws and is caught. While this works correctly, checking raw.length === 0 on all attempts (removing && attempt < maxReadAttempts - 1) would avoid the unnecessary parse error and make the intent clearer.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 180:184

Comment:
Consider checking empty file on final attempt too

On the final attempt, if `raw.length === 0`, the code falls through to `JSON.parse("")` which throws and is caught. While this works correctly, checking `raw.length === 0` on all attempts (removing `&& attempt < maxReadAttempts - 1`) would avoid the unnecessary parse error and make the intent clearer.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +165 to +169
// Retry up to 3 times when the file is empty or unparseable. On Windows the
// temp-file + rename write is not fully atomic: a concurrent reader can briefly
// observe a 0-byte file (between truncate and write) or a stale/locked state.
// A short synchronous backoff (50 ms via `Atomics.wait`) is enough for the
// writer to finish.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify comment about when 0-byte observation occurs

The comment states "temp-file + rename write is not fully atomic" and readers can observe a 0-byte file, but with proper temp-file + rename (writing to a different path first), readers should never see a 0-byte target file. The 0-byte window only occurred with the old direct writeFile approach. Consider rephrasing to clarify this is mitigating edge cases from potential race windows during the rename operation itself, not from the temp-file write.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 165:169

Comment:
Clarify comment about when 0-byte observation occurs

The comment states "temp-file + rename write is not fully atomic" and readers can observe a 0-byte file, but with proper temp-file + rename (writing to a different path first), readers should never see a 0-byte target file. The 0-byte window only occurred with the old direct `writeFile` approach. Consider rephrasing to clarify this is mitigating edge cases from potential race windows during the rename operation itself, not from the temp-file write.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

@steipete steipete merged commit 94eecaa into openclaw:main Feb 16, 2026
4 checks passed
@sebslight
Copy link
Member

Reverted in d0697d5aa (reverts merge commit 94eecaa).

This PR was merged accidentally, so we reverted it to restore the prior Windows session store behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants