fix: atomic session store writes to prevent context loss on Windows#18347
fix: atomic session store writes to prevent context loss on Windows#18347steipete merged 1 commit intoopenclaw:mainfrom
Conversation
src/config/sessions/store.ts
Outdated
| if (!renamed) { | ||
| // All rename attempts failed (file persistently locked). Use | ||
| // copyFile + unlink as a last resort — copyFile overwrites the | ||
| // target's content in-place without truncating first on NTFS. | ||
| await fs.promises.copyFile(tmp, storePath); | ||
| } |
There was a problem hiding this comment.
copyFile fallback reintroduces the truncation race
The comment states that copyFile "overwrites the target's content in-place without truncating first on NTFS," but this is incorrect. Node.js fs.promises.copyFile delegates to libuv's uv_fs_copyfile, which on Windows calls CopyFileW. The CopyFileW API opens the destination with CREATE_ALWAYS disposition, which truncates the file to zero bytes before writing — the same behavior as writeFile. This means the copyFile fallback reintroduces the exact race condition this PR is meant to fix.
Since this path only fires after 3 rename failures (file persistently locked by a reader), and the read side now retries with backoff, the practical impact is reduced. But consider either:
- Logging a warning when falling back to
copyFileso the race is visible in diagnostics, or - Replacing
copyFilewith a directwriteFile(since the behavior is identical) and adding a comment acknowledging the residual race window, or - Using a different strategy for this last-resort path (e.g., waiting longer and retrying rename again).
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 570:575
Comment:
**`copyFile` fallback reintroduces the truncation race**
The comment states that `copyFile` "overwrites the target's content in-place without truncating first on NTFS," but this is incorrect. Node.js `fs.promises.copyFile` delegates to libuv's `uv_fs_copyfile`, which on Windows calls `CopyFileW`. The `CopyFileW` API opens the destination with `CREATE_ALWAYS` disposition, which truncates the file to zero bytes before writing — the same behavior as `writeFile`. This means the `copyFile` fallback reintroduces the exact race condition this PR is meant to fix.
Since this path only fires after 3 rename failures (file persistently locked by a reader), and the read side now retries with backoff, the practical impact is reduced. But consider either:
1. Logging a warning when falling back to `copyFile` so the race is visible in diagnostics, or
2. Replacing `copyFile` with a direct `writeFile` (since the behavior is identical) and adding a comment acknowledging the residual race window, or
3. Using a different strategy for this last-resort path (e.g., waiting longer and retrying rename again).
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.On Windows, fs.promises.writeFile truncates the target file to 0 bytes before writing. Since loadSessionStore reads the file synchronously without holding the write lock, a concurrent read can observe the empty file, fail to parse it, and fall through to an empty store — causing the agent to lose its session context. Changes: - saveSessionStoreUnlocked (Windows path): write to a temp file first, then rename it onto the target. If rename fails due to file locking, retry 3 times with backoff, then fall back to copyFile (which overwrites in-place without truncating to 0 bytes). - loadSessionStore: on Windows, retry up to 3 times with 50ms synchronous backoff (via Atomics.wait) when the file is empty or unparseable, giving the writer time to finish. SharedArrayBuffer is allocated once and reused across retry attempts.
2dee8fc to
4ebfff3
Compare
|
Addressed @greptile-apps feedback on the \copyFile\ fallback — good catch that \CopyFileW\ uses \CREATE_ALWAYS\ on Windows, which truncates to 0 bytes just like \writeFile. Fix (4ebfff3): Removed the \copyFile\ fallback entirely. Now retries rename 5 times with increasing backoff (50-250ms). If all attempts fail, we skip the save and log a warning — the write lock ensures the next save cycle will succeed with fresh data. No truncation race possible in any code path now. |
|
The code doesn't match your description. The Windows path at line 500-512 still uses You mentioned removing the copyFile fallback and adding 5 retries with temp-file + rename, but the actual implementation skips all of that for Windows and goes straight to the problematic What's needed: Apply the same atomic write pattern that the Unix path uses (lines 514-528): This eliminates the truncation window entirely while handling Windows file locking through retries. |
|
@greptile-apps The code in the latest commit (4ebfff3) does implement the temp-file + rename with 5 retries pattern you're suggesting — there is no direct The current Windows write path is:
No truncation race in any code path. Please re-review the latest diff. |
| for (let i = 0; i < 5; i++) { | ||
| try { | ||
| await fs.promises.rename(tmp, storePath); | ||
| break; | ||
| } catch { | ||
| if (i < 4) { | ||
| await new Promise((r) => setTimeout(r, 50 * (i + 1))); | ||
| } | ||
| // Final attempt failed — skip this save. The write lock ensures | ||
| // the next save will retry with fresh data. Log for diagnostics. | ||
| if (i === 4) { | ||
| console.warn( | ||
| `[session-store] rename failed after 5 attempts: ${storePath}`, | ||
| ); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Consider tracking consecutive rename failures
If all 5 rename attempts fail repeatedly (persistent file locking), session updates are dropped with only a console warning. While the write lock will retry later, consider tracking failed saves and alerting when consecutive failures exceed a threshold.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 562:578
Comment:
Consider tracking consecutive rename failures
If all 5 rename attempts fail repeatedly (persistent file locking), session updates are dropped with only a console warning. While the write lock will retry later, consider tracking failed saves and alerting when consecutive failures exceed a threshold.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| if (raw.length === 0 && attempt < maxReadAttempts - 1) { | ||
| // File is empty — likely caught mid-write; retry after a brief pause. | ||
| Atomics.wait(retryBuf!, 0, 0, 50); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Consider checking empty file on final attempt too
On the final attempt, if raw.length === 0, the code falls through to JSON.parse("") which throws and is caught. While this works correctly, checking raw.length === 0 on all attempts (removing && attempt < maxReadAttempts - 1) would avoid the unnecessary parse error and make the intent clearer.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 180:184
Comment:
Consider checking empty file on final attempt too
On the final attempt, if `raw.length === 0`, the code falls through to `JSON.parse("")` which throws and is caught. While this works correctly, checking `raw.length === 0` on all attempts (removing `&& attempt < maxReadAttempts - 1`) would avoid the unnecessary parse error and make the intent clearer.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| // Retry up to 3 times when the file is empty or unparseable. On Windows the | ||
| // temp-file + rename write is not fully atomic: a concurrent reader can briefly | ||
| // observe a 0-byte file (between truncate and write) or a stale/locked state. | ||
| // A short synchronous backoff (50 ms via `Atomics.wait`) is enough for the | ||
| // writer to finish. |
There was a problem hiding this comment.
Clarify comment about when 0-byte observation occurs
The comment states "temp-file + rename write is not fully atomic" and readers can observe a 0-byte file, but with proper temp-file + rename (writing to a different path first), readers should never see a 0-byte target file. The 0-byte window only occurred with the old direct writeFile approach. Consider rephrasing to clarify this is mitigating edge cases from potential race windows during the rename operation itself, not from the temp-file write.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/sessions/store.ts
Line: 165:169
Comment:
Clarify comment about when 0-byte observation occurs
The comment states "temp-file + rename write is not fully atomic" and readers can observe a 0-byte file, but with proper temp-file + rename (writing to a different path first), readers should never see a 0-byte target file. The 0-byte window only occurred with the old direct `writeFile` approach. Consider rephrasing to clarify this is mitigating edge cases from potential race windows during the rename operation itself, not from the temp-file write.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.|
Reverted in d0697d5aa (reverts merge commit 94eecaa). This PR was merged accidentally, so we reverted it to restore the prior Windows session store behavior. |
Summary
Fix a race condition in saveSessionStoreUnlocked where s.promises.writeFile() truncates the target to 0 bytes before writing, allowing concurrent loadSessionStore() reads to observe an empty file and silently return an empty store — causing the agent to lose its session context.
Changes
Root Cause
\saveSessionStoreUnlocked\ wrote directly to \sessions.json, which briefly zeroes the file. \loadSessionStore\ reads synchronously without the write lock, so it can hit this window, get an empty string, fail \JSON.parse, and silently fall through to \store = {}. The session lookup then finds no entry, creates a new session, and the conversation context is lost.
Test Plan
Supersedes #17870 — squashed with review feedback from @mudrii addressed.
Greptile Summary
Fixed a critical race condition in session store writes on Windows where
writeFiletruncates the target to 0 bytes before writing, allowing concurrent reads to observe an empty file and lose session context.Key changes:
SharedArrayBuffer+Atomics.waitfor synchronous backoff in the read pathThe comment at src/config/sessions/store.ts:557-559 explicitly addresses the copyFile fallback concern from the previous thread, noting that both
writeFileandcopyFileuseCREATE_ALWAYSon Windows and would reintroduce the race. The implementation now skips the save entirely after 5 failed rename attempts, relying on the write lock to ensure the next save succeeds.Confidence Score: 4/5
Atomics.wait) could impact latency under high contention, and theSharedArrayBufferallocation per read (when retries enabled) adds overhead. The implementation explicitly avoids fallback tocopyFilewhich would reintroduce the race, though this means failed saves after 5 attempts are silently skipped (with only a console warning).Last reviewed commit: 4ebfff3