Skip to content

tsdb: use defer to release locks in mmapHeadChunks#18422

Draft
roidelapluie wants to merge 9 commits intoprometheus:mainfrom
roidelapluie:roidelapluie/tsdb-mmap-defer-locks
Draft

tsdb: use defer to release locks in mmapHeadChunks#18422
roidelapluie wants to merge 9 commits intoprometheus:mainfrom
roidelapluie:roidelapluie/tsdb-mmap-defer-locks

Conversation

@roidelapluie
Copy link
Copy Markdown
Member

@roidelapluie roidelapluie commented Apr 1, 2026

handleChunkWriteError panics on chunk disk mapper errors, which occurs in TestDiskFillingUpAfterDisablingOOO. Without defer, the series mutex held inside mmapHeadChunks was not released during panic unwinding. A subsequent Close() call (e.g. from t.Cleanup) also calls mmapHeadChunks and deadlocks trying to re-acquire the same lock.

Extract the per-stripe and per-series iterations into dedicated helpers so that defer can release both the stripe read lock and the series mutex during panic unwinding.

Update the test comment to reflect that the deadlock is now resolved.

See #17941

Which issue(s) does the PR fix:

Release notes for end users (ALL commits must be considered).

Reviewers should verify clarity and quality.

NONE

handleChunkWriteError panics on chunk disk mapper errors, which
occurs under i386 in TestDiskFillingUpAfterDisablingOOO. Without
defer, the series mutex held inside mmapHeadChunks was not released
during panic unwinding. A subsequent Close() call (e.g. from
t.Cleanup) also calls mmapHeadChunks and deadlocks trying to
re-acquire the same lock.

Extract the per-stripe and per-series iterations into dedicated
helpers so that defer can release both the stripe read lock and
the series mutex during panic unwinding.

Update the test comment to reflect that the deadlock is now resolved.

See prometheus#17941

Signed-off-by: Julien Pivotto <[email protected]>
Signed-off-by: Julien Pivotto <[email protected]>
Signed-off-by: Julien Pivotto <[email protected]>
The test only writes ~80 samples, so the default 512MB chunk segment
pre-allocation during compaction is unnecessary. Use 1MB instead to
avoid large file allocations on constrained CI environments.

Signed-off-by: Julien Pivotto <[email protected]>
Running count=500 sequentially on Windows causes SetFileInformationByHandle
and FlushFileBuffers to hang, due to I/O pressure from the 128MB head chunk
file preallocation on Windows after hundreds of iterations. Use 3 parallel
matrix workers with count=50 each to distribute the load while preserving
coverage.

Signed-off-by: Julien Pivotto <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant