Conversation
ryzhyk
left a comment
There was a problem hiding this comment.
Looks like a no-brainer. Thank you!
|
The merge failed running the unit tests: I was able to reproduce this locally with The issue appears to be related to merger threads and in particular the variables set up by fn scope_tokio_merger_locals<F>(
worker_index: usize,
buffer_cache: Arc<BufferCache>,
slab_allocator: Arc<FBufSlabs>,
future: F,
) -> impl Future<Output = F::Output>
where
F: Future,
{
TOKIO_WORKER_INDEX.scope(
worker_index,
TOKIO_BUFFER_CACHE.scope(buffer_cache, TOKIO_FBUF_SLABS.scope(slab_allocator, future)),
)
} |
|
The panic comes inside Drop for ImmutableFileRef, which atempts to evict the file being dropped from the buffer cache. But ImmutableFileRef doesn't have a reference to the BufferCache, it only has a I guess what's happening is that a merger thread is opening a Reader and taking an Arc to it, and that occasionally the last reference to that Arc is somehow dropped outside the merger task but still in a thread created by the merger (and therefore marked as ThreadType::Background). That would naturally produce the panic in question. |
|
One option would be to drop the attempt to evict data in the ImmutableFileRef drop implementation. I added that because I guessed it would help, but I do not have any measurements demonstrating it. |
|
Another solution would be to give ImmutableFileRef a direct handle to its buffer cache rather than a function to call to get one. The status quo is there because buffer caches were (and often are) worker- or thread-specific. But then, at least for the thread-specific case, we'd have to figure out how or when to switch buffer caches. If we had only a single buffer cache for the runtime, there would not be a problem. |
|
I guess the question is how can this happen outside of one of the tasks we spawn? We terminate the tokio runtime when killing the DBSP runtime like this: // Terminate the tokio merger runtime.
if let Some(tokio_merger_runtime) = self
.runtime
.inner()
.tokio_merger_runtime
.lock()
.unwrap()
.take()
{
// Block until all running merger tasks have yielded. At this point, the tokio runtime will have shut down automatically.
drop(tokio_merger_runtime);
}This is probably when this happens. But why? |
Since we don't know if it helps we could just drop this 'feature'. I was questioning this choice when we worked on FBuf allocation. If we trust the eviction policy to be "good enough" it should be fine to just skip the step. |
|
Destructors are always trouble. |
`RefCell` is relatively expensive because it maintains a borrow counter and associated lock-like logic. `Cell` is just a transparent wrapper around the inner type, so that it uses less time and space. Signed-off-by: Ben Pfaff <[email protected]>
Sometimes, running the unit tests failed with: ``` panicked at crates/dbsp/src/circuit/runtime.rs:972:64: cannot access a task-local storage value without setting it first thread panicked while processing panic. aborting. Aborted (core dumped) ``` The panic comes inside Drop for ImmutableFileRef, which attempts to evict the file being dropped from the buffer cache. But ImmutableFileRef doesn't have a reference to the BufferCache, it only has a fn() -> Arc<BufferCache> that it can call to get a buffer cache. This fn is actually Runtime::buffer_cache, which is what's panicking. The problem is that tokio task shutdown does not provide values for task-local variables, which is what stores the buffer cache. This commit avoids the problem by allowing the buffer cache to be unavailable for Drop in ImmutableFileRef. This requires some code refactoring so that the cache fn, and Runtime::buffer_cache(), return an Option. Signed-off-by: Ben Pfaff <[email protected]>

RefCellis relatively expensive because it maintains a borrow counter and associated lock-like logic.Cellis just a transparent wrapper around the inner type, so that it uses less time and space.Describe Manual Test Plan
I ran the unit tests.