Conversation
|
I have a general question about this code because I was looking at it a few weeks ago: why do we always need to use this (general form) of the serializer with the scratch space. e.g., we see this thing show up in profiles a lot in macos because TLS is more expensive there and it does all this manipulation of taking it out of the TLS and putting it back in (which seems to cause some copying overheads too).. From what I can tell if the type is relatively simple (like an integer) the relative overhead of doing that is big because the function that uses it gets invoked a lot. Background is that we tried to take this out from the inner loops (TLS) variable and have e.g., a serializer per writer that we pass down into the relevant functions but it's also very painful to deal with. |
I didn't understand this part, wouldn't we be able to check our own serializers? how many do we ahve? |
| } | ||
|
|
||
| fn cleared(mut self) -> Self { | ||
| self.main.clear(); |
There was a problem hiding this comment.
does clear reset to SCRATCH_SIZE?
There was a problem hiding this comment.
It resets the position:
/// Resets the scratch space to its initial state.
pub fn clear(&mut self) {
self.pos = 0;
}
| /// This is the amount of space we allocate as base scratch space for rkyv | ||
| /// serialization. If more is needed for a particular serialization, then we | ||
| /// fall back to [AllocScratch]. | ||
| pub const SCRATCH_SIZE: usize = 65536; |
There was a problem hiding this comment.
should we move this to the top of the file?
| /// This is the amount of space we allocate as base scratch space for rkyv | ||
| /// serialization. If more is needed for a particular serialization, then we | ||
| /// fall back to [AllocScratch]. | ||
| pub const SCRATCH_SIZE: usize = 65536; |
There was a problem hiding this comment.
| pub const SCRATCH_SIZE: usize = 65536; | |
| pub const SCRATCH_SIZE: usize = 65_536; |
crates/dbsp/src/storage/file.rs
Outdated
| let mut serializer = Serializer::new(serializer, SCRATCH.take().unwrap(), Default::default()); | ||
| let mut serializer = Serializer::new( | ||
| serializer, | ||
| SCRATCH.take().unwrap().cleared(), |
There was a problem hiding this comment.
is cleared() expensive to call? this function is quite hot in general
There was a problem hiding this comment.
It should not be, it just sets some variables really.
We can check every serializer, obviously, but bugs can creep in. This is a kind of manual memory management, which is always problematic.
About 150 (not all of them handwritten). |
I'll see what I can do for performance. I was focusing on correctness here. |
This might have been rhetorical, but I am not sure. Here is the answer. We need scratch space and the sharedserializermap because:
|
|
OK, I've got a much bigger change that I'll submit tomorrow that should eliminate a lot of overhead. You like higher-order lifetimes, right? It's got higher-order lifetimes. |
|
Oh boy, a new Rust feature! |
|
Hm, I think we already use this in a few places, it's not really new. |
|
They're not all that rare, but I do notice whenever I add them. |
| impl ScratchSpace for DbspScratch { | ||
| #[inline] | ||
| unsafe fn push_scratch(&mut self, layout: Layout) -> Result<NonNull<[u8]>, Self::Error> { | ||
| unsafe { |
There was a problem hiding this comment.
Missing // SAFETY: comment on this unsafe block. Hard rule: every unsafe block needs a comment explaining what invariant makes it safe. Here: explain why it is valid to call push_scratch on self.main and self.fallback without further preconditions — e.g., that the caller holds the invariants required by the ScratchSpace contract.
|
|
||
| #[inline] | ||
| unsafe fn pop_scratch(&mut self, ptr: NonNull<u8>, layout: Layout) -> Result<(), Self::Error> { | ||
| unsafe { |
There was a problem hiding this comment.
Missing // SAFETY: comment. For pop_scratch, the key invariant is that ptr must have been returned by a prior push_scratch call on the same scratch space (or, in the fallback case, that we are routing to the right scratch implementation). Please document that here.
|
@gz I asked for a re-review because I added a big commit that should fix the performance issues you mentioned. |
Signed-off-by: Ben Pfaff <[email protected]>
A customer reported large and growing memory use that showed up in heap profiles attributed to rkyv serialization in the storage file writer. Only some of this made sense, in particular the part written to FBufs, which is data blocks that will end up in the cache. The rest was not more specifically attributed. Some investigation showed the possibility that data is accumulating in our per-thread scratch space cache. This cache should get emptied for every use, but that depends on the rkyv serialization implementations being correct both in the common case and the error case. DBSP hits the error case frequently in practice (once per data block), because it uses errors to avoid going over block size limits. Perhaps the rkyv built-in implementations handle errors correctly regarding scratch space; I don't know whether DBSP implementations of serializers do. If this is the problem, this commit will avoid it, by clearing the scratch space every time that we use it. It is larger than otherwise necessary because rkvy's `HeapScratch` and `FallbackScratch` don't provide any way to clear themselves, so this has to copy in their implementations. Signed-off-by: Ben Pfaff <[email protected]>
gz
left a comment
There was a problem hiding this comment.
awesome, it makes sense... I tried something similar two weeks ago but the way you did it is much more elegant :)
| fn serialize_with_flush<B, K, V, R>((batch, flush): (B, bool)) -> Vec<u8> | ||
| let mut serializer_inner = None; | ||
| fn serialize_with_flush<B, K, V, R>( | ||
| (batch, flush): (B, bool), |
There was a problem hiding this comment.
I didn't know this syntax, cool
There was a problem hiding this comment.
Yeah, you can pattern match in function arguments and it's occasionally useful!
rkyv serializers can be lightweight but serializing some types requires additional data, in the form of a ScratchSpace and SharedSerializeRegistry. These types are somewhat big (about 100 bytes) and the ScratchSpace implementation we use also allocates 64 kiB on the heap, so it's very wasteful to create and destroy them for serializing, say, a single i32. So, until now, we mainly kept one of them in a thread-local cache. This has at least two problems: - As implemented, we were still copying the 100-byte structure in and out of the cache whenever we used it, which is wasteful. - I'm told that Mac OS has slow thread-locals. This commit changes the implementation in a couple of ways. The big change is that it changes our serializer from an rkyv CompositeSerializer that includes the Serializer, the ScratchSpace, and the SharedSerializerRegistry inline, to a custom struct that merely references them. Thus, there is no need to copy the 100-byte structure in and out of the cache; we just use a reference to it. This does mean that our serializer now has a generic lifetime parameter, which we have to pass around a bit, but it's not too bad. The other change is to use the cache less often. Instead of using the cache almost everywhere, this commit changes many of the important code paths, such as the storage file writer, to allocate their own serializer data and use it directly instead of through the cache. The hope is that this will reduce overhead further, especially on OSes where thread-locals are slow. Signed-off-by: Ben Pfaff <[email protected]>
Signed-off-by: feldera-bot <[email protected]>
mythical-fred
left a comment
There was a problem hiding this comment.
All my blockers addressed. LGTM.
|
I don't see how this change could prevent a pipeline from starting so I filed https://github.com/feldera/cloud/issues/1580 for the merge failure. Requeuing. |
|
I requeued this because the CI failure was due to: |
A customer reported large and growing memory use that showed up in heap profiles attributed to rkyv serialization in the storage file writer. Only some of this made sense, in particular the part written to FBufs, which is data blocks that will end up in the cache. The rest was not more specifically attributed.
Some investigation showed the possibility that data is accumulating in our per-thread scratch space cache. This cache should get emptied for every use, but that depends on the rkyv serialization implementations being correct both in the common case and the error case. DBSP hits the error case frequently in practice (once per data block), because it uses errors to avoid going over block size limits. Perhaps the rkyv built-in implementations handle errors correctly regarding scratch space; I don't know whether DBSP implementations of serializers do.
If this is the problem, this commit will avoid it, by clearing the scratch space every time that we use it. It is larger than otherwise necessary because rkvy's
HeapScratchandFallbackScratchdon't provide any way to clear themselves, so this has to copy in their implementations.Describe Manual Test Plan
I tested this with the unit tests.