Skip to content

[dbsp] Clear rkyv scratch space from one serialization to the next.#5919

Merged
blp merged 4 commits intomainfrom
scratch
Mar 27, 2026
Merged

[dbsp] Clear rkyv scratch space from one serialization to the next.#5919
blp merged 4 commits intomainfrom
scratch

Conversation

@blp
Copy link
Copy Markdown
Member

@blp blp commented Mar 25, 2026

A customer reported large and growing memory use that showed up in heap profiles attributed to rkyv serialization in the storage file writer. Only some of this made sense, in particular the part written to FBufs, which is data blocks that will end up in the cache. The rest was not more specifically attributed.

Some investigation showed the possibility that data is accumulating in our per-thread scratch space cache. This cache should get emptied for every use, but that depends on the rkyv serialization implementations being correct both in the common case and the error case. DBSP hits the error case frequently in practice (once per data block), because it uses errors to avoid going over block size limits. Perhaps the rkyv built-in implementations handle errors correctly regarding scratch space; I don't know whether DBSP implementations of serializers do.

If this is the problem, this commit will avoid it, by clearing the scratch space every time that we use it. It is larger than otherwise necessary because rkvy's HeapScratch and FallbackScratch don't provide any way to clear themselves, so this has to copy in their implementations.

Describe Manual Test Plan

I tested this with the unit tests.

@blp blp requested a review from ryzhyk March 25, 2026 19:00
@blp blp self-assigned this Mar 25, 2026
@blp blp added bug Something isn't working DBSP core Related to the core DBSP library storage Persistence for internal state in DBSP operators rust Pull requests that update Rust code user-reported Reported by a user or customer labels Mar 25, 2026
@ryzhyk ryzhyk requested a review from gz March 25, 2026 19:06
@gz
Copy link
Copy Markdown
Contributor

gz commented Mar 25, 2026

I have a general question about this code because I was looking at it a few weeks ago:

why do we always need to use this (general form) of the serializer with the scratch space. e.g., we see this thing show up in profiles a lot in macos because TLS is more expensive there and it does all this manipulation of taking it out of the TLS and putting it back in (which seems to cause some copying overheads too).. From what I can tell if the type is relatively simple (like an integer) the relative overhead of doing that is big because the function that uses it gets invoked a lot.

Background is that we tried to take this out from the inner loops (TLS) variable and have e.g., a serializer per writer that we pass down into the relevant functions but it's also very painful to deal with.

@gz
Copy link
Copy Markdown
Contributor

gz commented Mar 25, 2026

because it uses errors to avoid going over block size limits. Perhaps the rkyv built-in implementations handle errors correctly regarding scratch space; I don't know whether DBSP implementations of serializers do.

I didn't understand this part, wouldn't we be able to check our own serializers? how many do we ahve?

}

fn cleared(mut self) -> Self {
self.main.clear();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does clear reset to SCRATCH_SIZE?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It resets the position:

    /// Resets the scratch space to its initial state.
    pub fn clear(&mut self) {
        self.pos = 0;
    }

/// This is the amount of space we allocate as base scratch space for rkyv
/// serialization. If more is needed for a particular serialization, then we
/// fall back to [AllocScratch].
pub const SCRATCH_SIZE: usize = 65536;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we move this to the top of the file?

/// This is the amount of space we allocate as base scratch space for rkyv
/// serialization. If more is needed for a particular serialization, then we
/// fall back to [AllocScratch].
pub const SCRATCH_SIZE: usize = 65536;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub const SCRATCH_SIZE: usize = 65536;
pub const SCRATCH_SIZE: usize = 65_536;

let mut serializer = Serializer::new(serializer, SCRATCH.take().unwrap(), Default::default());
let mut serializer = Serializer::new(
serializer,
SCRATCH.take().unwrap().cleared(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is cleared() expensive to call? this function is quite hot in general

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be, it just sets some variables really.

@blp
Copy link
Copy Markdown
Member Author

blp commented Mar 25, 2026

because it uses errors to avoid going over block size limits. Perhaps the rkyv built-in implementations handle errors correctly regarding scratch space; I don't know whether DBSP implementations of serializers do.

I didn't understand this part, wouldn't we be able to check our own serializers?

We can check every serializer, obviously, but bugs can creep in.

This is a kind of manual memory management, which is always problematic.

how many do we ahve?

About 150 (not all of them handwritten).

@blp
Copy link
Copy Markdown
Member Author

blp commented Mar 25, 2026

I have a general question about this code because I was looking at it a few weeks ago:

why do we always need to use this (general form) of the serializer with the scratch space. e.g., we see this thing show up in profiles a lot in macos because TLS is more expensive there and it does all this manipulation of taking it out of the TLS and putting it back in (which seems to cause some copying overheads too).. From what I can tell if the type is relatively simple (like an integer) the relative overhead of doing that is big because the function that uses it gets invoked a lot.

Background is that we tried to take this out from the inner loops (TLS) variable and have e.g., a serializer per writer that we pass down into the relevant functions but it's also very painful to deal with.

I'll see what I can do for performance. I was focusing on correctness here.

@blp
Copy link
Copy Markdown
Member Author

blp commented Mar 25, 2026

I have a general question about this code because I was looking at it a few weeks ago:
why do we always need to use this (general form) of the serializer with the scratch space.

This might have been rhetorical, but I am not sure. Here is the answer. We need scratch space and the sharedserializermap because:

  • Many common types need scratch space to implement Serialize, e.g. see Vec.
  • Some types we use need a sharedserializermap, e.g. e.g. Arc.
  • If we choose a type for our Serializer that doesn't implement those, then our SerializeDyn trait won't be implemented for them because it is auto-implemented for types that implement ArchivedDBData, which in turn requires the type to implement Serialize<Serializer>.

@blp
Copy link
Copy Markdown
Member Author

blp commented Mar 26, 2026

OK, I've got a much bigger change that I'll submit tomorrow that should eliminate a lot of overhead.

You like higher-order lifetimes, right? It's got higher-order lifetimes.

@mihaibudiu
Copy link
Copy Markdown
Contributor

Oh boy, a new Rust feature!

@mihaibudiu
Copy link
Copy Markdown
Contributor

Hm, I think we already use this in a few places, it's not really new.

@blp
Copy link
Copy Markdown
Member Author

blp commented Mar 26, 2026

They're not all that rare, but I do notice whenever I add them.

impl ScratchSpace for DbspScratch {
#[inline]
unsafe fn push_scratch(&mut self, layout: Layout) -> Result<NonNull<[u8]>, Self::Error> {
unsafe {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing // SAFETY: comment on this unsafe block. Hard rule: every unsafe block needs a comment explaining what invariant makes it safe. Here: explain why it is valid to call push_scratch on self.main and self.fallback without further preconditions — e.g., that the caller holds the invariants required by the ScratchSpace contract.


#[inline]
unsafe fn pop_scratch(&mut self, ptr: NonNull<u8>, layout: Layout) -> Result<(), Self::Error> {
unsafe {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing // SAFETY: comment. For pop_scratch, the key invariant is that ptr must have been returned by a prior push_scratch call on the same scratch space (or, in the fallback case, that we are routing to the right scratch implementation). Please document that here.

@blp blp requested a review from gz March 26, 2026 18:02
@blp
Copy link
Copy Markdown
Member Author

blp commented Mar 26, 2026

@gz I asked for a re-review because I added a big commit that should fix the performance issues you mentioned.

blp added 2 commits March 26, 2026 15:52
A customer reported large and growing memory use that showed up in heap
profiles attributed to rkyv serialization in the storage file writer.  Only
some of this made sense, in particular the part written to FBufs, which
is data blocks that will end up in the cache.  The rest was not more
specifically attributed.

Some investigation showed the possibility that data is accumulating in
our per-thread scratch space cache.  This cache should get emptied for
every use, but that depends on the rkyv serialization implementations
being correct both in the common case and the error case.  DBSP hits the
error case frequently in practice (once per data block), because it uses
errors to avoid going over block size limits.  Perhaps the rkyv built-in
implementations handle errors correctly regarding scratch space; I don't
know whether DBSP implementations of serializers do.

If this is the problem, this commit will avoid it, by clearing the scratch
space every time that we use it.  It is larger than otherwise necessary
because rkvy's `HeapScratch` and `FallbackScratch` don't provide any way
to clear themselves, so this has to copy in their implementations.

Signed-off-by: Ben Pfaff <[email protected]>
Copy link
Copy Markdown
Contributor

@gz gz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, it makes sense... I tried something similar two weeks ago but the way you did it is much more elegant :)

fn serialize_with_flush<B, K, V, R>((batch, flush): (B, bool)) -> Vec<u8>
let mut serializer_inner = None;
fn serialize_with_flush<B, K, V, R>(
(batch, flush): (B, bool),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know this syntax, cool

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you can pattern match in function arguments and it's occasionally useful!

rkyv serializers can be lightweight but serializing some types requires
additional data, in the form of a ScratchSpace and SharedSerializeRegistry.
These types are somewhat big (about 100 bytes) and the ScratchSpace
implementation we use also allocates 64 kiB on the heap, so it's very
wasteful to create and destroy them for serializing, say, a single i32.
So, until now, we mainly kept one of them in a thread-local cache.

This has at least two problems:

- As implemented, we were still copying the 100-byte structure in and
  out of the cache whenever we used it, which is wasteful.

- I'm told that Mac OS has slow thread-locals.

This commit changes the implementation in a couple of ways.  The big
change is that it changes our serializer from an rkyv
CompositeSerializer that includes the Serializer, the ScratchSpace, and
the SharedSerializerRegistry inline, to a custom struct that merely
references them.  Thus, there is no need to copy the 100-byte structure
in and out of the cache; we just use a reference to it.  This does
mean that our serializer now has a generic lifetime parameter, which we
have to pass around a bit, but it's not too bad.

The other change is to use the cache less often.  Instead of using the
cache almost everywhere, this commit changes many of the important code
paths, such as the storage file writer, to allocate their own serializer
data and use it directly instead of through the cache.  The hope is that
this will reduce overhead further, especially on OSes where
thread-locals are slow.

Signed-off-by: Ben Pfaff <[email protected]>
Signed-off-by: feldera-bot <[email protected]>
@blp blp added this pull request to the merge queue Mar 26, 2026
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my blockers addressed. LGTM.

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 27, 2026
@blp
Copy link
Copy Markdown
Member Author

blp commented Mar 27, 2026

I don't see how this change could prevent a pipeline from starting so I filed https://github.com/feldera/cloud/issues/1580 for the merge failure. Requeuing.

@blp blp added this pull request to the merge queue Mar 27, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 27, 2026
@blp
Copy link
Copy Markdown
Member Author

blp commented Mar 27, 2026

I requeued this because the CI failure was due to:

/usr/bin/docker pull thekevjames/gcloud-pubsub-emulator:e852273e07
  Error response from daemon: Head "https://registry-1.docker.io/v2/thekevjames/gcloud-pubsub-emulator/manifests/e852273e07": unauthorized: incorrect username or password
  Warning: Docker pull failed with exit code 1, back off 6.504 seconds before retry.
  /usr/bin/docker pull thekevjames/gcloud-pubsub-emulator:e852273e07
  Error response from daemon: Head "https://registry-1.docker.io/v2/thekevjames/gcloud-pubsub-emulator/manifests/e852273e07": unauthorized: incorrect username or password
  Warning: Docker pull failed with exit code 1, back off 6.817 seconds before retry.
  /usr/bin/docker pull thekevjames/gcloud-pubsub-emulator:e852273e07
  Error response from daemon: Head "https://registry-1.docker.io/v2/thekevjames/gcloud-pubsub-emulator/manifests/e852273e07": unauthorized: incorrect username or password
  Error: Docker pull failed with exit code 1

@blp blp added this pull request to the merge queue Mar 27, 2026
Merged via the queue into main with commit 51a642a Mar 27, 2026
1 check passed
@blp blp deleted the scratch branch March 27, 2026 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working DBSP core Related to the core DBSP library rust Pull requests that update Rust code storage Persistence for internal state in DBSP operators user-reported Reported by a user or customer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants