Skip to content

[improve][common] Optimize TopicName.get() to reduce lock contention on cache lookup#25367

Merged
liangyepianzhou merged 7 commits intoapache:masterfrom
liangyepianzhou:optimize/TopicName-get
Apr 21, 2026
Merged

[improve][common] Optimize TopicName.get() to reduce lock contention on cache lookup#25367
liangyepianzhou merged 7 commits intoapache:masterfrom
liangyepianzhou:optimize/TopicName-get

Conversation

@liangyepianzhou
Copy link
Copy Markdown
Contributor

@liangyepianzhou liangyepianzhou commented Mar 20, 2026

Motivation

TopicName.get() previously used ConcurrentHashMap.computeIfAbsent() to populate the topic-name cache. Although computeIfAbsent is atomic, it holds the internal bin-lock for the entire duration of the mapping function, which includes the non-trivial TopicName construction (string splitting, validation, etc.).

Under high-concurrency workloads where many threads simultaneously encounter the same uncached topic name, this causes unnecessary lock contention and can degrade throughput.

Modifications

Replace computeIfAbsent with an explicit two-step pattern:

  1. Fast path: call cache.get(topic) first — a single volatile read with no locking — and return immediately on a cache hit (steady-state case).
  2. Slow path (cache miss): construct TopicName outside the lock, then use cache.putIfAbsent() to insert. If two threads race on the same key, one wins the putIfAbsent and the other's instance is discarded; this is safe because TopicName is immutable.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

…on cache lookup

### Motivation

`TopicName.get()` previously used `ConcurrentHashMap.computeIfAbsent()` to populate
the topic-name cache. Although `computeIfAbsent` is atomic, it holds the internal
bin-lock for the entire duration of the mapping function, which includes the
non-trivial `TopicName` construction (string splitting, validation, etc.).

Under high-concurrency workloads where many threads simultaneously encounter the same
uncached topic name, this causes unnecessary lock contention and can degrade throughput.

### Modifications

Replace `computeIfAbsent` with an explicit two-step pattern:

1. **Fast path**: call `cache.get(topic)` first — a single volatile read with no
   locking — and return immediately on a cache hit (steady-state case).
2. **Slow path** (cache miss): construct `TopicName` *outside* the lock, then use
   `cache.putIfAbsent()` to insert. If two threads race on the same key, one wins
   the `putIfAbsent` and the other's instance is discarded; this is safe because
   `TopicName` is immutable.

Add a Javadoc comment on `get()` explaining the rationale.
@liangyepianzhou
Copy link
Copy Markdown
Contributor Author

liangyepianzhou commented Mar 20, 2026

image

Background

While load-testing Pulsar with a single topic containing 1,000,000 partitions, we observed that TopicName.get() was consuming a disproportionate amount of CPU, showing up prominently in flame graphs.

Root Cause Analysis

Two compounding issues were identified:

  1. topicNameCacheMaxCapacity too small — With 1M partitions, the default cache capacity is inevitably exceeded, triggering repeated clearIfReachedMaxCapacity calls and causing cache stampedes where all partitions miss simultaneously.

  2. new TopicName() executed inside the lock — The current implementation uses computeIfAbsent(key, TopicName::new), which holds the ConcurrentHashMap bin-lock for the entire duration of object construction. Under concurrent cache misses, threads serialize on the same lock, degrading throughput significantly.

Fix

Replace computeIfAbsent with a lock-free get + out-of-lock construction + putIfAbsent pattern:

// Before
return cache.computeIfAbsent(topic, TopicName::new);

// After
TopicName cached = cache.get(topic);
if (cached != null) return cached;
TopicName created = new TopicName(topic);          // constructed outside lock
TopicName existing = cache.putIfAbsent(topic, created);
return existing != null ? existing : created;

Under concurrent misses on the same key, each thread constructs its own instance independently; putIfAbsent elects the winner and the losers are simply GC'd — eliminating bin-lock contention entirely.

Benchmark Results (JMH · SingleShotTime · 8 threads · 1M partitions · full cold cache)

Benchmark Avg (ms) Median (ms) p99 (ms) Min (ms)
miss_AcomputeIfAbsent 427.1 395.3 644.5 316.5
miss_Bget + putIfAbsent 287.1 278.9 368.0 236.1
Improvement ~1.49× ~1.42× ~1.75× ~1.34×

The tail latency improvement (1.75× at p99) is especially significant: computeIfAbsent causes severe jitter under lock contention (316–644 ms range), while get+putIfAbsent stays stable (236–368 ms).

image

@github-actions github-actions Bot added the doc-not-needed Your PR changes do not impact docs label Mar 20, 2026
Copy link
Copy Markdown
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good.

can you try to compare to just call cache .put() instead of putIfAbsent()?

It would be good if you could add the jhm code to the microbenchmarks module

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Mar 20, 2026

Related PR from the past #24457 (change was rejected so I closed it) with a lot of interesting comments.
One of the comments came from Caffeine author: #24457 (comment).
Besides performance another detail to consider is the duplication of java.lang.String instances in memory.

Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't address one of the key problems which is that the cache is cleared if the size exceeds the max size setting (topicNameCacheMaxCapacity):

public static void clearIfReachedMaxCapacity(int maxCapacity) {
if (maxCapacity < 0) {
// Unlimited cache.
return;
}
if (cache.size() > maxCapacity) {
cache.clear();
}
}

We'd be better off by switching to use Caffeine again. The Caffeine author commented in #24457 (comment) that the bottleneck in the earlier version has been addressed. Since we have upgraded to Java 17 for the client, we can use the newer Caffeine version.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Mar 20, 2026

The flamegraph in #25367 (comment) looks like it was from an old version of Pulsar that uses Guava Cache (before #23052 <3.0.6, <3.3.1).
@liangyepianzhou What version of Pulsar are you load testing?

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Mar 20, 2026

The tail latency improvement (1.75× at p99) is especially significant: computeIfAbsent causes severe jitter under lock contention (316–644 ms range), while get+putIfAbsent stays stable (236–368 ms).

The benchmark seems to measure the miss of 1M cache entries. That's not a very realistic scenario that there would be such amount of misses at once.

I think it would be more useful to have a solution that ensures that the memory use of the cache is bounded.
That is something that I have added quite recently to AbstractMetadataStore children cache:

long childrenCacheMaxSizeBytes = getChildrenCacheMaxSizeBytes();
Caffeine<Object, Object> childrenCacheBuilder = Caffeine.newBuilder()
.recordStats()
.refreshAfterWrite(CACHE_REFRESH_TIME_MILLIS, TimeUnit.MILLISECONDS)
.expireAfterWrite(CACHE_REFRESH_TIME_MILLIS * 2, TimeUnit.MILLISECONDS);
if (childrenCacheMaxSizeBytes > 0) {
childrenCacheBuilder.maximumWeight(childrenCacheMaxSizeBytes)
.weigher((String key, List<String> children) -> {
// calculate the total byte size of the key and entries in the children list
// to get some estimation of the required heap memory required for the entry.
// add 16 bytes overhead for Java object header and 16 bytes for java.lang.String fields.
int totalSize = ByteBufUtil.utf8Bytes(key) + 32;
for (String child : children) {
totalSize += ByteBufUtil.utf8Bytes(child) + 32;
}
return totalSize;
});
}

in #24868
The max size is limited to 20% of max heapsize.

For TopicName cache there could be a bytesize limit too. The StringInterner solution would be useful at least for namespace and tenant Strings to ensure that those don't cause heap duplication if the namespace or tenant caches expire or overflow.

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Mar 20, 2026

The benchmark doesn't use actual code which would be part of the Pulsar code base.
I believe that this is a more representative way: https://github.com/lhotari/pulsar/blob/lh-fix-topicname-memory-leak/microbench/src/main/java/org/apache/pulsar/common/naming/TopicNameBenchmark.java
That was done in #25367 which wasn't merged.

@liangyepianzhou
Copy link
Copy Markdown
Contributor Author

The flamegraph in #25367 (comment) looks like it was from an old version of Pulsar that uses Guava Cache (before #23052 <3.0.6, <3.3.1). @liangyepianzhou What version of Pulsar are you load testing?

The flamegraph comes from 3.0.5, I change version to 3.0.13 and make topicNameCacheMaxCapacity = -1. CPU usage dropped from 90% to 9%, but the percentage of TopicName.get() in the flame graph is still very high.

The benchmark seems to measure the miss of 1M cache entries. That's not a very realistic scenario that there would be such amount of misses at once.

In my scenario, one topic has 1 million partitions, so there are 1 million cache misses during the initial startup. Topic loading is currently very slow, and I suspect this may be one of the reasons.

@liangyepianzhou
Copy link
Copy Markdown
Contributor Author

The flamegraph in #25367 (comment) looks like it was from an old version of Pulsar that uses Guava Cache (before #23052 <3.0.6, <3.3.1). @liangyepianzhou What version of Pulsar are you load testing?

Thanks for pointing out the historical context and Ben Manes' insights!

Just to clarify the scope of this PR: the get + putIfAbsent change is intentionally a minimal, targeted fix to reduce bin-lock contention during cache misses, independent of the broader cache design questions (capacity bounding, Caffeine version, soft references, etc.).

That said, I'd like to confirm whether this small optimization is still worthwhile on its own:

  1. Even with a properly bounded cache, concurrent cache misses (e.g. during a cold start or after a cache clear) will still happen. In those cases, putIfAbsent avoids holding the bin-lock during TopicName construction, which seems beneficial regardless of the eviction strategy.
  2. The fast path (cache.get() first) is a strict improvement for the steady-state hit case with zero synchronization overhead.

So the question is: do you see any reason this low-risk change should be blocked on the larger cache redesign? Happy to fold it into a bigger PR if that's preferred, but wanted to check if it can stand alone first.

Comment thread pulsar-common/src/main/java/org/apache/pulsar/common/naming/TopicName.java Outdated
Copy link
Copy Markdown
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great solution @liangyepianzhou. Just one comment about documenting implementation details in javadoc.

@liangyepianzhou
Copy link
Copy Markdown
Contributor Author

Change looks good.

can you try to compare to just call cache .put() instead of putIfAbsent()?

It would be good if you could add the jhm code to the microbenchmarks module

Thanks for the suggestion. Fixed.

@BewareMyPower
Copy link
Copy Markdown
Contributor

I think we should not depend much from the cache itself. The TopicName construction is low-efficient and used nearly everywhere. Could you try my previous patch here (#24463) to see if there is any improvement? If so, I can resolve the conflicts again

@lhotari
Copy link
Copy Markdown
Member

lhotari commented Apr 20, 2026

I think we should not depend much from the cache itself. The TopicName construction is low-efficient and used nearly everywhere. Could you try my previous patch here (#24463) to see if there is any improvement? If so, I can resolve the conflicts again

@BewareMyPower Could you rebase #24463? I think we can handle that optimization separately.

On the caching: one benefit worth preserving is that the TopicName and NamespaceName caches reduce java.lang.String instance duplication, which is valuable beyond just avoiding the construction cost itself. So I'd lean toward keeping the caches rather than removing them.

@liangyepianzhou
Copy link
Copy Markdown
Contributor Author

liangyepianzhou commented Apr 21, 2026

I think we should not depend much from the cache itself. The TopicName construction is low-efficient and used nearly everywhere. Could you try my previous patch here (#24463) to see if there is any improvement? If so, I can resolve the conflicts again

@BewareMyPower After applying your optimization, the benchmark results are as follows: jmh2 uses computeIfAbsent, jmh1 uses put.

Benchmark Mode Cnt Score Error Units
TopicNameGetBenchmark.coldStartGet ss 50 5.600 ± 1.722 us/op
[1]+ Done java -jar ./microbench-*-benchmarks.jar TopicNameGetBenchmark 2>&1 > jmh1.log

Benchmark Mode Cnt Score Error Units
TopicNameGetBenchmark.coldStartGet ss 50 6.446 ± 2.168 us/op
[1]+ Done java -jar ./microbench-*-benchmarks.jar TopicNameGetBenchmark 2>&1 > jmh2.log

@BewareMyPower
Copy link
Copy Markdown
Contributor

BewareMyPower commented Apr 21, 2026

so the result is now with this PR (replace computeIfAbsent with put), it's

Then I believe both PRs are valuable, I will rebase my PR soon, for this PR, just go ahead to fix the CI

@BewareMyPower
Copy link
Copy Markdown
Contributor

I'm rebasing my PR currently. But it's a bit confusing from your test result. It seems that switching to put would be slower with my improvement. Is there anything wrong I understood?

@liangyepianzhou
Copy link
Copy Markdown
Contributor Author

I'm rebasing my PR currently. But it's a bit confusing from your test result. It seems that switching to put would be slower with my improvement. Is there anything wrong I understood?

Are you testing the cold start / cache-miss scenario?

My understanding is that the main latency comes from the cache-miss startup phase, where the optimization is most noticeable. You can use my JMH program to test it.

@BewareMyPower
Copy link
Copy Markdown
Contributor

BewareMyPower commented Apr 21, 2026

No. I didn't run any test for now. I'm just analyzing your test report here: #25367 (comment)

mh2 uses computeIfAbsent, jmh1 uses put.

  • jmh1: 5.600 ± 1.722 us/op
  • jmh2: 6.446 ± 2.168 us/op

Oh my bad, I misunderstood the unit (us/op). I've thought it's ops/us

@liangyepianzhou liangyepianzhou merged commit 8c4e83d into apache:master Apr 21, 2026
79 of 82 checks passed
@liangyepianzhou liangyepianzhou deleted the optimize/TopicName-get branch April 21, 2026 08:56
lhotari pushed a commit that referenced this pull request Apr 21, 2026
…on cache lookup (#25367)

### Motivation

`TopicName.get()` previously used `ConcurrentHashMap.computeIfAbsent()` to populate the topic-name cache. Although `computeIfAbsent` is atomic, it holds the internal bin-lock for the entire duration of the mapping function, which includes the non-trivial `TopicName` construction (string splitting, validation, etc.).

Under high-concurrency workloads where many threads simultaneously encounter the same uncached topic name, this causes unnecessary lock contention and can degrade throughput.

### Modifications

Replace `computeIfAbsent` with an explicit two-step pattern:

1. **Fast path**: call `cache.get(topic)` first — a single volatile read with no locking — and return immediately on a cache hit (steady-state case).
2. **Slow path** (cache miss): construct `TopicName` *outside* the lock, then use `cache.put()` to insert.

(cherry picked from commit 8c4e83d)
lhotari pushed a commit that referenced this pull request Apr 21, 2026
…on cache lookup (#25367)

### Motivation

`TopicName.get()` previously used `ConcurrentHashMap.computeIfAbsent()` to populate the topic-name cache. Although `computeIfAbsent` is atomic, it holds the internal bin-lock for the entire duration of the mapping function, which includes the non-trivial `TopicName` construction (string splitting, validation, etc.).

Under high-concurrency workloads where many threads simultaneously encounter the same uncached topic name, this causes unnecessary lock contention and can degrade throughput.

### Modifications

Replace `computeIfAbsent` with an explicit two-step pattern:

1. **Fast path**: call `cache.get(topic)` first — a single volatile read with no locking — and return immediately on a cache hit (steady-state case).
2. **Slow path** (cache miss): construct `TopicName` *outside* the lock, then use `cache.put()` to insert.

(cherry picked from commit 8c4e83d)
lhotari pushed a commit that referenced this pull request Apr 21, 2026
…on cache lookup (#25367)

### Motivation

`TopicName.get()` previously used `ConcurrentHashMap.computeIfAbsent()` to populate the topic-name cache. Although `computeIfAbsent` is atomic, it holds the internal bin-lock for the entire duration of the mapping function, which includes the non-trivial `TopicName` construction (string splitting, validation, etc.).

Under high-concurrency workloads where many threads simultaneously encounter the same uncached topic name, this causes unnecessary lock contention and can degrade throughput.

### Modifications

Replace `computeIfAbsent` with an explicit two-step pattern:

1. **Fast path**: call `cache.get(topic)` first — a single volatile read with no locking — and return immediately on a cache hit (steady-state case).
2. **Slow path** (cache miss): construct `TopicName` *outside* the lock, then use `cache.put()` to insert.

(cherry picked from commit 8c4e83d)
lhotari pushed a commit that referenced this pull request Apr 21, 2026
…on cache lookup (#25367)

### Motivation

`TopicName.get()` previously used `ConcurrentHashMap.computeIfAbsent()` to populate the topic-name cache. Although `computeIfAbsent` is atomic, it holds the internal bin-lock for the entire duration of the mapping function, which includes the non-trivial `TopicName` construction (string splitting, validation, etc.).

Under high-concurrency workloads where many threads simultaneously encounter the same uncached topic name, this causes unnecessary lock contention and can degrade throughput.

### Modifications

Replace `computeIfAbsent` with an explicit two-step pattern:

1. **Fast path**: call `cache.get(topic)` first — a single volatile read with no locking — and return immediately on a cache hit (steady-state case).
2. **Slow path** (cache miss): construct `TopicName` *outside* the lock, then use `cache.put()` to insert.

(cherry picked from commit 8c4e83d)
priyanshu-ctds pushed a commit to datastax/pulsar that referenced this pull request Apr 22, 2026
…on cache lookup (apache#25367)

### Motivation

`TopicName.get()` previously used `ConcurrentHashMap.computeIfAbsent()` to populate the topic-name cache. Although `computeIfAbsent` is atomic, it holds the internal bin-lock for the entire duration of the mapping function, which includes the non-trivial `TopicName` construction (string splitting, validation, etc.).

Under high-concurrency workloads where many threads simultaneously encounter the same uncached topic name, this causes unnecessary lock contention and can degrade throughput.

### Modifications

Replace `computeIfAbsent` with an explicit two-step pattern:

1. **Fast path**: call `cache.get(topic)` first — a single volatile read with no locking — and return immediately on a cache hit (steady-state case).
2. **Slow path** (cache miss): construct `TopicName` *outside* the lock, then use `cache.put()` to insert.

(cherry picked from commit 8c4e83d)
(cherry picked from commit 069d1a4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants