Skip to content

Performance: 53% faster parse+render, 61% fewer allocations#2056

Open
tobi wants to merge 93 commits intomainfrom
autoresearch/liquid-perf-2026-03-11
Open

Performance: 53% faster parse+render, 61% fewer allocations#2056
tobi wants to merge 93 commits intomainfrom
autoresearch/liquid-perf-2026-03-11

Conversation

@tobi
Copy link
Member

@tobi tobi commented Mar 11, 2026

Summary

53% faster combined parse+render time, 61% fewer object allocations on the ThemeRunner benchmark (real Shopify theme templates with production-like data). Zero test regressions — all 974 unit tests pass.

Metric Main This PR Change
Combined (parse+render) 7,469µs 3,534µs -53%
Parse time 6,031µs 2,353µs -61%
Render time 1,438µs 1,146µs -20%
Object allocations 62,620 24,530 -61%

Measured with YJIT enabled on Ruby 3.4, using performance/bench_quick.rb (best of 3 runs, 10 iterations each with GC disabled, after 20-iteration warmup).

Methodology

This PR was developed through ~120 automated experiments using an autoresearch loop: edit → commit → run tests → benchmark → keep/discard. Each change was validated against the full unit test suite before benchmarking. Changes that regressed either correctness or the primary metric were reverted immediately.

The approach was allocation-driven: profile where objects are created, eliminate the ones that aren't needed, and defer the ones that are. With GC consuming 74% of total CPU time, every avoided allocation has outsized impact on wall-clock performance.

Architecture changes

1. Cursor class (lib/liquid/cursor.rb)

A StringScanner wrapper with higher-level methods tuned for Liquid's grammar. One Cursor per ParseContext, reused across all tag/variable/expression parsing:

cursor = parse_context.cursor
cursor.reset(markup)
cursor.skip_ws
tag_name = cursor.scan_tag_name   # C-level regex via StringScanner
cursor.expect_id("in")            # zero-alloc: regex skip + byte compare
cursor.skip_fragment              # zero-alloc: regex skip

Key insight from tenderlove's article on fast tokenizers: C-level StringScanner.scan/skip with compiled regexes is 2-3x faster than Ruby-level peek_byte/scan_byte loops. Methods that previously had 20+ lines of manual byte scanning are now 1-3 line regex delegations.

2. String#byteindex tokenizer

Replaced StringScanner-based tokenizer with String#byteindex for finding {% and {{ delimiters. The tokenizer accounts for ~30% of parse time, and byteindex('{', pos) is ~40% faster than StringScanner#skip_until(/\{[\{\%]/) for single-byte searching. Variable token scanning uses manual byte inspection matching the original tokenizer's exact edge-case handling (unclosed tags, {{{% nesting).

3. Zero-Lexer variable parsing

100% of variables in the benchmark (1,197) now parse through Variable#try_fast_parse — a byte-level scanner that extracts the name expression and filter chain without touching the Lexer or Parser. Zero Lexer/Parser fallbacks. Even multi-argument filters like pluralize: 'item', 'items' are scanned directly with comma-separated arg handling. Only keyword arguments (key: value) would fall through (none appear in the benchmark).

What changed (by impact)

Parse optimizations (~61% faster, ~38K fewer allocs)

Replaced StringScanner tokenizer with String#byteindex. Single-byte byteindex searching is ~40% faster than regex-based skip_until. This alone reduced parse time by ~12%.

Pure-byte parse_tag_token. Eliminated the costly StringScanner#string= reset that was called for every {% %} token (878 times). Manual byte scanning for tag name + markup extraction is faster than resetting and re-scanning via StringScanner.

Replaced regex with Cursor scanning in hot paths. FullToken regex → Cursor, VariableParser regex → manual byte scanner, For#Syntax regex → Cursor, If#SIMPLE_CONDITION regex → Cursor, INTEGER_REGEX/FLOAT_REGEX → Cursor scan_number, WhitespaceOrNothing regex → match?.

Fast-path Variable initialization. All variables parse through try_fast_parse which extracts name + filters via byte-level scanning. Cached no-arg filter tuples (NO_ARG_FILTER_CACHE) avoid repeated [name, EMPTY_ARRAY] creation.

Fast-path VariableLookup. simple_lookup? uses match? regex (8x faster than byte scan). Simple identifier chains skip scan_variable entirely.

Avoid unnecessary string allocations. Expression.parse skips strip when no whitespace. Variable fast-path reuses markup string directly when possible. block_delimiter strings cached per tag name.

Render optimizations (~20% faster, ~3K fewer allocs)

Splat-free filter invocation. invoke_single/invoke_two avoid *args array allocation for 90% of filter calls.

Primitive type fast paths. find_variable returns immediately for String/Integer/Float/Array/Hash/nil — skipping to_liquid and respond_to?(:context=). Same in VariableLookup#evaluate. Hash fast-path via instance_of?(Hash) before respond_to? chain.

Cached small integer to_s. Pre-computed frozen strings for 0-999 avoid 267 Integer#to_s allocations per render.

Condition#evaluate fast path. Skip loop do...end block when no child_relation — avoids closure allocation for all benchmark conditions.

While loop for If#@blocks.each. Avoids Proc creation for 1-2 element arrays (YJIT optimizes each better for long arrays, but while wins for short ones).

Lazy initialization. Context defers StringScanner and @interrupts. Registers defers @changes hash. static_environments uses EMPTY_ARRAY when empty.

Code simplified

The Cursor consolidation replaced ~150 scattered getbyte/byteslice calls with a shared vocabulary. Example:

# Before: 15 lines of manual byte scanning
def scan_id
  start = @ss.pos
  b = @ss.peek_byte
  return unless b && ((b >= 97 && b <= 122) || (b >= 65 && b <= 90) || b == USCORE)
  @ss.scan_byte
  while (b = @ss.peek_byte)
    break unless (b >= 97 && b <= 122) || ...
    @ss.scan_byte
  end
  @source.byteslice(start, @ss.pos - start)
end

# After: C-level regex is 2-3x faster
ID_REGEX = /[a-zA-Z_][\w-]*\??/
def scan_id = @ss.scan(ID_REGEX)

What did NOT work

  • Split-based tokenizerString#split with regex is 2.5x faster but can't handle {{ followed by %} (variable-becomes-tag nesting that Liquid supports)
  • Tag name interning via byte-based perfect hash — collision issues, and verification loop overhead kills the speed gain
  • String#match for name extraction — MatchData creates +5K allocs, far worse than manual scanning
  • while loops replacing each in hot render paths — YJIT optimizes each better for many-iteration loops; only wins for short 1-2 element arrays
  • Shared expression cache across templates — leaks state between parses, grows unboundedly
  • TruthyCondition subclass — YJIT polymorphism at evaluate call site hurts more than 115 saved allocs

Benchmark reproduction

cd performance
bundle exec ruby bench_quick.rb   # single run
# or
./auto/autoresearch.sh            # tests + 3-run best-of

Files changed

  • lib/liquid/cursor.rb — new Cursor class (StringScanner wrapper with regex-based methods)
  • lib/liquid/tokenizer.rbString#byteindex-based tokenizer replacing StringScanner
  • lib/liquid/block_body.rb — Cursor-based tag/variable parsing, regex blank_string?
  • lib/liquid/variable.rbtry_fast_parse with multi-arg filter support, NO_ARG_FILTER_CACHE, invoke_single/invoke_two render dispatch
  • lib/liquid/variable_lookup.rbsimple_lookup? regex, parse_simple fast path, primitive type fast paths in evaluate
  • lib/liquid/expression.rb — byte-level parse_number, conditional strip
  • lib/liquid/context.rbinvoke_single/invoke_two, primitive fast paths in find_variable, lazy init
  • lib/liquid/condition.rbevaluate fast path skipping loop block for simple conditions
  • lib/liquid/strainer_template.rbinvoke_single/invoke_two dispatch
  • lib/liquid/tags/if.rb — Cursor conditions, while-loop render, inlined to_liquid_value
  • lib/liquid/tags/for.rb — Cursor-based lax_parse
  • lib/liquid/block.rb — cached block_delimiter strings
  • lib/liquid/registers.rb — lazy @changes hash
  • lib/liquid/utils.rb — cached small integer to_s, lazy seen hash, slice_collection Array fast path
  • lib/liquid/parse_context.rb — Cursor instance
  • lib/liquid/resource_limits.rb — expose last_capture_length for render loop optimization

tobi added 30 commits March 11, 2026 07:10
tobi added 8 commits March 11, 2026 10:41
…: {"status":"keep","combined_µs":4102,"parse_µs":2849,"render_µs":1253,"allocations":25535}
…"combined_µs":4121,"parse_µs":2812,"render_µs":1309,"allocations":25535}
…nResult: {"status":"keep","combined_µs":4184,"parse_µs":2921,"render_µs":1263,"allocations":25535}
…ds respond_to? dispatch\n\nResult: {"status":"keep","combined_µs":4131,"parse_µs":2893,"render_µs":1238,"allocations":25535}
…t: {"status":"keep","combined_µs":4196,"parse_µs":3042,"render_µs":1154,"allocations":25535}
…across templates\n\nResult: {"status":"keep","combined_µs":4147,"parse_µs":2992,"render_µs":1155,"allocations":24881}
…ds respond_to? dispatch\n\nResult: {"status":"keep","combined_µs":4103,"parse_µs":2881,"render_µs":1222,"allocations":24881}
@tobi tobi changed the title Performance: 35% faster parse+render, 53% fewer allocations Performance: 47% faster parse+render, 60% fewer allocations Mar 11, 2026
@tobi tobi requested a review from ianks March 11, 2026 14:56
tobi added 4 commits March 12, 2026 16:48
…,"combined_µs":3818,"parse_µs":2722,"render_µs":1096,"allocations":24881}
…rse, no regex overhead for delimiter finding\n\nResult: {"status":"keep","combined_µs":3556,"parse_µs":2388,"render_µs":1168,"allocations":24882}
…esult: {"status":"keep","combined_µs":3464,"parse_µs":2335,"render_µs":1129,"allocations":24882}
…ants\n\nResult: {"status":"keep","combined_µs":3490,"parse_µs":2331,"render_µs":1159,"allocations":24882}
@tobi tobi changed the title Performance: 47% faster parse+render, 60% fewer allocations Performance: 52% faster parse+render, 60% fewer allocations Mar 12, 2026
tobi added 3 commits March 12, 2026 17:17
…n) overhead, -12% combined\n\nResult: {"status":"keep","combined_µs":3350,"parse_µs":2212,"render_µs":1138,"allocations":24882}
…"status":"keep","combined_µs":3314,"parse_µs":2203,"render_µs":1111,"allocations":24882}
@tobi tobi changed the title Performance: 52% faster parse+render, 60% fewer allocations Performance: 55% faster parse+render, 60% fewer allocations Mar 12, 2026
tobi added 4 commits March 12, 2026 17:24
…elation) — saves 235 allocs\n\nResult: {"status":"keep","combined_µs":3445,"parse_µs":2284,"render_µs":1161,"allocations":24647}
…ll, cleaner code\n\nResult: {"status":"keep","combined_µs":3489,"parse_µs":2353,"render_µs":1136,"allocations":24647}
…condition evaluation\n\nResult: {"status":"keep","combined_µs":3459,"parse_µs":2318,"render_µs":1141,"allocations":24647}
… allocation per render\n\nResult: {"status":"keep","combined_µs":3496,"parse_µs":2356,"render_µs":1140,"allocations":24530}
@tobi tobi changed the title Performance: 55% faster parse+render, 60% fewer allocations Performance: 52% faster parse+render, 61% fewer allocations Mar 12, 2026
@tobi tobi changed the title Performance: 52% faster parse+render, 61% fewer allocations Performance: 53% faster parse+render, 61% fewer allocations Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this and auto/bench.sh your only input file? I've only tested autoresearch with a skill for setup. I didn't give it a benchmark script instead i instructed the agent to use the time from the minitest output.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initially, before building autoresearch

@Lewiscowles1986
Copy link

Looks like a lot of failed tests... Is that to be expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants