# frozen_string_literal: true at the top of most of your Ruby
source code files, or at the very least, that you’ve seen it in some other projects.
Based on informal discussions at conferences and online, it seems that what this magic comment really is about is not always well understood, so I figured it would be worth talking about why it’s there, what it does exactly, and what its future might look like.
Before we can delve into what makes frozen string literals special, we first need to talk about the Ruby String type, because it’s quite different from the equivalent type in other popular languages.
In the overwhelming majority of popular languages, strings are immutable. That’s the case in Java, JavaScript, Python, Go, etc.
There are a few exceptions, though, like Perl, PHP1, C/C++ (except for literals), and of course Ruby:
>> str = String.new
=> ""
>> str.object_id
=> 24952
>> str << "foo"
=> "foo"
>> str
=> "foo"
>> str.capitalize!
=> "Foo"
>> str.upcase!
=> "FOO"
>> str
=> "FOO"
>> str.object_id
=> 24952
Implementation-wise, they’re just an array of bytes, with an associated encoding to know how these bytes should be interpreted:
class String
attr_reader :encoding
def initialize
@bytes = []
@encoding = Encoding::UTF_8
end
end
That too is quite unusual.
Most languages, especially the ones I listed above, instead have chosen a specific internal encoding, and all strings are encoded that way. For instance, in Java and JavaScript, strings are encoded in UTF-16 because they were created somewhat at the same time as the first Unicode specification, and at that time, many people thought that surely 16 bits should be enough to encode all possible characters, but that later turned out to be wrong. Most newer languages uses UTF-8, or a limited set of internal encodings.
For instance, in Python, strings can be encoded in either ISO-8859-1 (AKA Latin 1), UTF-16 or UtF-32.
But from a user perspective, it’s an implementation detail, and you can’t really tell what encoding a particular string is using.
Semantically, strings are Unicode sequences, how that sequence is encoded in memory is abstracted away.
In these languages, whenever you have to handle text in another encoding, you start by re-encoding it into the internal representation. In Ruby however, strings with different internal encodings can exist in the same program, and Ruby supports over a hundred different encodings:
>> Encoding.list.size
=> 103
While I’m not 100% percent certain of why Ruby went that way, I highly suspect it is in big part due to Ruby’s Japanese origin. In the early days of the Unicode specification, there was an attempt at unifying some of the “common” Chinese, Korean, and Japanese characters, as what is now called the Han unification. Because of that character unification attempt, Unicode had lots of problems for Japanese text, hence the Japanese IT industry didn’t adopt Unicode as fast as the Western IT industry did, and for a very long time, Japanese-specific encoding such as Shift JIS remained widespread.
As such, being able to work with Japanese text without going through a forced Unicode conversion was an important feature for a large part of Ruby’s core contributors.
But let’s go back to mutability.
Like most things in engineering, both immutable and mutable strings have pros and cons, so it’s not like one choice is inherently superior to the other.
One of the advantages of immutable strings is that you can more easily share them, for instance:
sliced_string = very_long_string[1..-1]
In the above case, if strings are mutable, you need to copy all but one of the bytes of very_long_string into sliced_string, which can be costly.
But if strings are immutable, you can instead have sliced_string internally be pointing at the content of very_long_string with just an offset.
That is what some languages call String Views, or String slices.
Another advantage of immutable strings is that they allow for interning. The idea is simple, if strings can’t be mutated, whenever you have multiple instances of strings with identical content, you can coalesce them into a single instance. This deduplication can be done more or less aggressively, as it’s always a tradeoff in how much CPU time you want to spend searching for duplicates in the hope of saving some memory.
Some other advantages include not having to worry about mutation in multi-threaded code, as well as dictionary keys. Strings are used a lot as dictionary keys. If you mutate a string, you change its hash code, and that basically breaks hash tables.
On the other hand, mutable strings are very handy in some scenarios, like to iteratively build a final string:
buffer = ""
10.times do
buffer << "hello"
end
Whereas in a language with immutable strings like Java, concatenating strings in a loop is known as a classic performance gotcha:
String buffer = "";
for (int i = 0; i < 10; i++) {
buffer += "hello";
}
In the above example, on every loop, the += operator causes a new string to be allocated, and the content to be copied, which gets exponentially more expensive as the string grows.
Instead, you are supposed to use a different object as a buffer: StringBuilder:
StringBuilder buffer = new StringBuilder();
for (int i = 0; i < 10; i++) {
buffer.append("hello");
}
buffer.toString();
That’s the Java equivalent of appending strings to an array and then calling array.join("").
It’s a common enough mistake that at some point the Java compiler gained the ability to detect that pattern and automatically replace it with the equivalent code using StringBuilder.
While having to use a different buffer type isn’t the end of the world, I do very much like that it’s not necessary in Ruby.
But more generally, the advantage of mutable strings is that for some algorithms, being able to modify the string in place saves a lot of memory allocations and copying.
Earlier in this post, I said Ruby had mutable strings, but it’s not quite true. Ruby actually has both mutable and immutable strings, because in Ruby, every mutable object can be frozen, hence, Ruby has both mutable and immutable strings, and it takes advantage of this.
A fun way to poke at Ruby internals is through the ObjectSpace.dump method.
require "json"
require "objspace"
def dump(obj)
JSON.pretty_generate(JSON.parse(ObjectSpace.dump(obj)))
end
str = "Hello World" * 80
puts dump(str)
The above script will output something like:
{
"address": "0x105068e10",
"type": "STRING",
"slot_size": 40,
"bytesize": 880,
"memsize": 921,
...
}
It tells us the string content is 880B (bytesize) and that Ruby allocated a 40B wide slot (slot_size),
hence the string content is stored in an external buffer for a total of 921B (memsize).
Now, look what happens if we slice that string:
require "json"
require "objspace"
def dump(obj)
JSON.pretty_generate(JSON.parse(ObjectSpace.dump(obj)))
end
str = "Hello World" * 80
puts "initial str: #{dump(str)}\n"
slice = str[40..-1]
puts "str after:\n#{dump(str)}\n"
puts "slice:\n#{dump(slice)}\n"
str after:
{
"address": "0x105178e18",
"type": "STRING",
"slot_size": 40,
"shared": true,
"references": [ "0x1051786c0" ],
"memsize": 40,
...
}
slice:
{
"address": "0x1051786e8",
"type": "STRING",
"slot_size": 40,
"shared": true,
"references": [ "0x1051786c0" ],
"memsize": 40,
...
}
Now, both str and slice have the shared: true attribute, which indicates that they’re not actually owning their content, they are pointing inside another String object.
You can also see that both str and slice have a reference to the same object at address: 0x1051786c0.
So even though it has mutable strings, Ruby is still able to optimize some operations using “string views” like languages with immutable strings.
However, since str is mutable, Ruby couldn’t directly create a string view that references str, it first had to transfer the buffer ownership to a third String object, and that one is immutable.
But if str was frozen, Ruby would have been able to directly create slice as a view inside str.
Similarly, when I was listing some of the pros and cons of mutable strings, I mentioned how mutable strings are a problem when used as hash table keys. Perhaps you’ve never noticed it, but to avoid this problem, Ruby automatically freezes string keys in Hash:
>> str = "test"
=> "test"
>> str.frozen?
=> false
>> hash = { str => 1 }
=> {"test" => 1}
>> hash.keys.first
=> "test"
>> hash.keys.first.frozen?
=> true
>> [str.object_id, hash.keys.first.object_id]
=> [16, 24]
As you can see, here Ruby couldn’t directly use the str string as a Hash key, it first had to make a frozen copy of it.
Here, too, if str was frozen, Ruby could have saved the extra work of duplicating this string.
I believe that illustrates the common tradeoffs at play with mutable strings. On one hand, they can be much more efficient, allowing for in-place modifications, but on the other hand, they impose extra allocations and copying to protect yourself from mutations.
To avoid this extra copying overhead, it used to be a fairly common optimization technique to store string literals in constants. For instance, you can see this idiom in a 17 years old patch to rack:
module Rack
class MethodOverride
METHOD_OVERRIDE_PARAM_KEY = "_method".freeze
HTTP_METHOD_OVERRIDE_HEADER = "HTTP_X_HTTP_METHOD_OVERRIDE".freeze
def call(env)
# ...
method = req.POST[METHOD_OVERRIDE_PARAM_KEY] ||
env[HTTP_METHOD_OVERRIDE_HEADER]
# ...
end
end
end
It’s this pattern that led Hailey Somerville from GitHub to open a feature request to propose a new syntax for frozen string literals: %f.
req.POST[%f(_method)] || env[%f(HTTP_X_HTTP_METHOD_OVERRIDE)]
This syntax wasn’t accepted, but as a counter proposal, Yusuke Endoh (mame) suggested an “f suffix”:
req.POST["_method"f] || env["HTTP_X_HTTP_METHOD_OVERRIDE"f]
This one was accepted and implemented in Ruby 2.1.0dev.
However, many core developers didn’t like this new syntax, so even after its implementation, multiple counterproposals were made.
Notably, Akira Tanaka (akr), proposed a file-based directive: # freeze_string: true, but it didn’t catch on.
However before the final 2.1.0 release, Charles Nutter opened another feature request,
and suggested to instead implement a compiler optimization for String#freeze, so as to provide the same feature but without introducing a new syntax.
If you aren’t familiar with how the Ruby virtual machine works, or virtual machines in general, you may be surprised to hear that Ruby has a compiler, but it absolutely does.
Prior to Ruby 2.1, the program "Hello World".freeze would be compiled by Ruby into a sequence of two instructions:
>> puts RubyVM::InstructionSequence.compile(%{"Hello World".freeze}).disasm
== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(1,19)>
0000 putstring "Hello World" ( 1)[Li]
0002 opt_send_without_block <calldata!mid:freeze, argc:0, ARGS_SIMPLE>
0004 leave
First, a putstring instruction to put "Hello World" on the VM stack, followed by an opt_send_without_block to call the #freeze method on it.
def putstring(frozen_string)
@stack.push(frozen_string.dup)
end
When invoked, the instruction receives a reference to a frozen String object that has been created by the Ruby compiler.
But since the semantics is that the string #freeze will be called on must be mutable, it has to duplicate it, and it’s the mutable copy that is put on the stack.
In my opinion, the putstring instruction isn’t correctly named, because its name suggests it just puts the frozen string directly on the stack.
This isn’t consistent with other put* instructions like putobject, which directly puts an object on the stack without duping it:
def putobject(object)
@stack.push(object)
end
But also inconsistent with some other instructions like duparray and duphash, which actually behave like putstring does:
def duparray(array)
@stack.push(array.dup)
end
So it would be much clearer if it had been named dupstring instead of putstring.
But anyways, Charles’ suggestion was to have the compiler generate a different set of VM instructions when the #freeze method is called
on a string literal:
>> puts RubyVM::InstructionSequence.compile(%{"Hello World".freeze}).disasm
== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(1,20)>
0000 opt_str_freeze "Hello World", <calldata!mid:freeze, argc:0, ARGS_SIMPLE>( 1)[Li]
0003 leave
As you can see, on more recent rubies, the putstring and opt_send_without_block instructions have been replaced by a single opt_str_freeze.
Its implementation in pseudo-Ruby would be something like:
def opt_str_freeze(frozen_string)
if RubyVM.string_freeze_was_redefined?
@stack.push(frozen_string.dup.freeze)
else
@stack.push(frozen_string)
end
end
As you can see, to not break semantics, the instruction has to check that String#freeze hasn’t been redefined, but apart from that cheap precondition, the instruction does strictly less work than before.
This is the feature Ruby 2.1.0 ultimately shipped with in December 2013.
To further reduce string allocations, in 2014, Aman Karmani (tmm1) and Hailey Somerville (haileys) from GitHub submitted a patch to add two more optimized instructions, opt_aref_with and opt_aset_with.
Before their patch, accessing a hash with a string key would cause a string allocation:
>> puts RubyVM::InstructionSequence.compile(%{some_hash["str"]}).disasm
...
0003 putstring "str"
0005 opt_aref <calldata!mid:[], argc:1, ARGS_SIMPLE>[CcCr]
0007 leave
After the patch, these two instructions were replaced by a single opt_aref_with:
>> puts RubyVM::InstructionSequence.compile(%{some_hash["str"]}).disasm
...
0003 opt_aref_with "str", <calldata!mid:[], argc:1, ARGS_SIMPLE>
0006 leave
Similar to opt_str_freeze, these instructions would check if the method is being called on a Hash, and if Hash#[] hadn’t been redefined.
When both conditions are true, the instruction would be able to look up in the hash without first copying the string.
def opt_aref_with(frozen_string)
if RubyVM.hash_aref_was_redefined? || !@stack.last.is_a?(Hash)
# fallback
@stack.push(frozen_string.dup)
value = RubyVM.call_method(:[], 1)
@stack.push(value)
else
# fast path
hash = @stack.pop
value = hash[frozen_string]
@stack.push(value)
end
end
According to Aman Karmani, this reduced allocations in GitHub by 3%, which is quite massive for what is a relatively small patch.
As a sidenote, this optimized instruction has just been removed by Aaron Paterson on the Ruby trunk, because given most performance-sensitive code already uses the magic comment, this optimization no longer yields much benefit.
Perhaps in part because of that new feature, or perhaps because of other reasons. The knowledge of the performance impact of all these useless string duplication in Ruby applications started to spread around 2014, and some community members, notably Richard Scheenman, started to submit pull requests in Rails, rack and a bunch of other gems, with some pretty significant results, such as an 11.9% latency reduction on codetriage.com.
These performance gains were generally too good to pass up, but regardless, many people felt that the resulting code was much more ugly. So the question of freezing string by default came back regularly, but was always rejected.
Until Akira Matsuda (amatsuda) brought the issue again at the Ruby core developer meeting in August 2015, and there Matz decided that Ruby string literals would be frozen in Ruby 3.0.
A number of other features to ease the transition were also decided.
First, the # frozen_string_literal: true magic comment was introduced to help gems prepare for Ruby 3.0.
Then, to ensure that any code that wouldn’t have been made compatible with Ruby 3.0 would remain usable, two Ruby command line options were added: --enable-frozen-string-literal and --disable-frozen-string-literal.
This way, once Ruby 3.0 would be released, if your code or one of your dependencies wasn’t compatible yet, you could just set
RUBYOPT="--disable-frozen-string-literal" and keep going.
And also a --debug-frozen-string-literal command line option, to help developers.
All these new features were released with Ruby 2.3 in December 2015.
What happens when you run Ruby with --enable-frozen-string-literal or with the # frozen_string_literal: true magic comment is that the compiler generates a different bytecode:
>> puts RubyVM::InstructionSequence.compile(%{# frozen_string_literal: true\n"Hello World"}).disasm
== disasm: #<ISeq:<compiled>@<compiled>:2 (2,0)-(2,13)>
0000 putobject "Hello World" ( 2)[Li]
0002 leave
Now, instead of the putstring instruction, the compiler generates a putobject instruction.
As I mentioned above, this instruction directly puts the frozen string that was created during compilation on the stack, with no extra duplication.
So it’s important to understand that frozen string literals are strictly less work for Ruby than mutable string literals.
Following the release of Ruby 2.3, the Rubocop project added a new cop to enforce the use of the # frozen_string_literal: true comment,
with the intent of helping projects be ready for Ruby 3.0 in the future.
Over the following years, many projects migrated to frozen string literals, including Rails and rake in 2017, Rack in 2018, and of course a long tail of other projects.
It’s always hard to say with certainty how much a feature is used, but I think it’s safe to say that, aside from a few projects that deliberately chose not to follow suit, a large majority of the actively developed gems did migrate to frozen string literals. However, many of the more stable and less actively developed gems didn’t.
There was no indication of when Ruby 3.0 would be released, and the lack of compatibility with it wasn’t advertised by warnings or any other methods, hence, few people even knew whether any of their dependencies needed to be updated.
Over time, the magic comment slowly became an incantation most Rubyists follow, in big part because of rubocop, but as far as I know, basically no one was trying to run their application with --enable-frozen-string-literal, and few even knew about it.
I consider this for years. I REALLY like the idea but I am sure introducing this could cause HUGE compatibility issue, even bigger than Ruby 1.9. So I officially abandon making frozen-string-literals default (for Ruby3).
– Matz
I must say this decision did surprise me at the time.
I definitely understand not wanting to cause a Python 3 sort of moment, but I don’t think frozen string literals would have caused it,
because ultimately you could always have set RUBYOPT="--disable-frozen-string-literal" and kept running your applications unchanged if necessary.
I’m pretty sure if Python 3 had a way of running Python 2 code, the migration would have been much less of a big deal.
It was even more surprising to me because Ruby 2.7 also introduced new deprecation warnings in preparation for the keyword argument change in Ruby 3.0, and from my point of view, this breaking change was way bigger than frozen string literals would ever have been.
It caused so many deprecations that a Ruby 2.7.2 was later released specifically to turn deprecation warnings off.
And arguably, updating code to support the new keyword argument logic was way more involved than for frozen string literals.
If you have a look at the migration guide, it’s fairly long and complex,
whereas frozen string literals only need a few strategically placed .dup there and there.
As a datapoint, I personally handled the migration of Shopify’s monolith and roughly 700 gem dependencies for both the Ruby 3.0 keyword arguments and for --enable-frozen-string-literal.
For keyword arguments, I had to send pull requests to almost a hundred gems, as well as change a lot of code in the monolith itself, and some of them were really non-trivial to fix.
For frozen string literals, I only had to send pull requests to 12 gems, and it was just a matter of adding a few .dup calls.
But anyway, by the time of the Ruby 3.0 release, it had been almost 5 years since the initial plan had been laid out, and most of the performance-sensitive code had migrated to use the magic comment, so this abandonment didn’t spark much discussion, and few people noticed.
Until four years later, in January 2024, I started hearing about standardrb and how it doesn’t enforce the presence of the frozen string literal magic comment.
I also saw a few projects starting to remove them, or new projects deliberately not adding them, because this extra comment at the top is seen as cruft.
And I must say I agree. I hate that comment.
Back when I started with Ruby, in version 1.8, the default encoding of source files was ASCII, so we frequently had to add a magic comment at the top of the file to tell Ruby they were encoded in UTF-8.
# encoding: utf-8
I hated that comment back then, because what I always loved about Ruby is that the source code is almost entirely free of boilerplate. So when Ruby 2.0 made UTF-8 the default encoding, and we could finally get rid of all this cruft, it made me extremely happy.
I would love to do the same with the frozen string literal comment, but once you are aware of all these useless allocations and copies, it’s really hard to unsee.
I’m now familiar enough with the VM that when I look at code without the magic comment, I pretty much visualize the implicit dup calls.
When I look at code like this:
env["HTTPS"] == "on" ? "https" : "http"
I can’t help but see this:
env["HTTPS".dup] == "on".dup ? "https".dup : "http".dup
Which drives me nuts. And yes, these are small strings, and the GC got faster in the last few years, but still, string literals are everywhere, so these allocations add up and cause a death by a thousand cuts.
So seeing that the community was slowly unlearning this lesson pained me, and I decided I’d try to revive the initiative.
In my opinion, what the initial plan lacked was a proper deprecation path. Many Ruby users had heard the default would change with Ruby 3.0, but Ruby itself never emitted any deprecation to warn users that code would need to be updated, so very little work happened to prepare for it.
Hence, if I wanted to convince Matz to try again, I needed to come up with a way to emit useful deprecation warnings whenever some code would mutate a literal string. That’s where I came up with the concept of chilled strings.
Starting from Ruby 3.4, when a source file has no frozen_string_literal comment (either true or false), instead of generating putstring instructions, the compiler now generates putchilledstring instructions:
>> puts RubyVM::InstructionSequence.compile(%{puts "Hello World"}).disasm
== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(1,18)>
0000 putself ( 1)[Li]
0001 putchilledstring "Hello World"
0003 opt_send_without_block <calldata!mid:puts, argc:1, FCALL|ARGS_SIMPLE>
0005 leave
This new instruction is identical to putstring, except it additionally marks the newly allocated string with the STR_CHILLED flag.
Then I modified the rb_check_frozen function, which is responsible for raising FrozenError when a frozen object is mutated, to also check for that flag.
When a chilled string is mutated, a deprecation warning is emitted, and the flag is removed so that only the very first mutation emits a warning:
>> Warning[:deprecated] = true
=> true
>> "test" << "a" << "b"
(irb):3: warning: literal string will be frozen in the future (run with --debug-frozen-string-literal for more information)
=> "testab"
The migration plan is that in a yet to be defined future version, these deprecation warnings would be visible by default, and then in a further version, frozen string literals would become the default.
Just like in the previous discussions back in 2014, Yusuke Endoh (mame) objected to the change, arguing that the performance benefits of frozen string literals were never properly measured because back in 2014, lots of code wasn’t compatible so it wasn’t possible to measure.
how much would the performance degrade if we removed
# frozen_string_literal: truefrom all code used in yjit-bench?
So I went ahead and built a modified Ruby interpreter on which the magic comment had no effect, and benchmarked it against mainline Ruby.
The results were that frozen string literals make Lobsters, an open source discussion board in Rails, 8-9% faster.
It also made railsbench, a synthetic Rails application, 4-6% faster, and liquid-render 11% faster.
And one thing to note is that the benchmarked codebase and its dependencies, like Rack, still contain lots of code that was hand-optimized from the pre-frozen string literal days to avoid allocations. So the difference would be certainly larger if mutable string literals weren’t already worked around.
Similarly, back then I was surprised to only see a meager 1-2% gain on the erubi-rails benchmark, given it’s quite string-heavy.
But in retrospect, it’s very much expected because one of the biggest performance tricks of erubi is that it works around mutable string literals in its code generation by leveraging opt_str_freeze instructions:
>> puts Erubi::Engine.new("Hello <% name%>!").src
_buf = ::String.new; _buf << 'Hello '.freeze; name; _buf << '!'.freeze;
_buf.to_s
All this makes it hard to come up with a clear measure of the performance benefits of freezing string literals. At this point, making them the default is more to allow Rubyists to write nicer and less contrived code, not so much about improving performance.
After some more rounds of discussion, Matz accepted the proposal but without committing to any specific timeline, and I implemented the feature with Étienne Barrié, which shipped with Ruby 3.4.0.
So at this point, it may look like a done deal. The deprecations are in place, it’s just a matter of deciding when to flip the switch.
But as we’ve seen in the past, that doesn’t mean much. Matz may still change his mind at any point, and there are still a few Ruby core members actively campaigning against frozen string literals.
Personally, I’m quite tired of arguing about it. It might be a personal bias, given the overwhelming majority of the code I interact with has been frozen string literal compatible for a decade, but it seems to me that the Ruby community very largely adopted frozen string literals, so for me it seems obvious to make it the default.
But not everyone in Ruby core has the same view of the community. Some members like Mame are very involved in quines and other forms of artistic programming like TRICK, in which mutable string literals are used a lot. So I understand that for him, switching the default means breaking a number of historical programs he cares about.
Ultimately, as always with Ruby’s direction, it will come down to what Matz decides. For now, he has publicly accepted the migration plan, but not yet committed to any timeline, and I’m not sure Matz really has a vision of what the community at large desires on this topic. With Ruby 4.0 being likely released this year, it’s very possible this migration stays in limbo for years and is ultimately abandoned again.
At the end of the day, I don’t care so much about frozen string literals being the default. I just want to be able to stop adding this ugly comment at the top of my files, without losing the performance benefit and without having to explicitly freeze my constants.
An alternative to changing the default could be to allow setting compiler options for entire directories.
This would allow Rubyists to enable frozen string literals in a single place, typically the gemspec or Rails config.
However, this would fragment Ruby more, because it means a given code snippet may or may not work based on where it is located. This was already a concern with the magic comment, it would be an even bigger one with directory-based compiler options. So I’m not sure Matz would be ok with that.
I can’t predict what the future of string literals in Ruby will be. I do hope they’ll be frozen a few years from now, but I’m not holding my breath.
In the meantime I do encourage gem authors to test their gems with --enable-frozen-string-literal
What is certain, however, is that performance-wise, they only have upsides, as they’re strictly less work for the Ruby VM, but your performance-sensitive dependencies likely already use them, or at least work around mutable string literals in the hot paths.
Hence, you are unlikely to notice a big difference if you were to run your application with RUBYOPT="--enable-frozen-string-literal".
However, if you do measure a negative performance impact, there is no doubt you are measuring incorrectly.
A previous version of the post wrongly listed PHP as a language with immutable strings. ↩
So here it is, I am deeply convinced that contrary to what has been alleged recently, Shopify has nothing but good intentions toward Ruby and its community.
It is healthy to be skeptical toward corporations, I certainly am, but I believe Shopify is currently receiving undue distrust considering their track record of massive investment in the Ruby ecosystem. And some of that may be due to a lack of understanding of how they engage with Open Source communities.
So I’ll try to explain what they do, how they do it, and why we need more companies like Shopify, not less.
As is customary in this sort of situation, I first need to disclose the nature of my relationship with Shopify.
I could try to brush it off by just saying that I was employed by them from November 2013 to August 2025, but in my opinion, that would be a cop-out. Knowing that someone has been previously employed by someone else doesn’t tell you anything about where they’re speaking from. Worse, instead of enlightening you on which biases the author might have, it might let you think they have insider knowledge, hence are even more reliable.
What is important to disclose is how the relationship ended.
In my case, I left Shopify for several reasons, but mainly because of my constant friction with the CEO. Ever since my first interaction with him twelve years ago, I knew he was someone I couldn’t see eye to eye with on almost every subject. Even when I’d occasionally happen to agree on a specific topic, his overly maximalist position and lack of nuance would drive me away. The only reason I managed to stick this long at the company is that I made sure to pick projects and teams so as to minimize my interactions with him.
And the reason why I ended up quitting is that it was no longer possible to avoid him. Since I consider him directly responsible for my burnout last year, I couldn’t possibly stay any longer.
I could go on for hours about all the hard feelings, but this is not really the place, I only mean to share enough to explain where I’m speaking from. What is important to know is that I have absolutely zero reasons to give a pass to Shopify over anything.
But despites my personal feelings and history, it has to be said that Shopify’s CEO is a Rubyist at heart, almost to a fault.
Contrary to what you might think, Ruby isn’t all that popular at Shopify. Even when I started back in 2013, only a small fraction of new hires had any prior experience with Ruby, and a decade later, there aren’t so many proud Rubyists in the Shopify ranks. Most developers, and even many executives, would rather use something else.
Yet, Ruby and Rails remain the default stack at Shopify, and the only reason for that is the CEO. Every Shopify employee knows that suggesting straying away from Ruby wouldn’t fly there. And I’m convinced that if it were anyone else at the helm, Shopify would have joined the long list of companies that attempted to migrate to something else and are now stuck with both a Ruby monolith and a ton of half-migrated micro-services in Java or Go.
Hence, it’s important to recognize that people are multidimensional. Just because you can’t see eye to eye on some topic doesn’t mean you can’t be allies (even if only by circumstances) on another.
But Shopify isn’t only its CEO.
As Rubyists, the side of Shopify you are the most likely to interact with, or at least be familiar with, is the Ruby and Rails Infrastructure team (R&RI).
It’s a team of 40ish people. They’re the ones you see on countless GitHub issues and pull requests, maintaining countless projects, and speaking at conferences. I know all of them very well, and I can attest that, barring a couple of rare exceptions, they’re all long-time proud Rubyists, not mercenaries nor zealous “company men”.
I believe, without the shadow of a doubt, that if Shopify ever started to have ill intentions toward the community, many people in the R&RI team would either resign or call it out or both. At the very least, they would confide in other members of the community, and that would inevitably be public rather quickly.
You may think I’m exaggerating, and surely with their cushy salaries, many of them would have second thoughts. But I honestly don’t think so. Shopify isn’t even paying that well (depending on the market). Based on my discussions with the people who left the team over the years, the most common cause of voluntary departures, by far, was compensation. And most of the team could find another job rather quickly anyway, even in this market.
What makes them stay at Shopify, and why it took me so long to finally decide to quit, is that right now, it is hands down the best place in the world to contribute to the Ruby ecosystem. Nowhere else comes close, and that’s all due to Shopify’s philosophy toward Open Source.
Whether you realized it already or not, all the code you depend on, all the code that runs on your servers, is your code. It doesn’t matter if it was written by someone you never met in Nebraska, or by a multi-billion-dollar corporation.
You run it, you own it.
If it has a bug, if it is missing a feature, or if it has any other needs, that’s on you to figure out the solution for yourself. There’s no relying on the original author to get that responsibility off your plate.
To illustrate this, I remember back in 2014 or 2015, when MySQL servers started segfaulting in production at a regular interval. IIRC, Shopify had a support contract with a MySQL consultancy, and they probably were notified of it. But we didn’t sit there waiting for the “owners” or experts to figure it out.
It’s a colleague who went knee deep in core dumps to figure out this was caused by an alloca call in a non-leaf function,
causing a stack overflow, produced a patch, patched our MySQL servers, and then sent the patch upstream1.
This philosophy is at the heart of Shopify’s Ruby & Rails Infrastructure team. It is determined not just to be a user of the open source ecosystem, but to proactively engage with it, contribute, and make it better through engineering time and contributions. Not by delegating the responsibility to a third party nor exploiting maintainers goodwill.
I also sometimes hear people saying that Shopify is snatching all the super senior Ruby developers, but I’d argue that’s mostly untrue. The reality is that in most cases, Shopify is growing these developers internally.
Take Kevin Newton, for instance. He started as a product developer at Shopify, but after a few years, he pitched his vision of a universal parser for Ruby, managed to get transferred to the R&RI team, worked on the project that became Prism, became a Ruby core committer, won the Ruby Prize award, etc. Since then, he left Shopify to work on a Python JIT at Meta, yet he is still maintaining Prism, because he is a Rubyist at heart. And Kevin is far from the only example of that; I am one as well, and so are dozens of my former teammates. Some, like Peter Zhu, even started as interns.
The reason I’m explaining this is that I feel there is a part of the community that is naturally distrustful of Shopify or corporations in general. I don’t blame them, there have been countless examples of nefarious behaviours from companies, so it’s logical and healthy to at least be skeptical. But it’s also important to recognize and salute positive behavior when it happens.
In this specific case, I believe that recently, Shopify has been giving the community something that is priceless: a large number of proficient and deeply committed contributors to Ruby itself and the whole ecosystem. And I’d argue that is way more valuable for the future and sustainability of Ruby than any amount of money.
Usually, when the topic of Open Source sustainability comes up, it ends up revolving around how to make companies pay for developers’ time. There is this idealized image of Open Source being an amalgamation of lone developers tirelessly maintaining projects for free, eating ramen while big bad companies make huge profits out of their work. There is definitely some truth to it, it is far from uncommon, but it’s also a bit of a tired cliché.
The Open Source ecosystem is also a lot of projects that are contributed to by people on various companies’ payrolls. Linux is the poster child of healthy corporate involvement, with the overwhelming majority of contributions coming from employees of companies with a vested interest in the kernel. That’s just one example, but when you look at big and complex open source projects, most of the time you’ll see big companies involved in one way or another. That’s how most of the sustainable open source happens today, way more than through donations.
Hence, I’d argue that if an open source community wants to be sustainable, it needs to be welcoming of corporate contributions. I don’t mean trust them blindly, it’s important to keep them in check just in case, but you have to let them play ball.
Ruby has successfully done that. Back in 2019, Rafael França and Matz met in Bristol. Rafael asked Matz what he needed, and Matz answered: “I need people”. That’s how the Ruby and Rails Infrastructure team started getting involved in Ruby development, that’s what ultimately led to YJIT, now ZJIT, numerous GC improvements like Variable Width Allocation, modular GC, Prism, tons of Ractors improvements, etc. But more importantly, almost a dozen new Ruby core committers.
I would wager that if that day Matz had asked for money, we’d have much worse results to show for.
And aside from worse results, I’d argue it would have created perverse incentives.
I have nothing but respect for people who try to find ways to fund open source development in alternative ways. However, it’s important to look at it through the lens of structures and incentives.
Whenever you design a system that involves people, you need to consider how a person who tries to maximize their personal benefits is incentivized to behave.
A typical example is ticket inspectors on trains and buses. You may be tempted to give them a cut on the fines they give to people, as to incentivise them to work harder, but by doing so, you create a problem that they are incentivized to be inflexible with commuters, causing a lot of conflicts instead of resolving situations peacefully. Some of them might even be incentivized to give bullshit fines to earn a little extra money2.
If a system requires all the people involved to be perfect and act selflessly, then I’d argue it’s a flawed system.
Now, if Shopify had instead poured millions in cash into the Ruby Association or Matz himself, how would you, I, or anyone be able to trust that the project direction and decision are free of influence? How to trust that a given feature was accepted solely on its own merit and not just because it came from a big sponsor? Inversely, when a feature is declined, how do you trust it wasn’t because it didn’t come from a sponsor?
That’s the thing with money, once you have it, it’s very difficult to do without. When a big sponsor pulls out, you have to lay off staff, stop some initiatives, etc. So even if you publicly declare that there’s no strings attached, even if you never explicitly say anything about it, entities and people who receive funding are naturally incentivized to keep the donor happy so that the funding keeps coming.
Whereas with corporate contributors, sure, their employer may decide to assign them to another project, but there are no hard consequences, and most of them will stick around regardless. Most will even remain contributors if they quit or are laid off.
You can actually witness that dynamic between Shopify and Ruby publicly, for instance, in how Prism is now the default parser, but isn’t yet the only official parser. I can tell you that this has ruffled quite a few feathers at Shopify, but that’s the thing, Matz and Ruby don’t feel indebted to Shopify, they feel entirely free to say no. And I think that’s how it should be.
To be clear, I’m not saying open source should be free of any monetary exchanges, just that it’s crucial to do it in a way that doesn’t let these sorts of suspicions arise.
I know some people will object to the above, arguing that this is all open source, so if you are not happy with the direction of the project, you can always fork, ergo: shut up! And while this is true in most cases, in practice, there are some projects that aren’t as easily forked because of their position.
For instance, if you look at Sidekiq, it’s making loads of money with its Pro and Enterprise offerings, and quite openly declines some features in the open source project so as not to cannibalize sales. As far as I am aware, pretty much everyone is fine with it. Sure, you’ll find a few people complaining about it, but that’s just background noise.
This is because Sidekiq isn’t on any critical path, there are plenty of alternatives you can go for if you aren’t satisfied with it, and if you wish to fork it and add such a feature for yourself, it’s pretty trivial, you don’t need to convince anyone. Hence, everyone sees it as fair.
However, some projects have a moat. A dominant position granted by another project. Imagine if, instead of allowing you to use any job processor you want through Active Job, Rails had instead decided to make Sidekiq the only option. In such a world, then I believe a whole lot more people would be upset or suspicious, because the bar to clear to use an alternative would be way higher. A lot of Rails users would feel captive.
Well, I would argue that rubygems is in such a situation.
It is distributed with Ruby, required early during the Ruby boot process, is coupled with all distributed gems via the gemspec format, etc.
Because of this, it has a massive moat.
Forking it to build and use your own alternative to it is hardly viable, even for a big team like Shopify’s Ruby and Rails Infrastructure team.
As such, while it’s still nothing but commendable to try to fund its maintenance work, you have to be careful to avoid any perverse incentives and conflicts of interest. Otherwise, even if you are exceptionally selfless and well-intentioned, you will inevitably spur suspicion whenever you refuse contributions or ask for sponsorship on a GitHub issue.
Unfortunately, it did happen.
Over the past decade, people in the community, not just Shopify employees, started to conclude that rubygems and bundler were being monetized by some key maintainers. To be clear, I’m not trying to convince anyone that this was actually the case. Some of that dirty laundry that has been an open secret among the Ruby maintainers’ community for a long time has recently been aired out, and I suspect there’s more to come. You are free to form your own opinion on the topic if you so wish.
But my point is that it doesn’t actually matter whether rubygems was actually being unduly taken advantage of or not. Ultimately, it’s down to who and what you consider legitimate.
My point is that the economic model chosen to fund rubygems’ maintenance, combined with its critical position in the ecosystem, has allowed for these suspicions to exist and persist, creating tensions and driving potential sources of funding away.
Again, I believe the problem is with structures and incentives, as well as optics, not specific people being imperfect or ill-intentioned.
Because of this, the relationship between Shopify and the various entities overseeing rubygems development has been quite rocky for a long time.
As you are probably aware, supply chain security has been a hot topic in the corporate world, hence, around 2021, Shopify started trying to contribute more to rubygems, and an entire team of developers was assembled with the goal of helping the upstream projects.
I no longer have access to all the history, and some details are now blurry. But from what I recall, there were various goals, such as requiring multi-factor authentication to publish the most popular packages, making code signing easier, and a few other topics.
However, that initiative didn’t exactly receive a warm welcome from upstream. It’s not that these features weren’t desired, but the understanding on Shopify’s side was that maintainers preferred to be paid to do it, rather than just accept contributions.
This is what ultimately led to Shopify funding Ruby Central directly (other than being a recurring major sponsor at their conferences for years). The deal was for one million dollars over 4 years, under the name Ruby Shield.
But even after that, the feeling on the Shopify side was that upstream was still uncooperative, until ultimately they decided to cut their losses and re-assigned engineers elsewhere. The 4-year funding deal remained, but not much was expected of it.
Shopify could have threatened to pull funding at that time to try to coerce Ruby Central, yet they didn’t.
As I said earlier, ever since this controversy started, I’ve been unconvinced by the theory that all this would have been orchestrated by Shopify or through Shopify. That simply would have required involving too many people, and I absolutely can’t imagine that none of them would have objected in one way or another.
But anyway, since then, I did contact two former coworkers, and they both assured me that Shopify never threatened to pull Ruby Central’s funding, nor threatened not to renew it.
Now, as I tried to explain earlier, even if you loudly claim money comes with no strings attached, people and entities are naturally incentivized to do what they think is necessary to keep it coming. As such, it’s entirely possible that despite the absence of threats, Ruby Central’s moves may have been motivated by the need to secure the existing funding and/or find additional sources of funding.
My former coworkers also told me their side of the story, and it’s absolutely nothing like what has been alleged so far. I deeply trust these two people, and I can’t possibly imagine they’d be lying to me, but I’d understand if you don’t want to take my word for it.
I don’t know when their side of the story will come out, nor if it will come out at all, but I do hope it comes out soon and with receipts. Seeing so many good-natured and well-intentioned people get demonized like they have been over the last few weeks is depressing.
It is undeniable that, regardless of what Ruby Central’s intentions were, the communication and execution have been abysmal. It is also true that there is a deep disagreement about what they rightfully or legitimately owned that won’t easily be resolved. However, I can’t believe the entire organisation was ill-intentioned, here again, that would involve too many people to be conceivable.
Similarly, the claim that Aaron sending patches to rubygems is a clue that there was a conspiracy at play drives me nuts.
I’ve seen these pull requests being made with my own eyes, and I can tell you that the reason is way more mundane than that.
We were at Rails World, someone mentioned rv, the question of why you’d need to write something in Rust to speed up gem installation was raised, and Aaron and a few others started to profile Bundler to see if it could be made faster.
That’s it, that’s all there is. Aaron got nerd sniped into making Bundler faster, and now he’s being called out for supposedly being part of a hostile takeover? Give me a break.
I think it’s healthy to be wary of Shopify’s huge footprint on the ecosystem. Companies are fickle beings, and even if I’m not particularly concerned about them ever having ill intent toward the Ruby ecosystem, it’s not impossible that in the future they may decide to invest less.
But the response shouldn’t be to try to cast Shopify and its employees aside. It would be silly to punish them for helping too much. What we need is more companies doing their part. Both to reduce Shopify’s relative influence, but also to have more diverse perspectives, use cases, and priorities.
I’m not saying every company should have a team as big as Shopify’s R&RI, but there are numerous Ruby-based companies with valuations in billions and several hundred developers on their payroll, yet they contribute very little upstream. If you work at one of such companies, you should really consider how you could do more.
That’s what I intend to do at my next job, to get one more Ruby company to pull its weight.
]]>I also explained how I removed two of these contention points, the object_id method,
and class instance variables.
Since then, the situation has improved quite drastically, as numerous other contentious points have been either eliminated or reduced by me and my former teammates. I’m not going to make a post for each of them, as in most cases it boils down to the same RCU technique I explained in the post about class instance variables.
But there’s one such contention point I find interesting and that I’d like to write about: the generic instance variables table.
As a Ruby user, you are likely familiar with the idea that everything is an object, and that is somewhat true, but that doesn’t mean all objects are equal. I already touched on that subject in some of my previous posts, so I’ll do it quickly.
In the context of instance variables, in the Ruby VM you essentially have 3 or 4 types of objects, depending on how you count.
First, you have the “immediates”, small integers (1), booleans (true, false), static symbols (:foo, but not dynamic symbols like "bar".to_sym), etc.
These are called immediates because they don’t actually exist in memory; they don’t have an allocated object slot on the heap. Their reference is their value.
In other words, they’re just tagged pointers.
Hence, they can’t have instance variables, and Ruby will treat them as if they were frozen to maintain the illusion of parity with other objects:
>> 42.instance_variable_set(:@test, 1)
(irb):2:in 'Kernel#instance_variable_set': can't modify frozen Integer: 42 (FrozenError)
Then you have the more regular T_OBJECT, for your user-defined classes.
In the case of T_OBJECT, instance variables are stored inside the object’s slot like an array.
Consider the following object with 3 instance variables:
class Foo
def initialize
@a = 1
@b = 2
@c = 3
end
end
It will fit in the base 40B object slot.
16B is being used for the object’s flags and a pointer to its class, and the remaining 24B is used for the three instance variable references:
| flags | klass | @a | @b | @c |
|---|---|---|---|---|
| T_OBJECT | 0xffeff | 1 | 2 | 3 |
In some cases, if an instance variable is added later and the slot is full, the Ruby VM may have to allocate a separate memory region and “spill” the instance variables there, but this is actually fairly rare. The VM keeps track of how many variables the instances of each class have, so if Ruby ever has to spill, every future instance of that class will be allocated in a larger slot.
The third type of objects are T_CLASS and T_MODULE. Since that was the topic of my previous post, I’ll be quick.
Class instance variables are laid out like for T_OBJECT except they’re in a “companion” slot.
class Foo
@a = 1
@b = 2
@c = 3
end
The layout of the class itself stores a reference to that “companion” slot:
| flags | klass | obj_fields | … | … |
|---|---|---|---|---|
| T_CLASS | 0xffeaa | 0xffdddd |
And that other slot is laid out exactly like a T_OBJECT, except its type is T_IMEMO for “Internal Memory”:
| flags | klass | @a | @b | @c |
|---|---|---|---|---|
| T_IMEMO/fields | 0xffeaa | 1 | 2 | 3 |
That’s a type of object that, as a Ruby user, you can’t directly interact with, nor even get a reference to; they’re basically invisible.
But they are used internally by the VM to store various data in memory managed by the GC instead of using manual memory management with malloc and free.
And then you have all the other objects. Hash, Array, String, etc.
For these, the space inside the object slot is already used.
For example, a String slot is used to store the string length, capacity, and if it’s small enough, the bytes that compose the string itself, otherwise a pointer to a manually allocated buffer.
Yet, Ruby allows you to define any instance variables you want on a string:
>> s = "test"
>> s.instance_variable_set(:@test, 1)
>> s.instance_variable_get(:@test)
=> 1
To allow this, the VM has an internal hash table, which used to be called the genivar_tbl, for Generic Instance Variables Hash-Table, and that I renamed into generic_fields_tbl_ as part of my work on object_id.
I previously explained how this works in my post about the object_id
method, but I’ll reexplain here with a bit more detail, as it’s really the core topic.
Once again, I’ll use Ruby pseudo-code to make it easier:
module GenericIvarObject
GENERIC_FIELDS_TBL = Hash.new.compare_by_identity
def instance_variable_get(ivar_name)
if ivar_shape = self.shape.find(ivar_name)
RubyVM.synchronize do
if buffer = GENERIC_FIELDS_TBL[self]
buffer[ivar_shape.index]
end
end
end
end
end
In that global hash, the keys are the reference to the objects, and the values are pointers to manually allocated buffers.
Inside the buffer, there is an array of references just like in a T_OBJECT or a T_IMEMO/fields.
This isn’t ideal for multiple reasons.
First, having to do a hash-lookup is way more expensive than reading at an offset like we do for T_OBJECT, or even chasing a reference
like we do for T_CLASS and T_MODULE.
But worse, if we’re in a multi-ractor scenario, we have to acquire the VM lock for the whole operation. First, because that hash-table is global and not thread-safe, then because we must ensure that another Ractor can’t free that manually allocated buffer while we’re reading it1.
So now, you probably understand the problem.
Any code that reads or writes an instance variable in an object that isn’t a direct descendant of Object (actually BasicObject) nor Module is a contention point for Ractors.
Before I dig into what can be changed, you may wonder if it even matters.
And it’s a very fair question. Developer time isn’t unlimited, hence the question of whether it is worth removing a contention points boil down to how hot a code path it is, and how hard it is to fix it.
When I started looking at this, it was from the angle of T_STRUCT.
I wanted the instance variable of Struct and Data objects
not to be contention points, e.g., it’s not that rare to see Struct being used as some sort of code generator:
Address = Struct.new(:street, :city) do
def something_else
@something_else ||= compute_something
end
end
Because Struct.new and Data.define don’t create T_OBJECT but T_STRUCT objects.
In these, the space inside the slot is used for the declared fields, not for the ivars.
Another pattern I expected was C extensions. When a Ruby C extension needs to expose an API, it uses the TypedData API, which allows to create T_DATA objects.
But it’s not rare for extensions to do as little as possible in C, and to extend that C class with some Ruby.
An example of that is the trilogy gem, which defines a bunch of C methods
RUBY_FUNC_EXPORTED void Init_cext(void)
{
VALUE Trilogy = rb_const_get(rb_cObject, rb_intern("Trilogy"));
rb_define_alloc_func(Trilogy, allocate_trilogy);
rb_define_private_method(Trilogy, "_connect", rb_trilogy_connect, 3);
rb_define_method(Trilogy, "change_db", rb_trilogy_change_db, 1);
rb_define_alias(Trilogy, "select_db", "change_db");
rb_define_method(Trilogy, "query", rb_trilogy_query, 1);
//...
}
But then augment that C class with Ruby code:
class Trilogy
def initialize(options = {})
options[:port] = options[:port].to_i if options[:port]
mysql_encoding = options[:encoding] || "utf8mb4"
encoding = Trilogy::Encoding.find(mysql_encoding)
charset = Trilogy::Encoding.charset(mysql_encoding)
@connection_options = options
@connected_host = nil
_connect(encoding, charset, options)
end
end
That’s a pattern I really like, as it allows to write less C and more Ruby, so I would have hated having to complexify some C extensions so that they’d perform better under ractors.
Then you have a few classics, like ActiveSupport::SafeBuffer,
which is a subclass of String with a @html_safe instance variable:
module ActiveSupport
class SafeBuffer < String
def initialize(str = "")
@html_safe = true
super
end
# ...snip
end
end
So it’s not that rare for code to inherit from core types, and it can end up in hot spots. Even though I would recommend avoiding it as much as possible, for reasons other than performance, sometimes it’s the pragmatic thing to do, so users do it.
Regardless, I was quite convinced that improving this code path would be useful and started working on it. But later on, I was asked to provide some data, so while I’m breaking the chronology here, let me share it with you.
I started by doing my favorite hack in the VM, a good old print gated by an environment variable:
if (getenv("DEB")) {
fprintf(stderr, "%s\n", rb_obj_info(obj));
}
Then I modified the yjit-bench suite to set ENV["DEB"] = "1" at the start
of the benchmarks loops, as I’m more interested in runtime codepaths than in boottime ones.
I then ran the shipit benchmark while redirecting STDERR to a file:
$ bundle exec ruby benchmark.rb 2> /tmp/ivar-stats.txt
And did some quick number crunching with irb:
File.readlines("/tmp/ivar-stats.txt", chomp: true).tally.sort_by(&:last).reverse
Here are some results. It’s a very vanilla Rails 8 application, nothing fancy:
[
["VM/thread", 4886969],
["T_HASH", 229501],
["SQLite3::Backup", 122531],
["T_STRING", 70597],
["xmlDoc", 23625],
["T_ARRAY", 9039],
["OpenSSL/Cipher", 2800],
["xmlNode", 2025],
["encoding", 358],
["time", 199],
["proc", 68],
["T_STRUCT", 38],
["OpenSSL/X509/STORE", 3],
["Psych/parser", 2],
["set", 1],
]
T_STRUCT was there as I expected, but entirely dwarfed by other types.
For the ones that aren’t obvious:
"VM/Thread" is literally Thread instances.xmlNode and xmlDoc are nokogiri objects.T_, is a T_DATA.The T_HASH I definitely didn’t expect, and it wasn’t clear where it was coming from. So I did another hack:
if (getenv("DEB") && TYPE_P(obj, T_HASH) && (rand() % 1000) == 0) {
rb_bug("here");
}
The rb_bug function causes the RubyVM to abort and print its crash report, which does contain the Ruby level-backtrace.
With that, I figured these were Rack::Utils::HeaderHash instances.
As for the T_ARRAY, it seems like it was mostly from ActiveSupport::Inflector::Inflections::Uncountables
And for "VM/Thread" it comes from ActiveSupport::IsolatedExecutionState.
All the rest was various T_DATA defined by C extensions, like the trilogy example I shared.
I ran a few other benchmarks from the yjit-bench repo, and often found similar generic instance variable usages.
So to answer the question, while it’s not that big of a hotspot, I believe it’s used enough to be worth optimizing, especially for T_DATA,
and not just because of Ractors.
But as I said, before I got all that data, my sight was set on T_STRUCT.
Struct objects are laid out very similarly to T_OBJECT except that the space is used for “members” instead of instance variables.
For instance, the following struct:
struct = Struct.new(:field_1, :field_2).new(1, 2)
Would be laid out as is:
| flags | klass | field_1 | field_2 | - |
|---|---|---|---|---|
| T_STRUCT | 0xbbeaa | 1 | 2 |
Hence, my initial idea was that if we were to encode the struct’s layout using shapes like we do for instance variables, we’d be able to collocate members and variables together so that:
MyStruct = Struct.new(:field_1, :field_2) do
def initialize(...)
super
@c = 1
end
Could be laid out as:
| flags | klass | field_1 | field_2 | @c |
|---|---|---|---|---|
| T_STRUCT | 0xffeaa | 1 | 2 | 3 |
Which would be perfect. Everything would be embedded in the object slot, so we’d have minimal memory usage and access times.
Unfortunately, after putting some more thought into it, I realized that was a major problem with it: complex shapes. I previously wrote at length on what complex shapes are, so very quickly, in the Ruby VM, shapes aren’t garbage collected, so if some code generates a lot of different shapes, Ruby will deoptimize the object and use a hash table to store its instance variables. It also does the same if the program uses all the possible shape slots.
So if Struct members were encoded with shapes, we’d need to have many fallback code paths to handle complex structs,
and for some of the struct APIs, that is straight out impossible, because Struct objects can be treated like arrays:
>> Struct.new(:a, :b).new(1, 2)[1]
=> 2
In such a case, all we have is the member offset, so if the struct was deoptimized into a hash, we wouldn’t be able to look up members by index anymore, short of keeping a reverse index, but that’s really a lot of extra complexity. So I abandoned this idea.
A few days later, I was brainstorming with Étienne Barrié, and we thought of a simpler solution. Instead of encoding struct members in shapes, we could introduce a new type of shape to encode at which offset the instance variables start.
As often mentioned, shapes are a tree, so an object with variables @a -> @b -> @c -> @d, the shape tree would look like:
ROOT_SHAPE
\- Ivar(name: :@a, index: 0, capacity: 3)
\- Ivar(name: :@b, index: 1, capacity: 3)
\- Ivar(name: :@c, index: 2, capacity: 3)
\- Ivar(name: :@d, index: 3, capacity: 8)
With offset shapes, the same instance variable list, but for a struct with two members, would look like:
ROOT_SHAPE
\- Offset(index: index: 1, capacity: 3)
\- Ivar(name: :@a, index: 2, capacity: 3)
\- Ivar(name: :@b, index: 3, capacity: 8)
\- Ivar(name: :@c, index: 4, capacity: 8)
\- Ivar(name: :@d, index: 5, capacity: 8)
Here again, we’d need to handle the case where the Ruby VM ran out of shapes, but at least only the instance variables would be deoptimized into a hash table, the struct members would still be laid out like an array, saving a ton of complexity.
That being said, while I still think this is a good idea, it’s a fairly big project with some uncertainties. So when I evoked this solution with Peter Zhu, he suggested something much simpler.
The annoying thing with generic instance variables isn’t so much that they aren’t embedded inside the object’s slot, but that to find the companion slot, you need to go through that global hash table.
Of course, if they were embedded, it would mean better data locality, which is good for performance, but that really isn’t much compared to the hash-lookup, so a single pointer chase would already be a major win.
Hence, Peter’s suggestion was to just use empty space in struct slots to keep a direct reference to the buffer that holds the instance variables, and since structs are basically fixed-size arrays, we can store that reference right after the last struct member.
In pseudo-code, it would be more or less:
class Struct
def instance_variable_get(ivar)
if __slot_capacity__ > size
self[size].instance_variable_get(ivar)
else
# use the generic instance variables table
end
end
end
That’s essentially the same strategy as with classes and modules.
At least on paper, that was quite easy because a few weeks prior, I had refactored the generic instance variables to use the same underlying managed object as classes: T_IMEMO/fields.
Once again, I paired with Étienne Barrié to implement that idea, but the resulting PR was way larger and more complex than I had hoped for, because of a lack of encapsulation.
In many places across the VM, when dealing with instance variables, you have a similar big switch/case statement with
a branch for each of the 3 or 4 possible types of object layouts.
So making T_STRUCT different would mean adding one more code path in all these places, which would leave me with a bad taste in my mouth.
That’s why I backtracked a bit and decided to start by refactoring the generic instance variables table, so that all accesses go through a very small number of functions. After that, all reads and writes to the table went through mostly just two functions, making it the perfect place to specialize the behavior for struct objects.
As a bit of a sidenote, the more I work on the Ruby VM, the more I realize the challenging part isn’t to come up with a brilliant idea, or a clever algorithm, but the sheer effort required to refactor code without breaking everything. The C language doesn’t have a lot of features for abstractions and encapsulation, so coupling is absolutely everywhere.
Anyways, with that refactoring done, I was able to re-implement the same pull request we did with Étienne, but half the size, most of it being just tests, documentation, and benchmarking code.
Now the generic instance variable lookup function looks like this:
VALUE
rb_obj_fields(VALUE obj, ID field_name)
{
RUBY_ASSERT(!RB_TYPE_P(obj, T_IMEMO));
ivar_ractor_check(obj, field_name);
VALUE fields_obj = 0;
if (rb_shape_obj_has_fields(obj)) {
switch (BUILTIN_TYPE(obj)) {
case T_STRUCT:
if (LIKELY(!FL_TEST_RAW(obj, RSTRUCT_GEN_FIELDS))) {
fields_obj = RSTRUCT_FIELDS_OBJ(obj);
break;
}
// fall through
default:
RB_VM_LOCKING() {
if (!st_lookup(generic_fields_tbl_, (st_data_t)obj, (st_data_t *)&fields_obj)) {
rb_bug("Object is missing entry in generic_fields_tbl");
}
}
}
}
return fields_obj;
}
When dealing with a T_STRUCT and if there’s some unused space in the slot, we entirely bypass the generic_fields_tbl and RB_VM_LOCKING.
And to ensure we don’t fall in the fallback path too much, we modified the Struct allocator to allocate a large enough slots for structs that have instance variables:
static VALUE
struct_alloc(VALUE klass)
{
long n = num_members(klass);
size_t embedded_size = offsetof(struct RStruct, as.ary) + (sizeof(VALUE) * n);
if (RCLASS_MAX_IV_COUNT(klass) > 0) {
embedded_size += sizeof(VALUE);
}
// snip...
}
As a result, instance variable accesses in structs are now noticeably faster, even when no ractor is involved:
compare-ruby: ruby 3.5.0dev (2025-08-06T12:50:36Z struct-ivar-fields-2 9a30d141a1) +PRISM [arm64-darwin24]
built-ruby: ruby 3.5.0dev (2025-08-06T12:57:59Z struct-ivar-fields-2 2ff3ec237f) +PRISM [arm64-darwin24]
warming up.....
| |compare-ruby|built-ruby|
|:---------------------|-----------:|---------:|
|member_reader | 590.317k| 579.246k|
| | 1.02x| -|
|member_writer | 543.963k| 527.104k|
| | 1.03x| -|
|member_reader_method | 213.540k| 213.004k|
| | 1.00x| -|
|member_writer_method | 192.657k| 191.491k|
| | 1.01x| -|
|ivar_reader | 403.993k| 569.915k|
| | -| 1.41x|
That was a satisfying change.
Now that we had a working pattern, the question was where else could we apply it.
I definitely knew instance variables on T_STRING are rather common, given I’m very familiar with ActiveSupport::SafeBuffer, so I thought about pulling a similar trick for them.
Unfortunately, what made this possible with T_STRUCT is that they are essentially fixed-size arrays.
Which means we know that whatever free space is left in the slot won’t ever be needed in the future.
Whereas other types like T_STRING and T_ARRAY are variable size.
If you start storing a reference in free space at the end of the slot, you then need to be very careful that if the user appends
to the string or array, it won’t overwrite that reference. That’s much harder to do and probably not worth the extra complexity.
But one of my favorite things with Ruby and Rails is to be able to optimize from both ends. If some pattern Rails uses isn’t very performant, I can try to optimize Ruby, but I can also just change what Rails does.
In the case of ActiveSupport::SafeBuffer, all we’re storing is just a boolean: @html_safe = true, and eventually, if something is appended to the buffer, the flag will be flipped.
But appends into safe buffers are very rare.
Most of the time, String#html_safe is only used as a way to tag the string, to indicate that it doesn’t need to be escaped when it’s later appended into another buffer. In other words, the overwhelming majority of instances never flip that flag.
Based on that knowledge, I changed that variable to be a negative.
Instead of starting with @html_safe = true, we can start with @html_unsafe = false, and since referencing an instance
variable that doesn’t exist evaluates to nil, which is also falsy, we can simply not set the variable at all.
The result made String#html_safe twice as fast, even when no Ractor is started:
ruby 3.5.0dev (2025-07-17T14:01:57Z master a46309d19a) +YJIT +PRISM [arm64-darwin24]
Calculating -------------------------------------
String#html_safe (old) 6.421M (± 1.6%) i/s (155.75 ns/i) - 32.241M in 5.022802s
String#html_safe 12.470M (± 0.8%) i/s (80.19 ns/i) - 63.140M in 5.063698s
I guess this is a good example of mechanical sympathy2, the more you know about how the tools you are using work, the more effectively you can use them.
And now that I learned about ActiveSupport::Inflector::Inflections::Uncountables, I should probably change it in a similar way.
But the one other type that I thought was worth attention to was T_DATA.
Until just a few months ago, T_DATA slots were fully used; here’s the RTypedData C struct in Ruby 3.4,
I added some annotations with the size of each field:
struct RTypedData {
/** The part that all ruby objects have in common. */
struct RBasic basic; // 16B
/**
* This field stores various information about how Ruby should handle a
* data. This roughly resembles a Ruby level class (apart from method
* definition etc.)
*/
const rb_data_type_t *const type; // 8B
/**
* This has to be always 1.
*
* @internal
*/
const VALUE typed_flag; // 8B
/** Pointer to the actual C level struct that you want to wrap. */
void *data; // 8B
};
Just quickly, the first 16B was used for the common header all Ruby objects share, 8B was used to store a pointer
to another struct that gives information to Ruby on what to do with this object, for instance, how to garbage collect it.
And then two other 8B values, one pointing to arbitrary memory a C extension might have allocated, and then typed_flag.
If you read the comment associated with typed_flag, you may wonder what purpose it can possibly serve.
It’s there because RTypedData is the newer API for C extensions that was introduced in 2009 by Koichi Sasada.
Historically, when you needed to wrap a piece of native memory in a Ruby object, you’d use the RData API, and you had to
supply:
That older, deprecated API is still there today, and you can see the struct that backs it up:
/**
* @deprecated
*
* Old "untyped" user data. It has roughly the same usage as struct
* ::RTypedData, but lacked several features such as support for compaction GC.
* Use of this struct is not recommended any longer. If it is dead necessary,
* please inform the core devs about your usage.
*
* @internal
*
* @shyouhei tried to add RBIMPL_ATTR_DEPRECATED for this type but that yielded
* too many warnings in the core. Maybe we want to retry later... Just add
* deprecated document for now.
*/
struct RData {
/** Basic part, including flags and class. */
struct RBasic basic;
/**
* This function is called when the object is experiencing GC marks. If it
* contains references to other Ruby objects, you need to mark them also.
* Otherwise GC will smash your data.
*
* @see rb_gc_mark()
* @warning This is called during GC runs. Object allocations are
* impossible at that moment (that is why GC runs).
*/
RUBY_DATA_FUNC dmark;
/**
* This function is called when the object is no longer used. You need to
* do whatever necessary to avoid memory leaks.
*
* @warning This is called during GC runs. Object allocations are
* impossible at that moment (that is why GC runs).
*/
RUBY_DATA_FUNC dfree;
/** Pointer to the actual C level struct that you want to wrap. */
void *data;
};
So in various places in the Ruby VM, when you interact with a T_DATA object, you need to know if it’s a RTypedData or a RData
before you can do much of anything with it.
That’s where typed_flag comes in. It’s at the same offset in the RTypedDatastruct as the dfree pointer in the RData struct, and for various reasons, it’s impossible for a legitimate C function pointer to be strictly equal to 1.
That’s why typed_flag is always 1, it allows us to check if a T_DATA is typed by checking rdata->dfree == 1.
Now you might wonder why I’m telling you all of this.
Well, it’s because that typed_flag field is using 8B of space to store exactly 1bit of information, and that has bugged me for several years.
Even though truth be told, the comment is outdated, and the field can also sometimes be 3 as we piggy-backed on it with Peter Zhu last year to implement embedded TypedData objects.
But that’s still 32 times more than needed, so if someone could think of a better place to store these two bits, that would free and entire 8B to store a direct reference to the T_IMEMO/fields.
Well, it turns out that someone did earlier this year.
Just before the RubyKaigi developer meeting, Jeremy Evans proposed to turn Set into a core class, and to reimplement it in C, and that was accepted.
Later during the conference, he asked me to review his usage of the RTypedData API, and I suggested a bunch of improvements to make Set
objects smaller and reduce pointer chasing by leveraging embedded RTypedData objects.
But turns out that there was a bit of an annoying tradeoff here. The RTypedData struct is 40B large, but when used embedded, we recycle the data pointer, so it’s only 32B large,
and the set_table struct Jememy needed to store is 56B, for a total of 88B, which is a particularly annoying number.
Not because of the meaning some distasteful people attribute to it, but because it is just 8B too large to fit in a standard 80B GC slot, hence if we marked it as embeded, the footprint would grow from 40 + 56 = 96B to 160B with lots of wasted space.
In all honesty, it wasn’t a massive problem unless your application is using a massive amount of sets, but it seems that it really bothered Jeremy.
What he came up with a couple of weeks later was that he moved these two bits of memory into the low bits of RTypedData.type and RData.dmark,
freeing 8B per embedded TypedData object and allowing Set objects to fit in 80B.
Here again, the assumption was that because of alignment rules, the three lower bits of pointers can’t ever be set, so we can store our own information in there.
But now, I think this space could be put to better use to store a reference to a companion T_IMEMO/fields, so we could skip the global instance variables table.
The problem is that here again it’s a matter of tradeoff. We can waste some memory to save some CPU cycles, which is better is really just a judgment call.
Just like this issue bothered Jeremy a few months back, it now bothered me, and I went searching for a way to save another 8B in Set objects.
Hence, I started to stare at the struct set_table while frowning my eyebrows in the hope of spotting some redundant or superfluous member I could eliminate:
struct set_table {
/* Cached features of the table -- see st.c for more details. */
unsigned char entry_power, bin_power, size_ind;
/* How many times the table was rebuilt. */
unsigned int rebuilds_num;
const struct st_hash_type *type;
/* Number of entries currently in the table. */
st_index_t num_entries;
/* Array of bins used for access by keys. */
st_index_t *bins;
/* Start and bound index of entries in array entries.
entries_starts and entries_bound are in interval
[0,allocated_entries]. */
st_index_t entries_start, entries_bound;
/* Array of size 2^entry_power. */
set_table_entry *entries;
};
I was first attracted to the trio of num_entries, entries_start, and entries_bound. All of these are 8B integers, so if I could eliminate just one of them, I’d be set3.
Without being really intimate with the set implementation, I guessed that surely, if you know how many entries you have, you don’t need both the offset of the start and end of the entries list.
So in theory, I could just replace every reference to entries_bound by entries_start + num_entries.
What I do when I experiment with code I’m not fully familiar with, is that I try to prove my assumptions. Here I wrote a small helper function:
static inline st_index_t
set_entries_bound(const struct set_table *set)
{
RUBY_ASSERT(set->entries_start + set->num_entries == set->entries_bound);
return set->entries_bound;
}
And then went over the code to replace all the direct accesses to set->entries_bound by my helper, and tried to run the test suite to see if that RUBY_ASSERT would trip or not.
Well, turns out it wasn’t that simple… After seeing the test suite light up like a Christmas tree, I dug into the code
helped by the backtraces in the crash reports, and realized the entries_bound doesn’t always match the entries’ size,
There is even a comment about it in the code:
/* Do not update entries_bound here. Otherwise, we can fill all
bins by deleted entry value before rebuilding the table. */
So that was a bust, and I went back to the drawing board.
After some more staring and eyebrow frowning, I got another idea.
Ruby’s hash-tables (Ruby sets are hash-sets) are ordered. Hence, you can see them as the combination of a regular unordered hash table and a classic array. The hash-table values are just offset into that array.
Here, the hash-table part is the st_index_t *bins, and the array part is set_table_entry *entries.
Both of these are memory regions allocated with malloc, and they are grown and shrunk at the same time when you add or remove elements from the set.
Hence, if we can know how large one of them is, we could allocate both with a single malloc, and then access the other by simply skipping over the first one.
In this case, the size of set_table.bins is indicated by set_table.bin_power:
/* Return size of the allocated bins of table TAB. */
static inline st_index_t
set_bins_size(const set_table *tab)
{
return features[tab->entry_power].bins_words * sizeof (st_index_t);
}
That’s how with a relatively small patch, I was able to save 8B from struct set_table,
which could allow us to keep Set objects in 80B slots even if we make embedded RTypedData 32B again.
However, I still need to run some benchmarks to make sure this patch wouldn’t degrade set performance significantly.
For some remaining types like T_STRING, T_ARRAY, or T_HASH, it’s unlikely we’ll ever find spaces in their slots for an extra reference.
So I had another idea to speed up accesses and reduce contention.
The core of the assumption is that whenever we look up the instance variables of an object, there is a high chance that the next lookup will be for the same object.
So what if we kept a cache of the last object we looked up, and its associated T_IMEMO/fields?
In pseudo-ruby:
module GenericIvarObject
GENERIC_FIELDS_TBL = Hash.new.compare_by_identity
def instance_variable_get(ivar_name)
if ivar_shape = self.shape.find(ivar_name)
fields_obj = if Fiber[:__last_obj__] == self
Fiber[:__last_fields__]
else
Fiber[:__last_obj__] = self
Fiber[:__last_obj__] = RubyVM.synchronize do
GENERIC_FIELDS_TBL[self]
end
end
fields_obj.instance_variable_get(ivar_name)
end
end
end
Given that the cache is in fiber local storage, we don’t need to protect it with a lock.
I have a draft patch for that idea that I need to polish and benchmark, but I like that it’s quite simple.
Ultimately, for the remaining cases, it would be good if the Ruby VM had a proper concurrent-map implementation to allow lock-free lookups into the generic instance variables table. However, concurrent maps are hard, so it might not happen any time soon.
In the meantime, for the more important types like T_STRUCT and T_DATA, we now have solutions, either already merged or potentially soon to be, and for others, we have a way to reduce how often we look up the table.
And all that improves performance for both single-threaded and multi-ractor applications, so it’s a win-win.
My biggest concern with Ractors is that at some point we’d significantly impact single-threaded performance for the benefit of Ractors, so when we find optimizations that improve both use-cases, I’m particularly happy.
You might think that since an object can’t be visible by more than one ractor unless it is frozen, then this isn’t a concern. But actually, since object_id is now essentially a memoized instance variable, it can happen. ↩
There was a pretty good talk on that subject at Euruko 2024. ↩
Pun intended. ↩
The actual reason is that the gem has many APIs that I think aren’t very good, and some that are outright dangerous.
As a gem user, it’s easy to be annoyed at deprecations and breaking changes. It’s noisy and creates extra work, so I entirely understand that people may suffer from deprecation fatigue. But while it occasionally happens to run into mostly cosmetic deprecations that aren’t really worth the churn they cause (and that annoys me a lot too), most of the time there’s a good reason for them, it just is very rarely conveyed to the users, and even more rarely discussed, so let’s do that for once.
So I’d like to go over some of the API changes and deprecations I already implemented or will likely implement soon, given it’s a good occasion to explain why the change is valuable, and to talk about API design more broadly.
But before I delve into deprecated API, I’d like to mention how to effectively deal with deprecations in modern Ruby.
Since Ruby 2.7, warning messages emitted with Kernel#warn are categorized, and one of the available categories is :deprecated.
By default, deprecation warnings are silenced; to display them, you must enable the :deprecated category like so:
Warning[:deprecated] = true
It is very highly recommended to do so in your test suite, so much so that Rails and Minitest will do it by default.
However, if you are using RSpec, you’ll have to do it yourself in your spec_helper.rb file, because we’ve tried to get
RSpec to do it too for over four years now, but without success.
But I’m still hopeful it will eventually happen.
Another useful thing to know about Ruby’s Kernel#warn method is that under the hood, it calls the Warning.warn method,
allowing you to redefine it and customize its behavior.
For instance, you could turn warnings into errors like this:
module Warning
def warn(message, ...)
raise message
end
end
Doing so both ensures warnings aren’t missed, and helps tracking them down as you’ll get an exception with a full backtrace rather than a warning that points at a single call-site that may not necessarily help you find the problem.
This is a pattern I use in most of my own projects, and that I also included into Rails’ own test suite.
For larger projects, where being deprecation-free all the time may be complicated, there’s also the more sophisticated deprecation_toolkit gem.
Now, let’s start with the API that convinced me to request maintainership.
Do you know the difference between JSON.load and JSON.parse?
There’s more than one, but the main difference is that it has a different set of options enabled by default, and notably
one that is a massive footgun: create_additions: true.
This option is so bad that Rubocop’s default set of rules bans JSON.load outright for security reasons,
and it has been involved in more than one security vulnerabilities.
Let’s dig into what it does:
require "json"
class Point
class << self
def json_create(data)
new(data["x"], data["y"])
end
end
def initialize(x, y)
@x = x
@y = y
end
end
document = <<~'JSON'
{
"json_class": "Point",
"x": 123.456,
"y": 789.321
}
JSON
p JSON.parse(document)
# => {"json_class" => "Point", "x" => 123.456, "y" => 789.321}
p JSON.load(document)
# => #<Point:0x00000001007f6d08 @x=123.456, @y=789.321>
So what the create_additions: true parsing option does is that when it notices an object with the special key "json_class",
It resolves the constant and calls #json_create on it with the object.
By itself, this isn’t really a security vulnerability, as only classes with a .json_create method can be instantiated this way.
But if you’ve been using Ruby for a long time, this may remind you of similar issues with gems like YAML where similar capabilities
were exploited.
That’s the problem with these sorts of duck-typed APIs: they are way too global.
You can have a piece of code using JSON.load that is perfectly safe on its own, but then if it’s embedded in an application
that also loads some other piece of code that defines some .json_create methods you weren’t expecting, you may end up with
an unforeseen vulnerability.
But even if you don’t define any json_create methods, the gem will always define one on String:
>> require "json"
>> JSON.load('{"json_class": "String", "raw": [112, 119, 110, 101, 100]}')
=> "pwned"
Here again, you probably need to find some specific circumstances to exploit that, but you can probably see how this trick can be used to bypass a validation check of some sort.
So what do I plan to do about it? Several things.
First, I deprecated the implicit create_additions: true option. If you use JSON.load for that feature, a deprecation
warning will be emitted, asking to use JSON.unsafe_load instead:
require "json"
Warning[:deprecated] = true
JSON.load('{"json_class": "String", "raw": [112, 119, 110, 101, 100]}')
# /tmp/j.rb:3: warning: JSON.load implicit support for `create_additions: true`
# is deprecated and will be removed in 3.0,
# use JSON.unsafe_load or explicitly pass `create_additions: true`
That being said, considering how wonky this feature is, I’m also considering extracting it into another gem.
This used to be impossible, as it was baked deep into the both the C and the Java parsers, but I recently refactored it to be pure Ruby code using a callback exposed by the parsers.
Now you can provide a Proc to JSON.load, the parser will invoke it for every parsed value, allowing you to substitute
a value by another:
cb = ->(obj) do
case obj
when String
obj.upcase
else
obj
end
end
p JSON.load('["a", {"b": 1}]', cb)
# => ["A", {"B" => 1}]
Prior to that change, JSON.load already accepted a Proc, but its return value was ignored.
The nice thing is that this callback also now serves as a much safer and flexible way to handle the serialization of rich objects. For instance, you could implement something like this:
types = {
"range" => MyRangeType
}
cb = ->(obj) do
case obj
when Hash
if type = types[obj["__type"]]
type.load(obj)
else
obj
end
else
obj
end
end
While this requires more code from the user, it gives much tighter control over the deserialization,
but more importantly, it isn’t global anymore.
If a library uses this feature to deserialize trusted data, its callback is never going to be invoked by another library
like it’s the case with the old Class#json_create API.
The obvious solution would have been to follow the same route as YAML, with its permitted_classes argument, but
in my opinion, it wouldn’t have addressed the root of the problem, and it makes for a very unpleasant API to use.
Instead, I believe this Proc interface provides the same functionality as before, but in a way that is both more flexible and safer.
I think this is a clear case for deprecation, given it is very rarely needed, has security implications, and surprises users.
Another behavior of the parser I recently deprecated is the treatment of duplicate keys. Consider the following code:
p JSON.parse('{"a": 1, "a": 2}')["a"]
What do you think it should return? You could argue that the first key or the last key should win, or that this should result in a parse error.
Unfortunately, JSON is a bit of a “post-specified” format, as in it started as an extremely simple document. All it says about “objects” is:
An object is an unordered set of name/value pairs. An object begins with
{and ends with}. Each name is followed by:and the name/value pairs are separated by,.
That’s it, that’s the extent of the specification, as you can see, there is no mention of what a parser should do if it encounters a duplicate key.
Later on, various standardisation bodies tried to specify JSON based on the implementations out there.
Hence, we now have IETF’s STD 90, also known as RFC 8259, which states:
Many implementations report the last name/value pair only. Other implementations report an error or fail to parse the object, and some implementations report all of the name/value pairs, including duplicates.
In other words, it acknowledges most implementations return the last seen pair, but doesn’t prescribe any particular behavior.
There’s also the ECMA-404 standard
The JSON syntax does not impose any restrictions on the strings used as names, does not require that name strings be unique, and does not assign any significance to the ordering of name/value pairs. These are all semantic considerations that may be defined by JSON processors or in specifications defining specific uses of JSON for data interchange.
Which is pretty much the specification language equivalent of: 🤷♂️.
The problem with under-specified formats is that they can sometimes be exploited, the classic example being HTTP request smuggling.
And while it wasn’t an exploitation per se, a security issue happened to Hacker One, in part because of that behavior. Technically, the bug was on the JSON generation side, but if the JSON’s gem parser didn’t silently accept duplicated keys, they would have caught it early in development.
That’s why starting from version 2.13.0, JSON.parse now accepts a new allow_duplicate_key: keyword argument,
and if not explicitly allowed, a deprecation warning is emitted if a duplicate key is encountered:
require "json"
Warning[:deprecated] = true
p JSON.parse('{"a": 1, "a": 2}')
# => {"a" => 2}
# /tmp/j.rb:4: warning: detected duplicate key "a" in JSON object.
# This will raise an error in json 3.0 unless enabled via `allow_duplicate_key: true`
#at line 1 column 1
As mentioned in the warning message, I plan to change the default behavior to be an error in the next major version, but of course it will always be possible to explicitly allow for duplicate keys, for the rare cases where it’s needed.
Here again, I think this deprecation is justified because duplicated keys are rare, but also almost always a mistake, hence I expect few people to need to change anything, and the ones who do will likely learn about a previously unnoticed mistake in their application.
Before you gasp in horror, don’t worry, I don’t plan on deprecating the Object#to_json method, ever.
It is way too widespread for this to ever be acceptable.
But that doesn’t mean this API is good, nor that nothing should be done about it.
At the center of the json gem API, there’s the notion that objects can define themselves how they should be
serialized into JSON by responding to the to_json method.
At first sight, it seems like a perfectly fine API, it’s an interface that objects can implement, fairly classic object-oriented design.
Here’s an example that changes how Time objects are serialized.
By default, json will call #to_s on objects it doesn’t know how to handle:
>> puts JSON.generate({ created_at: Time.now })
{"created_at":"2025-08-02 13:03:32 +0200"}
But we can instruct it to instead serialize Time using the ISO8601 / RFC 3339
format:
class Time
def to_json(...)
iso8601(3).to_json(...)
end
end
>> puts JSON.generate({ created_at: Time.now })
{"created_at":"2025-08-02T13:05:04.160+02:00"}
This seems all well and good, but the problem, like for the .json_create method, is that this is a global behavior.
An application may very well need to serialize dates in different ways in different contexts.
Worse, in the context of a library, say an API client that needs to serialize Time in a specific way, it’s not really
possible to use this API, you can’t assume it’s acceptable to change such a global behavior, given you know nothing about the application in which you’ll run.
So to me, there are two problems here. First, using #to_s as a fallback works for a few types, like date, but it is really not helpful
for the overwhelming majority of other objects:
>> puts JSON.generate(Object.new)
"#<Object:0x000000011ce214a0>"
I really can’t think of a situation in which this is the behavior that you want. If JSON.generate ends up calling to_s on an object, I’m willing to bet that in 99% of the time, the developer didn’t intend for that object to be serialized, or forgot to implement a #to_json on it.
Either way, it would be way more useful to raise an error, and requires that an explicit method to serialize that unknown object be provided.
The second is that it should be possible to customize a given type serialization locally, instead of globally.
In addition, returning a String as a JSON fragment is also not great, because it means recursively calling generators, and allows to generate invalid documents:
class Broken
def to_json
to_s
end
end
>> Broken.new.to_json
=> "#<Broken:0x0000000123054050>"
>> JSON.parse(Broken.new.to_json)
#> JSON::ParserError: unexpected character: '#<Broken:0x000000011c9377a0>'
# > at line 1 column 1
That’s the problems the new JSON::Coder API is meant to solve.
By default, JSON::Coder only accepts to serialize types that have a direct JSON equivalent, so Hash, Array, String / Symbol,
Integer, Float, true, false and nil. Any type that doesn’t have a direct JSON equivalent produces an error:
>> MY_JSON = JSON::Coder.new
>> MY_JSON.dump({a: 1})
=> "{\"a\":1}"
>> MY_JSON.dump({a: Time.new})
#> JSON::GeneratorError: Time not allowed in JSON
But it does allow you to provide a Proc to define the serialization of all other types:
MY_JSON = JSON::Coder.new do |obj|
case obj
when Time
obj.iso8601(3)
else
obj # return `obj` to fail serialization
end
end
>> MY_JSON.dump({a: Time.new})
=> "{\"a\":\"2025-08-02T14:03:15.091+02:00\"}"
Contrary to the #to_json method, here the Proc is expected to return a JSON primitive object, so you don’t have to
concern yourself with JSON escaping rules and such, which is much safer.
But if for some reason you do need to, you still can using JSON::Fragment:
MY_JSON = JSON::Coder.new do |obj|
case obj
when SomeRecord
JSON::Fragment.new(obj.json_blob)
else
obj # return `obj` to fail serialization
end
end
With this new API, it’s now much easier for a gem to customize JSON generation in a local way.
Now, as I said before, I absolutely don’t plan to deprecate #to_json, nor even the behavior that calls #to_s on unknown objects.
Even though I think it’s a bad API, and that its replacement is way superior, the #to_json method has been at the center of the json
gem from the beginning and would require a massive amount of work from the community to migrate out of.
The decision to deprecate an API should always weigh the benefits against the costs. Here, the cost is so massive that it is unimaginable for me to even consider it.
Another set of APIs I’ve marked as deprecated are the various _default_options accessors.
>> puts JSON.dump("http://example.com")
"http://example.com"
>> JSON.dump_default_options[:script_safe] = true
>> puts JSON.dump("http://example.com")
"http:\/\/example.com"
The concept is simple: you can globally change the default options received by certain methods.
At first sight, this might seem like a convenience, it allows you to set some option without having to pass it around at potentially dozens of different call sites.
But just like #to_json and other APIs, this change applies to the entire application, including some dependencies that may
not expect standard JSON methods to behave differently.
And that’s not a hypothetical, I personally ran into a gem that was using JSON to fingerprint some object graphs, e.g.
def fingerprint
Digest::SHA1.hexdigest(JSON.dump(some_object_graph))
end
That fingerprinting method was well tested in the gem, and was working well in a few dozen applications until one
day someone reported a bug in the gem. After some investigation, I figured the host application in question
had modified JSON.dump_default_options, causing the fingerprints to be different.
If you think about it, these sorts of global settings aren’t very different from monkey patching:
JSON.singleton_class.prepend(Module.new {
def dump(obj, proc = nil, opts = {})
opts = opts.merge(script_safe: true)
super
end
})
The overwhelming majority of Rubyists are very aware of the potential pitfalls of monkey patching, and some absolutely loathe it, yet, these sorts of global configuration APIs don’t get frowned upon as much for some reason.
In some cases, they make sense. e.g. if the configuration is for an application, or a framework (a framework essentially being an application skeleton), there’s not really a need for local configuration, and a global one is simpler and easier to reason about. But in a library, that may in turn be used by multiple other libraries with different configuration needs, they’re a problem.
Amusingly, this sort of API was one of the justifications for the currently experimental namespace feature in Ruby 3.5.0dev,
which shows the json gem is not the only one with this problem.
Here again, a better solution is the JSON::Coder API, if you want to centralize your JSON generation configuration across
your codebase, you can allocate a singleton with your desired options:
module MyLibrary
JSON_CODER = JSON::Coder.new(script_safe: true)
def do_things
JSON_CODER.dump(...)
end
end
As a library author, you can even allow your users to substitute the configuration for one of their choosing:
module MyLibrary
class << self
attr_accessor :json_coder
end
@json_coder = JSON::Coder.new(script_safe: true)
def do_things
MyLibrary.json_coder.dump(...)
end
end
Thankfully, from what I can see of the gem’s usage, these API were very rarely used, so while they’re not a major hindrance, I figured the cost vs benefit is positive. And if someone really needs to set an option globally, they can monkey-patch JSON, the effect is the same, and at least it’s more honest.
As mentioned previously, the decision to deprecate shouldn’t be taken lightly. It’s important to have empathy for the users who will have to deal with the fallout, and there are a few things more annoying than cosmetic deprecations.
Yet it is also important to recognize when an API is error-prone or even outright dangerous, and deprecations are sometimes a necessary evil to correct course.
Also, as you probably noticed, a common theme in most of the APIs I don’t like in the json gem, is global behavior and configuration.
I’m not certain why that is. A part of it might be that as Rubyists we value simplicity and conciseness, and that historically
the community has built its ethos as a reaction against overly verbose and ceremonial enterprise Java APIs, with their dependency injection frameworks and whatnot.
A bit of global state or behavior can sometimes bring a lot of simplicity, but it’s a very sharp tool that needs to be handled with extreme care.
]]>But as I mentioned, this is unfortunately not yet viable because there are many known implementation bugs that can lead to interpreter crashes, and that while they are supposed to execute in parallel, the Ruby VM still has one true global lock that Ractors need to acquire to perform certain operations, making them often perform worse than the equivalent single-threaded code.
One of these remaining contention points is class instance variables and class variables, and given it’s quite frequent for code to check a class or module instance variable as some sort of configuration, this contention point can have a very sizeable impact on Ractor performance, let me show you with a simple benchmark:
module Mod
@a = @b = @c = 1
def self.compute(count)
count.times do
@a + @b + @c
end
end
end
ITERATIONS = 1_000_000
PARALLELISM = 8
if ARGV.first == "ractor"
ractors = PARALLELISM.times.map do
Ractor.new do
Mod.compute(ITERATIONS)
end
end
ractors.each(&:take)
else
Mod.compute(ITERATIONS * PARALLELISM)
end
This simplistic micro-benchmark just add three module instance variables together repeatedly.
In one mode it does it serialy in the main thread, and if the ractor argument is passed, it does as many loop, but with 8
parallel ractors.
Hence in a perfect world, using the Ractors branch should be close to 8 times faster.
However, if you run this benchmark on Ruby’s master branch, this isn’t the result you’ll get:
$ hyperfine -w 1 './miniruby --yjit ../test.rb' './miniruby --yjit ../test.rb ractor'
Benchmark 1: ./miniruby --yjit --disable-all ../test.rb
Time (mean ± σ): 252.4 ms ± 1.2 ms [User: 250.2 ms, System: 1.6 ms]
Range (min … max): 249.9 ms … 253.8 ms 11 runs
Benchmark 2: ./miniruby --yjit --disable-all ../test.rb ractor
Time (mean ± σ): 2.005 s ± 0.013 s [User: 2.098 s, System: 6.963 s]
Range (min … max): 1.992 s … 2.027 s 10 runs
Summary
./miniruby --yjit ../test.rb ran
7.94 ± 0.06 times faster than ./miniruby --yjit ../test.rb ractor
That’s right, instead of being 8 times faster, the branch that uses Ractors ended up being 8 times slower. This is because to read a module or class instance variables, secondary ractors have to acquire the VM lock, which is a costly operation in itself, and worse, they end up waiting a lot to obtain the lock.
So what can we do about it?
Before we delves into how this lock could be removed or reduced, let’s review how class instance variables behave with ractors.
Given that classes are global, their instance variables are too, hence they are essentially global. Because of this, Ractors can’t let you do everything with them, otherwise, it would be a way to work around Ractors isolation.
The first rule is that only the main Ractor is allowed to set class instance variables:
class Test
class << self
attr_accessor :var
end
end
Test.var = 1 # works
Ractor.new do
# works
p Test.var
# raises Ractor::IsolationError: can not set instance variables
# of classes/modules by non-main Ractors
Test.var = 2
end.take
So secondary ractors can read instance variables on classes and modules, but can’t write them.
The second rule is that they can only read instance variables on classes if the object stored in that variable is shareable:
class Test
class << self
attr_accessor :var1, :var2
end
@var1 = {}.freeze
@var2 = {}
end
Ractor.new do
# works:
p Test.var1
# raises Ractor::IsolationError: can not get unshareable values from
# instance variables of classes/modules from non-main Ractors
p Test.var2
end.take
Usually when dealing with lock contention issues, the first solution is to turn one big lock into multiple finer-grained locks. In our simplistic benchmark, all ractors are accessing variables on the same module, so that wouldn’t help, but we could assume that in more realistic scenarios, they’d access the variables of many different modules and, hence wouldn’t fight as much for the same one.
But the way I envision Ractors being used in real-world cases, at least initially, is for running small pieces of code in parallel, with an API approaching futures:
futures = []
futures << Ractor.new { fetch_and_compute_prices }
futures << Ractor.new { fetch_and_compute_order_history }
...
futures.map(&:take)
As such I actually expect Ractors to commonly access the same module or class variables over and over, so introducing more finely grained locks isn’t very enticing.
Another possibility would be to use a read-write lock, given only the main ractor can “write” variables, all secondary ractors could acquire the read lock concurrently. But from previous experience, while read-write locks do allow concurrent read threads not to stall, they’re still quite costly when contented because all threads have to atomically increment and decrement the same value and that isn’t good for the CPU cache. It’s a fine solution when the operation you are protecting is a relatively slow one, but in our case, reading an instance variable is extremely cheap, so any kind of lock, even an uncontended one, will be disproportionally costly and ruin performance.
That’s why the only reasonable solution is to find a way to not use a lock at all.
To understand how we could make instance variables lock-free, we must first understand how they work. As is now tradition, I’ll try to explain it using Ruby pseudo code, starting with instance variable reads:
class Module
def instance_variable_get(variable_name)
if RubyVM.main_ractor?
# The main ractor is the only one allowed to write instance variables
# hence it doesn't need to lock because we know no one else could be
# concurrently modifying `@shape` or `@fields`
if field_index = @shape.field_index_for(variable_name)
@fields[field_index]
end
else
# Secondary ractors must lock the VM even for reads because the main Ractor
# could be modifying `@shape` or `@fields` concurrently.
RubyVM.synchronize do
if field_index = @shape.field_index_for(variable_name)
value = @fields[field_index]
raise Ractor::IsolationError unless Ractor.shareable?(value)
value
end
end
end
end
end
I’m not going to explain how shapes work here, as I already explained it in multiple previous posts. The only thing you really need to know is that instance variables are stored in a continuous array, and shapes keep track of the offset at which each variable is stored. They also are immutable, so you can query them concurrently.
As a result, reading an instance variable only amount of querying the shape tree to figure out if that particular variable exists,
and if it does, what its index is. After that, we read the variable at the specified offset in the @fields array of the
object.
However, on secondary Ractors, we additionally need to lock the VM to ensure the shape and the fields are consistent, but that will be clearer once I explain how writing instance variables works.
class Module
def instance_variable_set(variable_name, value)
raise FrozenError if frozen?
# The main ractor is the only one allowed to write instance variables
raise Ractor::IsolationError unless RubyVM.main_ractor?
RubyVM.synchronize do
if field_index = @shape.field_index_for(variable_name)
# The variable already exists, we replace its value
@fields[field_index] = value
else
# The variable doesn't exist, we have to make a shape transition
next_shape = @shape.add_instance_variable(variable_name)
if next_shape.capacity > @shape.capacity
# @fields is full, we need to allocate a larger one
new_fields = Memory.allocate(size: next_shape.capacity)
new_fields.replace(@fields) # copy content
@fields, old_fields = new_fields, @fields
# The fields array is manually managed memory, so it needs to be freed explicitly
Memory.free(old_fields)
end
@fields[next_shape.field_index] = value
@shape = next_shape
end
end
end
end
As you can see, the fields array has a given size, if we’re adding a new instance variable, we may need to allocate a larger one and swap the two, as well as change the object’s shape.
That is why we need to lock the VM, we can’t let another ractor read an instance variable while we’re doing this because it would run into all sorts of race conditions:
old_fields while we’re freeing it, causing a use-after-free bug.old_fields using the new shape, causing an out-of-bounds read.new_fields using the new shape, but before we’ve written the new value, causing an uninitialized memory read.Now, if you are not familiar with C, or another low-level programming language, you might be thinking that I’m exaggerating. After all, updating the shape is the last operation, so surely cases 2 and 3 aren’t possible.
Well, I got some bad news…
Multithreaded programming is tricky, but even more so when allowing multiple threads to read and write the same memory, because processors have all sorts of caches, hence a variable doesn’t only reside in one place in your RAM.
It can also be copied in the CPU L1/L2/etc caches, or even in the CPU registers. When one thread writes into a variable, it’s not immediately visible to all other threads, the write will take a while to propagate back to the RAM. Worse, if you write into multiple variables in a specific order, it’s not even guaranteed other threads will witness these changes in the same order.
Let’s consider a simple multi-threaded program:
Point = Struct.new(:x, :y)
treasure = nil
thread = Thread.new do
while true
if treasure
puts "Treasure is at #{treasure.x.inspect} / #{treasure.y.inspect}"
break
end
end
end
point = Point.new
point.x = 12
point.y = 24
treasure = point
thread.join
As a Ruby programmer, you likely expect this program to print Treasure is at 12 / 24, and you’d be correct.
After all, we fully initialize the Point instance before updating the treasure global variable to point to it.
But if we were to write a similar program in C, the output could be any of:
Treasure is at 12 / 24Treasure is at nil / 24Treasure is at 12 / nilTreasure is at nil / nilWhy? Well, this has to do with memory models. In order to optimize your code, compilers sometimes may have to change the order of memory reads and writes. So for programmers to be able to write correct programs, they need to know what the compiler can and cannot do, and that’s what a language memory model defines. In the case of C, the memory model is very lax, and compilers are allowed to reorder reads and writes very extensively.
And it’s not only about the compilers. CPUs too can reorder read and write operations.
The x86 (AKA Intel) memory model is quite strict, so it doesn’t reorder much, but the arm64 memory model is much more lax,
so even if your compiler generated the native code in the same order, your CPU could execute them out of order,
giving you unpredictable results.
To work around this problem, C compilers and CPUs provide “barriers”. You can insert them in your code to enforce that reads and write can’t be reordered across such barriers, allowing you to ensure that all threads will observe memory in a consistent way.
From a programmer’s perspective, it’s generally exposed as “atomic” read and write operations, and it’s understood by the compiler and CPU that memory operations cannot be reordered across atomic operations.
So going back to our instance_variable_set implementation, we can fix two of the three race conditions by using an atomic
write:
class Module
def instance_variable_set(variable_name, value)
raise FrozenError if frozen?
# The main ractor is the only one allowed to write instance variables
raise Ractor::IsolationError unless RubyVM.main_ractor?
if field_index = @shape.field_index_for(variable_name)
# The variable already exists, we replace its value
@fields[field_index] = value
else
# The variable doesn't exist, we have to make a shape transition
next_shape = @shape.add_instance_variable(variable_name)
if next_shape.capacity > @shape.capacity
# @fields is full, we need to allocate a larger one
new_fields = Memory.allocate(size: next_shape.capacity)
new_fields.replace(@fields) # copy content
old_fields = @fields
# Ensure `@fields` isn't updated before its content has been filled
Atomic.write { @fields = new_fields }
# The fields array is manually managed memory, so it needs to be freed explicitly
Memory.free(old_fields)
end
@fields[next_shape.field_index] = value
@shape = next_shape
end
end
end
With this simple change, we now guarantee that the new @fields will be visible to other threads before the new @shape is.
They may still see the old @shape with the new @fields, but that’s acceptable because all the offsets @shape may point to
contain the same values. Pretty neat. Now we only need to find a solution for the use-after-free problem.
So our problem is that after we swap the old @fields array for the new one, we must free the old array to not leak memory.
But if there is no synchronization, we can’t guarantee that another thread doesn’t have a reference to the old array in its
registers or caches, so it may try to read from it after it was freed, and that might lead to a segmentation fault.
Hence, we must wait until there’s no longer any reference to the old array before freeing it, and if you think about it that’s exactly what a garbage collector does, and lucky for us, Ruby already has one.
So the solution to avoid use-after-free is to use an actual Ruby Array instead of manually allocated memory,
this way we no longer have to free it explicitly, the garbage collected will take care of it later:
class Module
def instance_variable_set(variable_name, value)
raise FrozenError if frozen?
# The main ractor is the only one allowed to write instance variables
raise Ractor::IsolationError unless RubyVM.main_ractor?
if field_index = @shape.field_index_for(variable_name)
# The variable already exists, we replace its value
@fields[field_index] = value
else
# The variable doesn't exist, we have to make a shape transition
next_shape = @shape.add_instance_variable(variable_name)
if next_shape.capacity > @shape.capacity
# @fields is full, we need to allocate a larger one
new_fields = Array.new(next_shape.capacity)
new_fields.replace(@fields) # copy content
old_fields = @fields
# Ensure `@fields` isn't updated before its content has been filled
Atomic.write { @fields = new_fields }
end
@fields[next_shape.field_index] = value
@shape = next_shape
end
end
end
Now, if another thread is currently reading inside the old @fields, it doesn’t matter because it will remain valid
memory until the garbage collector notices it’s no longer referenced by anyone.
And just like that, we now have fully lock-free class instance variable reads and writes!
Well… no. Because we overlooked two complications.
Perhaps you don’t know about it, because it’s quite a rare thing to do, but in Ruby, you can remove an object’s instance variables:
class Test
p instance_variable_defined?(:@foo) # => false
@foo = 1
p instance_variable_defined?(:@foo) # => true
remove_instance_variable(:@foo)
p instance_variable_defined?(:@foo) # => false
end
And while this is an extremely rare operation, it can happen, hence we must handle it in a thread safe way.
Let’s look at its pseudo-implementation:
class Module
def remove_instance_variable(variable_name)
removed_index = @shape.field_index_for(variable_name)
# The variable didn't exist in the first place
return unless removed_index
next_shape = @shape.remove_instance_variable(variable_name)
# Shift fields left
removed_index.upto(next_shape.fields_count) do |index|
@fields[index] = @fields[index + 1]
end
@shape = next_shape
end
end
So when removing an instance variable, we get a new shape that is shorter than the previous one, which means that all the variables indexed after the one we removed are now lower, so we need to shift all the fields.
To better illustrate, consider the following code:
@a = 1
@b = 2
@c = 3
remove_instance_variable(:@b)
In the snippet above, @fields will change from [1, 2, 3] to [1, 3], and that’s not really possible to do this in a thread-safe way.
We could, of course, do this shifting in a copy of @fields, and then swap @fields atomically, but one major problem would remain: the old shape and the new
shape are fundamentally incompatible.
If you are accessing @c using the old fields with the new shape, you will get 2 which is incorrect.
If you are accessing @c using new fields with the old shape, you will get whatever is outside the array, or perhaps a segmentation fault.
So in this case, we can’t rely on clever ordering of writes to keep a consistent view of the instance variables for all ractors.
For the anecdote, this isn’t how the initial implementation of object shapes in Ruby worked.
Early in Ruby 3.2 development, #remove_instance_variable wouldn’t produce a shorter shape, but instead
a child shape of type UNDEF that would record that the variable at offset 1 needs to be considered not defined.
However it was found that this could cause an infinite amount of shapes to be created by misbehaving code:
obj = Object.new
loop do
obj.instance_variable_set(:@foo)
obj.remove_instance_variable(:@foo)
end
So instead the implementation was changed to rebuild the shape tree.
That previous implementation would have been useful in this case, as it would have prevented this race condition. But ultimately it doesn’t matter, because there is another complication I didn’t mention.
The other major complication I deliberately overlooked in my explanation thus far, is the existence of complex shapes.
Since shapes are append-only, Ruby code that defines instance variables in random order or often removes instance variables can potentially generate an infinite combination of shapes, and each shape uses some amount of memory.
That’s why Ruby keeps track of how many shape variations a given class causes, and after a specific threshold (currently 8), Ruby gives up and marks the class as “too complex”.
If you run this script on a recent Ruby, you will see a performance warning:
Warning[:performance] = true
class TooComplex
def initialize
10.times do |i|
instance_variable_set("@iv_#{i}", i)
remove_instance_variable("@iv_#{i}")
end
end
end
TooComplex.new
/tmp/complex.rb:6: warning: The class TooComplex reached 8 shape variations,
instance variables accesses will be slower and memory usage increased.
It is recommended to define instance variables in a consistent order,
for instance by eagerly defining them all in the #initialize method.
When this happens, any operation on an instance of that class that would result in a new shape being created instead results in some sort of “singleton” shape, known as the complex shape, and in that case instance variables are stored in a Hash instead of being stored in an array. It’s slower and uses more memory, but limits the creation of new shapes.
So the real #instance_variable_get and #instance_variable_set implementations are more complicated than what I described at the start of the post.
In reality, they look more like this:
class Module
def instance_variable_get(variable_name)
if @shape.too_complex?
@fields[variable_name] # @fields is is Hash
elsif field_index = @shape.field_index_for(variable_name)
@fields[field_index] # @fields is an Array
end
end
def instance_variable_set(variable_name, value)
raise FrozenError if frozen?
# The main ractor is the only one allowed to write instance variables
raise Ractor::IsolationError unless RubyVM.main_ractor?
if shape.too_complex?
return @field_index[variable_name] = value
end
if field_index = @shape.field_index_for(variable_name)
# The variable already exists, we replace its value
@fields[field_index] = value
else
# The variable doesn't exist, we have to make a shape transition
next_shape = @shape.add_instance_variable(variable_name)
if next_shape.too_complex?
new_fields = {}
@shape.each_ancestor do |shape|
new_fields[shape.variable_name] = @fields[shape.field_index]
end
@fields = new_fields
@shape = next_shape
return @fields[variable_name] = value
end
if next_shape.capacity > @shape.capacity
# @fields is full, we need to allocate a larger one
new_fields = Array.new(next_shape.capacity)
new_fields.replace(@fields) # copy content
old_fields = @fields
# Ensure `@fields` isn't updated before its content has been filled
Atomic.write { @fields = new_fields }
end
@fields[next_shape.field_index] = value
@shape = next_shape
end
end
end
And this code is now riddled with race conditions because regular and complex shapes are radically different,
even in the happy path case where we’re adding a new instance variable, we might turn @fields from an array into
a Hash.
So if @shape and @fields aren’t perfectly synchronized together, we might end up trying to access a Hash
like an Array, and vice-versa, which will likely end up in a VM crash.
One solution could have been to ensure @shape and @fields are written atomically together, but unfortunately in this case
it isn’t really possible.
First, because it would require to write two pointer-sized (64bit) values in a single atomic operation, which is possible on some modern CPUs using SIMD instruction, but Ruby supports many different platforms, and there is no way all of them would have support for it.
And second, because the constraint with this is that both fields need to be contiguous.
You can’t atomically write two pointer-sized values that are distant from each other.
Semantically you are treating two contiguous 64bit values are a single 128bit one, and for reasons I won’t get into here,
@shape and @fields can’t be made contiguous.
That’s where it came to me that we could instead bundle the @shape and @fields in their own GC-managed object,
so that when we have to update both atomically, we can work on a copy and then swap the pointer:
class Module
def instance_variable_get(variable_name)
@fields_object&.instance_variable_get(variable_name)
end
def instance_variable_set(variable_name, value)
raise FrozenError if frozen?
# The main ractor is the only one allowed to write instance variables
raise Ractor::IsolationError unless RubyVM.main_ractor?
new_fields_object = @fields_object ? @fields_object.dup : Object.new
new_fields_object.instance_variable_set(variable_name, value)
Atomic.write { @fields_object = new_fields_object }
end
def remove_instance_variable(variable_name)
raise FrozenError if frozen?
# The main ractor is the only one allowed to write instance variables
raise Ractor::IsolationError unless RubyVM.main_ractor?
new_fields_object = @fields_object ? @fields_object.dup : Object.new
new_fields_object.remove_instance_variable(variable_name)
Atomic.write { @fields_object = new_fields_object }
end
end
It really is that trivial. Instead of storing instance variables in the class or module, we store them in a regular Object,
and on mutation, we first clone the current state, do our unsafe mutation, and finally atomically swap the @fields_object reference.
Of course, doing it exactly like this would cause a huge increase in object allocation, so in the actual code I added lots of special cases to directly mutate the existing object rather than to copy it when it is safe to do so, but conceptually this is exactly what my current patch is doing.
That patch is mostly a proof of concept, in the end, I don’t think we should use an actual T_OBJECT for various reasons,
but I already have a follow-up patch that replaces it with a T_IMEMO, which is an internal type invisible to Ruby users.
With this solution I was able to remove the locks around class instance variables, and now the ractor version of the micro-benchmark runs almost 3 times faster than the single-threaded version:
$ hyperfine -w 1 './miniruby --yjit ../test.rb' './miniruby --yjit ../test.rb ractor'
Benchmark 1: ./miniruby --yjit ../test.rb
Time (mean ± σ): 166.3 ms ± 1.1 ms [User: 164.4 ms, System: 1.5 ms]
Range (min … max): 164.0 ms … 168.5 ms 18 runs
Benchmark 2: ./miniruby --yjit ../test.rb ractor
Time (mean ± σ): 59.3 ms ± 2.6 ms [User: 211.4 ms, System: 1.5 ms]
Range (min … max): 57.9 ms … 67.7 ms 48 runs
Summary
./miniruby --yjit ../test.rb ractor ran
2.80 ± 0.12 times faster than ./miniruby --yjit ../test.rb
That’s still far from the 8 times faster you might expect, but profiling indicates that it’s now a scheduling problem, which we’ll eventually fix too, and it’s still over 13 times faster than on Ruby 3.4:
$ hyperfine -w 1 'ruby --disable-all --yjit ../test.rb ractor' './ruby --disable-all --yjit ../test.rb ractor'
Benchmark 1: ruby --disable-all --yjit ../test.rb ractor
Time (mean ± σ): 772.3 ms ± 9.0 ms [User: 1023.8 ms, System: 1325.6 ms]
Range (min … max): 759.3 ms … 790.5 ms 10 runs
Benchmark 2: ./ruby --disable-all --yjit ../test.rb ractor
Time (mean ± σ): 56.8 ms ± 1.4 ms [User: 205.7 ms, System: 1.6 ms]
Range (min … max): 55.8 ms … 65.6 ms 50 runs
Summary
./ruby --disable-all --yjit ../test.rb ractor ran
13.59 ± 0.36 times faster than ruby --disable-all --yjit ../test.rb ractor
Hopefully, I’ll get this merged in the next couple of weeks.
You may be thinking that this is all well and good, but that using another object to store classes and modules instance variables in another object will increase Ruby’s memory usage.
Well, probably not. Previously the @fields memory was managed by malloc, and while it depends on which implementation
of malloc you are using, most of them will have an overhead of 16B per allocated pointer, which is exactly the overhead
of a Ruby object.
So overall it shouldn’t cause memory usage to increase.
This solution has another incidental benefit, which is that it fixes both a bug and a performance regression recently introduced when the new Namespace feature was merged.
Under namespaces, core classes are supposed to have a different set of instance variables, and frozen status, in each namespace, but this doesn’t work well at all with shapes because right now the shape is stored in the object header, hence all objects including classes and modules, only have a single shape.
By delegating instance variable management to another object, classes can now have one @fields_object per namespace,
encompassing both the shape and the fields, hence properly namespace class instance variables.
It wasn’t at all a motivation for this change, but it’s a nice side effect.
]]>But as I mentioned, this is unfortunately not yet viable because there are many known implementation bugs that can lead to interpreter crashes, and that while they are supposed to execute in parallel, the Ruby VM still has one true global lock that Ractors need to acquire to perform certain operations, making them often perform worse than the equivalent single-threaded code.
But things are evolving rapidly. Since then, there is now a team of people working on fixing exactly that: tackling known bugs and eliminating or reducing the remaining contention points.
The one example I gave to illustrate this remaining contention, was the fstring_table, which in short is a big internal
hash table used to deduplicate strings, which Ruby does whenever you use a String as a key in a Hash.
Because looking into that table while another Ractor is inserting a new entry would result in a crash (or worse),
until last week Ruby had to acquire the remaining VM lock whenever it touched that table.
But John Hawthorn recently replaced it with a lock-free Hash-Set, and now this contention point is gone. If you re-run the JSON benchmarks from the previous post using the latest Ruby master, the Ractor version is now twice as fast as the single-threaded version, instead of being 3 times slower.
This still isn’t perfect though, as the benchmark uses 5 ractors, hence in an ideal world should be almost 5 times faster then the single-threaded example, so we still have a lot of work to do to eliminate or reduce the remaining contention points.
One of such remaining contention points, that you likely didn’t suspect would be one, is
the #object_id method.
And on my way back from RubyKaigi, I started working on tackling it.
But before we delve into what I plan to do about it, let’s talk about how this method came to be a contention point.
Up until Ruby 2.6, the #object_id implementation used to be quite trivial:
VALUE
rb_obj_id(VALUE obj)
{
if (STATIC_SYM_P(obj)) {
return (SYM2ID(obj) * sizeof(RVALUE) + (4 << 2)) | FIXNUM_FLAG;
}
else if (FLONUM_P(obj)) {
return LL2NUM((SIGNED_VALUE)obj);
}
else if (SPECIAL_CONST_P(obj)) {
return LONG2NUM((SIGNED_VALUE)obj);
}
return LL2NUM((SIGNED_VALUE)(obj) / 2);
}
Of course, it’s C so it might be a bit cryptic to the uninitiated, but in short, for the common case of a heap allocated
object, its object_id would be the address where the object is stored, divided by two.
So in a way, #object_id used to return you an actual pointer to the object.
This made implementing the lesser-known counterpart of #object_id, ObjectSpace._id2ref,
just as trivial, multiply the object_id by two, and here you go, you now have a pointer to the corresponding object.
s = "I am a string"
ObjectSpace._id2ref(s.object_id).equal?(s) # => true
But there was actually a major problem with that implementation, which is that the Ruby heap is composed of standard-size slots. When an object is no longer referenced, the GC reclaims the object slot and will most likely re-use it for a future object.
Hence if you were to hold onto an object_id, and use ObjectSpace._id2ref, it’s not actually certain the object you get
back is the one you got the object_id from, it might be a totally different object.
It also meant that if you are holding onto an object_id as a way to know if you’ve already seen a given object,
you may run into some false positives.
That’s why in 2018 there was already a feature request to deprecate both #object_id and _id2ref.
Back then Matz agreed to deprecated _id2ref for Ruby 2.7, but pointed out that removing #object_id would be too much of a breaking change,
and that it is a useful API.
However, this somehow fell through the cracks, and _id2ref was never formally deprecated, which is something I’d like to
do for Ruby 3.5.
I’m not certain why _id2ref was added initially, given that git blame points to a commit from 1999 that was generated by cvs2svn.
But if I had to guess, I’d say it was added for drb which today remains the only significant user of that API in the stdlib, but even that is about to change.
Regardless of why _id2ref was added, that major flaw in its design became a blocker for Aaron Patterson when he implemented
GC compaction in Ruby 2.7.
Since GC compaction implies that objects can be moved from one slot to another, #object_id could no longer be derived from
the object address, otherwise, it wouldn’t remain stable.
What Aaron did is conceptually simple:
module Kernel
def object_id
unless id = ObjectSpace::OBJ_TO_ID_TABLE[self]
id = ObjectSpace.next_obj_id
ObjectSpace.next_obj_id += 8
ObjectSpace::OBJ_TO_ID_TABLE[self] = id
ObjectSpace::ID_TO_OBJ_TABLE[id] = self
end
id
end
end
module ObjectSpace
def self._id2ref(id)
ObjectSpace::ID_TO_OBJ_TABLE[id]
end
end
In short, Ruby added two internal Hash tables. One of them with objects as keys and IDs as values, and the inverse for the other. Whenever you access an object’s ID for the first time, a unique ID is created by incrementing an internal counter, and the relation between the object and its ID is stored in the two hash tables.
As a Ruby user, you can observe this change easily by printing some object_id:
p Object.new.object_id
p Object.new.object_id
Up to Ruby 2.6, the above code will print some large and seemingly random integers such as 50666405449360, whereas on
Ruby 2.7 onwards, it will print small integers, likely 8 and 16.
This change both solved the historical issue with _id2ref and allowed the GC to keep stable IDs when moving objects from one
address to the other, but made object_id way more costly than it used to be.
Ruby’s hash-table implementation stores 3 pointer-sized numbers per entry. One for the key, one for the value, and one for the hashcode:
struct st_table_entry {
st_hash_t hash;
st_data_t key;
st_data_t record;
};
And given every object_id is stored in two hash-tables, that makes for a total of 48B (plus some change) per object_id.
That’s quite a lot of memory for just a small number.
In addition, accessing the object_id now requires doing a hash lookup, when before it was a simple division, and whenever
the GC frees or moves an object that has an ID, it needs to update these two hash-tables.
To be clear, I don’t have any evidence that these two tables cause significant memory or CPU overhead in real-world Ruby applications.
I’m just saying that #object_id is way more expensive than one might expect.
Then later on, when Koichi Sasada implemented Ractors since now multiple ractors could attempt to access these two hash-tables
concurrently, he had to add a lock around them in #object_id, turning
#object_id in a contention point:
module Kernel
def object_id
RubyVM.synchronize do
unless id = ObjectSpace::OBJ_TO_ID_TABLE[self]
id = ObjectSpace.next_obj_id
ObjectSpace.next_obj_id += 8
ObjectSpace::OBJ_TO_ID_TABLE[self] = id
ObjectSpace::ID_TO_OBJ_TABLE[id] = self
end
id
end
end
end
module ObjectSpace
def self._id2ref(id)
RubyVM.synchronize do
ObjectSpace::ID_TO_OBJ_TABLE[id]
end
end
end
At this point, you may wonder if it’s really a big deal.
After all, #object_id is used a bit for debugging, but not so much in actual production code.
And this is mostly true, but it does come up in real-world code, e.g. in the mail gem,
in rubocop,
and of course quite a bit in Rails.
But calling Kernel#object_id isn’t the only way you might rely on an object ID.
The Object#hash method for example rely on it:
static st_index_t
objid_hash(VALUE obj)
{
VALUE object_id = rb_obj_id(obj);
if (!FIXNUM_P(object_id))
object_id = rb_big_hash(object_id);
return (st_index_t)st_index_hash((st_index_t)NUM2LL(object_id));
}
VALUE
rb_obj_hash(VALUE obj)
{
long hnum = any_hash(obj, objid_hash);
return ST2FIX(hnum);
}
Common value classes such as String, Array etc, do define their own #hash method that doesn’t rely on the object ID,
but all other objects that are compared by identity by default will end up using Object#hash, hence accessing the object_id.
For instance here’s a quite class #hash implementation from one of Rails classes:
# activerecord/lib/arel/nodes/delete_statement.rb
def hash
[self.class, @relation, @wheres, @orders, @limit, @offset, @key].hash
end
It absolutely isn’t obvious, but here we’re hashing a Class object, and classes are indexed by identity like a default object:
>> Class.new.method(:hash).owner
=> Kernel
>> Object.new.method(:hash).owner
=> Kernel
Hence the above code currently requires to lock the entire virtual machine, just to produce a hashcode.
So what could we do to remove or reduce the need to synchronize the entire virtual machine when accessing object IDs?
Well first, given that ObjectSpace._id2ref is very rarely used, and will likely be marked as deprecated soon,
we can start by optimistically not creating nor updating the id -> object table until someone needs it, which hopefully
won’t be the case in the vast majority of programs:
module Kernel
def object_id
RubyVM.synchronize do
unless id = ObjectSpace::OBJ_TO_ID_TABLE[self]
id = ObjectSpace.next_obj_id
ObjectSpace.next_obj_id += 8
ObjectSpace::OBJ_TO_ID_TABLE[self] = id
if defined?(ObjectSpace::ID_TO_OBJ_TABLE)
ObjectSpace::ID_TO_OBJ_TABLE[id] = self
end
end
id
end
end
end
module ObjectSpace
def self._id2ref(id)
RubyVM.synchronize do
unless defined?(ObjectSpace::ID_TO_OBJ_TABLE)
ObjectSpace::ID_TO_OBJ_TABLE = ObjectSpace::OBJ_TO_ID_TABLE.invert
end
ObjectSpace::ID_TO_OBJ_TABLE[id]
end
end
end
This doesn’t remove the lock yet, but assuming your program never calls ObjectSpace._id2ref it removes some work
from inside the lock, hence it shouldn’t be held as long.
And even if you don’t use Ractors, it should slightly reduce memory usage as well as remove work for the GC,
as demonstrated by a micro-benchmark:
benchmark:
baseline: "Object.new"
object_id: "Object.new.object_id"
compare-ruby: ruby 3.5.0dev (2025-04-10T09:44:40Z master 684cfa42d7) +YJIT +PRISM [arm64-darwin24]
built-ruby: ruby 3.5.0dev (2025-04-10T10:13:43Z lazy-id-to-obj d3aa9626cc) +YJIT +PRISM [arm64-darwin24]
warming up..
| |compare-ruby|built-ruby|
|:----------|-----------:|---------:|
|baseline | 26.364M| 25.974M|
| | 1.01x| -|
|object_id | 10.293M| 14.202M|
| | -| 1.38x|
As always, when possible, the most efficient way to speed up some code is to not call it if you can avoid it.
If you’re curious to see the actual implementation, you can have a look at the pull request.
But while saving a bit of memory and CPU is nice, we’re still not significantly reducing contention, so what else could we do?
The crux of the issue here is that the object_id is stored in a centralized hash table, and as long as it will be the case,
synchronization will be required, short of implementing a lock-free hash table, but this is quite tricky to do.
Much trickier than a hash-set John used for the fstring_table.
But more importantly, a centralized data structure to store all the IDs of all objects isn’t great for locality anyway. More so, needing to do a hash lookup to access an object’s property is quite costly, when conceptually it should be stored directly inside the object.
If you think about it, object_id isn’t very different from an instance variable:
module Kernel
def object_id
@__object_id ||= ObjectSpace.generate_next_obj_id
end
end
You’d need the id generation to be thread-safe, which is easily done using an atomic increment operation, but other than that,
assuming the object isn’t one of the special objects that is accessible from multiple ractors, you can mutate it to store the
object_id without having to lock the entire VM.
However, as is tradition, nothing is ever that simple.
Since Ruby 3.2, objects use shapes to define how their instance variables are stored.
Here again, let’s use some pseudo-Ruby code to illustrate the basics of how they work.
To start, shapes are a tree-like structure. Every shape has a parent (except the root one) and 0-N children:
class Shape
def initialize(parent, type, edge_name, next_ivar_index)
@parent = parent
@type = type
@edge_name = edge_name
@next_ivar_index = next_ivar_index
@edges = {}
end
def add_ivar(ivar_name)
@edges[ivar_name] ||= Shape.new(self, :ivar, ivar_name, next_ivar_index + 1)
end
end
With this, when the Ruby VM has to execute code such as:
class User
def initialize(name, role)
@name = name
@role = role
end
end
It can compute the object shape on the fly such as:
# Allocate the object
object = new_object
object.shape = ROOT_SHAPE
# add @name
next_shape = object.add_ivar(:@name)
object.shape = next_shape
object.ivars[next_shape.next_ivar_index - 1] = name
# add @role
next_shape = object.add_ivar(:@role)
object.shape = next_shape
object.ivars[next_shape.next_ivar_index - 1] = role
This method may seem surprising, but it’s actually very efficient for various reasons I won’t get into here, because I wrote another post about it a bit over a year ago, go read it if you are curious to know more.
But how instance variables are laid out isn’t the only thing that shapes record. They also keep track of how large an object is, hence how many instance variables it can store, as well as whether it has been frozen.
Still in pseudo-Ruby code, it looks like this:
class Shape
def add_ivar(ivar_name)
if @type == :frozen
raise "Can't modify frozen object"
end
@edges[ivar_name] ||= Shape.new(self, :ivar, ivar_name, next_ivar_index + 1)
end
def freeze
@edges[:__frozen] ||= Shape.new(self, :frozen, nil, next_ivar_index)
end
end
So frozen shapes are final. It is expected that a shape of type frozen won’t ever have any children.
But in the case of object_id, we want to be able to store the id on any object, regardless of whether they are frozen
or not. So the first step is to modify shapes to allow that, which I did in a relatively simple commit.
But here too there was a bit of a complication. In a few cases, for instance when calling Object#dup, Ruby needs to find
the unfrozen version of a shape. Previously, since frozen shapes couldn’t possibly have children, it was quite simple:
class Object
def dup
new_object = self.class.allocate
if self.shape.type == :frozen
new_object.shape = self.shape.parent
else
new_object.shape = self.shape
end
# ...
end
end
Once you allow frozen shapes to have children, this operation becomes more involved, as you now need to go up the tree to find the last non-frozen shape, then reapply all the child shapes you wish to carry over.
After this small refactoring was done, I could introduce a new type of shape: SHAPE_OBJ_ID, which behaves very similarly
to instance variable shapes:
class Shape
def object_id
# First check if there is an OBJ_ID shape in ancestors
shape = self
while shape.parent
return shape if shape.type == :obj_id
shape = shape.parent
end
# Otherwise create one.
@edges[:__object_id] ||= Shape.new(self, :obj_id, nil, next_ivar_index + 1)
end
end
And just like this, we’re now able to reserve some inline space inside any object to store the object_id,
and in some cases we’re able to access an object’s ID fully lock-free.
Why I’m saying in some cases is because there are still a number of limitations.
First, since shapes are mostly immutable, we can access an object’s shape, and all its ancestors without taking a lock. However, finding or creating a shape’s child currently still requires synchronizing the VM. So even if my patch was to be applied, Ruby would still lock when accessing an object’s ID for the very first time, it would only be lock-free on subsequent accesses.
Being able to find or create child shapes in a lock-free way would be useful way beyond the object_id use case, so
hopefully we’ll get to it in the future, I haven’t yet dedicated much thought to it, but I’m hopeful we can find
a solution. But even if we can’t do it lock-free, I think we could at least use a dedicated lock for it, so we wouldn’t
contend with all the other code paths that synchronize the entire VM, only paths that do the same operation.
Then, if the object is potentially shared between ractors, we also still need to acquire the lock before storing the ID,
as otherwise, concurrent writes may cause a race condition. Given we need to both update the object’s shape and write
the object_id inside the object, we can’t do it all in an atomic manner.
Finally, not all objects store their instance variables in the same way.
As a Rubyist, you likely know that in Ruby everything is an object, but that doesn’t mean all objects are equal.
In the context of instance variables, there are essentially three types of objects: T_OBJECT, T_CLASS/T_MODULE and
then all the rest.
T_OBJECT are your classic objects that inherit from the BasicObject class. Their instance variables are stored
inline directly inside the object slot, as long as it’s large enough. If it ends up overflowing, then a separated memory
location is allocated, and instance variables are moved there, the object slot then only contains a pointer to that auxiliary memory.
T_CLASS and T_MODULE as their name suggests are all instances of the Class and Module classes. These are much
larger than regular objects, as they need to keep track of a lot of things, such as their method table, a pointer to the
parent class, etc:
>> ObjectSpace.memsize_of(Object.new)
=> 40
>> ObjectSpace.memsize_of(Class.new)
=> 192
As such, they never store their instance variables inline, they always store them in auxiliary memory, and they have dedicated space in their object slot to store the auxiliary memory pointer:
# internal/class.h
struct rb_classext_struct {
VALUE *iv_ptr; // iv = instance variable
// ...
}
And finally, there are all the other objects, such as T_STRING, T_ARRAY, T_HASH, T_REGEXP, etc.
None of these have free space in their slot to store inline variables, and not even space to store the auxiliary memory
pointer.
So what does Ruby do when you do add an instance variable to such objects? Well, it stores it in a Hash-table of course!
In pseudo-Ruby, it would look like this:
module GenericIvarObject
class GenericStorage
attr_accessor :shape
attr_reader :ivars
def initialize
@ivars = []
end
end
def instance_variable_get(ivar_name)
store = RubyVM.synchronize do
GENERIC_STORAGE[self] ||= GenericStorage.new
end
if ivar_shape = store.shape.find(ivar_name)
store.ivars[ivar_shape.next_ivar_index - 1]
end
end
end
As you probably have noticed or even guessed, since this is yet another global hash table, any access needs to be synchronized,
which means that for objects other than T_OBJECT, T_CLASS and T_MODULE,
my patch replaces one global synchronized hash with another…
So perhaps for these, keeping the original object -> id table would be preferable, that’s something I still need to figure out.
My patch isn’t finished. I still have to figure out how to best deal with “generic” objects, and probably refine the implementation some more, and perhaps it won’t even be merged at all in the end.
But I wanted to share it because explaining something helps me think about the problem,
and also because while I don’t think object_id is currently the biggest Ractor bottleneck,
it’s a good showcase of the type of work that needs to be done to make Ractors more parallel.
If you are curious about the patch, here’s what it currently looks like as of this writing.
Similar work will have to be done for other internal tables, such as the symbol table and the various method tables.
]]>However one database-adjacent topic I don’t think I’ve ever seen any discussions about, and that I think could be improved, is the protocols exposed by these databases to execute queries. Relational databases are very impressive pieces of technology, but their client protocol makes me wonder if they ever considered being used by anything other than a human typing commands in a CLI interface.
I also happen to maintain the Redis client for Ruby, and while the Redis protocol is far from perfect, I think there are some things it does better than PostgreSQL and MySQL protocols, which are the two I am somewhat familiar with.
You’ve probably never seen them, because they’re not logged by default, but when Active Record connects to your database it starts by executing several database-specific queries, which I generally call the “prelude”.
Which queries are sent exactly depends on how you configured Active Record, but for most people, it will be the default.
In the case of MySQL it will look like this:
SET @@SESSION.sql_mode = CONCAT(@@sql_mode, ',STRICT_ALL_TABLES,NO_AUTO_VALUE_ON_ZERO'),
@@SESSION.wait_timeout = 2147483
For PostgreSQL, there’s a bit more:
SET client_min_messages TO 'warning';
SET standard_conforming_strings = on;
SET intervalstyle = iso_8601;
SET SESSION timezone TO 'UTC'
In both cases the idea is the same, we’re configuring the connection, making it behave differently. And there’s nothing wrong with the general idea of that, as a database gets older, new modes and features get introduced so for backward compatibility reasons you have to opt-in to them.
My issue with this however is that you can set these at any point. They’re not restricted to an initial authentication and configuration step, so when as a framework or library you hand over a connection to user code and later get it back, you can’t know for sure they haven’t changed any of these settings. Similarly, it means you have both configured and unconfigured connections and must be careful to never use an unconfigured one. It’s not the end of the world but noticeably complexifies the connection management code.
This statefulness also makes it hard if not impossible to recover from errors. If for some reason a query fails, it’s hard to tell which state the connection is in, and the only reasonable thing to do is to close it and start from scratch with a new connection.
If these protocols had an explicit initial configuration phase, it would make it easier to have some sort of “reset state” message you could send after an error (or after letting user code run unknown queries) to get the connection back to a known clean state.
From a Ruby client perspective, it would look like this:
connection = MyDB.new_connection
connection.authenticate(user, password)
connection.configure("SET ...")
connection.query("INSERT INTO ...")
connection.reset
You could even cheaply reset the state whenever a connection is checked back into a connection pool.
I’m not particularly knowledgeable about all the constraints database servers face, but I can’t think of a reason why such protocol feature would be particularly tricky to implement.
One of the most important jobs of a database client, or network clients in general, is to deal with network errors.
Under the hood, most if not all clients will look like this:
def query(command)
packet = serialize(command)
@socket.write(packet)
response = @socket.read
deserialize(response)
end
It’s fairly trivial, you send the query to the server and read the server response.
The difficulty however is that both the write and the read operations can fail in dozens of different ways.
Perhaps the server is temporarily unreachable and will work again in a second or two. Or perhaps it’s reachable but was temporarily overloaded and didn’t answer fast enough so the client timeout was reached.
These errors should hopefully be rare, but can’t be fully avoided. Whenever you are sending something through the network, there is a chance it might not work, it’s a fact of life. Hence a client should try to gracefully handle such errors as much as possible, and there aren’t many ways to do so.
The most obvious way to handle such an error is to retry the query, the problem is that most of the time, from the point of view of the database client, it isn’t clear whether it is safe to retry or not.
In my view, the best feature of HTTP by far is its explicit verb specification.
The HTTP spec clearly states that clients, and even proxies, are allowed to retry some specific verbs such as GET or DELETE
because they are idempotent.
The reason this is important is that whenever the write or the read fails, in the overwhelming majority of cases,
you don’t know whether the query was executed on the server or not.
That is why idempotency is such a valuable property, by definition an idempotent operation can safely be executed twice,
hence when you are in doubt whether it was executed, you can retry.
But knowing whether a query is idempotent or not with SQL isn’t easy.
For instance, a simple DELETE query is idempotent:
DELETE
FROM articles
WHERE id = 42;
But one can perfectly write a DELETE query that isn’t:
DELETE
FROM articles
WHERE id IN (
SELECT id
FROM articles
LIMIT 10
);
So in practice, database clients can’t safely retry on errors, unless the caller instructs them that it is safe to do so. You could attempt to write a client that parses the queries to figure out whether they are idempotent, but it is fraught with peril, hence it’s generally preferable to rely on the caller to tell us.
That’s one of the reasons why I’ve been slowly refactoring Active Record lately, to progressively make it easier to retry more queries in case of network errors. But even once I’ll be done with that refactoring, numerous non-idempotent queries will remain, and whenever they fail, there is still nothing Active Record will be able to do about it.
However, there are solutions to turn non-idempotent operations into idempotent ones, using what is sometimes called “Idempotency Keys”. If you’ve used the Stripe API, perhaps you are already familiar with them. I suspect they’re not the first ones to come up with such a solution, but that’s where I was first exposed to it.
Conceptually it’s rather simple, when performing a non-idempotent operation, say creating a new customer record, you can
add an Idempotency-Key HTTP header containing a randomly generated string.
If for some reason you need to retry that request, you do it with the same idempotency key, allowing the Stripe API to
check if the initial request succeeded or not, and either perform or discard the retry.
They even go a bit further, when a request with an idempotency key succeeds, they record the response so that in case of a retry, they return you exactly the original response. Thanks to this feature, it is safe to retry all API calls to their API, regardless of whether they are idempotent or not.
This is such a great feature that last year, at Rails World 2024, when I saw there was a ValKey booth, hosted by Kyle Davis, I decided to go have a chat with him, to see if perhaps ValKey was interested in tackling this fairly common problem.
Because everything I said about SQL and idempotency also applies to Redis (hence to ValKey). It is also hard for a Redis client to know if a query can safely be retried, and for decades, long before I became the maintainer, the Redis client would retry all queries by default.
At first, it would only do so in case of ECONNRESET errors, but over time more errors were added to the retry list.
I must admit I’m not the most knowledgeable person about TCP, so perhaps it is indeed safe to assume the server never
received the query when such an error is returned, but over time more and more errors were added to the list, and I highly
doubt all of them are safe to retry.
That’s why when I later wrote redis-client, a much simpler and lower-level client for Redis, I made sure not to retry by
default, as well as a way to distinguish idempotent queries by having both a call and a call_once method.
But from the feedback I got when Mike Perham replaced the redis gem with redis-client in Sidekiq, lots of users
started noticing reports of errors they wouldn’t experience before, showing how unreliable remote data stores can be in practice,
especially in cloud environments.
So even though these retries were potentially unsafe, and may have occasionally caused data loss, they were desired by users.
That’s why I tried to pitch an idempotency key kind of feature to Kyle, and he encouraged me to open a feature request in the ValKey repo. After a few rounds of discussion, the ValKey core team accepted the feature, and while as far as I know it hasn’t been implemented yet, the next version of ValKey will likely have it.
It is again pretty simple conceptually:
MULTISTORE 699accd1-c7fa-4c40-bc85-5cfcd4d3d344 EX 10
INC counter
LPOP queue
EXEC
Just like with Stripe’s API, you start a transaction with a randomly generated key, in this case, a UUID, as well as an expiry.
In the example above we ask ValKey to remember this transaction for the next 10 seconds, that’s for how long we can safely retry, after that ValKey can discard the response.
Assuming the next version of ValKey ships with the feature, that should finally offer a solution to safely retry all possible queries.
I fully understand that relational databases are much bigger beasts than an in-memory key-value store, hence it likely is harder to implement, but if I was ever asked what feature MySQL or PostgreSQL could add to make them nicer to work with, it certainly would be this one.
In the case of ValKey, given it’s a text protocol that meant introducing a new command, but MySQL and PostgreSQL both have binary protocols, with distinct packet types, so I think it would be possible to introduce at the protocol level with no change to their respective SQL syntax, and no backward compatibility concerns.
Another essential part of database protocols that I think isn’t pleasant to work with is prepared statements.
Prepared statements mostly serve two functions, the most important one is to provide a query and its parameters separately, as to eliminate the risk of SQL injections. In addition to that, it can in some cases help with performance, because it saves on having to parse the query every time, as well as to send it down the wire. Some databases will also cache the associated query plan.
Here’s how you use prepared statements using the MySQL protocol:
COM_STMT_PREPARE packet with the parametized query (SELECT * FROM users WHERE id = ?).COM_STMT_PREPARE_OK packet and extract the statement_id.COM_STMT_EXECUTE with the statement_id and the parameters.OK_Packet response.COM_STMT_CLOSE packet with the statement_id.Now ideally, you execute the same statements relatively often, so you keep track of them, and in the happy path you
can perform a parameterized query in a single roundtrip by directly sending a COM_STMT_EXECUTE with the known statement_id.
But one major annoyance is that these statement_id are session-scoped, meaning they’re only valid with the connection
that was used to create them.
In a modern web application, you don’t just have one connection, but a pool of them, and that’s per process, so you need
to keep track of the same thing many times.
Worse, as explained previously, since closing and reopening the connection is often the only safe way to recover from errors, whenever that happens, all prepared statements are lost.
These statements also have a cost on the server side. Each statement requires some amount of memory in the database server. So you have to be careful not to create an unbounded amount of them, which for an ORM isn’t easy to enforce.
It’s not rare for applications to dynamically generate queries based on user input, typically some advanced search or filtering form.
In addition, Active Record allows you to provide SQL fragments, and it can’t know whether they are static strings or dynamically generated ones. For example, it’s not good practice, but users can perfectly do something like this:
Article.where("published_at > '#{Time.now.to_s(db)}'")
Also, if you have Active Record query logs, then most queries will be unique.
All this means that a library like Active Record has to have lots of logic to keep track of prepared statements and their lifetime. You might even need some form of Least Recently Used logic to prune unused statements and free resources on the server.
In many cases, when you have no reason to believe a particular query will be executed again soon, it is actually advantageous not to use prepared statements. Ideally, you’d still use a parameterized query, but then it means doing 2-3 rountrips1 to the database instead of just one.
So for MySQL at least, when you use Active Record with a SQL fragment provided as a string, Active Record fallback to not use prepared statements, and instead interpolate the parameters inside the query.
Ideally, we’d still use a parameterised query, just not a prepared one, but the MySQL protocol doesn’t offer such functionality. If you want to use parameterized queries, you have to use prepared statements and in many cases, that will mean an extra roundtrip.
I’m much less familiar with the PostgreSQL protocol, but from glancing at its specification I believe it works largely in the same way.
So how could it be improved?
First I think it should be possible to perform parameterized queries without a prepared statement, I can’t think of a reason why this isn’t a possibility yet.
Then I think that here again, some inspiration could be taken from Redis.
Redis doesn’t have prepared statements, that wouldn’t make much sense, but it does have something rather similar in the form of Lua scripts.
> EVAL "return ARGV[1] .. ARGV[2]" 0 "hello" "world!"
"helloworld!"
But just like SQL queries, Lua code needs to be parsed and can be relatively large, so caching that operation is preferable for
performance.
But rather than a PREPARE command that returns you a connection-specific identifier for your given script, Redis
instead use SHA1 digests.
You can first load a script with the SCRIPT LOAD command:
> SCRIPT LOAD "return ARGV[1] .. ARGV[2]"
"702b19e4aa19aaa9858b9343630276d13af5822e"
Then you can execute the script as many times as desired by only referring its digest:
> EVALSHA "702b19e4aa19aaa9858b9343630276d13af5822e" 0 "hello" "world!"
"helloworld!"
And that script registry is global, so even if you have 5000 connections, they can all share the same script, and you can even assume scripts have been loaded already, and load them on a retry if they weren’t:
require "redis-client"
require "digest/sha1"
class RedisScript
def initialize(src)
@src = src
@digest = Digest::SHA1.hexdigest(src)
end
def execute(connection, *args)
connection.call("EVALSHA", @digest, *args)
rescue RedisClient::CommandError
connection.call("SCRIPT", "LOAD", @src)
connection.call("EVALSHA", @digest, *args)
end
end
CONCAT_SCRIPT = RedisScript.new(<<~LUA)
return ARGV[1] .. " " .. ARGV[2]
LUA
redis = RedisClient.new
p CONCAT_SCRIPT.execute(redis, 0, "Hello", "World!")
I’m not a database engineer, so perhaps there’s some big constraint I’m missing, but I think it would make a lot of sense for prepared statement identifiers to be some sort of predictable digests, so that they are much more easily shared across connection, and let the server deal with garbage-collecting prepared statements that haven’t been seen in a long time, or use some sort of reference counting strategy.
I could probably find a few more examples of things that are impractical in MySQL and PostgreSQL protocols, but I think I’ve shown enough to share my feelings about them.
Relational databases are extremely impressive projects, clearly built by very smart people, but It feels like the developer experience isn’t very high on their priority list, if it’s even considered. And that perhaps explains part of the NoSQL appeal in the early 2010’s. However, I think it would be possible to significantly improve their usability without changing the query language, just by improving the query protocol.
3 roundtrips in total, but you theoretically can do the COM_STMT_CLOSE asynchronously. ↩
It has a bit of an unusual design and makes hard tradeoffs, so I’d like to explain the thought process behind these decisions and how I see the future of that project.
Ever since I joined Shopify over 11 years ago, the main monolith application has been using Unicorn as its application server in production. I know that Unicorn is seen as legacy software by many if not most Rubyists, including Unicorn’s own maintainer, but I very strongly disagree with this opinion.
A major argument against Unicorn is that Rails apps are mostly IO-bound, so besides the existence of the GVL, you can use a threaded server to increase throughput. I explained in a previous post why I don’t believe most Rails applications are IO-bound, but regardless of how true it is in general, it certainly isn’t the case of Shopify’s monolith, hence using a threaded server wasn’t a viable option.
In addition, back in 2014, before the existence of the Ruby and Rails Infrastructure team at Shopify, I worked on the Resiliency team, where we were in charge of reducing the likeliness of outages, as well as reducing the blast radius of any outage we failed to prevent. That’s the team where we developed tools such as Toxiproxy and Semian.
During my stint on the Resiliency team, I’ve witnessed some pretty catastrophic failures. Some C extensions segfaulting, or worse, deadlocking the Ruby VM, some datastores becoming unresponsive, and more.
What I learned from that experience, is that while you should certainly strive to catch as many bugs as possible out front on CI, you have to accept that you can’t possibly catch them all. So ultimately, it becomes a number game. If an application is developed by half a dozen people, this kind of event may only happen once in a blue moon. But when dealing with a monolith on which hundreds if not thousands of developers are actively making changes every day, bugs are a fact of life.
As such, it’s important to adopt a defense-in-depth strategy, if you cannot possibly abolish all bugs, you can at least limit their blast radius with various techniques. And Unicorn’s process based execution model largely participated in the resiliency of the system.
But while I’ll never cease to defend Unicorn’s design, I’m also perfectly able to recognize that it also has its downsides.
One is that Unicorn doesn’t attempt to protect against common attacks such as slowloris, so it’s mandatory to put it behind a buffering reverse proxy such as NGINX. You may consider this to be extra complexity, but to me, it’s the opposite. Yes, it’s one more “moving piece”, but from my point of view, it’s less complex to defer many classic concerns to a battle-tested software used across the world, with lots of documentation, rather than to trust my application server can safely be exposed directly to the internet. I’d much rather trust the NGINX community to keep up with whatever novel attack was engineered last week than rely on the part of the Ruby community that uses my app server of choice. Not that I distrust the Ruby community, but my assumption is that the larger community is more likely to quickly get the security fixes in.
And if a reverse proxy will be involved anyway, you can let it take care of many standard concerns such as terminating SSL, allowing newer versions of HTTP, serving static assets, etc. I don’t think that an extra moving piece brings extra complexity when it is such a standard part of so many stacks and removes a ton of complexity from the next piece in the chain. But that’s just me, I suppose, especially after reading some of the reactions to my previous posts, that not everybody agree on what is complex and what is simple.
Another shortcoming of the multi-process design that’s often mentioned, is its inability to do efficient connection pooling. Since connections aren’t easily shared across processes, each unicorn worker will maintain a separate pool of connections, that will be idle most of the time.
But here too, there aren’t many alternatives. Even if you accept the tradeoff of using a threaded server, you will still need to run at least one process per core, hence you won’t be able to cut the number of idle connections significantly compared to Unicorn. You may be able to buy a bit of time that way, but sooner or later it won’t be enough.
Ultimately, once you scale past a certain size you kinda have to accept that external connection pooling is a necessity. The only alternative I can think of would be to implement cross-process connection pooling by passing file descriptors via IPC. It’s technically doable, but I can’t imagine myself arguing that it’s less complex than setting up ProxySQL, mcrouter / twemproxy etc.
Yet another complaint I heard, was that the multi-process design made it impossible to cache data in memory. But here too I’m going to sound like a broken record, as long as Ruby doesn’t have a viable way to do in-process parallelism, you will have to run at least one process per core, so trying to cache data in-process is never going to work well.
But even without that limitation, I’d still argue you’d be better not to use the heap as a cache because by doing so you are creating extra work for the garbage collector, and anyway, all the caches would be wiped on every deploy, which may be quite frequent, so I’d much rather run a small local Memcached instance on every web node, or use something like SQLite or whatever. It’s a bit slower than in-memory caching, in part because it requires serialization, but it persists across deploys and is shared across all the processes on the server, so have a much better hit ratio.
And finally, by far the most common complaint against the Unicorn model is the extra memory usage induced by processes, and that’s exactly what Pitchfork was designed to solve.
Whenever I’m asked what my day-to-day job is like, I have a very hard time explaining it, because I kind of do an amalgamation of lots of small things that aren’t necessarily all logically related. So it’s almost impossible for me to come up with an answer that makes sense, and I don’t think I ever gave the same answer twice. I also probably made a fool of myself more than once.
But among the many hats I occasionally wear, there’s one I call the “Heap Janitor”. When you task hundreds if not thousands of developers to add features to a monolith, its memory usage will keep growing. Some of that growth will be legitimate because every line of code has to reside somewhere in memory as VM bytecode, but some of it can be reduced or eliminated by using better data structures, deduplicating some data, etc.
Most of the time when the Shopify monolith would experience a memory leak, or simply would have increased its memory usage enough to be problematic, I’d get involved in the investigation. Over time I developed some expertise on how to analyse a Ruby application’s heap, find leaks or opportunities for memory usage reduction.
I even developed some dedicated tools to help with that task, and integrated them into CI so every morning I’d get a nightly report of what Shopify’s monolith heap is made of, to better see historical trends and proactively fix newly introduced problems.
Once, by deduplicating the schema information Active Record keeps, I managed to reduce each process memory usage by 114MB, and by now I probably sent over a hundred patches to many gems to reduce their memory usage, most patches revolve around interning some strings.
But while you can often find more compact ways to represent some data in memory, that can’t possibly compensate for the new features being added constantly.
So by far, the most effective way to reduce an application’s memory usage is to allow more memory to be shared between processes via Copy-on-Write, which in the case of Puma or Unicorn, means ensuring it’s loaded during boot, and is never mutated after that.
Since the Shopify monolith runs in pretty large containers with 36 workers, if you load 1GiB of extra data in memory,
as long as you do it during boot and it is never mutated, thanks to Copy-on-Write that will only account for an extra
28MiB (1024 / 36) of actual memory usage per worker, which is perfectly reasonable.
Unfortunately, the lazy loading pattern is extremely common in Ruby code, I’m sure you’ve seen plenty of code like this:
module SomeNamespace
class << self
def config
@config ||= YAML.load_file("path/to/config.yml")
end
end
end
Here I used a YAML config file as an example, but sometimes it’s fetching or computing data from somewhere else,
they key point is @ivar ||= being done in a class or module method.
This pattern is good in development because it means that if you don’t need that data, you won’t waste time computing it, but in production, it’s bad, because not only that memory won’t be in shared pages, it will also cause the first request that needs this data to do some extra work, causing latency to spike around deploys.
A very simple way to improve this code is to just use a constant:
module SomeNamespace
CONFIG = YAML.load_file("path/to/config.yml")
end
But if for some reason you really want this to be lazily loaded in development, Rails offers a not-so-well-known API to help with that:
module SomeNamespace
class << self
def eager_load!
config
end
def config
@config ||= YAML.load_file("path/to/config.yml")
end
end
end
# in: config/application.rb
config.eager_load_namespaces << SomeNamespace
In the above example, Rails takes care of calling eager_load! on all objects you add to config.eager_load_namespaces
when it’s booted in production mode. This way you keep lazy loading in development environments, but get eager loading
in production.
I spent a lot of time improving Shopify’s monolith and its open-source dependencies to make it eager-load more.
To help me track down the offending call sites, I configured our profiling middleware
so that it would automatically trigger profiling of the very first request processed by a worker.
And similarly, I configured our Unicorn so that a few workers would dump their heap with ObjectSpace.dump_all
before and after their very first request.
On paper, every object allocated as part of a Rails request is supposed to no longer be referenced once the request has been completed. So by taking a heap snapshot before and after a request, and making a diff of them, you can locate any object that should have been eager loaded during boot.
Over time this data helped me increase the amount of shared memory, from something around 45% up to about 60% of the total,
hence significantly reduced the memory usage of individual workers, but I was hitting diminishing returns.
60% is good, but I was hoping for more. In theory, only the memory allocated as part of the request cycle can’t be shared,
the overwhelming majority of the rest of the objects should be shareable, so I was expecting the ratio of shared memory
to be more akin to 80%, which begged the question, which memory still wasn’t shared?
For a while I tried to answer this question using eBPF probes, but after reading man pages for multiple days, I had to accept that these sorts of things fly over my head1, so I gave up.
But one day I had a revelation: It must be the inline caches!
A very large portion of the Shopify monolith heap is comprised of VM bytecode, as mentioned previously, all the code written by all these developers has to end up somewhere. That bytecode is largely immutable but very close to it there are inline caches2 and they are mutable, at least early.
And if they are close together in the heap, mutating an inline cache would invalidate the entire 4kiB page, including lots of immutable objects on the same page.
To validate my assumption, I wrote a test application:
module App
CONST_NUM = Integer(ENV.fetch("NUM", 100_000))
CONST_NUM.times do |i|
class_eval(<<~RUBY, __FILE__, __LINE__ + 1)
Const#{i} = Module.new
def self.lookup_#{i}
Const#{i}
end
RUBY
end
class_eval(<<~RUBY, __FILE__, __LINE__ + 1)
def self.warmup
#{CONST_NUM.times.map { |i| "lookup_#{i}"}.join("\n")}
end
RUBY
end
It uses meta-programming, but is rather simple, it defines 100k methods, each referencing a unique constant. If I removed the meta-programing it would look like this:
module App
Const0 = Module.new
def self.lookup_0
Const0
end
Const1 = Module.new
def self.lookup_1
Const1
end
def self.warmup
lookup_0
lookup_1
# snip...
end
end
Why this pattern? Because it’s a good way to generate a lot of inline caches, constant caches in this case, and to trigger their warmup.
>> puts RubyVM::InstructionSequence.compile('Const0').disasm
== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(1,6)>
0000 opt_getconstant_path <ic:0 Const0> ( 1)[Li]
0002 leave
Here the <ic:0> tells us this instructions has an associated inline cache.
These constant caches start uninitialized, and the first time this codepath is executed, the Ruby VM goes through
the slow process of finding the object that’s pointed by that constant, and stores it in the cache.
On further execution, it just needs to check the cache wasn’t invalidated, which for constants is extremely rare unless
you are doing some really nasty meta programming during runtime.
Now, using this app, we can demonstrate the effect of inline caches on Copy-on-Write effectiveness:
def show_pss(title)
# Easy way to get PSS on Linux
print title.ljust(30, " ")
puts File.read("/proc/self/smaps_rollup").scan(/^Pss: (.*)$/)
end
show_pss("initial")
pid = fork do
show_pss("after fork")
App.warmup
show_pss("after fork after warmup")
end
Process.wait(pid)
If you run the above script on Linux, you should get something like:
initial 246380 kB
after fork 121590 kB
after fork after warmup 205688 kB
So our synthetic App made our initial Ruby process grow to 246MB, and once we forked a child, its
proportionate memory usage was immediately cut in half as expected.
However once App.warmup is called in the child, all these inline caches end up initialized, and most of the Copy-on-Write
pages get invalidated, making the proportionate memory usage grow back to 205MB.
So you probably guessed the next step, if you can call App.warmup before forking, you stand to save a ton of memory:
def show_pss(title)
# Easy way to get PSS on Linux
print title.ljust(30, " ")
puts File.read("/proc/self/smaps_rollup").scan(/^Pss: (.*)$/)
end
show_pss("initial")
App.warmup
show_pss("after warmup")
pid = fork do
show_pss("after fork")
App.warmup
show_pss("after fork after warmup")
end
Process.wait(pid)
initial 246404 kB
after warmup 251140 kB
after fork 123944 kB
after fork after warmup 124240 kB
My theory was somewhat validated. If I found a way to fill inline caches before fork, I’d stand to achieve massive memory savings. Some would for sure continue to flip-flop like inline method caches in polymorphic code paths, but the vast majority of them would essentially be static memory.
However, that was easier said than done.
Generally, when I mentioned that problem, the suggestion was to exercise these code paths as part of boot, but it already isn’t easy to get good coverage in the test environment, it would be even harder during boot in the production environment. Even worse, many of these code paths have side effects, you can’t just run them like that out of context. Anyway, with something like this in place, the application would take ages to boot, and it would be painful to maintain.
Another idea was to attempt to precompute these caches statically, which for constant caches is relatively easy. But it’s only part of the picture, method caches, and instance variable caches are much harder, if not impossible to predict statically, so perhaps it would help a bit, but it wouldn’t solve the issue once and for all.
Given all these types of caches are stored right next to each other, as soon as a single one changes, the entire 4kiB memory page is invalidated.
Yet another suggestion was to serve traffic for a while from the Unicorn master process, but I didn’t like this idea because that process is in charge of overseeing and coordinating all the workers, it can’t afford to render requests, as it can’t be timed out.
That idea lived in my head for quite some time, not too sure how long but certainly months, until one day I noticed
an experimental feature in Puma: fork_worker.
Someone had identified the same issue, or at least a very similar one, and came up with an interesting idea.
It would initially start Puma in a normal way, with the cluster process overseeing its workers, but after a while you could trigger a mechanism that would cause all workers except the first one to shut down, and be replaced not by forking from the cluster process, but from the remaining worker.
So in terms of process hierarchy, you’d go from:
10000 \_ puma 4.3.3 (tcp://0.0.0.0:9292) [puma]
10001 \_ puma: cluster worker 0: 10000 [puma]
10002 \_ puma: cluster worker 1: 10000 [puma]
10003 \_ puma: cluster worker 2: 10000 [puma]
10004 \_ puma: cluster worker 3: 10000 [puma]
To:
10000 \_ puma 4.3.3 (tcp://0.0.0.0:9292) [puma]
10001 \_ puma: cluster worker 0: 10000 [puma]
10005 \_ puma: cluster worker 1: 10000 [puma]
10006 \_ puma: cluster worker 2: 10000 [puma]
10007 \_ puma: cluster worker 3: 10000 [puma]
I found the solution quite brilliant, rather than trying to exercise code paths in some automated way, just let live traffic do it and then share that state with other workers. Simple.
But I had a major reservation with that feature, it’s that if you use it you end up with 3 levels of processes, and as I explained in my post about how guardrails are important, if anything goes wrong, I want to be able to terminate any worker safely.
In this case, what happens if worker 0 is terminated or crashes by itself? Other workers end up orphaned, which in POSIX
means that they’ll be adopted by the PID 1, AKA the init process, not the Puma cluster process and that’s a major resiliency issue,
as Puma needs the workers to be its direct children for various things.
For this to be resilient, you’d need to fork these workers as siblings, not children, and that’s just not possible.
I really couldn’t reasonably consider deploying Shopify’s monolith this way, it would for sure bite us hard soon enough. Yet, I was really curious about how effective it could be, so I set an experiment to have a single container in the canary environment to use Puma with this feature enabled for a while, and it performed both fantastically and horribly.
Fantastically because the memory gains were absolutely massive, and horribly because the newly spawned workers started
raising errors from the grpc gem.
Errors that I knew relatively well because they came from a safety check added a few years prior in the grpc gem by one of my coworkers
to prevent grpc from deadlocking in the presence of fork.
In addition to my reservations about process parenting, it was also clear that making the grpc gem fork-safe would
be almost impossible.
So I shoved that idea in the drawer with all the other good ideas that will never be and moved on.
Until one day, I’m not too sure how long after, I was searching for a solution to a different problem, in the
prctl(2) manpage, and I stumbled upon the PR_SET_CHILD_SUBREAPER
constant.
If set is nonzero, set the “child subreaper” attribute of the calling process; if set is zero, unset the attribute.
A subreaper fulfills the role of init(1) for its descendant processes. When a process becomes orphaned (i.e., its immediate parent terminates), then that process will be reparented to the nearest still living ancestor subreaper.
This was exactly the feature I didn’t know existed and didn’t know I wanted, to make Puma’s experimental feature more robust.
If you’d enable PR_SET_CHILD_SUBREAPER on the Puma cluster process, the worker 0 would be able to spawn siblings
by doing the classic daemonization procedure: forking a grandchild, and orphaning it.
This would cause the new worker to be reparented to the Puma cluster process, effectively allowing you to fork a sibling.
Additionally, at that point, we were running YJIT in production, which made our memory usage situation noticeably worse, so we had to use tricks to enable it only on a subset of workers.
By definition, JIT compilers generate code at runtime, that is a lot of memory that can’t be in shared pages. If I could make this idea work in production, that would allow JITed code to be shared, making the potential savings even bigger.
So I then proceeded to spend the next couple weeks prototyping.
I both tried to improve Puma’s feature and also to add the feature to Unicorn to see which would be the simplest.
It is probably in big part due to my higher familiarity with Unicorn, but I found it easier to do in Unicorn, and proceeded to send a patch to the mailing list.
The first version of the patch actually didn’t use PR_SET_CHILD_SUBREAPER because it is a Linux-only feature, and Unicorn
support all POSIX systems.
Instead, I built on Unicorn’s zero-downtime restart functionality, I’d fork a new master process and proceed to shutdown
the old one, and replace the pidfile.
To help you picture it better, starting from a classic Unicorn process tree:
PID Proctitle
1000 \_ unicorn master
1001 \_ unicorn worker 0
1002 \_ unicorn worker 1
1003 \_ unicorn worker 2
1004 \_ unicorn worker 3
Once you trigger reforking, the worker starts to behave like a new master:
PID Proctitle
1000 \_ unicorn master
1001 \_ unicorn master, generation 2
1002 \_ unicorn worker 1
1003 \_ unicorn worker 2
1004 \_ unicorn worker 3
Then the old and new master processes would progressively shut down and spawn their workers respectively:
PID Proctitle
1000 \_ unicorn master
1001 \_ unicorn master, generation 2
1005 \_ unicorn worker 0, generation 2
1006 \_ unicorn worker 1, generation 2
1003 \_ unicorn worker 2
1004 \_ unicorn worker 3
Until the old master has no workers left, at which point it exits.
This approach had the benefit of working on all POSIX systems, however, it was very brittle and required launching Unicorn in daemonized mode, which isn’t what you want in containers and most modern deployment systems.
I was also relying on creating named pipes in the file system to allow the master process and workers to have a communication pipe, which really wasn’t elegant at all.
But that was enough to send a patch and get some feedback on whether such a feature was desired upstream, as well as feedback on the implementation.
In Unicorn, the master process has to be able to communicate with its workers, for instance, to ask them to shut down, this sort of thing.
The easiest way to do inter-process communication is to send a signal, but it limits you to just a few predefined signals, many of which already have a meaning. In addition, signals are handled asynchronously, so they tend to interrupt system calls and can generally conflict with the running application.
So what Unicorn does is that it implements “soft signals”. Instead of sending real signals, before spawning each workers, it creates a pipe, and the children look for messages from the master process in between processing two requests.
Here’s a simplified example of how it works.
def spawn_worker
read_pipe, write_pipe = IO.pipe
child_pip = fork do
write_pipe.close
loop do
ready_ios = IO.select([read_pipe, @server_socket])
ready_ios.each do |io|
if io == read_pipe
# handle commands sent by the parent process in the pipe
else
# handle HTTP request
end
end
end
end
read_pipe.close
[child_pid, write_pipe]
end
The master process keeps the writing end of the pipe, and the worker the reading end.
Whenever it is idle, a worker waits for either the command pipe or the HTTP socket to have something to read using
either epoll, kqueue or select. In this example, I just use Ruby’s provided IO.select, which is functionally equivalent.
With this in place, the Unicorn master always has both the PID and a communication pipe to all its workers.
But in my case, I wanted the master to be able to know about workers it didn’t spawn itself. For the PID, it wasn’t that hard, I could just create a second pipe, but in the opposite direction, so that workers would be able to send a message to the master to let it know about the new worker PID. But how to establish the communication pipe with the grandparent?
That’s why my first prototype used named pipes, also known as FIFO, which are exactly like regular pipes, except they are
exposed as files on the file system tree. This way the master to look for a named pipe at an agreed-upon location, and
have a way to send messages to its grandchildren. It worked but as Unicorn’s maintainer, pointed out in his feedback, there
was a much cleaner solution, socketpair(2) and
UNIXSocket#send_io.
First, socketpair(2) as its name implies creates two sockets that are connected to each other, so it’s very similar
to pipes but is bidirectional. Since I needed two-way communication between processes, that was simpler and cleaner than
creating two pipes each time.
But then, a little-known capability of UNIX domain sockets (at least I didn’t know about it), is that they allow you to pass file descriptors to another process. Here’s a quick demo in Ruby:
require 'socket'
require 'tempfile'
parent_socket, child_socket = UNIXSocket.socketpair
child_pid = fork do
parent_socket.close
# Create a file that doesn't exist on the file system
file = Tempfile.create(anonymous: true)
file.write("Hello")
file.rewind
child_socket.send_io(file)
file.close
end
child_socket.close
child_io = parent_socket.recv_io
puts child_io.read
Process.wait(child_pid)
In the above example, we have the child process create an anonymous file and share it with its parent through a UNIX domain socket.
With this new capability, I could make the design much less brittle. Now when a new worker was spawned, it could send a message to the master process with all the necessary metadata as well as an attached socket for direct communication with the new worker.
Thanks to Eric Wong’s suggestions, I started to have a much neater design based around PR_SET_CHILD_SUBREAPER but at that
point rather than continue to attempt to upstream that new feature in Unicorn, I chose to instead fork the project under
a different name for multiple reasons.
First, it became clear that several Unicorn features were hard to make work in conjunction with reforking. Not impossible, but it would have required quite a lot of effort, and ultimately it would induce a risk that I’d break Unicorn for some of its users.
Unicorn also isn’t the easiest project to contribute to. It has a policy of supporting very old versions of Ruby, many of them lacking features I wanted to use, and hard to install on modern systems, making debugging extra hard. It also doesn’t use bundler nor most of the modern Ruby tooling, which makes it hard to contribute to for many people, has its own bash-based unit test framework, and accept patches over a mailing list rather than some forge.
I wouldn’t go as far as to say Unicorn is hostile to outside contributions, as it’s not the intent, but in practice it kinda is.
So if I had to make large changes to support that new feature, it was preferable to do it as a different project, one that wouldn’t impact the existing user base in case of mistakes, and one I’d be in control of, allowing me to iterate and release quickly based on production experience.
That’s why I decided to fork. I started by removing many of Unicorn’s features that I believe aren’t useful in a modern
container-based world, removing the dependency on kgio in favor of using the non-locking IO APIs introduced in newer
versions of Ruby.
From that simplified Unicorn base I could more easily do a clean and robust implementation of the feature I wanted without having the constraint of not breaking features I didn’t need.
The nice thing when you start a new project is that you get to choose a name for it.
Initially, I wanted to continue the trend of naming Ruby web servers after animals and possibly marking the lineage with
Unicorn by naming it after another mythical animal.
So for a while, I considered naming the new project Dahu,
but ultimately I figured something with fork in the name would be more catchy.
Unfortunately, it’s very hard to find names on Rubygems that haven’t been taken yet, but I decided to send a mail to
the person who owned the pitchfork gem, which was long abandoned, and they very gracefully transferred the gem to me.
That’s how pitchfork was born.
Now that I could more significantly change the server, I decided to move the responsibility of spawning new workers out of the master process, which I renamed “monitor process” for the occasion.
In Unicorn, assuming you use the preload_app option to better benefit from Copy-on-Write, new workers are forked from
the master process, but that master process never serves any request, so all the application code it loaded is never called.
In addition, if you are running in a container, you can’t reasonably replace the initial process.
What I did instead is that Pitchfork’s monitor process never loads the application code, instead it gives that responsibility to the first child it spawns: the “mold”. That mold process is responsible for loading the application, and spawning new workers when ordered to do so by the “monitor” process. The process tree initially looks like this:
PID Proctitle
1000 \_ pitchfork monitor
1001 \_ pitchfork mold
Then, once the mold is fully booted, the monitor sends requests to spawn workers, which the mold does using the classic double fork:
PID Proctitle
1000 \_ pitchfork monitor
1001 \_ pitchfork mold
1002 \_ pitchfork init-worker
1003 \_ pitchfork worker 0
Once the init-worker process exits, worker 0 becomes an orphan and is automatically reparented to the monitor:
PID Proctitle
1000 \_ pitchfork monitor
1001 \_ pitchfork mold
1003 \_ pitchfork worker 0
Since all workers and the mold are at the same level, whenever we decide to do so, we can declare that a worker is now the new mold, and respawn all other workers from it:
PID Proctitle
1000 \_ pitchfork monitor
1001 \_ pitchfork mold <exiting>
1003 \_ pitchfork mold, generation 2
1005 \_ pitchfork worker 0, generation 2
1007 \_ pitchfork worker 1, generation 2
All of this of course being done progressively, one worker at a time, to avoid significantly reducing the capacity of the server.
After that, I turned my constant cache demo into a memory usage benchmark for Rack servers, and that early version of Pitchfork performed as well as I hoped.
Compared to Puma with 2 workers and 2 threads, Pitchfork configured with 4 processes would use half the memory:
$ PORT=9292 bundle exec benchmark/cow_benchmark.rb puma -w 2 -t 2 --preload
Booting server...
Warming the app with ab...
Memory Usage:
Single Worker Memory Usage: 207.5 MiB
Total Cluster Memory Usage: 601.6 MiB
$ PORT=8080 bundle exec benchmark/cow_benchmark.rb pitchfork -c examples/pitchfork.conf.minimal.rb
Booting server...
Warming the app with ab...
Memory Usage:
Single Worker Memory Usage: 62.6 MiB
Total Cluster Memory Usage: 320.3 MiB
Of course, this is an extreme micro-benchmark for demonstration purposes, and not indicative of the effect on any given real application in production, but it was very encouraging.
Writing a new server, and benchmarking it, is the fun and easy part, and you can probably spend months ironing it out if you so wish.
But it’s only once you attempt to put it in production that you’ll learn of all the mistakes you made and all the problems you didn’t think of.
In this particular case though, there was one major blocker I did know of, and that I did know I had to solve
before even attempting to put Pitchfork in production: my old nemesis, the grpc gem.
I have a very long history of banging my head against my desk trying to fix compilation issues in that gem, or figuring out leaks and other issues, so I knew making it fork-safe wouldn’t be an easy task.
To give you an idea of how much of a juggernaut it is, here’s a sloccount report from the
source package, hence excluding tests, etc:
$ cloc --include-lang='C,C++,C/C++ Header' .
-----------------------------------------------------------------
Language files blank comment code
-----------------------------------------------------------------
C/C++ Header 1797 43802 96161 309150
C++ 983 35199 53621 261047
C 463 9020 8835 81831
-----------------------------------------------------------------
SUM: 3243 88021 158617 652028
-----------------------------------------------------------------
Depending on whether you consider that headers are code or not, that is either significantly bigger than Ruby’s own source code, or about as big.
Here’s the same sloccount in ruby/ruby excluding tests and default gems for comparison:
$ cloc --include-lang='C,C++,C/C++ Header' --exclude-dir=test,spec,-test-,gems,trans,build .
------------------------------------------------------------------
Language files blank comment code
------------------------------------------------------------------
C 304 51562 83404 315614
C/C++ Header 406 8588 32604 84751
------------------------------------------------------------------
SUM: 710 60150 116008 400365
------------------------------------------------------------------
And to that, you’d also need to add the google-protobuf gem that works in hand with grpc and is also quite
sizeable.
Because of that, rather than try to make grpc fork-safe, I first tried to see if I could instead eliminate that
problematic dependency, given that after all, it was barely used in the monolith. It was only used to call a single service.
Unfortunately, I wasn’t capable of convincing the team using that gem to move to something else.
I later attempted to find a way to make the library fork-safe, but I was forced to admit I wasn’t capable of it. All I managed to do was figure out that the Python bindings had optional support for fork safety behind an environment variable. That confirmed it was theoretically possible, but still beyond my capacities.
So I wasn’t happy about it, but I had to abandon the Pitchfork project. It just wasn’t viable as long as grpc remained
a dependency.
A few months later, a colleague who probably heard me cursing across the Atlantic Ocean asked if he could help.
Given that fork-safety was supported by the Python version of grpc, and that Shopify is a big Google Cloud customer
with a very high tier of support, he thought he could pull a few strings and get Google to implement it.
And he was right, it took a long time, probably something like six months, but
the grpc gem did end up gaining fork support.
And just like that, after being derailed for half a year, the Pitchfork project was back on track, so a big thanks to
Alexander Polcyn for improving grpc.
At that point, it was clear there were other issues than grpc, but I had some confidence I’d be able to
tackle them. Even without enabling reforking, it was advantageous to replace Unicorn with Pitchfork in production,
as to confirm no bugs were introduced in the HTTP and IO layers, but also because it allowed us to remove
our dependency on kgio, unlocked compatibility with rack 3, and a few other small things.
So that was the first step.
Then, fixing the fork safety issues other than grpc took approximately another month.
The first thing I did was to simulate reforking on CI.
Every 100 tests or so, CI workers would refork the same way Pitchfork does. This uncovered fork-safety issues
in other gems, notably ruby-vips.
Luckily this gem wasn’t used much by web workers, so I devised a new strategy to deal with it.
Pitchfork doesn’t actually need all workers to be fork-safe, only the ones that will be promoted into the next mold.
So if some libraries cause workers to become fork unsafe once they’ve been used, like ruby-vips, but are very rarely called,
what we can do is mark the worker as no longer being allowed to be promoted.
If you are abusing this feature, you may end up with all workers marked as fork-unsafe, and no longer able to refork ever. But once I shipped Pitchfork in production, I did put some instrumentation in place to keep an eye on how often workers would be marked unsafe and it was very rare, so we were fine.
Once I managed to get a green CI with reforking on, I still was a bit worried about the application being fork-safe. Because simulating reforking on CI was good for catching issues with dead threads, but didn’t do much for catching issues with inherited file descriptors.
In production, the problem with inheriting file descriptors mostly comes from multiple processes using the same file descriptor concurrently. But on CI, even with that reforking simulation, we’re always running a single process.
So I had to think of another strategy to ensure no file descriptors were leaking.
This led me to develop another Pitchfork helper: close_all_ios!.
The idea is relatively simple, after a reforking happens, you can use ObjectSpace.each_object
to find all instances of IO and close them unless they’ve been explicitly marked as fork-safe with Pitchfork::Info.keep_io.
This isn’t fully reliable, as it can only catch Ruby-level IOs, and can’t catch file descriptors held in C extensions, but it still helped find numerous issues in gems and private code.
Here’s one example in the mini_mime gem.
The gem is a small wrapper that allows querying flat files that contain information about mime types,
and to do that it would keep a read-only file, and seek into it:
def resolve(row)
@file.seek(row * @row_length)
Info.new(@file.readline)
end
Since seek and readline aren’t thread-safe, the gem would wrap all that in a global mutex.
The problem here is that on fork file descriptors are inherited, and file descriptors aren’t just a pointer to a file
or socket. File descriptors also include a cursor that is incremented when you call seek or read.
To make this fork safe you could detect that a fork happened, and reopen the file, but there’s actually a much better solution.
Rather than to rely on seek + read, you can instead rely on pread(2),
which Ruby conveniently exposes in the IO class.
Instead of advancing the cursor like read, pread takes absolute offsets from the start of the file, which makes it
ideal to use in multi-threaded and multi-process scenarios:
def resolve(row)
Info.new(@file.pread(@row_length, row * @row_length))
end
In addition to fixing the fork-safety in that gem, using pread also allowed to remove the global mutex, making the gem faster.
Win-win.
After a few more rounds of grepping the codebase and its dependencies for patterns that may be problematic, I started being confident enough to start manually triggering reforking in a single canary container.
To be clear, I was expecting some issues to be left, but I was out of ideas on how to catch any more of them and confident the most critical problems such as data corruption were out of the picture.
These manual reforks didn’t reveal any issues, except that I forgot to also prevent manual reforking once a worker had been maked as fork-unsafe, 🤦.
Since other than that it went well, I progressively enabled automatic reforking on more and more servers over the span of a few days, first 1%, then 10%, etc, with seemingly no problems. While doing that I was also trying multiple different reforking frequencies, to try to identify a good tradeoff between memory usage reduction and latency impact.
But one of the characteristics of the Shopify monolith, with so many engineers shipping changes every day, is that it’s deployed extremely frequently, as often as every 30 minutes, and with teams across the world, this never really stops except for a couple of hours at night, and a couple of days during weekends.
For the same reason that rebooting your computer will generally make whatever issue you had go away, redeploying a web application will generally hide various bugs that take time to manifest themselves. So over the years, doing this sort of infrastructure changes, I learned that even when you think you succeeded, you might discover problems over the next weekend.
And in this case, it is what happened. On the night of Friday to Saturday, Site Reliability Engineers got paged because some application servers became unresponsive, with very high CPU usage.
Luckily I had a ton of instrumentation in place to help me tune reforking, so I was able to investigate this immediately on Saturday morning, and quickly identified some smoking guns.
The first thing I noticed is that on these nodes, the after_fork callbacks were taking close to a minute on average,
while they’d normally take less than a second. In that callback, we were mostly doing two things,
calling Pitchfork::Info.close_all_ios!, and eagerly reconnecting to datastores. So a good explanation for these spikes
would be an IO “leak”.
Hence I immediately jumped on a canary container to confirm my suspicion. The worker processes were fine, but the mold processes were indeed “leaking” file descriptors, I still have the logs from that investigation:
appuser@web-59bccbbd79-sgfph:~$ date; ls /proc/135229/fd | wc -l
Sat Sep 23 07:52:46 UTC 2023
155
appuser@web-59bccbbd79-sgfph:~$ date; ls /proc/135229/fd | wc -l
Sat Sep 23 07:52:47 UTC 2023
156
appuser@web-59bccbbd79-sgfph:~$ date; ls /proc/135229/fd | wc -l
Sat Sep 23 07:52:47 UTC 2023
157
appuser@web-59bccbbd79-sgfph:~$ date; ls /proc/135229/fd | wc -l
Sat Sep 23 07:52:48 UTC 2023
157
appuser@web-59bccbbd79-sgfph:~$ date; ls /proc/135229/fd | wc -l
Sat Sep 23 07:52:49 UTC 2023
158
appuser@web-59bccbbd79-sgfph:~$ date; ls /proc/135229/fd | wc -l
Sat Sep 23 07:52:49 UTC 2023
158
appuser@web-59bccbbd79-sgfph:~$ date; ls /proc/135229/fd | wc -l
Sat Sep 23 07:52:50 UTC 2023
159
appuser@web-59bccbbd79-sgfph:~$ date; ls /proc/135229/fd | wc -l
Sat Sep 23 07:52:51 UTC 2023
160
appuser@web-59bccbbd79-sgfph:~$ date; ls /proc/135229/fd | wc -l
Sat Sep 23 07:52:51 UTC 2023
160
I could see that the mold process was creating file descritors at the rate of roughly one per second.
So I snapshotted the result of ls -lh /proc/<pid>/fd twice a few seconds apart, and used diff to see
which ones were new:
$ diff tmp/fds-1.txt tmp/fds-2.txt
130a131,135
> lrwx------ 1 64 Sep 23 07:54 215 -> 'socket:[10443548]'
> lrwx------ 1 64 Sep 23 07:54 216 -> 'socket:[10443561]'
> lrwx------ 1 64 Sep 23 07:54 217 -> 'socket:[10443568]'
> lrwx------ 1 64 Sep 23 07:54 218 -> 'socket:[10443577]'
> lrwx------ 1 64 Sep 23 07:54 219 -> 'socket:[10443605]'
> lrwx------ 1 64 Sep 23 07:54 220 -> 'socket:[10465514]'
> lrwx------ 1 64 Sep 23 07:54 221 -> 'socket:[10443625]'
> lrwx------ 1 64 Sep 23 07:54 222 -> 'socket:[10443637]'
> lrwx------ 1 64 Sep 23 07:54 223 -> 'socket:[10477738]'
> lrwx------ 1 64 Sep 23 07:54 224 -> 'socket:[10477759]'
> lrwx------ 1 64 Sep 23 07:54 225 -> 'socket:[10477764]'
> lrwx------ 1 64 Sep 23 07:54 226 -> 'socket:[10445634]'
...
These file descriptors were sockets. I went on and took a heap dump using rbtrace,
to see what the leak looked like from Ruby’s point of view:
...
5130070:{"address":"0x7f5d11bfff48", "type":"FILE", "class":"0x7f5d8bc9eec0", "fd":11, "memsize":248}
7857847:{"address":"0x7f5cd9950668", "type":"FILE", "class":"0x7f5d8bc9eec0", "fd":-1, "memsize":8440}
7857868:{"address":"0x7f5cd99511d0", "type":"FILE", "class":"0x7f5d81597280", "fd":4855, "memsize":248}
7857933:{"address":"0x7f5cd9951fb8", "type":"FILE", "class":"0x7f5d8bc9eec0", "fd":-1, "memsize":8440}
7857953:{"address":"0x7f5cd99523c8", "type":"FILE", "class":"0x7f5d81597280", "fd":4854, "memsize":248}
7858016:{"address":"0x7f5cd9952fd0", "type":"FILE", "class":"0x7f5d8bc9eec0", "fd":-1, "memsize":8440}
7858036:{"address":"0x7f5cd9953390", "type":"FILE", "class":"0x7f5d81597280", "fd":4853, "memsize":248}
...
Here "type":"FILE" corresponds to Ruby’s T_FILE base type, which encompasses all IO objects.
I then used harb3, to get some more context on these IO objects
and quickly got my answer:
harb> print 0x7f5cd9950668
0x7f5cd9950668: "FILE"
memsize: 8,440
retained memsize: 8,440
references to: [
0x7f5cc9c59158 (FILE: (null))
0x7f5cd71d8540 (STRING: "/tmp/raindrop_monitor_84")
0x7f5cc9c590e0 (DATA: mutex)
]
referenced from: [
0x7f5cc9c59158 (FILE: (null))
]
The /tmp/raindrop_monitor path hinted at one of our utility threads, which used to run in the Unicorn master process
and that I had moved into the Pitchfork mold process.
It uses raindrops gem to connect to the server port and extract TCP statistics to estimate how many requests
are queued, hence producing a utilization metric of the application server.
Basically, it executes the following code in a loop, and makes the result accessible to all workers:
Raindrops::Linux.tcp_listener_stats("localhost:$PORT")
The problem here is that tcp_listener_stats opens a socket to get the TCP stats, but doesn’t close the socket, nor even return it to you. It leaves to the Ruby GC the responsibility of closing the file descriptor.
Normally, this isn’t a big deal, because GC should trigger somewhat frequently, but the Pitchfork mold process, or even the Unicorn master process, doesn’t do all that much work, hence allocates rarely, as a result, GC may only very rarely trigger, if at all, letting these objects, hence file descriptors, accumulate over time.
Then once a new worker had to be spawned, it would inherit all these file descriptors, and have to close them all, causing a lot of work for the kernel. That perfectly explained the observed issue and also explained why it would get worse over time. The reforking frequency wasn’t fixed, it was configured to be relatively frequent at first, and then less and less so. Leaving increasingly more time for file descriptors to accumulate.
To fix that problem, I submitted a patch to Raindrops, to make it eagerly close these sockets, and applied the patch immediately on our systems, and the problem was gone.
What I find interesting here, is that in a way this bug was predating the Pitchfork migration. Sockets were already accumulating in Unicorn’s master process, it just had not enough of an impact there for us to notice.
This wasn’t the only issue found in production, but it was the most impactful and is a good illustration of how reforking can go wrong.
Concurrently to ironing out reforking bugs, I spent a lot of time deploying various reforking settings, as it’s a bit of a balancing act.
Reforking and Copy-on-Write aren’t free. It sounds a bit magical when described, but this is a lot of work for the kernel.
Forking a process with which you share memory isn’t terribly costly, but after that, whenever a shared page has to be invalidated because either the child or the parent has mutated it, the kernel has to pause the process and copy the page over. So after you trigger a refork, you can expect some negative impact on the process latency, at least for a little while.
That’s why it can be hard to find the sweet spot. If you refork too often you’ll degrade the service latency, if you refork too infrequently, you’re not going to save as much memory.
For this sort of configuration, with lots of variables, I just tend to deploy multiple configurations concurrently, and graph the results to try to locate the sweet spot, which is exactly what I did here.
Ultimately I settled on a setting with fairly linear growth:
PITCHFORK_REFORK_AFTER="500,750,1000,1200,1400,1800,2000,2200,2400,2600,2800,...
The idea is that young containers are likely triggering various lazy initializations at a relatively fast rate, but that over time, as more an more of these have been warmed, invalidations become less frequent.
Back in 2023 I wrote a post that shared quite a few details on the results of reforking on Shopify’s monolith,
you can read it if you want more details, but in short, memory usage was reduced by 30%, and latency by 9%.
The memory usage reduction was largely expected, but the latency reduction was a bit of a nice surprise at first, if anything I was hoping latency wouldn’t be degraded too much.
I had to investigate to understand how it was even possible.
One thing to know about how Unicorn and Pitchfork works is that, on Linux, they wait for incoming requests using the epoll system call.
Once a request comes in, the worker is woken up by the kernel and immediately calls accept to, well, accept the request.
This is a very classic pattern, that many servers use, but historically it suffered from a problem called the
“thundering herd problem”.
Assuming a fully idle server with 32 workers, all waiting on epoll, whenever a request would come in,
all 32 workers would be woken up, and all try to call accept, but only one of them would succeed.
This was a pretty big waste of resources, so in 2016, with the release of Linux 4.5, epoll gained a new flag: EPOLLEXCLUSIVE.
If this flag is set, the Linux kernel will only wake up a single worker when a request comes in. However the feature doesn’t try to be fair or anything, it just wakes up the first it finds, and because of how the feature is implemented, it behaves a bit like a Last In First Out queue, in other words, a stack.
As a result, unless most workers are busy most of the time, what you’ll observe is that some workers will serve
disproportionately more requests than others. In some cases, I witnessed that worker 0 had processed over a thousand
requests while worker 47 had only seen a dozen requests.
Unicorn isn’t the only server impacted by that, Cloudflare engineers wrote a much more detailed post on how NGINX behaves the way.
In Ruby’s case, this imbalance means that all these inline caches in the VM, all the lazy initialized code in the application, as well as YJIT, are much more warmed up in some workers than in others.
Because of all these caches, JIT, etc, a “cold” worker is measurably slower than a warmed-up one, and because of the balancing bias, workers are very unevenly warmed up.
However since the criteria for promoting a worker into the new mold is the number of requests it has handled, it’s almost always the most warmed-up worker that ends up being used as a template for the next generation of workers.
As a result, with reforking enabled, workers are much more warmed up on average, hence running faster. In my initial post about Pitchfork, I illustrated this by showing how much more JITed code workers had in containers where reforking was enabled compared to the ones without:

And more JITed code translates into faster execution and less time spent compiling hot methods.
As explained previously, the motivator for working on Pitchfork was reducing memory usage. Especially with the advent of YJIT, we were hitting some limits, and I wanted to solve that once and for all. But in reality, it would have been much less effort to just ask for more RAM on servers. RAM is quite cheap these days, and most hosting services will give you about 4GiB of RAM per core, which even for Ruby is plenty.
It’s only when working with very large monoliths that this becomes a bit tight. But even then, we could have relatively easily used servers with more RAM per core, and while it would have incurred extra cost, it probably wouldn’t have been too bad in the grand scheme of things.
It’s only after reforking fully shipped to production, that I started to understand its real benefits. Beyond the memory savings, the way the warmest worker is essentially “checkpointed” and used as a template means that whenever a small spike of traffic comes in, and workers that are normally mostly idle respond to that traffic, they do it noticeably faster than they used to.
In addition, when we were running Unicorn, we were keeping a close eye on worker terminations caused by request timeouts or OOM, because killing a Unicorn worker meant replacing a warm worker with a cold worker, hence it had a noticeable performance impact.
But since reforking was enabled, not only does this happen less often because OOM events are less common, but also the killed worker is now replaced with a fairly well-warmed-up one, with already a lot of JITed code and such.
And I now believe this is the true killer feature of Pitchfork, before the memory usage reduction.
This realization of how powerful checkpointing is, later led me to further optimize the monolith.
YJIT has this nice characteristic that it warms up quite fast and for relatively cheap. By that, I mean that it reaches its peak performance quickly, and doesn’t slow down normal Ruby execution too much while doing so.
However last summer, when I started testing Ruby 3.4.0-preview1 in production, I discovered a pretty major regression in YJIT compile time. The compiled code was still as fast if not faster, but YJIT was suddenly requiring 4 times as much CPU to do its compilation, which was causing large spikes of CPU utilization on our servers, negatively impacting the overall latency.
What happened is that the YJIT team had recently rewritten the register allocator to be smarter, but it also ended up being noticeably slower. This is a common tradeoff in JIT design, if you complexify the compiler, it may generate faster code, but degrade performance more while it is compiling.
I of course reported the issue to the YJIT team, but it was clear that this performance would not be reclaimed quickly, so it was complicated to keep the Ruby preview in production with such regression in it.
Until it hit me: why are we even bothering to compile this much?
If you think about it, we were deploying Pitchfork with 36 workers, and all 36 of them have YJIT enabled, so all of them compile new code when they discover new hot methods. So most methods, especially the hottest ones, are compiled 36 times.
But once one worker has served the 500 requests required to be promoted, all the code compiled by other workers is just thrown out of the window, it’s a huge waste.
Which gave me the idea, what if we only enabled YJIT in the worker 0? Thanks to the balancing bias induced by
EPOLLEXCLUSIVE, we already know it will most likely be the one to be promoted, and for the others, we can just
mark them as not fork-safe.
This is quite trivially done from the Pitchfork config:
after_worker_fork do |server, worker|
if worker.nr == 0
RubyVM::YJIT.enable
else
::Pitchfork::Info.no_longer_fork_safe!
end
end
Of course, once the first generation is promoted, YJIT is then enabled in all workers, but this helped tremendously to reduce the YJIT overhead soon after a deploy.
Here’s a graph that shows the distribution of system time around deploys. YJIT tends to make the system time spike
when warming up, because it calls mprotect frequently to mark pages as either executable or writable.
This causes quite a lot of load on the kernel.
The first spike is a deploy before I enabled this configuration, on the second spike the yellow line has the configuration enabled, while the green one still doesn’t have it.

While there is currently no way to turn YJIT back off once it has been enabled, we did experiment with such a feature for other reasons a few years ago. So there may be a case for bringing that feature back, as it would allow to keep YJIT compilation disabled in all workers but one, further reducing the overhead caused by YJIT’s warmup.
There are also a few other advanced optimizations that aren’t exclusive to Pitchfork but are facilitated by it, such as Out of Band Garbage Collection, but I can’t mention everything.
I never really intended Pitchfork to be more than a very opinionated fork of Unicorn, for very specific needs. I even wrote a long document essentially explaining why you probably don’t want to migrate to Pitchfork.
But based on issues open on the repo, some conference chatter, and a few DMs I got, it seems that a handful of companies either migrated to it or are currently working on doing so.
Unsurprisingly, these are mostly companies that used to run Unicorn and have relatively large monoliths.
However, the only public article about such migration I know of is in Japanese.
But it’s probably for the better, because while reforking is very powerful, as I tried to demonstrate in this post, fork-safety issues can lead to pretty catastrophic bugs that can be very hard to debug, hence it’s probably better left to teams with the resources and expertise needed to handle that sort of thing.
So I prefer to avoid any sort of Pitchfork hype.
That being said, I’ve also noticed some people simply interested in a modernized Unicorn, not intending to ever enable reforking, which I guess is a good enough reason to migrate.
At this point, after seeing all the performance improvements I mentioned, you may be thinking that Shopify must be pretty happy with its brand-new application server.
Well.
While Pitchfork was well received by my immediate team, my manager, my director, and many of my peers, the feedback I got from upper management wasn’t exactly as positive:
reforking is a hack that I think is borderline abdication of engineering responsibilities, so this won’t do
Brushing aside the offensiveness of the phrasing, it may surprise you to hear that I do happen to, at least partially, agree with this statement.
This is why before writing this post, I wrote a whole series on how IO-bound Rails applications really are, the current state of parallelism in Ruby and a few other adjacent subjects. To better explain the tradeoffs currently at play when designing a Ruby web server.
I truly believe that today, Pitchfork’s design is what best answers the needs of a large Rails monolith, I wouldn’t have developed it otherwise. It offers true parallelism and faster JIT warmup, absurdly little time spent in GC, while keeping memory usage low and does so with a decent level of resiliency.
That being said, I also truly hope that tomorrow, Pitchfork’s design will be obsolete.
I do hope that in the future Ruby will be capable of true parallelism in a single process, be it via improved Ractors, or by progressively removing the GVL, I’m not picky.
But this is a hypothetical future. The very second it happens, I’ll happily work on Pitchfork’s successor, and slap a deprecation notice on Pitchfork.
That being said, I know I’m rarely the most optimistic person in the room, it’s in my nature, but I honestly can’t see this future happening in the short term. Maybe in 2 or 3 years, certainly not before.
Because it’s not just about Ruby itself, it’s also about the ecosystem. Even if Ractors were perfectly usable tomorrow morning, tons of gems would need to be adapted to work in a Ractor world. This would be the mother of all yak-shaves.
Trust me, I’ve done my fair share of yak-shaves in the past. When Ruby 2.7 started throwing keyword deprecation warnings I took it upon myself to fix all these issues in Shopify’s monolith and all its dependencies, which led me to open over a hundred pull requests on open-source gems, trying to reach maintainers, etc. And again recently with frozen string literal, I submitted tons of PRs to fix lots of gems ahead of Ruby 3.4’s release.
All this to say, I’m not scared of yak-shaves, but making an application like Shopify’s monolith, including its dependencies, Ractor compatible requires an amount of work that is largely beyond what you may imagine. And more than work, an ecosystem like Ruby’s need time to adapt to new features like Ractors, It’s not just a matter of throwing more engineers at the problem.
In the meantime, reforking may or may not be a hack, I don’t really care. What is important to me is that it solves some real problems, and it does so today.
Of course, it’s not perfect, there are several common complaints it doesn’t solve, such as still requiring more
database connections than what would be possible with in-process parallelism.
But I don’t believe it’s a problem that can be reasonably solved today with a different server design that doesn’t mostly
rely on fork, and trying to do so now would be putting the cart before the horse.
An engineer’s responsibility is to solve problems while considering the limitations imposed by practicality.
As such, I believe Pitchfork will continue to do fine for at least a few more years.
Years later, John Hawthorn figured how to to it with perf to great effect. ↩
Since I explained what inline caches are multiple times in the past, I’ll just refer you to Optimizing JSON, Part 2. ↩
When Ractors were announced 4 or 5 years ago, many people expected we’d quickly see a Ractor-based web server, some sort of Puma but with Ractors instead of threads. Yet this still hasn’t happened, except for a few toy projects and experiments.
Since this post series is about giving context to Ruby HTTP servers design constraints, I think it makes sense to share my view on Ractors viability.
The core idea of Ractors is relatively simple, the goal is to provide a primitive that allows true in-process parallelism, while still not fully removing the GVL.
As I mentioned in depth in a previous post, operating without a GVL would require synchronization (mutexes) on every mutable object that is shared between threads. Ractors’ solution to that problem is not to allow sharing of mutable objects between Ractors. Instead, they can send each other copies of objects, or in some cases “move” an object to another Ractor, which means they can no longer access it themselves.
This isn’t unique to Ruby, it’s largely inspired by the Actor model, like the Ractor name suggests, and many languages in the same category as Ruby have a similar construct or are working on one. For instance, JavaScript has Web Workers, and Python has been working on subinterpreters for a while.
And it’s no surprise because it makes total sense from a language evolution perspective. If you have a language that has prevented in-process parallelism for a long time, a Ractor-like API allows you to introduce (constrained) parallelism in a way that isn’t going to break existing code, without having to add mutexes everywhere.
But even in languages that have free threading, shared mutable state parallelism is seen as a major foot gun by many, and message-passing parallelism is often deemed safer, for instance, channels in Go, etc.
Applied to Ruby, this means that instead of having a single Global VM Lock that synchronizes all threads, you’d instead have many Ractor Locks, that each synchronize all threads that belong to a given Ractor. So in a way, since the Ruby 3.0 release that introduced Ractors, on paper the GVL is somewhat already gone, even though as we’ll see later, it’s more subtle than that.
And this can easily be confirmed experimentally with a simple test script:
require "benchmark"
Warning[:experimental] = false
def fibonacci(n)
if n == 0 || n == 1
n
else
fibonacci(n - 1) + fibonacci(n - 2)
end
end
def synchronous_fib(concurrency, n)
concurrency.times.map do
fibonacci(n)
end
end
def threaded_fib(concurrency, n)
concurrency.times.map do
Thread.new { fibonacci(n) }
end.map(&:value)
end
def ractor_fib(concurrency, n)
concurrency.times.map do
Ractor.new(n) { |num| fibonacci(num) }
end.map(&:take)
end
p [:sync, Benchmark.realtime { synchronous_fib(5, 38) }.round(2)]
p [:thread, Benchmark.realtime { threaded_fib(5, 38) }.round(2)]
p [:ractor, Benchmark.realtime { ractor_fib(5, 38) }.round(2)]
Here we use the Fibonacci function as a classic CPU-bound workload and benchmark it in 3 different ways. First without any concurrency, just serially, then concurrently using 5 threads, and finally concurrently using 5 Ractors.
If I run this script on my machine, I get these results:
[:sync, 2.26]
[:thread, 2.29]
[:ractor, 0.68]
As we already knew, using threads for CPU-bound workloads doesn’t make anything faster because of the GVL, however using Ractors we can benefit from some parallelism. So this script proves that, at least to some extent, the Ruby VM can execute code in parallel, hence the GVL is not so global anymore.
But as always, the devil is in the details.
Running a pure function like fibonacci, that only deals with immutable integers, in parallel is one thing, running
a full-on web application, with hundreds of gems and a lot of global states, in parallel is another.
Where Ruby ractors are significantly different from most similar features in other languages, is that Ractors share the global namespace with other Ractors.
To create a WebWorker in JavaScript, you have to provide an entry script:
myWorker = new Worker("worker.js")
WebWorkers are created from a blank slate and have their own namespace, they don’t automatically inherit all the constants defined by the caller.
Similarly, Python’s sub-interpreters as defined in PEP 734, start with a clean slate.
So both JavaScript’s WebWorker and Python’s sub-interpreters have very limited sharing capabilities and are more akin to light subprocesses, but with an API that allows passing each other’s objects without needing to serialize them.
Ruby’s Ractors are more ambitious than that. From a secondary Ractor, you have visibility on all the constants and methods defined by the main Ractor:
INT = 1
Ractor.new do
p INT # prints 1
end.take
But since Ruby cannot allow concurrent access to mutable objects, it has to limit this in some way:
HASH = {}
Ractor.new do
p HASH # Ractor::IsolationError
# can not access non-shareable objects in constant Object::HASH by non-main Ractor.
end.take
So all objects are divided into shareable and unshareable objects, and only shareable ones can be accessed by secondary ractors. In general, objects that are frozen, or inherently immutable are shareable as long as they don’t reference a non-shareable object.
In addition, some other operations, such as assigning class instance variables aren’t allowed from any ractor other than the main one:
Ractor.new do
class Foo
class << self
attr_accessor :bar
end
end
Foo.bar = 1 # Ractor::IsolationError
# can not set instance variables of classes/modules by non-main Ractors
end.take
So Ractors’ design is a bit of a double-edged sword.
On one hand, by having access to all the loaded constants and methods, you don’t have to load the same code multiple
times, and it’s easier to pass complex objects from one ractor to the other, but it also means that not all code may be
able to run from a secondary ractor.
Actually, a lot, if not most, existing Ruby code can’t run from a secondary Ractor.
Something as mundane as accessing a constant that is technically mutable, like a String or Hash, will raise an IsolationError,
even if you never attempted to mutate it.
Something as mundane and idiomatic as having a constant with some defaults is enough to make your code not Ractor compatible, e.g.:
class Something
DEFAULTS = { config: 1 } # You'd need to explictly freeze that Hash.
def initialize(options = {})
@options = DEFAULTS.merge(options) # => Ractor::IsolationError
end
end
That’s one of the main reasons why a Ractor-based web server isn’t really practical for anything more than a trivial application.
If you take Rails as an example, there is quite a lot of legitimate global states, such as the routes, the database schema cache, or the logger. Some of it could probably be frozen to be accessible by secondary ractors, but for things like the logger, the Active Record connection pool, and various caches, it’s tricky.
To be honest, I’m not even sure how you could implement a Ractor safe connection pool with the current API, but I may be missing something. Actually, that’s probably a good illustration of the problem, let’s try to implement a Ractor-compatible connection pool.
The first challenge is that you’d need to be able to move connections from one ractor to another, something like:
require "trilogy"
db_client = Trilogy.new
ractor = Ractor.new { receive.query("SELECT 1") }
ractor.send(db_client, move: true)
p ractor.take
If you try that you’ll get a can not move Trilogy object. (Ractor::Error).
This is because as far as I’m aware, there is no way for classes implemented in C to define that they can be moved to
another ractor. Even the ones defined in Ruby’s core, like Time can’t:
Ractor.new{}.send(Time.now, move: true) # can not move Time object. (Ractor::Error)
The only thing C extensions can do is define that a type can be shared between Ractors once it is frozen, using the
RUBY_TYPED_FROZEN_SHAREABLE flag, but that wouldn’t make sense for a database connection.
A way around this is to encapsulate that object inside its own Ractor:
require "trilogy"
class RactorConnection
def initialize
@ractor = Ractor.new do
client = Trilogy.new
while args = Ractor.receive
ractor, method, *args = args
ractor.send client.public_send(method, *args)
end
end
end
def query(sql)
@ractor.send([Ractor.current, :query, sql], move: true)
Ractor.receive
end
end
When we need to perform an operation on the object, we send a message telling it what to do, and give it our own ractor so it can send the result back.
It really is a huge hack, and perhaps there is a proper way to do this, but I don’t know of any.
Now that we have a “way” to pass database connections across ractors, we need to implement a pool. Here again, it is tricky, because by definition a pool is a mutable data structure, hence it can’t be referenced by multiple ractors.
So we somewhat need to use the same hack again:
class RactorConnectionPool
def initialize
@ractor = Ractor.new do
pool = []
while args = Ractor.receive
ractor, method, *args = args
case method
when :checkout
ractor.send(pool.pop || RactorConnection.new)
when :checkin
pool << args.first
end
end
end
freeze # so we're shareable
end
def checkout
@ractor.send([Ractor.current, :checkout], move: true)
Ractor.receive
end
def checkin(connection)
@ractor.send([Ractor.current, :checkin, connection], move: true)
end
end
CONNECTION_POOL = RactorConnectionPool.new
ractor = Ractor.new do
db_client = CONNECTION_POOL.checkout
result = db_client.query("SELECT 1")
CONNECTION_POOL.checkin(db_client)
result
end
p ractor.take.to_a # => [[1]]
I’m not going to go further, as this implementation is quite ridiculous, I think this is enough to make my point.
For Ractors to be viable to run a full-on application in, Ruby would need to provide at least a few basic data structures that would be shareable across ractors, so that we can implement useful constructs like connection pools.
Perhaps some Ractor::Queue, maybe even some Ractor::ConcurrentMap, and more importantly, C extensions
would need to be able to make their types movable.
So while I don’t believe it makes sense to try to run a full application inside Ractors, I still think Ractors could be very useful even with their current limitations.
For instance, in my previous post about the GVL, I mentioned how some gems do have background threads, one example being
statsd-instrument,
but there are others like open telemetry and such.
These gems all have a similar pattern, they collect information in memory, and periodically serialize and send it down the wire. Currently, this is done using a thread, which is sometimes problematic because the serialization part holds the GVL, hence can slow down the threads that are responding to incoming traffic.
This would be an excellent pattern for Ractors, as they’d be able to do the same thing without holding the main Ractor’s GVL and it’s mostly fire and forget.
I only mean this as an example I know well, I’m sure there’s more. The key point is that while Ractors in their current form can hardly be used as the main execution primitive, they can certainly be used for parallelizing lower-level functions inside libraries.
But unfortunately, in practice, it’s not really a good idea to do that today.
If you attempt to use Ractors, Ruby will display a warning:
warning: Ractor is experimental, and the behavior may change in future versions of Ruby!
Also there are many implementation issues.
And that’s not an overstatement. As I’m writing this article, there are 74 open issues about Ractors. A handful are feature requests or minor things, but a significant part are really critical bugs such as segmentation faults, or deadlocks. As such, one cannot reasonably use Ractors for anything more than small experiments.
Another major reason not to use them even in these cases that are perfect for them, is that quite often, they’re not really running in parallel as they’re supposed to.
As mentioned previously, on paper, the true Global VM Lock is supposedly gone since the introduction of Ractors in Ruby 3.0 and instead, each ractor has its own “GVL”. But this isn’t actually true.
There are still a significant number of routines in the Ruby virtual machine that do lock all Ractors. Let me show you an example.
Imagine you have 5 millions small JSON documents to parse:
# frozen_string_literal: true
require 'json'
document = <<~JSON
{"a": 1, "b": 2, "c": 3, "d": 4}
JSON
5_000_000.times do
JSON.parse(document)
end
Doing so serially takes about 1.3 seconds on my machine:
$ time ruby --yjit /tmp/j.rb
real 0m1.292s
user 0m1.251s
sys 0m0.018s
As unrealistic as this script may look, it should be a perfect use case for Ractor. In theory, we could spawn 5 Ractors, have each of them parse 1 million documents, and be done in 1/5th of the time:
# frozen_string_literal: true
require 'json'
DOCUMENT = <<~JSON
{"a": 1, "b": 2, "c": 3, "d": 4}
JSON
ractors = 5.times.map do
Ractor.new do
1_000_000.times do
JSON.parse(DOCUMENT)
end
end
end
ractors.each(&:take)
But somehow, it’s over twice as slow as doing it serially:
/tmp/jr.rb:9: warning: Ractor is experimental, and the behavior may change in future versions of Ruby! Also there are many implementation issues.
real 0m3.191s
user 0m3.055s
sys 0m6.755s
What’s happening is that in this particular example, JSON has to acquire the true remaining VM lock for each key in the JSON document. With 4 keys, a million times, it means each Ractor has to acquire and release a lock 4 million times. It’s almost surprising it only takes 3 seconds to do so.
For the keys, it needs to acquire the GVL because it inserts string keys into a Hash, and as I explained in Optimizing Ruby’s JSON, Part 6, when you do that Ruby will look inside the interned string table to search for an equivalent string that is already interned.
I used the following Ruby pseudo-code to explain how it works:
class Hash
def []=(key, value)
if entry = find_entry(key)
entry.value = value
else
if key.is_a?(String) && !key.interned?
if interned_str = ::RubyVM::INTERNED_STRING_TABLE[key]
key = interned_str
elsif !key.frozen?
key = key.dup.freeze
end
end
self << Entry.new(key, value)
end
end
end
In the above example ::RubyVM::INTERNED_STRING_TABLE is a regular hash that could cause a crash if it was accessed
concurrently, so Ruby still acquires the GVL to look it up.
If you look at register_fstring in string.c
(fstring is the internal name for interned strings), you can see the very obvious RB_VM_LOCK_ENTER() and
RB_VM_LOCK_LEAVE() calls.
As I’m writing this, there are 42 remaining calls to RB_VM_LOCK_ENTER() in the Ruby VM, many are very rarely hit and not
much of a problem, but this one demonstrates how even when you have what is a perfect use case for Ractors, besides their constraints,
it may still not be advantageous to use them yet.
In his RubyKaigi 2023 talk about the state of Ractors, Koichi Sasada who’s the main driving force behind them, mentioned that Ractors suffered from some sort of a chicken and egg problem. By his own admission, Ractors suffer from many bugs, and often don’t actually deliver the performance they’re supposed to, hence very few people use them enough to be able to provide feedback on the API, and I’m afraid that almost two years later, my assessment is the same on bugs and performance.
If Ractors bugs and performance problems were fixed, it’s likely that some of the provided feedback would lead to some of their restrictions to be lifted over time. I personally don’t think they’ll ever have little enough restrictions for it to be practical to run a full application inside a Ractor, hence that a Ractor-based web server would make sense, but who knows, I’d be happy to be proven wrong.
Ultimately, even if you are among the people who believe that Ruby should just try to remove its GVL for real rather than to spend resources on Ractors, let me say that a large part of the work needed to make Ractors perform well, like a concurrent hash map for interned strings, is work that would be needed to enable free threading anyway, so it’s not wasted.
]]>From time to time, either online or at conferences, I hear people complain about the lack of support for HTTP/2 in Ruby HTTP servers, generally Puma. And every time I do the same, I ask them why they want that feature, and so far nobody had an actual use case for it.
Personally, this lack of support doesn’t bother me much, because the only use case I can see for it, is wanting to expose your Ruby HTTP directly to the internet without any sort of load balancer or reverse proxy, which I understand may seem tempting, as it’s “one less moving piece”, but not really worth the trouble in my opinion.
If you are not familiar with the HTTP protocol and what’s different in version 2 (and even 3 nowadays), you might be surprised by this take, so let me try to explain what it is all about.
HTTP/2 started under the name SPDY in 2009, with multiple goals, but mainly to reduce page load latency, by allowing it to download more resources faster. A major factor in page load time is that a page isn’t just a single HTTP request. Once your browser has downloaded the HTML page and starts parsing it, it will find other resources it needs to also download to render the page, be it stylesheets, scripts, or images.
So a page isn’t one HTTP request, but a cascade of them, and in the late 2000s, the number of resources on the average page kept going up. This bloat was in part offset by broadband getting better, but still, HTTP/1.1 wasn’t really adequate to download many small files quickly for a few reasons.
The first one is that RFC 2616, which introduced HTTP/1.1 specified that browsers were only allowed two concurrent connections to a given domain:
8.1.4 Practical Considerations
Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy.
So if you can only request a single resource per connection, and are limited to two connections, even if you have a very large bandwidth, the latency to the server will have a massive impact on performance whenever you need to download more than a couple of resources.
Imagine you have an excellent 100Gb connection, but are trying to load a webpage hosted across the Atlantic ocean.
The roundtrip time to that server (your ping), will probably be around 60ms. If you need to download 100 small resources
through just two connections, it will take at least ping * (resources / connections), so 3 seconds, which isn’t great.
That’s what made many frontend optimization techniques like assets bundling absolutely essentials back then, they made a major difference in load time1. Similarly, some websites were using a technique called domain sharding, splitting assets into multiple domains to allow more concurrency.
In theory, even these two connections could have been used much more effectively by pipelining requests, the RFC 2616 has an entire section about it, and that was one of the big features added in HTTP/1.1 compared to 1.0. The idea is simple, after sending your request, you don’t have to wait for the response before sending more requests. You can send 10 requests immediately before having received a single response, and the server will send them one by one in order.
But in practice most browsers ended up disabling that feature by default because they ran into misbehaving servers, dooming the feature. It also wasn’t perfect, as you could experience head-of-line blocking. Since responses don’t have an identifier to map them to the request they’re the answer to, they have to be sent in order. If one resource is slow to generate, all the subsequent resources can’t be sent yet.
That’s why as early as 2008, browsers stopped respecting the two concurrent connection rule. Firefox 3 started raising the connection limit to 6 per domain, and most other browsers followed suit shortly after.
However, more concurrent connections isn’t an ideal solution, because TCP connections have a slow start. When you connect to a remote address, your computer doesn’t know if the link to that other machine can support 10 gbit/s or only 56 kbit/s. Hence, to avoid flooding the network with tons of packets that will be dropped on the floor, it starts relatively slow and periodically increase the throughput until it receives packet loss notifications, at that point it know it has more or less reached the maximum throuhput the link can sustain.
That’s why persistent connections are a big deal, a freshly established connection has a much lower throughput than one that has seen some use.
So by multiplying the number of connections, you can download more resources faster, but it would be preferable if they were all downloaded from the same connection to not suffer as much from TCP slow start.
And that’s exactly the main thing HTTP/2 solved, by allowing multiplexing of requests inside a single TCP connection, solving the head-of-line blocking issue2.
It also did a few other things, such as mandating the use of encryption3 and also compressing request and response headers with GZip, and “server push”, but multiplexing is really the big one.
So the main motivation for HTTP/2 is multiplexing, and over the Internet, especially mobile Internet with somewhat more spotty connections, it can have a massive impact.
But in the data center, not so much. If you think about it, the very big factor in the computation we did above was the roundtrip time (ping) with the client. Unless your infrastructure is terribly designed, that roundtrip time between your server (say Puma) and its client (your load balancer or reverse proxy) should be extremely small, way under one millisecond, and totally dwarfed by the actual request render time.
When you are serving mostly static assets over the Internet, latency may be high and HTTP/2 multiplexing is a huge deal. But when you are serving application-generated responses over LAN (or even a UNIX socket), it won’t make a measurable difference.
In addition to the low roundtrip time, the connections between your load balancer and application server likely have a very long lifetime, hence don’t suffer from TCP slow start as much, and that’s assuming your operating system hasn’t been tuned to disable slow start entirely, which is very common on servers.
Another reason people may have wanted HTTP/2 all the way to the Ruby application server at one point was the “server push” capability.
The idea was relatively simple, servers were allowed to send HTTP resources to the client without being prompted for it. This way, when you request the landing page of a website, the server can send you all the associated resources up front so your browser doesn’t have to parse the HTML to realize it needs them and start to ask for it.
However, that capability was actually removed from the spec and nowadays all browsers have removed it because was actually doing more harm than good. It turns out that if the browser already had these resources in its cache, then pushing them again would slow down the page load time.
People tried to find smart heuristics to know which resources may be in the cache or not, but in the end, none worked and the feature was abandoned.
Today it has been superseded by 103 Early Hints, which is a much simpler and elegant spec, and is retro-compatible with HTTP/1.1.
So there isn’t any semantic difference left between HTTP/1.1 and HTTP/2.
From a Rack application point of view, whether the request was
issued through an HTTP/2 or HTTP/1.1 connection makes no difference.
You can tunnel one into the other just fine.
In addition to not providing much if any benefit over LAN, HTTP/2 adds some extra complexity.
First, the complexity of implementation, as HTTP/2 while not being crazy complicated at all, is still a largely binary protocol, so it’s much harder to debug.
But also the complexity of deployment. HTTP/2 is fully encrypted, so you need all your application servers to have a key and
certificate, that’s not insurmountable, but is an extra hassle compared to just using HTTP/1.1, unless of course for some
reasons you are required to use only encrypted connections even over LAN. Edit: The HTTP/2 spec doesn’t actually require
encryption, only browsers and some libraries, so you can do unencrypted HTTP/2 inside your datacenter.
So unless you are deploying to a single machine, hence don’t have a load balancer, bringing HTTP/2 all the way to the Ruby app server is significantly complexifying your infrastructure for little benefit.
And even if you are on a single machine, it’s probably to leave that concern to a reverse proxy, which will also take care of serving static assets, normalize inbound requests, and also probably fend off at least some malicious actors.
There are numerous battle-tested reverse proxies such as Nginx, Caddy, etc, and they’re pretty simple to setup, might as well use these common middlewares rather than to try to do everything in a single Ruby application.
But if you think a reverse proxy is too much complexity and you’d rather do without, there are now zero config solutions such as thruster, I haven’t tried it so I can’t vouch for it, but at least on paper it solves that need.
I think HTTP/2 is better thought of not as an upgrade over HTTP/1.1, but as an alternative protocol to more efficiently transport the same HTTP resources over the Internet. In a way, it’s similar to how HTTPS doesn’t change the semantics of the HTTP protocol, it only changes how it’s serialized over the wire.
So I believe handling HTTP/2 is better left to your infrastructure entry point, typically the load balancer or reverse proxy, for the same reason that TLS has been left to the load balancer or reverse proxy for ages. They have to decrypt and decompress the request to know what to do with it, why re-encrypt and re-compress it to forward it to the app server?
Hence, in my opinion, HTTP/2 support in Ruby HTTP servers isn’t a critically important feature, would be nice to have it for a few niche use cases, but overall, the lack of it isn’t hindering much of anything.
Note that I haven’t mentioned HTTP/3, but while the protocol is very different, its goals are largely the same as HTTP2, so I’d apply the same conclusion to it.
Minifying and bundling still improve load time with HTTP/2, fewer requests and fewer bytes transferred are still positive, so they’re still useful, but it’s no longer critical to achieve a decent experience. ↩
At the HTTP layer at least, HTTP/2 still suffers from some forms of head-of-line blocking in lower layers, but it is beyond the scope of this post. ↩
The RFC doesn’t actually requires encryption, but all browser implementations do. ↩