Low Level Bits 🇺🇦 on Low Level Bits 🇺🇦

Different ways to build LLVM/MLIR tools

Fri, 02 May 2025

This is a mirror of the Substack article
Different ways to build LLVM/MLIR tools
The most recent version is there.

LLVM and MLIR frameworks are typically used to build compilers for various use cases, but I’m using word “tools” here to cover a broader set of possibilities (compilers, language plugins, analyzers, etc.).

If you want to build such a tool, then you obviously need to somehow “connect” your code to LLVM or MLIR libraries.

In this article I’m not going to cover how to do the build itself (I believe there are plenty of great resources out there already), but rather focus on various ways to actually obtain those LLVM libraries and what kinds of features those options bring with them.

I’m also considering the simplest integration: CMake and C++, no fancy build systems, no fancy languages. Different build systems and languages would require different considerations.

Effectively, this article is organized as a table with different ways to get LLVM/MLIR on one axis, and various available features on another.

The actual table is at the very end.

Features

Here is a non-exhaustive list of different features that I consider important.
If you believe something is missing, please leave a comment.

(Fast) Build Times

Obviously, everyone wants to have fast build times. There are two slightly different angles to this story: if you decide to build LLVM from scratch, it would obviously take long time. But even if you don’t build LLVM from scratch, you may still have to wait for way too long due to the static linking.

Also, building LLVM/MLIR from scratch without caching is going to be a huge bottleneck on the CI.

Debugging Experience

Once in a while things go south, so you need to debug not only your code, but also look into what’s “wrong” inside of LLVM.
What I mean here is not just having debug info and assertions enabled, but also facilities like -debug-only=.
One example from MLIR is debugging long conversion pipelines/pattern matching, when things don’t quite work the way you’d expect.

Testing Infrastructure

Both LLVM and MLIR heavily rely on integration testing using lit and filecheck.
None of these are part of the “official distribution” unfortunately. While the official lit can be installed as a separate python package, for filecheck your best bet is third-party solutions, which are actually pretty good starting points if you don’t need very advanced filecheck features (e.g. mull-project/filecheck.py or AntonLydike/filecheck).

Bleeding Edge

This is also an important factor. As a starting point, you can just use whatever is available from your default OS package manager (e.g. apt or homebrew), but at some point you may need to pick something much newer due to bugfixes or new features.

Dynamic Linking

This is more of a niche feature, but it is very important if you are working on any kind of plugins, or if you don’t want to deal with long static linking time during development.

Different LLVM distributions

Here I’m considering more or less cross-platform solutions, so I’m not covering Debian/Ubuntu specific repo. Which leaves us with three options: (semi-)official versions from an OS package manager, precompiled binaries (submitted by volunteers), and BYOB: “bring your own build” story.

(Semi-)official OS packages

These are the packages maintained by the OS maintainers and not necessarily by LLVM maintainers. These packages are the easiest way to start: just call apt/brew install llvm and you are done.

The packages come with dynamic libraries, which enables both fast build times and plugin support. The packages usually contain everything that is needed for testing, but they of course lack the debugging story.

The other inconvenience might be the age of the package: depending on the OS and its stability guarantees, the package might be way too old for your use case.
For LLVM it’s probably fine, but it gets trickier for MLIR as the APIs are less stable across the recent versions.

Precompiled packages

These packages are available as the release artifacts, for example 20.1.4 or 18.1.8.

On one hand, this is the most convenient way to get those binaries: the most recent binaries appear there just a few days after the official release.
On the other hand, some packages are prepared by volunteers, so some releases might be missing the build for your specific OS/version, and the presence of e.g. LLVM-20.1.4-Linux-X64.tar.xz build doesn’t guarantee compatibility with e.g. Ubuntu 20.04 due to the the “old” glibc.

Just as with the official OS packages, the debugging story is not there: the packages are built in the release mode.

In general, these packages are kinda the “best effort”: if it works - great, if not - well, you are out of luck.

Build your own LLVM

This is obviously the most flexible approach: you can build any version/commit on any supported OS, you get the debugging facilities if you wish so, all the testing infrastructure is there, it’s your choice whether to use dynamic or static linking.

But of course the price is the long build times, especially if you want to get more than just LLVM (e.g, MLIR or clang libraries).

Summary

As a conclusion, the exact option depends on your use case.
Just to start with, you can pick the official package available on your OS and then decide whether you need more.

If you need the newest version, then the precompiled packages from LLVM releases page is your best bet, especially when it comes to CI integration.

However, at least at some point, you may consider building your own version of LLVM/MLIR libraries for local development, but still stick to the precompiled packages for CI checks.

To wrap it up, here is a table that sums it all up.

Building LLVM plugins with Bazel

Tue, 01 Apr 2025

This is a mirror of the Substack article
Building LLVM plugins with Bazel
The most recent version is there.

One of the premises of Bazel is to provide reproducible, hermetic builds, thus you shouldn’t depend on whatever is installed on the host OS and all the dependencies typically managed by Bazel directly.

However, if you want to build plugins for LLVM (or any other project really), then you should link against the specific versions installed on the user’s system.

As I’m working on such a plugin, it’s been a long “dream” of mine to migrate to Bazel for the many benefits it provides. Over time, the existing build system (CMake) has grown its capabilities and I have certain requirements for how the builds should work.Namely:

the plugin must work on different versions of OS (Ubuntu 20.xx-24.xx, macOS)
the plugin must support different versions of LLVM, which are different on each OS (e.g., LLVM 12 on Ubuntu 20.04, LLVM 16, 17, 18 on Ubuntu 24.04 etc)
the plugin must be linking against the system libraries due to the ABI requirements
the build system should support multiple versions at the same time

None of these are necessarily hard or impossible with Bazel, but the devil is always in the details.

What follows is my take on solving this problem.

Source code is available here https://github.com/AlexDenisov/bazel-llvm-plugin.

Following the Cunningham’s Law I claim that there is no better way to do it.

Detecting available LLVM versions

Third-party dependencies in Bazel are typically coming in a form of external repositories, thus all supported LLVM versions must be defined in MODULE.bazel upfront. However, what happens if the version is not supported or not installed on the host OS? In this case, these repositories must be defined dynamically.

To do so, first we need to define a custom dynamic repository which will check which versions are installed on the host OS and store this information in a global variable available for later use by different parts of the build system:

# available_llvm_versions.bzl
def _is_macos(ctx):
    return ctx.os.name.find("mac") != -1

def llvm_path(ctx, version):
    if _is_macos(ctx):
        return "/opt/homebrew/opt/llvm@" + version
    return "/usr/lib/llvm-" + version

def _is_supported(repository_ctx, version):
    return repository_ctx.path(llvm_path(repository_ctx, version)).exists

def _llvm_versions_repo_impl(repository_ctx):
    available_versions = []
    for version in repository_ctx.attr.versions:
        if _is_supported(repository_ctx, version):
            available_versions.append(version)
    repository_ctx.file("llvm_versions.bzl",
        content = "AVAILABLE_LLVM_VERSIONS = " + str(available_versions),
    )
    repository_ctx.file(
        "BUILD",
        content = "",
    )

available_llvm_versions_repo = repository_rule(
    local = True,
    implementation = _llvm_versions_repo_impl,
    attrs = {
        "versions": attr.string_list(),
    },
)

def _available_llvm_versions_impl(module_ctx):
    versions = []
    for mod in module_ctx.modules:
        for data in mod.tags.detect_available:
            for version in data.versions:
                versions.append(version)
    available_llvm_versions_repo(name = "available_llvm_versions", versions = versions)

available_llvm_versions = module_extension(
    implementation = _available_llvm_versions_impl,
    tag_classes = {
        "detect_available": tag_class(attrs = {"versions": attr.string_list(allow_empty = False)}),
    },
)

Which must be defined in MODULE.bazel:

# MODULE.bazel
SUPPORTED_LLVM_VERSIONS = ["17", "18"]

available_llvm_versions = use_extension("//:bazel/available_llvm_versions.bzl", "available_llvm_versions")
available_llvm_versions.detect_available(versions = SUPPORTED_LLVM_VERSIONS)
use_repo(available_llvm_versions, "available_llvm_versions")

Defining LLVM repositories

Now, as we know which versions are available installed on the host system, we can define LLVM repositories which will expose libLLVM.so and all the needed headers.

This part requires a dynamic module extension which will either define a real repository, or will define a “fake” empty repo. This is needed so that all the repositories can be later defined in MODULE.bazel safely.

# llvm_repos.bzl
load("@available_llvm_versions//:llvm_versions.bzl", "AVAILABLE_LLVM_VERSIONS")
load("@bazel_tools//tools/build_defs/repo:local.bzl", "new_local_repository")

def _empty_repo_impl(repository_ctx):
    repository_ctx.file(
        "BUILD",
        content = "",
    )

empty_repo = repository_rule(
    local = True,
    implementation = _empty_repo_impl,
)

def _llvm_repos_extension(module_ctx):
    """Module extension to dynamically declare local LLVM repositories."""
    for mod in module_ctx.modules:
        for data in mod.tags.configure:
            for version in data.versions:
                llvm_repo_name = "llvm_" + version
                if version not in AVAILABLE_LLVM_VERSIONS:
                    empty_repo(name = llvm_repo_name)
                    continue

                path = llvm_path(module_ctx, version)
                new_local_repository(
                    name = llvm_repo_name,
                    path = path,
                    build_file = ":third_party/LLVM/llvm.BUILD"
                )

    return modules.use_all_repos(module_ctx)

llvm_repos = module_extension(
    implementation = _llvm_repos_extension,
    tag_classes = {"configure": tag_class(attrs = {"versions": attr.string_list()})},
)

How we can tell Bazel that these repos are available for consumption:

# MODULE.bazel
SUPPORTED_LLVM_VERSIONS = ["17", "18"]

available_llvm_versions = use_extension("//:bazel/available_llvm_versions.bzl", "available_llvm_versions")
available_llvm_versions.detect_available(versions = SUPPORTED_LLVM_VERSIONS)
use_repo(available_llvm_versions, "available_llvm_versions")

llvm_repos = use_extension(":bazel/llvm_repos.bzl", "llvm_repos")
llvm_repos.configure(versions = SUPPORTED_LLVM_VERSIONS)

[use_repo(llvm_repos, "llvm_%s" % v) for v in SUPPORTED_LLVM_VERSIONS]

Defining plugin targets

Now, the rest is rather trivial. We can define all the plugin libraries depending on the LLVM versions available on the host OS:

# src/BUILD
load("@available_llvm_versions//:llvm_versions.bzl", "AVAILABLE_LLVM_VERSIONS")
load("@rules_cc//cc:defs.bzl", "cc_binary")

[
    cc_binary(
        name = "llvm_plugin_%s" % llvm_version,
        srcs = [
            "plugin.cpp",
        ],
        linkshared = True,
        visibility = ["//visibility:public"],
        deps = [
            "@llvm_%s//:libllvm" % llvm_version,
        ],
    )
    for llvm_version in AVAILABLE_LLVM_VERSIONS
]

Defining test targets

Obviously, we must have tests for the plugin. This is also relatively trivial, we need to define a test case for each available LLVM versions as well, thus producing NxM tests where N is the number of tests and M is the number of LLVM versions.

# tests/BUILD
load("@available_llvm_versions//:llvm_versions.bzl", "AVAILABLE_LLVM_VERSIONS")
load("@bazel_itertools//lib:itertools.bzl", "itertools")
load("@pypi//:requirements.bzl", "requirement")
load("@rules_python//python:defs.bzl", "py_test")

[
    py_test(
        name = "%s_%s_test" % (test, llvm_version),
        srcs = ["lit_runner.py"],
        args = [ "-v", test],
        data = [
            requirement("lit"),
            ":lit.cfg.py",
            "@llvm_%s//:clang" % llvm_version,
            "@llvm_%s//:FileCheck" % llvm_version,
            "//src:llvm_plugin_%s" % llvm_version,
            test,
        ],
        main = "lit_runner.py",
        deps = [requirement("lit"), "@rules_python//python/runfiles"],
    )
    for (test, llvm_version) in itertools.product(
        glob(["*.c"]),
        AVAILABLE_LLVM_VERSIONS,
    )
]

Conclusion

With all the little pieces above, the builds are now completely transparent and smooth for the end user:

Full working example can be found here: https://github.com/AlexDenisov/bazel-llvm-plugin

Compiling Ruby. Part 5: exceptions

Fri, 22 Dec 2023

Call Stack, Stack Frames, and Program Counter

During the program execution, a machine maintains a pointer to the instruction being executed. It’s called Program Counter (or Instruction Pointer).

When you call a method (or send a message if we are speaking of Ruby), the program counter is set to the first instruction on the called function (callee). The program somehow needs to know how to get back to the call site once the “child” method has completed its execution.

This information is typically maintained using the concept of a Call Stack.

Consider the following program and its call stack on the right.

The call stack consists of Stack Frames. Whenever a function is called, a new stack frame is created and pushed onto the stack. When the called function returns - the stack frame is poped.

At every point, the call stack represents the actual Stack Trace.

The very top of the call stack represents the scope of the whole file, followed by the stack frame of the first function, followed by the second function, and so forth. In Ruby, the top function/file scope is referred to as simply top.

Now, imagine that we want to pass some information from the second function to the top. Some error or something exceptional happened, and this specific program state needs some special handling.

There are several limited ways to handle such case: either return some special value up (thus, each function on the call stack should be aware of this), or we can use some global variable to communicate with the callers (e.g., errno in C) which is again “pollutes” the business logic through the call stack.

One way to handle this problem more elegantly is to use particular language constructs - exceptions.

Instead of polluting the whole call stack, we can throw/raise an exception and then add special handling at the top, like in this picture:

Stack Unwinding

Now, the question is: How do we implement this feature? To answer it, let’s understand what needs to happen!

The program was in some specific state before it called the first function at the top. Now, the program is in another specific state around the raise "error" line in the second function.

We need to restore the state somehow as it was right before the first call and continue execution right after the rescue in top (by changing the program counter accordingly).

Conceptually, we can save the machine state before calling the first method and restoring it later. The problem is that storing the state of the whole machine is too expensive and adds overhead by saving more than needed.

Instead, we can put the responsibility for maintaining the program on the actual program developers.

Most languages provide useful features for dealing with this:

Ruby has explicit ensure blocks
Java has explicit finally statements
C++ has RAII and implicit destructors
(C has setjmp/longjmp, but we are only talking about useful features)

Here is how it works in the case of Ruby.

Whenever the exception is thrown, the program climbs up through the call stack and executes code from those finalizersuntil it reaches the exception handler.

This process is called Stack Unwinding.

I’m not a native speaker, but I’d say it should be called “Stack Winding”, but oh well

Here is an updated example with explicit state restoration during the stack unwinding.

Without executing code from the ensure block, the hypothetical lock would never be released, thus breaking the program in terrible ways.

Exceptions in Ruby

Now, I can talk about different kinds of exceptions in Ruby. From my perspective, there are three different kinds:

actual raised exceptions
break statements
return statements

Both break and return statements have special meaning when used in the context of Procs.

Let me elaborate on all the three with the examples.

Normal Exceptions

Actual exceptions climb up the stack, calling finalizers until an exception handler is found.

These are the normal exceptions you are all familiar with.

`return`s from a block

return statements behave differently depending on the lexical scope they are part of.

Here is a little puzzle for you.

What will be printed on the screen:

return is called from within a block. You may expect the x * 4 to be returned from the block, but it’s returned from the enclosing function (lexical scope).

As you can see, return x * 4 would return from f instead of from the block.

The code prints

2: 8

instead of

1: 8
2: 42

`break`s

Almost like returns, breaks allow returning from the enclosing function, but in a slightly different way.

This is the most complex example here. Let me write down the steps explicitly. You may want to open the picture in a separate tab to read it.

top calls the loop function and passes the block to it. The block is just another function under the hood; it’s presented separately here as the __anonymous_block.
Runtime creates a new stack frame for loop and puts it on the call stack.
loop calls the passed block (__anonymous_block).
Runtime creates new stack frame for __anonymous_block and puts it on the stack.
The __anonymous_block increments i, checks for equality, and returns to loop, nothing special.
Runtime removes the __anonymous_block stack frame from the call stack.
loops stack frame is kept on the call stack, and the next iteration of while true calls the __anonymous_block again.
Runtime creates new stack frame for __anonymous_block and puts it on the stack.
The __anonymous_block increments i, checks for equality, and invokes break.
The break initiates stack unwinding and returns from the enclosing function (loop). See the dashed line.
loop returns, thus bypassing the endless loop while true.

The break construct is effectively equivalent to the following code:

Implementation

All the language constructs described above (exceptions, returns and breaks within a block) behave similarly: they unwind the stack (calling the finalizers on the way up) and stop at some well-defined point.

They are implemented slightly differently in the original mruby runtime. Still, I implemented them all as exceptions, with returns and breaks being special exceptions: they need to carry a value and store information on where to stop the unwinding process.

The implementation from the LLVM perspective is covered in my recent talk at LLVM Social Berlin: Stack unwinding, landing pads, and other catches.

Here, I’ll mainly focus on the details from the Mruby runtime perspective.

Consider the following example:

The blocks following rescue and ensure are called Landing Pads.

This example has two kinds of landing pads: catch (rescue) and cleanup (ensure). Catches are “conditional” landing pads: they will be executed only if the exception type matches their type. Note the last rescue: it doesn’t have any type attached, so it will just catch any exception.

Conversely, cleanups are “unconditional” - they will always run, but they will also forward the exception up to the next function on the call stack.

Another important detail in this example is the second rescue: it uses function argument as its type. That is, the landing pad type is only known at run time, and it could be anything.

In C++, for example, all the catch types must be known upfront, and the compiler emits special Runtime Type Information (RTTI). Again, IMO, it should be Compile Time Type Information, but it’s C++…

For this reason, Ruby VM always enters each landing pad. For catches, it first checks (at run time!) if the exception type matches the landing pad’s type, and if so, the exception is marked as caught, and the landing pad’s execution continues.

If the exception type doesn’t match - the exception is immediately re-thrown so the next landing pad can try to catch it.

MLIR

I’d love to describe how I modeled exceptions at the MLIR level, but it will take more time to do it for several reasons:

my original approach to constructing SSA right away didn’t work due to the way exceptions work (namely, some registers must spill on the stack), so the dialects have changed a bit, and I need to clean them up a bit
the way I model them currently is more of a hack and only works because I have certain conventions, so it’s not a solid model yet
I added JIT support (for Kernel.eval) and need to do some tweaking there to make exceptions work during just-in-time evaluation

I’ll write down all the low-level details at some point, but I don’t have an ETA, so I’ll stop here.

Thank you so much for reaching this far!

The following articles will focus on JIT compilation and debug information.

Don’t miss those details!

Compiling Ruby. Part 4: progress update

Thu, 30 Nov 2023

It’s been a while since I wrote the last blog post. One of the reasons is that so far, I had to change a lot of things in the implementation due to the exception support.

I’m writing a short progress update on where we are and what’s coming next.

What Happened

During this year, I gave two short talks related to this project:

a high-level overview of the project (EuroLLVM dev meeting)
intro into exception handling in LLVM (LLVM Social Berlin)

The state as of EuroLLVM (May 2023) was as follows:

compiler supported 104 out of 107 bytecode operations
it could compile ~150 out of ~180 files
it could compile ~15KLoC out of ~20KLOC
~72% of tests were passing (1033 out of 1416 it could compile)

Current Status

The three missing opcodes were all about exception handling, and this is what (so far) took the most time to implement. I have some drafts on the details, and I plan to publish them before the end of the year.

With the proper exception handling in place, things are finally starting to take the right shape. There is still much work to do, but it’s more predictable now.

Some new stats:

all bytecode operations are implemented 🎉
all the ruby code in the repo is now compiled (stdlib, gems, tests) 🎉
~95% of the tests are passing (1378 out of 1450) 🎉

Next Steps

The test suite now drives the next steps:

the majority of the failing tests (42 out of 71) are due to the missing fibers implementation
the second biggest group is various proc/methods metadata for runtime reflection
the next big part is related to JIT/runtime evaluation (i.e., when you can execute arbitrary Ruby code not known/visible at compile time)
and there is a long tail of more minor things

Besides that, I need to figure out a better build system for all of it. Currently, It’s a mess glued together by CMake scripts and CMake templates. It works perfectly for development and testing, but I’d hate to use such a system as an end user.

Ideally, I want a one-click solution that would take Ruby files as input and produce a native executable.

What is the state of the art when it comes to build systems/orchestration of compilation? Please let me know if you have any pointers 🙌

Thank you so much for reaching this far!

The next article is about exceptions - Exceptions

Compiling Ruby. Part 3: MLIR and compilation

Fri, 06 Jan 2023

Now as we have a decent understanding of how RiteVM works, we can tackle the compilation. The question I had around two years ago - how do I even do this?

A note of warning: so far, this is the longest article on this blog. And I’m afraid the most cryptic one.

The topics covered here:

MLIR
Control-Flow Graphs (CFG)
Static Single Assignment (SSA)
Dataflow Analysis

Compilation

mruby is written in C, so the logic behind each opcode is implemented in C. To compile a Ruby program from bytecode, we can emit an equivalent C program that uses mruby C API.

Some opcodes have direct API counterparts, e.g., OP_LOADI is equivalent to mrb_value mrb_fixnum_value(mrb_int i);. Yet, most opcodes are inlined in the giant dispatch loop in vm.c. However, we can extract these implementations into separate functions and call them from C.

Consider the following Ruby program:

puts 42

and its bytecode:

OP_LOADSELF R1
OP_LOADI    R2  42
OP_SEND     R1  :puts 1
OP_RETURN   R1
OP_STOP

An equivalent C program looks like this:

mrb_state *mrb = mrb_open();
mrb_value receiver = fs_load_self();
mrb_value number = mrb_fixnum_value(42);
mrb_funcall(mrb, receiver, "puts", 1, &number);
mrb_close(mrb);

fs_load_self is a custom runtime function as OP_LOADSELF doesn’t have a C API counterpart.

OP_RETURN is ignored in this small example.

To compile a Ruby program from its bytecode, we “just” need to generate the equivalent C program. In fact, this is what I did to start two years ago. It worked well and had some nice debugging capabilities - in the end, it’s just a C program.

Yet, at some point, the implementation became daunting. As I was generating a C program, it was pretty hard to do some custom analysis or optimizations on the C code. I started adding my auxiliary data structures (really, just arrays of hashmaps of hashmaps of pairs and tuples) before I generated the C code.

I realized I was about to invent my intermediate representation of questionable quality.

I needed a better solution.

MLIR

I remember watching the MLIR talk by Tatiana Shpeisman and Chris Lattner live at EuroLLVM in Brussels. It went over my head back then, as there was a lot of talk about machine learning, tensors, heterogeneous accelerators, and some other dark magic.

Yet, I also remember some mentions of custom intermediate representations. So I decided to give it a try and dig into it more. It turned out to be great.

One of the key features of MLIR is the ability to define custom intermediate representations called dialects. MLIR provides an infrastructure to mix and match different dialects and run analyses or transformations against them. Further, the dialects can be lowered to machine code (e.g., for CPU or GPU).

Here is a slide from my LLVM Social talk to illustrate the idea:

MLIR Rite Dialect

I need to define a custom dialect to make MLIR work for my use case. I called it “Rite.” The dialect needs an operation of each RiteVM opcode and some RiteVM types.

Here is the minimum required to compile the code sample from above (puts 42).

def Rite_Dialect : Dialect {
  let name = "rite";
  let summary = "A one-to-one mapping from mruby RITE VM bytecode to MLIR";

  let cppNamespace = "rite";
}

class RiteType<string name> : TypeDef<Rite_Dialect, name> {
  let summary = name;
  let mnemonic = name;
}

def ValueType : RiteType<"value"> {}
def StateType : RiteType<"state"> {}

class Rite_Op<string mnemonic, list<Trait> traits = []> :
    Op<Rite_Dialect, mnemonic, traits>;

// OPCODE(LOADSELF, B) /* R(a) = self */
def LoadSelfOp : Rite_Op<"OP_LOADSELF"> {
  let summary = "OP_LOADSELF";
  let results = (outs ValueType);
}

// OPCODE(LOADI, BB) /* R(a) = mrb_int(b) */
def LoadIOp : Rite_Op<"OP_LOADI"> {
  let summary = "OP_LOADI";
  let arguments = (ins SI64Attr:$value);
  let results = (outs ValueType);
}

// OPCODE(SEND, BBB) /* R(a) = call(R(a),Syms(b),R(a+1),...,R(a+c)) */
def SendOp : Rite_Op<"OP_SEND"> {
  let summary = "OP_SEND";
  let arguments = (ins ValueType:$receiver, StringAttr:$symbol, UI32Attr:$argc, Variadic<ValueType>:$argv);
  let results = (outs ValueType);
}

// OPCODE(RETURN, B) /* return R(a) (normal) */
def ReturnOp : Rite_Op<"OP_RETURN", [Terminator]> {
  let summary = "OP_RETURN";
  let arguments = (ins ValueType:$src);
  let results = (outs ValueType);
}

It defines the dialect, the types needed, and the operations. Some entities come from the MLIR’s predefined dialects (StringAttr, UI32Attr, Variadic<...>, Terminator). We define the rest.

Each operation may take zero or more arguments, but it also may produce zero or more results. Unlike a “typical” programming language, MLIR dialects define a graph (as ins and outs hint at). The dialects also have some other properties, but one step at a time.

With the dialect in place, I can generate an “MLIR program” which is roughly equivalent to the C program above:

Note: I omit some details for brevity.

module @"test.rb" {
  func @top(%arg0: !rite.state, %arg1: !rite.value) -> !rite.value {
    %0 = rite.OP_LOADSELF() : () -> !rite.value
    %1 = rite.OP_LOADI() {value = 42 : si64} : () -> !rite.value
    %2 = rite.OP_SEND(%0, %1) {argc = 1 : ui32, symbol = "puts"} : (!rite.value, !rite.value) -> !rite.value
    %3 = rite.OP_RETURN(%2) : (!rite.value) -> !rite.value
  }
}

Here, I generated an MLIR module containing a function (top) with four operations corresponding to each bytecode operation.

Let’s take a detailed look at one operation:

%2 = rite.OP_SEND(%0, %1) {argc = 1 : ui32, symbol = "puts"} : (!rite.value, !rite.value) -> !rite.value

This piece defines a value named %2, which takes two other values (%0 and %1). In MLIR, constants are defined as “attributes,” which are argc = 1 : ui32 and symbol = "puts" in this case. What follows is the operation signature (!rite.value, !rite.value) -> !rite.value. The operation returns rite.value and takes several arguments: %0 is the receiver, and %1 is part of the Variadic<ValueType>:$argv.

MLIR takes the declarative dialect definition and generates C++ code out of it. The C++ code serves as a programmatic API to generate the MLIR module.

Once the module is generated, I can analyze and transform it. The next step is directly converting the Rite Dialect into LLVM Dialect and lowering it into LLVM IR.

From there on, I can emit an object file (machine code) and link it with mruby runtime.

Static Single Assignment (SSA)

In the previous article, I mentioned that the virtual stack is essential, yet here in both C and MLIR programs, I use “local variables” instead of the stack. What’s going on here?

The answer is simple - MLIR uses a Static Single-Assignment form for all its representations.

As a reminder, SSA means that each variable can only be defined once.

Pedantic note: the “variables” should be referred to as “values” as they cannot vary.

Here is an “invalid” SSA form:

int x = 42;
x = 55; // redefinition not allowed in SSA
print(x);

And here is the same code in the SSA form:

int x = 42;
int x1 = 55; // "redefinition" generates a new value
print(x1);

We must convert the registers into SSA values to satisfy the MLIR requirement to be in SSA form.

At first glance, the problem is trivial. We can maintain a map of definitions for each register at each point in time. For example, for the following bytecode:

OP_LOADSELF R1    // #1
OP_LOADI    R2 10 // #2
OP_LOADI    R3 20 // #3
OP_LOADI    R3 30 // #4
OP_ADD      R2 R3 // #5
OP_RETURN   R2    // #6

The map changes as follows:

Step #1: { empty }
Step #2: {
  R1 defined by #1
}
Step #3: {
  R1 defined by #1
  R2 defined by #2
}
Step #4: {
  R1 defined by #1
  R2 defined by #2
  R3 defined by #3
}
Step #5: {
  R1 defined by #1
  R2 defined by #2
  R3 defined by #4 // R3 redefined at #4
}
Step #5: {
  R1 defined by #1
  R2 defined by #5 // OP_ADD stores the result in the first operand
  R3 defined by #4
}

With this map, we know precisely where a register was defined when an operation uses the register.

So MLIR version will look like this:

// OP_LOADSELF R1
%0 = rite.OP_LOADSELF() : () -> !rite.value
// OP_LOADI    R2 10
%1 = rite.OP_LOADI() {value = 10 : si64} : () -> !rite.value
// OP_LOADI    R3 20
%2 = rite.OP_LOADI() {value = 20 : si64} : () -> !rite.value
// OP_LOADI    R3 30
%3 = rite.OP_LOADI() {value = 30 : si64} : () -> !rite.value
// OP_ADD      R2 R3
%4 = rite.OP_ADD(%1, %3) : (!rite.value, !rite.value) -> !rite.value
// OP_RETURN   R2
%5 = rite.OP_RETURN(%4) : (!rite.value) -> !rite.value

Side note: %0 and %2 are never used and can be eliminated (if OP_LOADSELF/OP_LOADI don’t have side effects).

This solution is pleasant until the code has branching such as if/else, loops, or exceptions.

Consider the following non-SSA example:

x = 10;
if (something) {
  x = 20;
} else {
  x = 30;
}
print(x); // Where x is defined?

Classical SSA solves this problem with artificial phi-nodes:

x1 = 10;
if (something) {
  x2 = 20;
} else {
  x3 = 30;
}
x4 = phi(x2, x3); // Will magically resolve to the right x depending on where it comes from
print(x4);

MLIR approaches this differently and elegantly - via “block arguments.”

But first, let’s talk about Control-Flow Graphs.

Control-Flow Graph (CFG)

A control-flow graph is a form of intermediate representation that maintains the program in the form of a graph where operations are connected to each other based on the execution (or control) flow.

Consider the following bytecode (the number on the left is an operation address):

001: OP_LOADT R1      // puts "true" in R1
002: OP_LOADI R2 42
003: OP_JMPIF R1 006  // jump to 006 if R1 contains "true"
                      // otherwise implicitly falls through to 004
004: OP_LOADI R3 20
005: OP_JMP 007       // jump to 007 unconditionally
006: OP_LOADI R3 30
007: OP_ADD R2 R3     // R3 may be either 20 or 30, depending on the branching

The same program in the form of a graph:

This CFG can be further optimized: we can merge all the subsequent nodes unless the node has more than one incoming or more than one outgoing edge.

The merged nodes are called basic blocks:

Some more terms for completeness:

the “first” basic block where the execution of a function starts is called “entry.”
similarly, the “last” basic block is called “exit.”
preceding (incoming, previous) basic blocks are called predecessors. The entry block doesn’t have predecessors.
succeeding (outgoing, next) basic blocks are called successors. Exit blocks don’t have successors.
the last operation in a basic block is called a terminator

Based on the last picture:

B1: entry block
B4: single exit block. There could be several exit blocks, yet we can always add one “empty” block as a successor for the exit blocks to have only one exit block.
B1: predecessors: [], successors: [B2, B3], terminator: OP_JMPIF
B2: predecessors: [B1], successors: [B4], terminator: OP_JMP
B3: predecessors: [B1], successors: [B4], terminator: OP_LOADI
B4: predecessors: [B2, B3], successors: [], terminator: OP_ADD

CFGs in MLIR

Now we can take a look at CFGs from the MLIR perspective. If you are familiar with CFGs in LLVM, then the important difference is that in MLIR, all the basic blocks may have arguments. Function arguments are, in fact, the block arguments from the entry block. For example, this is a more accurate representation of a function:

func @top() -> !rite.value {
^bb0(%arg0: !rite.state, %arg1: !rite.value):
  %0 = rite.OP_LOADSELF() : () -> !rite.value
  %1 = rite.OP_LOADI() {value = 42 : si64} : () -> !rite.value
  %2 = rite.OP_SEND(%0, %1) {argc = 1 : ui32, symbol = "puts"} : (!rite.value, !rite.value) -> !rite.value
  %3 = rite.OP_RETURN(%2) : (!rite.value) -> !rite.value
}

Note, ^bbX represents the basic blocks.

To convert the following bytecode:

001: OP_LOADT R1      // puts "true" in R1
002: OP_LOADI R2 42
003: OP_JMPIF R1 006  // jump to 006 if R1 contains "true"
                      // otherwise implicitly falls through to 004
004: OP_LOADI R3 20
005: OP_JMP 007       // jump to 007 unconditionally
006: OP_LOADI R3 30
007: OP_ADD R2 R3     // R3 may be either 20 or 30, depending on the branching

we need to take several steps:

add an address attribute to all addressable operations (they could be jump targets)
add “targets” attribute to all the jumps, including implicit fallthrough jumps
add an explicit jump in place of the implicit jumps
add the successor blocks for all jump instructions
put all the operations in a single, entry basic block

func @top(%arg0: !rite.state, %arg1: !rite.value) -> !rite.value {
  %0 = rite.PhonyValue() : () -> !rite.value
  %1 = rite.OP_LOADT() { address = 001 } : () -> !rite.value
  %2 = rite.OP_LOADI() { address = 002, value = 42 } : () -> !rite.value
  rite.OP_JMPIF(%0)[^bb1, ^bb1] { address = 003, targets = [006, 004] }
  %3 = rite.OP_LOADI() { address = 004, value = 20 } : () -> !rite.value
  rite.OP_JMP()[^bb1] { address = 005, targets = [007] }
  %4 = rite.OP_LOADI() { address = 006, value = 30 } : () -> !rite.value
  rite.FallthroughJump()[^bb1]
  %5 = rite.OP_ADD(%0, %0) { address = 007 } : () -> !rite.value
^bb1:
}

Note: I’m omitting some details from the textual representation for brevity.

Notice, here, I added a “phony value” as a placeholder for SSA values as we cannot yet construct the proper SSA. We will remove them in the next section.

Additionally, I added a phony basic block to serve as a placeholder successor for the jump targets.

Now, the last steps are:

split the entry basic block by cutting it right before each jump target operation
rewire the jumps to point to the right target basic blocks
delete the phony basic block used as a placeholder

The final CFG looks like this:

func @top(%arg0: !rite.state, %arg1: !rite.value) -> !rite.value {
  %0 = rite.PhonyValue() : () -> !rite.value
  %1 = rite.OP_LOADT() { address = 001 } : () -> !rite.value
  %2 = rite.OP_LOADI() { address = 002, value = 42 } : () -> !rite.value
  rite.OP_JMPIF(%0)[^bb1, ^bb2] { address = 003, targets = [006, 004] }
^bb1: // pred: ^bb0
  %3 = rite.OP_LOADI() { address = 004, value = 20 } : () -> !rite.value
  rite.OP_JMP()[^bb3] { address = 005, targets = [007] }
^bb2: // pred: ^bb0
  %4 = rite.OP_LOADI() { address = 006, value = 30 } : () -> !rite.value
  rite.FallthroughJump()[^bb3]
^bb3: // pred: ^bb1, ^bb2
  %5 = rite.OP_ADD(%0, %0) { address = 007 } : () -> !rite.value
}

It corresponds to the last picture above, except that we now have an explicit rite.FallthroughJump().

With the CFG in place, we can solve the SSA problem and eliminate the rite.PhonyValue() placeholder.

SSA in MLIR

As a reminder, here is the CFG of the “problematic” program:

In the MLIR form, we no longer have registers from the virtual stack. We only have values such as %2, %3, %4, and so on. The tricky part is the 007: OP_ADD R2 R3 operation - where R3 is coming from? Is it %3 or %4?

To answer this question, we can use Data-flow analysis.

Dataflow analysis is used to derive specific facts about the program. The analysis is an iterative process: first, collect the base facts for each basic block, then for each basic block, update the facts combining them with the facts from successors or predecessors. As the facts updated for a basic block may affect the facts from successors/predecessors, the process should run iteratively until no new facts are derived.

A critical requirement for the facts - they should be monotonic. Once the fact is known, it cannot “disappear.” This way, the iterative process eventually stops as, in the worst case, the analysis will derive “all” the facts about the program and won’t be able to derive any more.

My favorite resource about dataflow analysis is Adrian Sampson’s lectures on the subject - The Data Flow Framework. I highly recommend it.

In our case, the facts we need to derive are: which values/registers are required for each operation.

Here is an algorithm briefly:

at every point in time, there is a map of the values defined so far
if an operation is using a value that is not defined, then this value is required
the required values become the block arguments and must be coming from the predecessors
the terminators of the “required” predecessors now use the values required by the successors
at the next iteration, the block arguments define the previously required values

The process runs iteratively until no new required values appear.

An important detail for the entry basic block is that, as it doesn’t have a predecessor, all the required values must come from the virtual stack.

Let’s look a the example bytecode once again:

001: OP_LOADT R1
002: OP_LOADI R2 42
003: OP_JMPIF R1 006
004: OP_LOADI R3 20
005: OP_JMP   007
006: OP_LOADI R3 30
007: OP_ADD   R2 R3

This is the initial state for the dataflow analysis. The comments above contain information about defined values for the given point in time. Comment on the side of each operation tells about the operation itself:

func @top(%arg0: !rite.state, %arg1: !rite.value) -> !rite.value {
  // defined: []
  %0 = rite.PhonyValue() : () -> !rite.value   // defines: [], uses: []
  // defined: []
  %1 = rite.OP_LOADT() : () -> !rite.value     // defines: [R1], uses: []
  // defined: [R1]
  %2 = rite.OP_LOADI(42) : () -> !rite.value   // defines: [R2], uses: []
  // defined: [R1, R2]
  rite.OP_JMPIF(%0)[^bb1, ^bb2]                // defines: [], uses: [R1]

^bb1: // pred: ^bb0                            // defines: [], uses: []
  // defined: []
  %3 = rite.OP_LOADI(20) : () -> !rite.value   // defines: [R3], uses: []
  // defined: [R3]
  rite.OP_JMP()[^bb3]                          // defines: [], uses: []

^bb2: // pred: ^bb0                            // defines: [], uses: []
  // defined: []
  %4 = rite.OP_LOADI(30) : () -> !rite.value   // defines: [R3], uses: []
  // defined: [R3]
  rite.FallthroughJump()[^bb3]                 // defines: [], uses: []

^bb3: // pred: ^bb1, ^bb2                      // defines: [], uses: []
  // defined: []
  %5 = rite.OP_ADD(%0, %0) : () -> !rite.value // defines: [R2], uses: [R2, R3]
}

The last operation uses values that are not defined. Therefore R2 and R3 are required and must come from the predecessors.

Update predecessors and rerun the analysis.

Note: I am using %RX_Y names to distinguish them from the original numerical value names. X is the register number, and Y is the basic block number.

func @top(%arg0: !rite.state, %arg1: !rite.value) -> !rite.value {
  // defined: []
  %0 = rite.PhonyValue() : () -> !rite.value   // defines: [], uses: []
  // defined: []
  %1 = rite.OP_LOADT() : () -> !rite.value     // defines: [R1], uses: []
  // defined: [R1]
  %2 = rite.OP_LOADI(42) : () -> !rite.value   // defines: [R2], uses: []
  // defined: [R1, R2]
  rite.OP_JMPIF(%0)[^bb1, ^bb2]                // defines: [], uses: [R1]

^bb1: // pred: ^bb0                            // defines: [], uses: []
  // defined: []
  %3 = rite.OP_LOADI(20) : () -> !rite.value   // defines: [R3], uses: []
  // defined: [R3]
  rite.OP_JMP(%0, %0)[^bb3]                    // defines: [], uses: [R2, R3]

^bb2: // pred: ^bb0                            // defines: [], uses: []
  // defined: []
  %4 = rite.OP_LOADI(30) : () -> !rite.value   // defines: [R3], uses: []
  // defined: [R3]
  rite.FallthroughJump(%0, %0)[^bb3]           // defines: [], uses: [R2, R3]

^bb3(%R2_3, %R3_3): // pred: ^bb1, ^bb2        // defines: [R2, R3], uses: []
  // defined: [R2, R3]
  %5 = rite.OP_ADD(%0, %0) : () -> !rite.value // defines: [R2], uses: [R2, R3]
}

Basic block ^bb3 now has two block arguments. The terminators from its predecessors (^bb1 and ^bb2) now use an undefined value, R2. R2 is now required. We must add it as a block argument and propagate it to the predecessors’ terminators.

Rerun the analysis:

func @top(%arg0: !rite.state, %arg1: !rite.value) -> !rite.value {
  // defined: []
  %0 = rite.PhonyValue() : () -> !rite.value   // defines: [], uses: []
  // defined: []
  %1 = rite.OP_LOADT() : () -> !rite.value     // defines: [R1], uses: []
  // defined: [R1]
  %2 = rite.OP_LOADI(42) : () -> !rite.value   // defines: [R2], uses: []
  // defined: [R1, R2]
  rite.OP_JMPIF(%0, %0, %0)[^bb1, ^bb2]        // defines: [], uses: [R1, R2, R2]

^bb1(%R2_1): // pred: ^bb0                     // defines: [R2], uses: []
  // defined: [R2]
  %3 = rite.OP_LOADI(20) : () -> !rite.value   // defines: [R3], uses: []
  // defined: [R2, R3]
  rite.OP_JMP(%0, %0)[^bb3]                    // defines: [], uses: [R2, R3]

^bb2(%R2_2): // pred: ^bb0                     // defines: [R2], uses: []
  // defined: [R2]
  %4 = rite.OP_LOADI(30) : () -> !rite.value   // defines: [R3], uses: []
  // defined: [R2, R3]
  rite.FallthroughJump(%0, %0)[^bb3]           // defines: [], uses: [R2, R3]

^bb3(%R2_3, %R3_3): // pred: ^bb1, ^bb2        // defines: [R2, R3], uses: []
  // defined: [R2, R3]
  %5 = rite.OP_ADD(%0, %0) : () -> !rite.value // defines: [R2], uses: [R2, R3]
}

We can run the analysis one more time, but it won’t change anything, so that would conclude the analysis, and we should have all the information we need to replace the phony value with the correct values.

Additionally, now we can replace our custom jump operations with the builtin ones from MLIR, so the final function looks like this:

func @top(%arg0: !rite.state, %arg1: !rite.value) -> !rite.value {
  %1 = rite.OP_LOADT() : () -> !rite.value
  %2 = rite.OP_LOADI(42) : () -> !rite.value
  cond_br %1, ^bb1(%2), ^bb2(%2)
^bb1(%R2_1): // pred: ^bb0
  %3 = rite.OP_LOADI(20) : () -> !rite.value
  br ^bb3(%R2_1, %3)
^bb2(%R2_2): // pred: ^bb0
  %4 = rite.OP_LOADI(30) : () -> !rite.value
  br ^bb3(%R2_2, %4)
^bb3(%R2_3, %R3_3): // pred: ^bb1, ^bb2
  %5 = rite.OP_ADD(%R2_3, %R3_3) : () -> !rite.value
}

Now, onto drawing the rest of the fu**ing owl.

Thank you so much for reaching this far!

The next article gives a short progress update.

Compiling Ruby. Part 2: RiteVM

Wed, 04 Jan 2023

mruby (so-called “embedded” Ruby) is a relatively small Ruby implementation.

mruby is based on a register-based virtual machine. In the previous article, I mentioned the difference between stack- and register-based VMs, but what is a Virtual Machine? As obvious as it gets, a Virtual Machine is a piece of software that mimics specific behavior(s) of a Real Machine.

Depending on the kind of virtual machine, the capabilities may vary. A VM can mimic a typical computer’s complete behavior, allowing us to run any software we’d run on a regular machine (think VirtualBox or VMware). Or it can implement a behavior of an imaginary, artificial machine that doesn’t have a counterpart in the real physical world (think JVM or CLR).

The mruby RiteVM is of a latter kind. It defines a set of “CPU” operations and provides a runtime to run them. The operations are referred to as bytecode. The bytecode consists of an operation kind (opcode) and its corresponding metadata (registers, flags, etc.).

Bytecode

Here is a tiny snippet of various RiteVM operations (coming from mruby/ops.h):

OPCODE(NOP,   Z)  /* no operation */
OPCODE(MOVE,  BB) /* R(a) = R(b) */
OPCODE(ADD,   B)  /* R(a) = R(a)+R(a+1) */
OPCODE(ENTER, W)  /* arg setup according to flags (23=m5:o5:r1:m5:k5:d1:b1) */
OPCODE(JMP,   S)  /* pc+=a */

All the opcodes follow the same form:

OPCODE(name, operands) /* comment */

The name is self-explanatory. The comment describes (or hints at) an operation’s semantics. The operands is a bit more tricky and is directly related to the bytecode encoding.

Each letter in the operands describes the size of the operand. Z means that the operand’s size is zero bytes (i.e., there is no operand). B, S, and W all mean one operand, but their sizes are 1, 2, and 3 bytes, respectively. These definitions can be mixed and matched as needed, but in practice, only the following combinations are used (from mruby/ops.h):

/* operand types:
 + BB: 8+8bit
 + BBB: 8+8+8bit
 + BS: 8+16bit
 + BSS: 8+16+16bit
*/

as the operation may have up to three operands max.

The operands are called a, b, and c. The following bytecode string will be decoded differently depending on the operand definition (the 42 will be mapped to a corresponding opcode):

42 1 2 3

BBB -> a = 1, b = 2, c = 3
B -> a = 1, b = undefined, c = undefined, 2 is treated as the next opcode
BS -> a = 1, b = 2 << 8 | 3, c = undefined
W -> a = 1 << 16 | 2 << 8 | 3, b = undefined, c = undefined
and so on.

Now the comments from the snippet above make more sense:

NOP does nothing with all its zero operands
MOVE copies value from register b to register a
ENTER maps the operand a to the flags needed for its logic
JMP changes the program counter to point to a new location b

With all this information, we now understand what the operations do. The next question is how do they do it?

Bytecode Execution

The bytecode doesn’t live in a vacuum. Each bytecode sequence is part of a method. Consider the following example:

def sum(a, b)
 a + b
end
puts sum(10, 32)

We can look into its bytecode:

> mruby --verbose sum.rb
<skipped>
irep 0x600001390000 nregs=6 nlocals=1 pools=0 syms=2 reps=1 ilen=25
file: sum.rb
 1 000 TCLASS R1
 1 002 METHOD R2 I(0:0x600001390050)
 1 005 DEF    R1 :sum
 4 008 LOADI  R3 10
 4 011 LOADI  R4 32
 4 014 SSEND  R2 :sum n=2 (0x02)
 4 018 SSEND  R1 :puts n=1 (0x01)
 4 022 RETURN R1
 4 024 STOP

irep 0x600001390050 nregs=7 nlocals=4 pools=0 syms=0 reps=0 ilen=14
local variable names:
 R1:a
 R2:b
 R3:&
file: sum.rb
 1 000 ENTER  2:0:0:0:0:0:0 (0x80000)
 2 004 MOVE   R4 R1 ; R1:a
 2 007 MOVE   R5 R2 ; R2:b
 2 010 ADD    R4 R5
 2 012 RETURN R4

The bytecode sequence is part of the mrb_irep struct, which is subsequently part of the RProc struct, which corresponds to a Ruby method (procedure?) object.

The distinction is necessary as RProc is a higher-level abstraction over an executable code, which might be either a RiteVM bytecode or a C function. Additionally, there is a distinction between a lambda, a block, and a method. Yet, we will only focus on the bytecode parts and ignore all the lambda/block/method shenanigans.

In the previous article, I briefly described the dispatch loop and how a VM interacts with the virtual stack. The description is not precise but accurate and catches the essential details.

Execution of each RProc requires a virtual stack to operate on the data, but it also requires some additional metadata. The “metadata” is part of the so-called mrb_callinfo struct. This concept is known as stack frame or activation record. The virtual stack is stored separately but is part of the mrb_callinfo (sort of). The virtual stack is essential as it is the only way to communicate between different operations and different RProcs.

Here is what happens during bytecode execution:

mrb_callinfo is created from an RProc and is put onto the “call info” stack or simply a call stack. The new mrb_callinfo points to a new location of the shared virtual stack (see the first picture below).
Each operation in RProc’s mrb_irep is executed in the context of the top mrb_callinfo on the call stack. The virtual stack and state of the VM are updated accordingly.
When any “sendable” (OP_SEND, OP_SSEND, OP_SENDBV, etc.) operation is encountered, we move to step 1.
When any “returnable” (OP_RETURN, OP_RETURN_BLK) operation is encountered, then the operand is put into the “return register” (for consumption by the caller), and the call stack is popped, effectively removing mrb_callinfo created at step 1.

Here is how it looks in memory:

mrb_state (the state of the whole VM) has a stack of mrb_contexts (more on them in a later article). Each mrb_context maintains the stack of mrb_callinfo (the call stack). Each mrb_context owns a virtual stack, which is shared among several mrb_callinfo.

This way, the caller prepares the stack for the callee.

As a reminder, here is the bytecode from the example above:

top:
TCLASS R1
METHOD R2 I(0:0x600001390050)
DEF    R1 :sum
LOADI  R3 10
LOADI  R4 32
SSEND  R2 :sum n=2 (0x02)
SSEND  R1 :puts n=1 (0x01)
RETURN R1
STOP

sum:
ENTER  2:0:0:0:0:0:0 (0x80000)
MOVE   R4 R1 ; R1:a
MOVE   R5 R2 ; R2:b
ADD    R4 R5
RETURN R4

This is how the shared stack looks from the perspective of both the top-level method top and the method sum: by the time the first SSEND operand (“send to self”) is executed, all the values are ready for consumption by the callee.

Hopefully, now you better understand how RiteVM uses bytecode, and we are one step closer to the actual fun part - compilation!

The following article covers MLIR and the way I modeled dialects - MLIR and compilation

Compiling Ruby. Part 1: Compilers vs. Interpreters

Fri, 02 Dec 2022

With the (hopefully) convincing motivation out of the way, we can get to the technical details.

Compiling Interpreter, Interpreting Compiler

As mentioned in the motivation, I want to build an ahead-of-time compiler for Ruby. I want it to be compatible with the existing Ruby implementation to fit it naturally into the existing system.

So the first question I had to answer is - how do I even do it?

Compilers vs. Interpreters

The execution model of compiled and interpreted languages is slightly different:

a compiler takes the source program and outputs another program that can be run on any other machine even when the compiler is not on that target machine
an interpreter also takes the source program as an input but does not output anything and runs the program right away

Unlike the compiler, the interpreter must be present on the machine you want to run the program. To build the compiler, I have to somehow combine the interpreter with the program it runs.

Let’s take a high-level schematic view of a typical compiler and interpreter.

The compiler is a straightforward one-way process: the source code is parsed, then the machine code is generated, and the executable is produced. The executable also depends on a runtime. The runtime can be either embedded into the executable or be an external entity, but usually both.

The interpreter is more complex in this regard. It contains everything in one place: parser, runtime, and a virtual machine. Also, note the two-way arrows Parser <-> VM and Runtime <-> VM. The reason is that Ruby is a dynamic language. During the regular program execution, a program can read more code from the disk or network and execute it, thus the interconnection between these components.

Parser + VM + Runtime

Arguably, the triple VM + Parser + Runtime can be called “a runtime,” but I prefer to have some separation of concerns. Here is where I draw the boundaries:

Parser: only does the parsing of the source code and converts it into a form suitable for execution via the Virtual Machine (“bytecode”)
Virtual Machine: the primary “computational device,” it operates on the bytecode and actually “runs” the program
Runtime: machinery required by the parser and VM (e.g., VM state manipulation, resource management, etc.)

A naïve approach to building the compiler is to tear the interpreter apart: replace VM and runtime with codegen and embed the runtime into the resulting executable. However, the runtime extraction won’t work due to the dynamism mentioned above - the resulting executable should be able to parse and run any arbitrary Ruby code.

Side note: an alternative approach is to build a JIT compiler and embed the whole compiler into the executable, but it adds more complexity than I am ready to deal with.

In the end, the solution is simpler - the compiler and the final executable include the whole interpreter. So the final “compiling interpreter” (or “interpreting compiler”) looks like this:

Ruby and its many Virtual Machines

Now it’s time to discuss the Virtual Machine component.

The most widely used Ruby implementation is CRuby, also known as MRI (as in “Matz’ Ruby Interpreter”). It is an interpreter built on top of a custom virtual machine (YARV).

Another widely used implementation is mruby (so-called “embedded” Ruby). It is also an interpreter and built on top of another custom VM (RiteVM).

YARV and RiteVM are rather lightweight virtual machines. Unlike full-fledged system or process-level VMs (e.g., VirtualBox, JVM, CLR, etc.), they only provide a “computational device” - there is no resource control, sandboxing, etc.

Stack vs. Registers

The “computational device” executes certain operations on certain data. The operations are encoded in the form of a “bytecode.” And the data is stored on a “virtual stack”. Though, the stack is accessed differently.

YARV accesses the stack implicitly (this is also known as a “stack-based VM”). RiteVM accesses the stack explicitly via registers (you got it, “register-based VM”).

To illustrate the bytecode and the difference between YARV and RiteVM, consider the following artificial examples.

Stack-based bytecode:

load 10
load 32
plus
print

load R1 10
load R2 32
plus R1 R1 R2
print R1

The stack-based version uses the stack implicitly, while another version specifies the storage explicitly.

Let’s “run” both examples to see them in action.

At every step, the VM does something according to the currently running instruction/opcode (underscored lines) and updates the virtual stack.

Stack-based VM only reads/writes data from/to the place where an arrow points to - this is the top of the virtual stack.

While the underlying machinery is very similar, there are good reasons for picking one or the other form of a VM. Yet, these reasons are out of the scope of this series. Please, consult elsewhere if you want to learn more. The topic of VMs is huge but fascinating.

Dispatch loop

Let’s consider how the VM works and deals with the bytecode. YARV and RiteVM use the so-called “dispatch loop,” which is effectively a for-loop + a huge switch-statement. Typical pseudocode looks like this:

// Iterate through each opcode in the bytecode stream
for (opcode in bytecode) {
  switch (opcode) {
  // Take a corresponding action for each separate opcode
  case OP_CODE_1: /* do something */;
  case OP_CODE_2: /* do something */;
  // ... more opcodes
  case OP_CODE_N: /* do something */;
  }
}

And then, the bodies for the actual opcodes may look as follows. Stack-based VM:

/*
Example program:
 load 10
 load 32
 plus
 print
*/
case OP_LOAD:
  val = pool[0] // pool is some abstract additional storage
  stack.push(val)
case OP_PLUS:
  lhs = stack.pop()
  rhs = stack.pop()
  res = lhs + rhs
  stack.push(res)
case OP_PRINT:
  val = stack.pop()
  print(val)

And the register-based version for completeness:

/*
Example program:
 load R1 10
 load R2 32
 plus R1 R1 R2
 print R1
*/

// md is some additional opcode metadata
case OP_LOAD:
  registers[md.reg1] = pool[0]
case OP_PLUS:
  lhs = registers[md.reg1]
  rhs = registers[md.reg2]
  res = lhs + rhs
  registers[md.reg1] = res
case OP_PRINT:
  val = registers[md.reg1]
  print(val)

In this case, if we know the values behind pool[0] and the actual values of md.regN, then we compile the example program to something like this:

/*
 load R1 10
 load R2 32
 plus R1 R1 R2
 print R1
*/
R1 = 10
R2 = 32
R1 = R1 + R2
print(R1)

and avoid the whole dispatch loop, but I digress :)

In the following article, we look into mruby’s implementation and virtual machine in more detail - Compiling Ruby. Part 2: RiteVM.

Compiling Ruby. Part 0: Motivation

Fri, 02 Dec 2022

For the last couple of years, I’ve been working on a fun side project called DragonRuby Game Toolkit, or GTK for short.

GTK is a professional-grade 2D game engine. Among the many incredible features:

you can build games in Ruby
it targets many (like, many!) platforms (Windows, Linux, macOS, iOS, Android, WASM, Nintendo Switch, Xbox, PlayStation, Oculus VR, Steam Deck)
super lightweight (~3.5 megabytes)
and many more really

GTK is built on top of a slightly customized mruby runtime and allows you to write games purely in Ruby. It comes with all the batteries included, but if you need more in a specific case, you can always fall back to C via the C extensions mechanism.

From a user perspective, the end product (the game) looks like this:

While the engine itself is pretty fast, what annoys me personally (from the aesthetic point of view) is that we cannot fully optimize the C extensions as they are compiled separately from the rest of the engine.

Looking at the picture, we have four components of the game:

the engine’s runtime (Ruby)
the engine’s runtime (C)
the game code (Ruby)
the game code (C)

Suppose we want to optimize all the C code together. In that case, we’d have to ship the runtime in some ‘common’ denominator form (e.g., LLVM Bitcode), then compile the C extension into the same form, optimize it all together and then link into an executable.

This is doable, but while I was thinking about this problem I’ve found even bigger (and much more interesting) ‘problem’ - what about all that Ruby code? Can we also compile it to some common form and then optimize it with the rest of the C code out there?

The answer is - definitely yes! We just need to build a compiler that would do that job.

At the time of writing, the compiler is far from being done, but it works reasonably well, and I can successfully compile and run more than half of the mruby test suite.

As a sneak peek, here is an output from the test suite:

/opt/DragonRuby/FireStorm/cmake-build-llvm-14-asan/tests/MrbTests/firestorm_mrbtest
mrbtest - Embeddable Ruby Test

............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................?.........................................................................................................................
Skip: File.expand_path (with ENV)
 Total: 934
 OK: 933
 KO: 0
 Crash: 0
 Warning: 0
 Skip: 1
 Time: 0.45 seconds

Process finished with exit code 0

I hope this motivation gives you enough information on why someone would do what I am doing!

Let’s take a look at the approach I am taking to solve this problem - Compilers vs. Interpreters

How to learn compilers: LLVM Edition

Thu, 04 Nov 2021

This is a mirror of the Substack article
How to learn Compilers (LLVM Edition)
The most recent version is there.

Compilers and Programming Languages is a huge topic. You cannot just take a learning path and finish it at some point. There are many different areas, each of which is endless.

Here, I want to share some links that would help to learn compilers. The list could not be exhaustive - everyone is busy, and no one has time to read the Dragon Book.

The main criteria behind each link:

I can personally recommend the material as I went through it
each entry should be relatively short and can be consumed in a reasonable time

I’m a big fan of learning through practicing. Thus the main focus is on LLVM, as you can go and do something cool with real-world programs!

The list consists of four groups: general theory, front-end, middle-end, and back-end.

At the first run, you can take the first item from each group, and it should put you on solid ground.

Disclaimer

There are a lot of excellent resources out there! Some of them are not on the list because of my subjective judgment, and the others are not here because I’ve never seen them!

Please, share your favorite resource either via [email protected]">email or on Twitter.

General Theory / Introduction

AOSA book: LLVM. This is a chapter from the Architecture of Open Source Applications book. It is written by Chris Lattner and covers high-level LLVM design.
Compilers. The course is taught by Alex Aiken. In this course, you build a compiler for a real programming language from scratch. It covers the whole compilation pipeline: parsing, type-checking, optimizations, code generation. Besides practical parts, it also dives into the theory.
Automata Theory. The course is taught by Jeffrey Ullman. This one is pretty heavy on theory. It starts with relatively simple topics like state machines and finite automata (deterministic and otherwise). It gradually moves on to more complex things like Turing-machines, computational complexity, famous P vs. NP, etc.

Theory of Computation. This course is taught by Michael Sipser. It is similar to the one above but delivered in a different style. It goes into more detail on specific topics.

Front-end

The compiler front-end is where the interaction with the actual source code happens. The compiler parses the source code into an Abstract Syntax Tree (AST), does semantic analysis and type-checking, and converts it into the intermediate representation (IR).

The Compilers course from the above covers the general parts. Here are some links specific to Clang:

Understanding the Clang AST. This article is written by Jonas Devlieghere. It goes into detail and touches implementation details of Clang’s AST. It also has a lot of excellent links to dive deeper into the subject.
clang-tutor. This repository maintained by Andrzej Warzyński. It contains several Clang plugins covering various topics, from simple AST traversals to more involved subjects such as automatic refactoring and obfuscation.

Middle-end

The middle-end is a place where various optimizations happen. Typically, the middle-ends use some intermediate representation. The intermediate representation of LLVM is usually referred to as LLVM IR or LLVM Bitcode. In a nutshell, it is a human-readable assembly language for a pseudo-machine (i.e., the IR does not target any specific CPU). The LLVM IR maintains certain properties: it is in a Static Single Assignment (SSA) form organized as a Control-Flow Graph (CFG).

LLVM IR Tutorial - Phis, GEPs and other things, oh my!. This is a great talk by Vince Bridgers and Felipe de Azevedo Piovezan.
Introduction to LLVM. A one-hour-long talk/tutorial from LLVM Developers meeting given by Eric Christopher and Johannes Doerfert. Another great tutorial that better builds on top of the previous video.
CS 6120: Advanced Compilers. The course is taught by Adrian Sampson. The title says “advanced,” but it covers what one would expect in a modern production-grade compiler: SSA, CFG, optimizations, various analyses.
Bitcode Demystified(🔌). This one is from me. It gives a high-level description of what’s the LLVM Bitcode is.
llvm-tutor. This one is also from Andrzej Warzyński. It covers LLVM plugins (so-called passes) that allow one to analyze and transform the programs in the LLVM IR form.

Back-end

The last phase of the compilation is a back-end. This phase aims to convert the intermediate representation into a machine code (zeros and ones). The zeros and ones later can be run on the CPU. Therefore, to understand the back-end, you need to understand the machine code and how CPUs work.

Build a Modern Computer from First Principles: From Nand to Tetris. Taught by Shimon Schocken and Noam Nisan. This course starts backward: first, you build the logic gates (and, or, xor, etc.), then use the logic gates to construct Arithmetic-Logic Unit (ALU), and then use the ALU to build the CPU. Then you learn how to control the CPU with zeros and ones (machine code), and eventually, you develop your assembler to convert the human-readable assembly into the machine code.
Parsing Mach-O files(🔌). This is a short article written by me. It shows how to parse object files on macOS (Mach-O). If you are on Linux or Windows, search for similar articles on elf and PE/COFF files, respectively.
Performance Analysis and Tuning on Modern CPUs. The book by Denis Bakhvalov. While it is about performance, it gives an excellent introduction to how CPUs work.

Bonus points

Here are some more LLVM related channels I recommend looking at:

LLVM’s YouTube channel. Here you can find a lot of talks from developer meetings.
LLVM Weekly. A weekly newsletter run by Alex Bradbury. This is the single newsletter I am aware of that doesn’t have ads!
LLVM Blog. This is, well, LLVM’s blog.
LLVM Tutorials. Good starting points, even if you know nothing about compilers.
Embedded in academia. John Regehr’s blog has lots of goodies when it comes to LLVM and compilers!

Strings attached

As I mentioned in the beginning, Compilers is a huge field! If you go through the material above, you will learn a lot, but you will still have a few knowledge gaps in the whole compilation pipeline (I certainly do). But the good thing is - you’d know what the gaps are and how to address them!

Good luck!

LLVM meets Code Property Graphs

Tue, 23 Feb 2021

This is a cross-post from LLVM’s blog post LLVM meets Code Property Graphs

The code property graph (CPG) is a data structure designed to mine large codebases for instances of programming patterns via a domain-specific query language. It was first introduced in the proceedings of the IEEE Security and Privacy conference in 2014 (publication, PDF) in the context of vulnerability discovery in C system code and the Linux kernel in particular. The core ideas of the approach are the following:

the CPG combines several program representations into one
the CPG is stored in a graph database
the graph database comes with a DSL allowing to traverse and query the CPG

Currently, the CPG infrastructure is supported by several tools:

Ocular - a proprietary code analysis tool supporting Java, Scala, C#, Go, Python, and JavaScript languages
Joern - an open-source counterpart of Ocular supporting C and C++
Plume - an open-source tool supporting Java Bytecode

This article presents ShiftLeft’s open-source implementation of llvm2cpg - a standalone tool that brings LLVM Bitcode support to Joern. But before we dive into details, let us say few more words about CPG and Joern.

Code Property Graph

The core idea of the CPG is that different classic program representations are merged into a property graph, a single data structure that holds information about the program’s syntax, control- and intra-procedural data-flow.

Graphically speaking, the following piece of code:

void foo() {
  int x = source();
  if (x < MAX) {
    int y = 2 * x;
    sink(y);
  }
}

combines these three different representations:

into a single representation - Code Property Graph:

Joern

The property graph is stored in a graph database and made accessible via a domain-specific language (DSL) to identify programming patterns based on a DSL for graph traversals. The query language allows a seamless transition between the original code representations, making it possible to combine aspects of the code from different views these representations offer.

One of the primary interfaces to the code property graphs is a tool called Joern. It provides the mentioned DSL and allows to query the CPG to discover specific properties of a program. Here are some examples of the Joern’s DSL:

joern> cpg.typeDecl.name.p
List[String] = List("ANY", "int", "void")

joern> cpg.method.name.p
List[String] = List(
  "foo",
  "<operator>.multiplication",
  "source",
  "<operator>.lessThan",
  "<operator>.assignment",
  "sink"
)
joern> cpg.method("foo").ast.isControlStructure.code.p
List[String] = List("if (x < MAX)")

joern> cpg.method("foo").ast.isCall.map(c => c.file.name.head + ":" + c.lineNumber.get + "  " + c.name + ": " + c.code).p
List[String] = List(
  "main.c:2  <operator>.assignment: x = source()",
  "main.c:2  source: source()",
  "main.c:3  <operator>.lessThan: x < MAX",
  "main.c:4  <operator>.assignment: y = 2 * x",
  "main.c:4  <operator>.multiplication: 2 * x",
  "main.c:5  sink: sink(y)"
)

Besides the DSL, Joern comes with a data-flow tracker enabling more sophisticated queries, such as “is there a user controlled malloc in the program?”

The DSL is much more powerful than in the example, but that is out of scope of this article. Please, refer to the documentation to learn more.

LLVM and CPG

This part is split into two smaller parts: the first one covers a few implementation details, the second one shows an example of how to use llvm2cpg. If you are not interested in the implementation - scroll down :)

Implementation Details

When we decided to add LLVM support for CPG, one of the first questions was: how do we map bitcode representation onto CPG?

We took a simple approach - let’s pretend the SSA representation is just a flat source program. In other words, the following bitcode

define i32 @sum(i32 %a, i32 %a) {
  %r = add nsw i32 %a, %b
  ret i32 %r
}

can be seen as a C program:

i32 sum(i32 a, i32 b) {
  i32 r = add(a, b);
  return r;
}

From the high-level perspective, the approach is simple, but there are some tiny details we had to overcome.

Instruction semantics

We can map some of the LLVM instructions back onto the internal CPG operations. Here are some examples:

add, fadd -> <operator>.addition
bitcast -> <operator>.cast
fcmp eq, icmp eq -> <operator>.equals
urem, srem, frem -> <operator>.modulo
getelementptr -> a combination of <operator>.pointerShift, <operator>.indexAccess, and <operator>.memberAccess depending on the underlying types of the GEP operand

Most of these <operator>.*s have special semantics, which plays a crucial role in the Joern and Ocular built-in data-flow trackers.

Unfortunately, not every LLVM instruction has a corresponding operator in the CPG. In those cases, we had to fall back to function calls. For example:

select i1 %cond, i32 %v1, i32 %v3 turns into select(cond, v1, v2)
atomicrmw add i32* %ptr, i32 1 turns into atomicrmwAdd(ptr, 1) (same for any other atomicrmw operator)
fneg float %val turns into fneg(val)

The only instruction we could not map to the CPG is the phi: CPG doesn’t have a Phi node concept. We had to eliminate phi instructions using reg2mem machinery.

Redundancy

For a small C program

int sum(int a, int b) {
  return a + b;
}

Clang emits a lot of redundant instructions by default

define i32 @sum(i32 %0, i32 %1) {
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  store i32 %0, i32* %3, align 4
  store i32 %1, i32* %4, align 4
  %5 = load i32, i32* %3, align 4
  %6 = load i32, i32* %4, align 4
  %7 = add nsw i32 %5, %6
  ret i32 %7
}

instead of a more concise version

define i32 @sum(i32 %0, i32 %1) {
  %3 = add nsw i32 %1, %0
  ret i32 %3
}

In general, this is not a problem, but it adds more complexity for the data-flow tracker and needlessly increases the graph’s size. One of the considerations was to run optimizations before emitting CPG for the bitcode. Still, in the end, we decided to offload this work to an end-user: if you want fewer instructions, then apply the optimizations manually before emitting the CPG.

Type Equality

The other issue is related to the way LLVM handles types. If two modules in the same context use the same struct with the same name, LLVM renames the other struct to prevent name collisions. For example

; Module1
%struct.Point = type { i32, i32 }

and

; Module 2
%struct.Point = type { i32, i32 }

when loaded into the same context yield two types

%struct.Point = type { i32, i32 }
%struct.Point.1 = type { i32, i32 }

We wanted to deduplicate these types for a better user experience and only emit Point in the final graph.

The obvious solution was to consider two structs with “similar” names and the same layout to be the same. However, we could not rely on the llvm::StructType::isLayoutIdentical because, despite the name, it produces misleading results.

According to llvm::StructType::isLayoutIdentical the structs Point and Pair have identical layout, but PointWrap and PairWrap are not.

; these two have identical layout
%Point = type { i32, i32 }
%Pair = type { i32, i32 }

; these two DO NOT have identical layout
%PointWrap = type { %Point }
%PairWrap = type { %Pair }

This happens because llvm::StructType::isLayoutIdentical determines equality based on the pointers. That is, if all the struct elements are identical, then the layout identical. It also meant we could not use this approach to compare types from different LLVM contexts. We had to roll out our custom solution based on the Tree Automata to solve this issue.

There are few more details, but the article is getting longer than it needs to be. So let’s look at how to use llvm2cpg with Joern.

Example

Once you have Joern and llvm2cpg installed the usage is straightforward:

Convert a program into LLVM Bitcode
Emit CPG
Load the CPG into Joern and start the analysis

Here are the steps codified:

$ cat main.c
extern int MAX;
extern int source();
extern void sink(int);
void foo() {
  int x = source();
  if (x < MAX) {
    int y = 2 * x;
    sink(y);
  }
}
$ clang -S -emit-llvm -g -O1 main.c -o main.ll
$ llvm2cpg -output=/tmp/cpg.bin.zip main.ll

Now you get the CPG saved at /tmp/cpg.bin.zip which you can load into Joern and find if there is a flow from the source function to the sink:

$ joern
joern> importCpg("/tmp/cpg.bin.zip")
joern> run.ossdataflow
joern> def source = cpg.call("source")
joern> def sink = cpg.call("sink").argument
joern> sink.reachableByFlows(source).p
List[String] = List(
  """_____________________________________________________
| tracked               | lineNumber| method| file   |
|====================================================|
| source                | 5         | foo   | main.c |
| <operator>.assignment | 5         | foo   | main.c |
| <operator>.lessThan   | 6         | foo   | main.c |
| <operator>.shiftLeft  | 7         | foo   | main.c |
| <operator>.shiftLeft  | 7         | foo   | main.c |
| <operator>.assignment | 7         | foo   | main.c |
| sink                  | 8         | foo   | main.c |
"""
)

Which indeed exists!

Conclusion

To conclude, let us outline some of the advantages and constraints implied by LLVM Bitcode:

the “surface” of the LLVM language is smaller than that of C and C++
many high-level details do not exist at the IR level
the program must be compiled, thus limiting the range of programs that one can analyze with Joern

Here you can find more tutorials and information.

If you get any questions, feel free to ping Fabs or Alex on Twitter, or better come over to the Joern chat.

Exploring LLVM Bitcode interactively

Fri, 28 Feb 2020

While working on a tool for software analysis, I find myself looking into the bitcode quiet often. It works OK when there is one small file, but it’s incredibly annoying when it comes to real-world projects which have tens and hundreds of files.

To simplify my life, I built a tool that converts LLVM Bitcode into the GraphML format: llvm2graphml.

What is GraphML

GraphML is an XML-based file format for storing graphs. The beautiful part is that it supported by many tools: you can use Neo4J, Cassandra, or TinkerPop to mine data or things like yEd or Gephi to visualize it.

My use-case is graph databases.

What is Graph Database

To understand what a graph database is to think of SQLite but for property graphs. And a property graph is simply a graph where each vertex (or node) and edge may have several key-value properties.

The classical example: there is a number of people in the graph and they have some relationship, e.g.: ‘Alice -> knows -> Bob’, ‘Bob -> friends-with -> Eve’, etc. In this case, we can model a query like “Find friends of people whom Alice knows” in the form of a query language:

graph.vertex('person').has('name', 'Alice').edge('knows').edge('friends-with')

Each step narrows down the search space:

from a graph get all the vertices labeled ‘person’
among those select the ones that have the property ’name’ with the value ‘Alice’
from the vertices select nodes through edges labeled ‘knows’
and from what’s left pick all the nodes reachable through the edges labeled ‘friends-with’

Note: this is an imaginary, simplified query language, but you’ve got the idea.

llvm2graphml

Let me walk you through an example of how to use llvm2graphml. To follow along you need to install llvm2graphml itself (prebuilt packages available for macOS and Ubuntu) and Gremlin Console from Apache TinkerPop project.

There are essentially three steps:

Create main.ll file with the following content:

; main.ll
define i32 @increment(i32 %x) {
  %result = add i32 %x, 1
  ret i32 %result
}

2. Run llvm2graphml to emit the GraphML file:

> llvm2graphml --output-dir=/tmp main.ll
[info] More details: /tmp/llvm2graphml-38dfea.log
[info] Loading main.ll
[info] Saved result into /tmp/llvm.graphml.xml
[info] Shutting down

3. Create the database from the GraphML file

Start console:

> gremlin-console/bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin>

Create the database:

gremlin> graph = TinkerGraph.open()
gremlin> g = graph.traversal()
gremlin> g.io("/tmp/llvm.graphml.xml").read()
gremlin> g
==>graphtraversalsource[tinkergraph[vertices:12 edges:27], standard]

Now go and run some queries!

Example queries

List all modules:

gremlin> g.V().hasLabel('module').valueMap().unfold()
==>moduleIdentifier=[main.ll]

List all functions:

gremlin> g.V().hasLabel('function').valueMap().unfold()
==>argSize=[1]
==>basicBlockCount=[1]
==>name=[increment]
==>isDeclaration=[false]
==>isVarArg=[false]
==>isIntrinsic=[false]
==>numOperands=[0]
==>instructionCount=[2]

Count all the instructions:

gremlin> g.V().hasLabel('instruction').groupCount().by('opcode').unfold()
==>ret=1
==>add=1

Explore the types:

gremlin> g.V().hasLabel('type').valueMap('typeID').unfold()
==>typeID=[label]
==>typeID=[pointer]
==>typeID=[function]
==>typeID=[integer]
==>typeID=[void]

Find a function with an argument called ‘x’:

gremlin> g.V().has('argument', 'name', 'x').out('function').valueMap('name')
==>[name:[increment]]

Et cetera, et cetera, et cetera…

Some numbers

These are just some numbers mined from the libLLVMCore.a.

How many

Number of functions	71 019
Number of basic blocks	172 621
Number of instructions	1 212 322
Number of types	122 220

Top 10 instructions:

call	290 495
load	214 769
store	167 640
alloca	154 922
br	96 848
getelementptr	78 622
ret	67 729
bitcast	62 760
icmp	20 624
phi	9 716

Top 10 biggest functions:

llvm::UpgradeIntrinsicCall(llvm::CallInst, llvm::Function)	14033
llvm::Intrinsic::getAttributes(llvm::LLVMContext&, unsigned int)	8420
ShouldUpgradeX86Intrinsic(llvm::Function*, llvm::StringRef)	3635
llvm::LLVMContextImpl::~LLVMContextImpl()	2181
UpgradeIntrinsicFunction1(llvm::Function, llvm::Function&)	2006
(anonymous namespace)::Verifier::visitIntrinsicCall(unsigned int, llvm::CallBase&)	1887
(anonymous namespace)::AssemblyWriter::printInstruction(llvm::Instruction const&)	1869
llvm::ConstantFoldBinaryInstruction(unsigned int, llvm::Constant, llvm::Constant)	1244
upgradeAVX512MaskToSelect(llvm::StringRef, llvm::IRBuilder&, llvm::CallInst&, llvm::Value*&)	1073
llvm::ConstantFoldGetElementPtr(llvm::Type, llvm::Constant, bool, llvm::Optional, llvm::ArrayRef)	1055

Resources

Here are some links if you want to learn more about Gremlin Queries and what’s possible:

Next steps

Currently, the project is in its very early days, and many features are missing, to name a few: specific properties on instructions and values, def-use chains and other connections, complex constants (such as vectors of structs), and many more.

With that being said - contributions are welcome!

Type Equality in LLVM

Tue, 28 Jan 2020

Some months ago, I joined ShiftLeft Security to work on the LLVM support for the custom code analysis platform Ocular. During these months, we have faced and overcome several challenges.

Here I want to share one of them: Type Equality in LLVM.

Intro

LLVM’s type system is a complicated topic. It attempts to solve problems that are not so obvious when you look at them from a high-level. Recently, I had a chance to dive deeper into the subject and discovered that while the current implementation makes some things more straightforward, some other parts are counter-intuitive and may not meet your expectations.

In this article, I want to describe some limitations of the LLVM type system and share how we solved one particular problem: detecting equivalent types in LLVM. The article is organized as follows: I start with the recap of the LLVM type system, followed by the problem statement, then describe how we attempted to solve the issue using existing LLVM features, and finally conclude with the solution we came up with.

LLVM Type System recap

It is highly recommended to read this post from Chris Lattner explaining some of the considerations that were taken into account when the type system was revised around LLVM 3.0: LLVM 3.0 Type System Rewrite.

Just a few random words on the current type system (if you didn’t read the linked article):

types belong to an LLVMContext
instances of each type allocated on the heap (e.g., llvm::Type *type = new llvm::Type;)
type comparison is done via pointer comparison
types in LLVM go into three groups: primitive types (integers, floats, etc.), derived types (structs, arrays, pointers, etc.), forward-declared types (opaque structs)

Problem Statement

Consider the following example:

// Point.h
struct Point {
  int x;
  int y;
};

// foo.c
#include "Point.h"

// use struct Point

// bar.c
#include "Point.h"

// use struct Point

When foo.c and bar.c compiled down to the LLVM IR (foo.ll and bar.ll) they both have the struct Point defined as follows:

%struct.Point = type { i32, i32 }

Though, when both IR files loaded in one context, the type names changed to prevent name collisions, so they end up being defined as

%struct.Point = type { i32, i32 }
%struct.Point.0 = type { i32, i32 }

We want to deduplicate such types.

Our (failed) attempts

We made several attempts to solve the problem using simple heuristics and built-in LLVM features.

It went wrong in many ways.

‘Types with the same name are the same type’ (false)

This is a very simple heuristic:

%struct.Point = type { i32, i32 }
%struct.Point.0 = type { i32, i32 }

If we strip the numeric suffix that is added by LLVM, then the types have the same name, and therefore they are the same. This is a good idea, but it does not work. This is a perfectly valid LLVM bitcode:

%struct.Point = type { i32, i32 }
%struct.Point.0 = type { float, float, float }

for which our heuristic does not apply.

Primitive Types Equality

In LLVM, types belong to the LLVMContext. Primitive types such as int32, float, or double pre-allocated and then reused. In the context of LLVMContext (pun intended), you can only create one instance of a primitive type. With this solution, it is easy to check if types are the same - simply compare the pointers.

However, this solution cannot work if you want to compare types from different contexts. According to LLVM, int32 from one LLVMContext differs from int32 from another LLVMContext, even though they are the same type according to intuition.

Struct Types Equality

This situation gets even more complicated when it comes to identified (named) structs.

Consider the same example I gave initially.

// Point.h
struct Point {
  int x;
  int y;
};

// foo.c
#include "Point.h"

// use struct Point

// bar.c
#include "Point.h"

// use struct Point

So far so good, but as mentioned previously, LLVM keeps both types and renames one of them to prevent name collisions:

%struct.Point = type { i32, i32 }
%struct.Point.0 = type { i32, i32 }

Even though these are the same types from a user perspective, they are very different from the LLVM’s point of view. Therefore, we cannot use pointer comparison: the types are distinct and point to different memory regions. In this case, the best we can do is to compare the layout of the types and consider them equal if the layouts are identical.

The good part is that LLVM has a function for that: llvm::StructType::isLayoutIdentical.

The bad part is that this function is broken. Consider the following example:

%struct.Point = type { i32, i32 }
%struct.Point.0 = type { i32, i32 }

%struct.wrapper = type { %struct.Point }
%struct.wrapper.0 = type { %struct.Point.0 }

According to LLVM, the layouts of struct.Point and struct.Point.0 are identical, while the layouts of struct.wrapper and struct.wrapper.0 are not: isLayoutIdentical returns true only when all the type elements of the struct are equal. And this equality is checked via pointer comparison.

`IRLinker`/`llvm-link`

LLVM has a class that merges two modules into one: IRLinker. LLVM also comes with a CLI tool llvm-link, which does the same. The IRLinker works fine, but far away from being good: it drops important information.

The following IR after running through IRLinker

%struct.Point = type { i32, i32 }
%struct.Tuple = type { i32, i32 }

becomes

%struct.Point = type { i32, i32 }

dropping the other struct since both have the same layout. We don’t want to lose this information.

Moreover, IRLinker does another kind of magic that may introduce types that never existed at the source code level. This is what I’ve seen after running llvm-link on the XNU kernel bitcode:

%struct.tree_desc_s = type {
  %struct.ct_data_s*,
  i32,
  %struct.mach_msg_body_t*
}
%struct.tree_desc_s.79312 = type {
  %struct.ct_data_s*,
  i32,
  %struct.static_tree_desc_s*
}

Notice the different types of the third element: struct.mach_msg_body_t* vs struct.static_tree_desc_s, even though there is only one definition of tree_desc_s at the source code level:

struct tree_desc_s {
  ct_data *dyn_tree;
  int     max_code;
  static_tree_desc *stat_desc;
};

So the IRLinker did something odd, at which point I gave up all the attempts to understand how it works and what it does.

Our solution to this problem

I could not find any other solution to the problem, so we decided to roll out our own.

A bit of background

Our implementation is inspired by Tree Automata and Ranked Alphabets.

Here is a short description: a ranked alphabet consists of a finite set of symbols F, and a function Arity(f), where f belongs to the set F. The Arity tells how many arguments a symbol f has. Symbols can be constant, unary, binary, ternary, or n-ary.

Here is an example of the notation: a, b, f(,), g(), h(,,,,). a and b are constants, f(,) is binary, g() is unary, and h(,,,,) is n-ary. The arity of each symbol is 0, 0, 2, 1, and 5, respectively.

Given the alphabet a, b, f(,), g() we can construct a number of trees:

f(a, b)
g(b)
g(f(b, b))
f(g(a), f(f(a, a), b))
f(g(a), g(f(a, a)))

etc.

If we know the arity of each symbol, then we can omit parentheses and commas and write the tree as a string. The tree is constructed in the depth-first order, here are the same examples as above, but in the string notation:

fab
gb
gfbb
fgaffaab
fgagfaa

Here is a more comprehensive example:

The arrows show the depth-first order.

We can map our type equivalence problem on the ranked alphabet/tree automaton concepts.

Type Equality

We consider each type to be a symbol, and its arity is the number of properties we want to compare. Then, we build a tree of the type and convert it to the string representation. If two types have the same string representation, then they are equal.

Some examples:

i32, i64, i156: symbol I, arity is 1 since we only care about bitwidth (e.g., 32, 64, 156)
float: symbol F, arity is 0, all float types are the same
[16 x i32]: symbol A, arity is 2, we care only about the length of the array and its element type
i8*: symbol P, arity is 1, we care only about the pointee type
{ i32, [16 x i8], i8* }: symbol S, arity is number of elements + 2. We want to store the struct ID and number of its elements.

If we care about more or fewer values, then we can simply change the arity for a given symbol. Examples of types represented as a tree:

i32 -> I(32) -> I32
i177 -> I(177) -> I177
[16 x i8*] -> A(16, P(I(8))) -> A16PI8
{ i32, i8*, float } -> S(3, S0, I(32), P(I(8)), F) -> S3S0I32PI8F

Note: the values in S are the number of elements (3), struct ID (S0), and all its contained types defined recursively.

Same types, but represented graphically:

Structural Equality

Above, I mentioned the struct ID. We need it to define the structural equality for recursive types. Consider the following example:

%list = type { %list*, i32 }
%node = type { %node*, i32 }
%root = type { %node*, i32 }

All of the above structs have the same layout: a pointer + an integer. But we do not consider them all to be equal. By our definition of equality the following holds:

list == node
root != node
root != list

The reasoning is simple: the list and node has the same layout and the same structure (recursive), while root has another structure.

Here is a graphical representation to highlight the idea. If we discard the struct titles, then it’s clear the first two are equal while the third one is distinct.

To take the structure into account and to make the equality hold, we do not use the names of the structures, but before building the tree, we assign them symbolic names or IDs. So both the list and node encoded as the following: S(2, S0, P(S(2, S0, x, x), I(32)) where S0 is the struct ID. To terminate the recursion we do not re-emit types for the structure that has already been emitted, but we do emit symbols x instead (otherwise we won’t respect the arity of the struct).

The root is defined as follows S(2, S0, P(S(2, S1, P(S(2, S1, x, x), I(32), I(32))), I(32)) please note the nestedness and S0 and S1 struct IDs.

Given these two encodings, the comparison above holds.

Opaque Struct Equality

Comparing opaque structs is as easy as the comparison of infinities. It’s totally up to us how we define this property.

The right and sound approach is to say that the opaque struct equals only to itself, but we need to do better than this.

For opaque structs, we also use symbolic names. But different opaque structs get the same symbolic name as soon as they have the same canonical name.

Example:

%struct.A = type opaque
%struct.A.0 = type opaque
%struct.B = type opaque

%foo = type { %struct.A* }
%bar = type { %struct.A.0* }
%buzz = type { %struct.B* }

Here, the canonical names for the opaque structs are A (%struct.A, %struct.A.0) and B (%struct.B). Therefore, we treat the %struct.A and %struct.A.0 as equal, while %struct.B is not equal to the either of As. Even though all of the 3 structs can point to the same type or completely different types.

Letters, symbols, and IDs

While IMO, letters and symbols are easier to work with for a human being, I implemented all the encodings as vectors of numbers. It is then easy to get a hash of such vector and add some memoization for better performance, even though I didn’t spend any time measuring and looking for bottlenecks.

Conclusion

To conclude, I’d say that one should not rely on the built-in capabilities of LLVM to compare types. In fact, IRLinker uses a very different algorithm.

The algorithm I described has drawbacks, and I probably missed some edge cases. But anyway, I would love to get some feedback on it, and I hope it may help someone who gets into a similar situation.

Building an LLVM-based tool. Lessons learned

Thu, 04 Jul 2019

This article is a text version of my recent EuroLLVM talk called Building an LLVM-based tool: lessons learned.

Intro

For the last three years, I work on a tool for mutation testing: Mull. It is based on LLVM and targets C and C++ primarily. What makes it interesting?

it works on Linux, macOS, and FreeBSD
it supports any version of LLVM starting from 3.9
it is fast because of JIT and parallelization
packaging and distribution is done in one click

Keep reading if you want to know how it works and how to apply it on your project.

The Build System

llvm-config

The most famous way to connect LLVM as a library is to use llvm-config. The simplest llvm-config-based build system:

> clang -c `llvm-config --cxxflags` foo.cpp -o foo.o
> clang -c `llvm-config --cxxflags` bar.cpp -o bar.o

> clang `llvm-config --ldflags` `llvm-config --libs core support` bar.o foo.o -o foobar.bin

It works quite well in the very beginning, but there are some issues with it.

The compiler flags: llvm-config --cxxflags gives you the flags the LLVM was compiled with, these are not the flags you necessarily want for your project. Let’s look at the example:
```
-I/opt/llvm/6.0.0/include
-Werror=unguarded-availability-new
-O3 -DNDEBUG
...
```
The first flag is correct, and you need it. The second one is specific to Clang: it may not work with gcc, and it may not work with an older of Clang itself. The rest (-O3 -NDEBUG) will force you to compile your project in the release mode. It’s fine, but not always desirable.
The linker flags. llvm-config --ldflags does the right job. It tells where to look for the libraries and tweaks some other linker settings. llvm-config --libs <components> also does the right job. It prints the set of libraries you need to link against to use the specified components (you can see the whole list of components via llvm-config --components). However, there is a weird edge case. If, on your system, you have installed several versions of LLVM, and they come with a dynamic library, e.g.:
```
/usr/lib/llvm-4.0/lib/libLLVM.dylib
/usr/lib/llvm-6.0/lib/libLLVM.dylib
```
Then, you may get a runtime error after successful linking:
```
> clang foo.o bar.o -lLLVMSupport -o foobar.bin
> ./foobar.bin
LLVM ERROR: inconsistency in registered CommandLine options
```
To prevent this from happening, you should instead link against the dynamic library:
```
> clang foo.o bar.o -lLLVM -o foobar.bin
> ./foobar.bin
Yay! We are good to go now!
```
To handle this case properly, you need to check the presence of the libLLVM.dylib on your system somehow. Alternatively, use CMake (see the next part).
The linking order. As I said, llvm-config --libs does the right job, but it only applies to the LLVM libraries. If you also want to use Clang libraries with llvm-config, then you are in trouble: the libraries should be placed in the right order. It may work, or may not. The problem arises only on Linux. Either you manually re-order the Clang libraries until it compiles, or you wrap the libraries list into the --start-group/--end-group. That’s a reasonable solution, but it does not work on macOS. Before migrating to CMake we ended up with something like this:
```
if macOS
LDFLAGS=-lLLVM -lclangEdit
else
LDFLAGS=-Wl,--start-group -lLLVM -lclangEdit -Wl,--end-group
endif
clang foo.o bar.o $LDFLAGS -o foobar.bin
```

Quite frankly, llvm-config is rather a suboptimal solution for the long run…

CMake

LLVM itself uses CMake as its primary build system. LLVM engineers put an enormous amount of work into making it very friendly to the LLVM users.

Note: I assume that you understand CMake, otherwise I suggest you build the mental model through this short article: Bottom-up CMake introduction.

Adding LLVM and Clang as a dependency through CMake is reasonably straightforward:

find_package(LLVM REQUIRED CONFIG
             PATHS ${search_paths}
             NO_DEFAULT_PATH)
find_package(Clang REQUIRED CONFIG
             PATHS ${search_paths}
             NO_DEFAULT_PATH)

Please, note the ${search_paths} and the NO_DEFAULT_PATH.

This is the ${search_paths} in our case:

set (search_paths
  ${PATH_TO_LLVM}
  ${PATH_TO_LLVM}/lib/cmake
  ${PATH_TO_LLVM}/lib/cmake/llvm
  ${PATH_TO_LLVM}/lib/cmake/clang
  ${PATH_TO_LLVM}/share/clang/cmake/
  ${PATH_TO_LLVM}/share/llvm/cmake/
)

The PATH_TO_LLVM is provided to CMake externally by the user.

Bold statement: You should not rely on the ‘use whatever is installed on the machine,’ but explicitly provide the path to the LLVM installation.

Bold statement: For development, you should not use LLVM/Clang provided by your Linux distro, but instead, install it manually using official precompiled binaries.

You can ignore the above statements if you only use LLVM libraries. If you also need Clang libraries, then you may get into trouble. On Ubuntu, some versions of Clang were coming with a broken CMake support:

CMake Error at /usr/share/llvm-6.0/cmake/ClangConfig.cmake:18 (include):
  include could not find load file:

    /usr/lib/cmake/clang/ClangTargets.cmake
Call Stack (most recent call first):
  CMakeLists.txt:8 (find_package)

Search on the Internets for “CMake cannot find ClangConfig” to see how many projects and users suffered from this.

Once the find_package succeeds, you get LLVM_INCLUDE_DIRS variable and bunch of LLVM targets you can use:

target_include_directories(mull PUBLIC ${LLVM_INCLUDE_DIRS})
target_link_libraries(mull LLVMSupport clangTooling)

Except there is the

LLVM ERROR: inconsistency in registered CommandLine options

runtime error. To handle it with CMake, consider using the following snippet:

if (LLVM IN_LIST LLVM_AVAILABLE_LIBS)
  target_link_libraries(mull LLVM clangTooling)
else()
  target_link_libraries(mull LLVMSupport clangTooling)
endif()

That should do the trick.

Supporting multiple LLVM versions

There are at least two ways to support several versions of LLVM. You can add a bunch of #ifdefs to the source code. This is how Klee does it, and it works for them pretty well (seems like).

Example #1:

#if LLVM_VERSION_CODE >= LLVM_VERSION(4, 0)
#include <llvm/Bitcode/BitcodeReader.h>
#else
#include <llvm/Bitcode/ReaderWriter.h>
#endif

Example #2:

#if LLVM_VERSION_CODE >= LLVM_VERSION(5, 0)
  assert(ii->getNumOperands() == 3 && "wrong number of arguments");
#else
  assert(ii->getNumOperands() == 2 && "wrong number of arguments");
#endif

The other way, the one Mull uses, is to provide a façade library. Mull has several libraries with the same interface, but with slightly different implementations. They are simply pairs of a header and .cpp file:

> tree LLVMCompatibility/
LLVMCompatibility/
├── 3.9.x
│   ├── CMakeLists.txt
│   ├── LLVMCompatibility.cpp
│   └── LLVMCompatibility.h
├── 4.x.x
│   ├── CMakeLists.txt
│   ├── LLVMCompatibility.cpp
│   └── LLVMCompatibility.h
...
├── 8.x.x
│   ├── CMakeLists.txt
│   ├── LLVMCompatibility.cpp
│   └── LLVMCompatibility.h

Then, we can use CMake to decide which version to use:

set (llvm_patch_version "${LLVM_VERSION_MAJOR}.${LLVM_VERSION_MINOR}.${LLVM_VERSION_PATCH}")
set (llvm_minor_version "${LLVM_VERSION_MAJOR}.${LLVM_VERSION_MINOR}.x")
set (llvm_major_version "${LLVM_VERSION_MAJOR}.x.x")

set (full_llvm_version ${llvm_patch_version})

if (EXISTS ${CMAKE_CURRENT_LIST_DIR}/LLVMCompatibility/${llvm_patch_version})
  set (LLVM_COMPATIBILITY_DIR ${llvm_patch_version})

elseif(EXISTS ${CMAKE_CURRENT_LIST_DIR}/LLVMCompatibility/${llvm_minor_version})
  set (LLVM_COMPATIBILITY_DIR ${llvm_minor_version})

elseif(EXISTS ${CMAKE_CURRENT_LIST_DIR}/LLVMCompatibility/${llvm_major_version})
  set (LLVM_COMPATIBILITY_DIR ${llvm_major_version})

else()
  message(FATAL_ERROR "LLVM-${full_llvm_version} is not supported")
endif()

add_subdirectory(LLVMCompatibility/${LLVM_COMPATIBILITY_DIR})

What happens here: CMake is looking for a directory with the compatibility layer for the given LLVM version in a special order. For example, for the version 8.0.1 it will do the following:

Use LLVMCompatibility/8.0.1 if it exists
Use LLVMCompatibility/8.0.x if it exists
Use LLVMCompatibility/8.x.x if it exists
Give up and fail

As soon as it finds the right folder, it will include it in the build process. So far we used only <number>.x.x, but the idea is that we can provide a particular library for any version of LLVM if we need to. Here is how two header files look like:

Then, in the source code we simply use the compatibility layer instead of bunch of ifdefs:

auto module = llvm_compat::parseBitcode(buffer.getMemBufferRef(),
                                        context);

Sources VS Binaries

So far I only covered builds against precompiled binary versions of LLVM. However, there are reasons you should also build against the source code. Look at the table:

Build time against precompiled versions is much faster, but you give up the ability to debug the LLVM itself which is needed when you hit some bug or some weird behavior. Another significant drawback: asserts. They are disabled in the release builds you get from the http://releases.llvm.org. In fact, we did violate some of the LLVM constraints but didn’t realize it until somebody tried to build Mull against the source code.

You can easily teach CMake to build against source code and against precompiled libraries at the same time.

Here is the trick:

if (EXISTS ${PATH_TO_LLVM}/CMakeLists.txt)
  add_subdirectory(${PATH_TO_LLVM} llvm-build-dir)

  # LLVM_INCLUDE_DIRS ???
  # LLVM_VERSION ???
else()
  ...
endif()

If the PATH_TO_LLVM contains CMakeLists.txt, then we are building against the source code. Otherwise, the behavior is the same as written in the previous paragraphs.

However, LLVM_INCLUDE_DIRS and LLVM_VERSION are not available in this case. We can fix that with these tricks:

get_target_property(LLVM_INCLUDE_DIRS
                    LLVMSupport
                    INCLUDE_DIRECTORIES)

It will fill in the LLVM_INCLUDE_DIRS with the right header search paths.

The LLVM_VERSION is a bit less trivial: we need to parse the CMakeLists.txt:

macro(get_llvm_version_component input component)
  string(REGEX MATCH "${component} ([0-9]+)" match ${input})
  if (NOT match)
    message(FATAL_ERROR "Cannot find LLVM version component '${component}'")
  endif()
  set (${component} ${CMAKE_MATCH_1})
endmacro()

file(READ ${PATH_TO_LLVM}/CMakeLists.txt LLVM_CMAKELISTS)
get_llvm_version_component("${LLVM_CMAKELISTS}" LLVM_VERSION_MAJOR)
get_llvm_version_component("${LLVM_CMAKELISTS}" LLVM_VERSION_MINOR)
get_llvm_version_component("${LLVM_CMAKELISTS}" LLVM_VERSION_PATCH)
set (LLVM_VERSION ${LLVM_VERSION_MAJOR}.${LLVM_VERSION_MINOR}.${LLVM_VERSION_PATCH})

The macro will extract all the information we need from this piece of text (llvm/CMakeLists.txt):

if(NOT DEFINED LLVM_VERSION_MAJOR)
  set(LLVM_VERSION_MAJOR 6)
endif()
if(NOT DEFINED LLVM_VERSION_MINOR)
  set(LLVM_VERSION_MINOR 0)
endif()
if(NOT DEFINED LLVM_VERSION_PATCH)
  set(LLVM_VERSION_PATCH 1)
endif()

That’s it. We are ready to build against LLVM’s source code.

Parallelization

Bold statement: Avoid using LLVM Passes for better parallelization (explanation follows).

Any LLVM-based tool is an excellent example of the fair parallelization: if you have 20 tasks and 4 cores, then you can run 5 tasks per each core and them merge the results. However, LLVM is not very friendly when it comes to the parallelization: lots of classes are not thread-safe.

Let’s consider this picture:

There are three phases: loading, analysis, and transformation:

We load two modules(#1, #2) within the Thread 1, and the third module (#3) within the Thread 2. What’s important is that each thread should have its own LLVMContext!
The next phase is the analysis. At this point we only read information from LLVM IR, so we can distribute all the 8 functions (F1-F8) across two threads evenly: Thread 1 analyzes F1-F4, and Thread 2 deals with F5-F8.
Transformation. Is it essential to ensure that any transformation of a module does not escape the module’s thread boundaries: even such ‘minor’ changes as renaming an instruction is not thread-safe.

Note: of course you can put there lots of locks, but what’s the point of parallelization then?

Now I can tell why you should avoid LLVM Passes: this approach incentivizes you to merge analysis and transformation into one phase, and therefore lose the ability to parallelize efficiently. (There are other issues with LLVM Passes, but it’s a different topic).

Also, LLVM’s PassManagers are not (yet?) parallelization-friendly.

My advice here is to start with separate analysis & transformation phases. It’s easier to implement and easier to test. You can wrap these phases into LLVM pass later if needed.

And of course, you should always measure the performance. Here is one of our measurements:

You may get the opposite results.

Getting Bitcode

Once per 2-3 months, there is a question on the mailing lists: “How do I compile my program to bitcode?” Clearly, there is a demand for that.

The most common answer I’ve seen is the whole-program-llvm. It’s a great tool, and I can also recommend using it, but keep in mind that it produces one large bitcode file as output. Therefore, you cannot get the benefits of your multicore machine.

There are a few other ways to get the bitcode:

-emit-llvm: passing this flag to the compiler will give you an LLVM Bitcode/IR file as an output. It will break the linking phase of your build system, though.
-flto: with this flag all the intermediate object files will, in fact, be LLVM Bitcode files. The program will compile just fine. It won’t work though if you don’t have any intermediate object files in the pipeline (e.g. clang foo.c bar.c -o foobar)
-fembed-bitcode: this should be your choice! Clang will compile your program just fine, but it will also include a special section into the binary containing all the Bitcode files (Learn More). You can extract the Bitcode from the binary programmatically using my fork of the awesome LibEBC tool.

Multi-OS Support

For more straightforward support of several operating systems, I highly recommend these two tools: Vagrant and Ansible.

Vagrant allows you to manage virtual machines easily:

config.vm.define "debian" do |cfg|
  cfg.vm.box = "debian/stretch64"
  cfg.vm.provision "ansible" do |ansible|
    ansible.verbose = "v"
    ansible.playbook = "debian-playbook.yaml"
  end
end

config.vm.define "ubuntu" do |cfg|
  cfg.vm.box = "ubuntu/xenial64"
  cfg.vm.provision "ansible" do |ansible|
    ansible.verbose = "v"
    ansible.playbook = "ubuntu-playbook.yaml"
  end
end

With this config you can create a VM ready for use:

vagrant up debian
vagrant up ubuntu

Vagrant also allows you to provision the machine using various providers: from old-school shell scripts to modern tools such as Chef and Ansible.

I prefer Ansible as it is the most straightforward tool, in my opinion. Basically, an Ansible playbook is a shell script on steroids. Here is how a part of it looks like:

packages:
  - fish
  - vim
  - wget
  - git
  - cmake
  - ninja-build
  - libz-dev
  - libsqlite3-dev
  - ncurses-dev
  - libstdc++-6-dev
  - pkg-config
  - libxml2-dev
  - uuid-dev
  - liblzma-dev

tasks:
  - name: Install Required Packages
    apt:
      name: "{{ packages }}"
      state: present
    become: true

This small snippet will make sure that all the packages are installed (present) in the VM. You can use Ansible to automate lots of things. In our case, we automate the following processes:

install packages
download LLVM
build & run Mull’s unit tests
create an OS dependent package (pkg, deb, rpm, sh)
run integration tests

Another great thing about Ansible: you can run it locally, not necessarily in the VM. We use this feature on CI: executing each mentioned step for every pull request.

It saves me lots of time and simplifies the release process. Here is the whole release script:

mkdir -p packages

function prepare_package () {
  printf "Preparing package for $1... "
  export LLVM_VERSION=$2
  vagrant up $1 --provision 2> ./packages/$1.err.log > ./packages/$1.out.log
  vagrant destroy -f $1 2>> ./packages/$1.err.log >> ./packages/$1.out.log
  printf "Done.\n"
}

prepare_package debian 6.0.0
prepare_package freebsd 8.0.0
prepare_package ubuntu 8.0.0

In the end, I have packages ready in the packages folder for Debian, FreeBSD, and Ubuntu. Doing so for macOS is not as straightforward, but we will get there soon as well.

Summary

Just reiterating all those bold statements one more time:

don’t use llvm-config as part of the build system
don’t use LLVM/Clang from your distro for development
don’t use LLVM passes
don’t use whole-program-llvm
use Vagrant & Ansible for multi-OS support
use different versions of LLVM for development

There is another big topic: Testing, but I will leave it for the next article.

Bottom-up CMake introduction

Fri, 24 May 2019

If you want to learn CMake, but do not have time to go through all the resources on the internet, then this article is for you. I will cover essentials you’ll need to start:

targets
commands
variables
functions
macros

In the next few minutes, we will reimplement some CMake’s builtin functionality using the CMake itself.

Disclaimer: there are several very inaccurate statements about CMake in this article. Most of them are here on purpose: the goal is to build an intuition of how CMake works, not to be 100% correct.

What is CMake?

CMake is not a build system as many think of it. CMake is a build system generator. Basically, you can see it as a compiler that compiles CMake scripts into Makefiles. Or several other build systems including Ninja and Xcode, Eclipse, and Visual Studio projects.

The typical workflow is as follows: you create a CMakeLists.txt (can be empty), you generate the build system, you build something. Here is the code:

> touch CMakeLists.txt
> cmake .
> make help

This is the bare minimum you need to start. Now let’s learn some CMake concepts.

Targets

All the work in CMake is organized around targets. A target is something you can build or call.

Target calls

Create a CMakeLists.txt with the following content:

add_custom_target(hello-target
                  COMMAND cmake -E echo "Hello, CMake World")

And run the following commands:

> cmake .
< truncated >
-- Configuring done
-- Generating done

> make hello-target
Scanning dependencies of target hello-target
Hello, CMake World
Built target hello-target

Here, the make tells us that it has built the target hello-target, even though it just called echo command and did not produce any artifacts.

Build targets

Let’s fix that and actually build some simple program. Create the following files:

// main.c

extern void hello_world();

int main() {
  hello_world();
  return 0;
}

// hello.c
extern int printf(const char *, ...);

void hello_world() {
  printf("Hello, CMake world\n");
}

Replace the custom target COMMAND:

add_custom_target(hello-target
                  COMMAND gcc main.c hello.c -o hello)

And re-run make:

> make hello-target
< truncated >
-- Configuring done
-- Generating done
Built target hello-target

As you can see make detected the change and reconfigured CMake. If everything is right, you should have the hello executable:

> ./hello
Hello, CMake world

Great success!!!

Commands

In fact, you can describe the whole build process using the custom target as we did above. The problem is, however, that the command will re-run every time whenever you run make hello-target: the hello program will be re-compiled completely even when nothing has changed.

Let’s use separate commands to solve this problem. The new version of CMakeLists.txt:

add_custom_command(OUTPUT hello.o
                   COMMAND gcc -c hello.c
                   DEPENDS hello.c)
add_custom_command(OUTPUT main.o
                   COMMAND gcc -c main.c
                   DEPENDS main.c)

add_custom_target(hello-target
                  COMMAND gcc main.o hello.o -o hello
                  DEPENDS main.o hello.o)

The important point here is the DEPENDS. This construct describes the build process in the form of a (direct acyclic) graph: A depends on B, B depends on C and D, and so forth. Then, a change of D or C means that B is changed, which means that A is also changed, and therefore, all the changed items should be re-created.

Now try the following: build the program, add some change to one of the files, re-run the build twice, you should see something like this:

> make hello-target
[ 50%] Generating hello.o
[100%] Generating main.o
[100%] Built target hello-target
# Add small change to hello.c
> make hello-target
[ 50%] Generating hello.o
[100%] Built target hello-target
> make hello-target
[100%] Built target hello-target

We’ve just got incremental compilation, yay!

Variables

It is time to do some refactoring: I’m more of a clang person than gcc, and therefore I want an easier way to change the compiler. Let’s extract it into a separate variable.

Definition of a variable is as easy as set (FOO bar) call, that defines a variable FOO with value bar. The usage is also straightforward: ${FOO} becomes bar when executed.

Here is how a better version of CMakeLists.txt looks like:

set (C_COMPILER gcc)

add_custom_command(OUTPUT hello.o
                   COMMAND ${C_COMPILER} -c hello.c
                   DEPENDS hello.c)
add_custom_command(OUTPUT main.o
                   COMMAND ${C_COMPILER} -c main.c
                   DEPENDS main.c)

add_custom_target(hello-target
                  COMMAND ${C_COMPILER} main.o hello.o -o hello
                  DEPENDS main.o hello.o)

The variable definition can be recursive. Try to add the following code to the CMake script:

set (NUMBERS 1)
message(${NUMBERS})
set (NUMBERS ${NUMBERS} 2)
message(${NUMBERS})
set (NUMBERS ${NUMBERS} 3)
message(${NUMBERS})
set (NUMBERS 0 ${NUMBERS})
message(${NUMBERS})

And re-run cmake . to see this in action:

> cmake .
1
12
123
0123

Functions

In CMake, everything is a function!

set (FOO bar)

set is a function that takes two arguments.

if(CMAKE_SYSTEM_NAME STREQUAL Linux)
  # ...
else()
  # ...
endif()

if, else, and endif are functions.

add_custom_target and add_custom_command are also functions.

Let’s create our own function and hide all the intricacies of our CMake script:

set (C_COMPILER gcc)

function(create_executable name)
  add_custom_command(OUTPUT hello.o
                     COMMAND ${C_COMPILER} -c hello.c
                     DEPENDS hello.c)
  add_custom_command(OUTPUT main.o
                     COMMAND ${C_COMPILER} -c main.c
                     DEPENDS main.c)

  add_custom_target(${name}
                    COMMAND ${C_COMPILER} main.o hello.o -o hello
                    DEPENDS main.o hello.o)
endfunction()

create_executable(hello-target)

Fun fact: function and endfunction are also functions.

The function is now reusable, but quite useless since the source files are hardcoded. Let’s go a bit deeper and fix this issue.

Macros

In CMake, everything is a function! Except for macros.

Macros are like functions, with one exception: they are inlined whenever they called. We can extract compilation into the macro:

macro(compile source_file)
  get_filename_component(output_file ${source_file} NAME_WE)
  set (output_file ${output_file}.o)
  add_custom_command(OUTPUT ${output_file}
                     COMMAND ${C_COMPILER} -c ${source_file}
                     DEPENDS ${source_file})
endmacro()

The macro uses get_filename_component to cut the extension from the input source file and constructs the output file name: main.c -> main.o.

Now we can use this macro:

function(create_executable name)
  compile(hello.c)
  set (output_files ${output_file})

  compile(main.c)
  set (output_files ${output_files} ${output_file})

  add_custom_target(${name}
                    COMMAND ${C_COMPILER} ${output_files} -o hello
                    DEPENDS ${output_files})
endfunction()

The code looks a bit cleaner now, but there is at least one part that may look confusing: set (output_files ${output_file}). Since the body of a macro is inlined, we can rewrite this function like this (just for illustration):

function(create_executable name)
  get_filename_component(output_file hello.c NAME_WE)
  set (output_file ${output_file}.o)
  add_custom_command(OUTPUT ${output_file}
                     COMMAND ${C_COMPILER} -c hello.c
                     DEPENDS hello.c)
  set (output_files ${output_file})

  get_filename_component(output_file main.c NAME_WE)
  set (output_file ${output_file}.o)
  add_custom_command(OUTPUT ${output_file}
                     COMMAND ${C_COMPILER} -c main.c
                     DEPENDS main.c)
  set (output_files ${output_files} ${output_file})

  add_custom_target(${name}
                    COMMAND ${C_COMPILER} ${output_files} -o hello
                    DEPENDS ${output_files})
endfunction()

So basically, we reuse the variable output_file. We can use it to construct the list of object files for the custom target. I hope it is clearer now.

Loops

It obviously follows (c) that we can use a loop to handle a variable amount of source files passed to this function:

function(create_executable name)
  foreach(file ${ARGN})
    compile(${file})
    set (output_files ${output_files} ${output_file})
  endforeach()

  add_custom_target(${name}
                    COMMAND ${C_COMPILER} ${output_files} -o hello
                    DEPENDS ${output_files})
endfunction()

create_executable(hello-target main.c hello.c)

Here we iterate over passed source files (main.c, hello.c) stored in the ARGN variable, and accumulate all the intermediate files in output_files.

Final touches

I added three more things to the final version:

I added another variable C_FLAGS that stores some additional compile flags one may need
the name of the executable passed as a separate argument
extracted the linking phase into a separate command

set (C_COMPILER gcc)
set (C_FLAGS -g -O0)

macro(compile source_file)
  get_filename_component(output_file ${source_file} NAME_WE)
  set (output_file ${output_file}.o)
  add_custom_command(OUTPUT ${output_file}
                     COMMAND ${C_COMPILER} ${C_FLAGS} -c ${source_file}
                     DEPENDS ${source_file})
endmacro()

function(create_executable name exe)
  foreach(file ${ARGN})
    compile(${file})
    set (output_files ${output_files} ${output_file})
  endforeach()

  add_custom_command(OUTPUT ${exe}
                     COMMAND ${C_COMPILER} ${output_files} -o ${exe}
                     DEPENDS ${output_files})

  add_custom_target(${name} DEPENDS ${exe})
endfunction()

create_executable(hello-target hello main.c hello.c)

Give it another try:

> cmake .
> make hello-target
> ./hello
Hello, CMake world!

Conclusion

We’ve just replicated (limited) version of CMake’s add_executable functionality.

Here is the version you would use if you didn’t know how to build the thing on your own:

set (CMAKE_C_COMPILER gcc)
set (CMAKE_C_FLAGS -g -O0)

add_executable(hello main.c hello.c)

What’s next?

Go and learn about other CMake functions (that are confusingly called commands), you are ready now!

I would highly recommend learning about the following concepts:

Low Level Bits 🇺🇦 on Low Level Bits 🇺🇦

Different ways to build LLVM/MLIR tools

Features

(Fast) Build Times

Debugging Experience

Testing Infrastructure

Bleeding Edge

Dynamic Linking

Different LLVM distributions

(Semi-)official OS packages

Precompiled packages

Build your own LLVM

Summary

Building LLVM plugins with Bazel

Detecting available LLVM versions

Defining LLVM repositories

Defining plugin targets

Defining test targets

Conclusion

Compiling Ruby. Part 5: exceptions

Call Stack, Stack Frames, and Program Counter

Stack Unwinding

Exceptions in Ruby

Normal Exceptions

returns from a block

breaks

Implementation

MLIR

Compiling Ruby. Part 4: progress update

What Happened

Current Status

Next Steps

Compiling Ruby. Part 3: MLIR and compilation

Compilation

MLIR

MLIR Rite Dialect

Static Single Assignment (SSA)

Control-Flow Graph (CFG)

CFGs in MLIR

SSA in MLIR

Compiling Ruby. Part 2: RiteVM

Bytecode

Bytecode Execution

Compiling Ruby. Part 1: Compilers vs. Interpreters

Compiling Interpreter, Interpreting Compiler

Compilers vs. Interpreters

Parser + VM + Runtime

Ruby and its many Virtual Machines

Stack vs. Registers

Dispatch loop

Compiling Ruby. Part 0: Motivation

How to learn compilers: LLVM Edition

Disclaimer

General Theory / Introduction

Front-end

Middle-end

Back-end

Bonus points

Strings attached

LLVM meets Code Property Graphs

Code Property Graph

Joern

LLVM and CPG

Implementation Details

Instruction semantics

Redundancy

Type Equality

Example

Conclusion

Exploring LLVM Bitcode interactively

What is GraphML

What is Graph Database

llvm2graphml

Example queries

Some numbers

How many

Top 10 instructions:

Top 10 biggest functions:

Resources

Next steps

`return`s from a block

`break`s

`IRLinker`/`llvm-link`