Peter Bosch's website

Intel LDAT reference

Fri, 22 May 2020 00:00:00 +0200

General info:
  LDAT is the Intel array test port, it is used for manufacturing 
  validation of arrays in their product. Examples of these arrays are
  the caches, the microcode stores and the various buffers inside the
  CPU core.

  Procedure to read:
    Load SDAT
    Load PDAT with command A1 set to READ, others set to NOP
    Read DatOut

  Procedure to write:
    Write DatIn
    Load SDAT
    Load PDAT with command A1 set to WRITE, others set to NOP

  Undefined bits are set to 0.
  Mode (Mod) defaults to 1 for both READ and WRITE
  Other fields default to 0/NOP.

BDX Broadwell-X:
  From Intel System Studio 2014 XML Database
  Port offsets (Normal):
    PDAT   +0
    DatOut +2
    DatIn  +3 + Index
    SDAT   +4 #Not sure if 1 or 4

    SDAT Bitfield:
       3                   2                   1                   0
     1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 
                  +-----------+---+---------+---+-----------+-------+
                  |    Port   |Mod|  DWord  |   |  ArraySel |BankSel|    
                  +-----------+---+---------+---+-----------+-------+

    PDAT Bitfield:
       3                   2                   1                   0
     1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 
    +---+---+---+---+---+---+---------------+-----------------------+
    | C1| C0| B1| B0| A1| A0|               |       FastAddr        |
    +---+---+---+---+---+---+---------------+-----------------------+

  Port offsets (Legacy):
    PDAT   0
    DatOut 8
    SDAT   4 #Not sure if 1 or 4

    PDAT Bitfield:
       3                   2                   1                   0
     1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 
                    +-----------+---+-------+---+---------+-+-------+
                    |   Port    |Mod| DWord |   |ArraySel | |BankSel|    
                    +-----------+---+-------+---+---------+-+-------+

    SDAT Bitfield:
       3                   2                   1                   0
     1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 
    +---+---+---+---+---+---+---------------+-----------------------+
    | C1| C0| B1| B0| A1| A0|               |       FastAddr        |
    +---+---+---+---+---+---+---------------+-----------------------+

Command fields:
  | Encoding |          Name   |
  +----------+-----------------+
  |        0 | NOP             |
  |        1 | RDIGN           |
  |        2 | WRITE           |
  |        3 | READ/WRITEBAR   |

SNB Sandy Bridge:
  From Intel System Studio 2014 XML Database
  Port offsets:
    PDAT   +0
    SDAT   +1
    DatOut +2
    DatIn  +3 + Index

    SDAT Bitfield:
       3                   2                   1                   0
     1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 
                    +-----------+---+-------+---+---------+-+-------+
                    |   Port    |Mod| DWord |   |ArraySel | |BankSel|    
                    +-----------+---+-------+---+---------+-+-------+

    PDAT Bitfield:
       3                   2                   1                   0
     1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 
    +---+---+---+---+---+-------------------+-----------------------+
    | C1|   | B1|   | A1|                   |       FastAddr        |
    +---+---+---+---+---+-------------------+-----------------------+
    

  Command fields:
    | Encoding | Name   |
    +----------+--------+
    |        0 | NOP    |
    |        1 | RDIGN  |
    |        2 | WRITE  |
    |        3 | READ   |

  Known arrays:
    | PDAT CR | ArraySel  | Name             | Description                     |
    +---------+-----------+------------------+---------------------------------|
    |   0x359 |         0 | dsbofset         | DSB FE OFFSET array             |
    |   0x359 |         1 | dsbnata          | DSB FE NATA array               |
    |   0x359 |         2 | dsbnata          | DSB FE PFRQ array               |
    |   0x359 |         6 | dsbofset         | DSB FE TAG array                |
    |   0x361 | 2+2*Table | jws_pred_global  | The BPU global predictors       |
    |   0x361 |         8 | bp_l2bpu set0    | The level 2 BPU predictor,set 0 |
    |   0x361 |         9 | bp_l2bpu set1    | The level 2 BPU predictor,set 1 |
    |   0x361 |        10 | bp_target(tag)   | BIT TargetTAG array (even)      |
    |   0x361 |        11 | bp_target(tag)   | BIT TargetTAG array (odd)       |
    |   0x361 |        12 | bp_target(tag)   | BIT TargetTAG array (indirect)  |
    |   0x361 |        13 | bp_target(addr)  | BIT TargetAddr array (even)     |
    |   0x361 |        14 | bp_target(addr)  | BIT TargetAddr array (odd)      |
    |   0x361 |        15 | bp_target(addr)  | BIT TargetAddr array (indirect) |
    |   0x361 |        17 | bp_bit           | The main BPU target&tag array   |
    |   0x361 |        18 | bp_baq           | BAQBRD?                         |
    |   0x361 |        19 | bpq              | Branch Predict Queue            |
    |   0x377 |         2 | dl1_cache(data)  | L1D Cache Data array            |
    |   0x377 |         3 | dl1_cache(mesi)  | L1D Cache ?MESI? array          |
    |   0x377 |         5 | dl1_cache(tag)   | L1D Cache Tag array             |
    |   0x377 |         6 | dl1_cache(lru)   | L1D Cache LRU array             |
    |   0x387 |         0 | il1_cache(data)  | L1I Cache Data array            |
    |   0x387 |         1 | il1_victim(data) | L1I Victim Cache Data array     |
    |   0x387 |         2 | il1_victim(tag)  | L1I Victim Cache Tag array      |
    |   0x387 |         3 | il1_cache(tag)   | L1I Cache Tag array             |
    |   0x387 |         4 | il1_cache(flags) | L1I Cache Flags array           |
    |   0x387 |         5 | itlb_sm_st       | ITLB Small Page, SingleThread   |
    |   0x3CE |         4 | mob_disambig     | Memory disambiguation predict   |
    |   0x3CE |       0,1 | mob_lb           | MOB Load Buffer                 |
    |   0x3CE |     2,3,5 | mob_sab          | MOB Store Address Buffer        |
    |   0x3CE |         6 | mob_phy          | MOB Physical Address Buffer     |
    |   0x3D3 |         0 | id_esp_data      | Instruction ESP data Q          |
    |   0x3D3 |         3 | MS RAM           | The MS patch RAM                |
    |   0x3E4 |         0 | rob_wbac         | ROB Ready bits and flags        |
    |   0x3E4 |         1 | rob_al           | ROB ALLOC Array                 |
    |   0x3E4 |         2 | bob_wbac         | Branch order buffer Writeback   |
    |   0x3E4 |         3 | non_renamed      | Non-renamed retirement array    |
    |   0x3EC |         5 | dtlb_sm_tag      | DTLB Small Page Tag             |

    I will add more of these later ( no later than 24/05/2020 )

HSW Haswell:
  PDAT Names:
    | CRB Addr| Name                |
    +---------+---------------------|
    |   0x361 | BPU1_CR_PDAT        |
    |   0x359 | DSBFE_CR_PDAT       |
    |   0x366 | CORE_CR_PDAT        |
    |   0x377 | DCU_CR_PDAT         |
    |   0x382 | IESLOW_CR_PDAT      |
    |   0x393 | MI_CR_PDAT          |
    |   0x3A9 | ML2_CR_PDAT         |
    |   0x3CE | MOB_CR_PDAT         |
    |   0x3D3 | MS_CR_PDAT          |
    |   0x3EC | PMH_CR_PDAT         |
    |   0x3F6 | RAT_CR_PDAT         |
    |   0x3FA | AL_CR_PDAT          |

GLM Goldmont: 

  Reverse engineered from https://github.com/chip-red-pill/crbus_scripts

  Port offsets:
    PDAT   +0
    SDAT   +1
    DatOut +2
    DatIn  +4,5? + Index

    SDAT Bitfield:
       3                   2                   1                   0
     1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 
                    +-----------+---+-------+-------+-------+-------+
                    |   Port    |Mod| DWord |ArrySel|       |BankSel|    
                    +-----------+---+-------+-------+-------+-------+

    PDAT Bitfield:
       3                   2                   1                   0
     1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 
    +---------------+---+-----------+-------------------------------+
    |               | A1|           |          FastAddr             |
    +---------------+---+-----------+-------------------------------+
   
  Command fields:
    | Encoding | Name   |
    +----------+--------+
    |        0 | NOP    |
    |        2 | WRITE  |
    |        3 | READ   |

  Known arrays:
    | PDAT CR | ArraySel  | Name             | Description                     |
    +---------+-----------+------------------+---------------------------------|
    |   0x6A0 |         0 | ms_rom           | Microcode ROM                   |
    |   0x6A0 |         1 | ms_irom          | Microcode constant ROM          |
    |   0x6A0 |         2 | ms_iram          | Microcode update constant RAM   |
    |   0x6A0 |         3 | ms_match_patch   | Microcode update match/patch    |
    |   0x6A0 |         4 | ms_ram           | Microcode update RAM            |

Introduction to the Intel Management Engine OS (Part 2)

Fri, 11 Oct 2019 00:00:00 +0200

If we wanted to analyze a ME module, we might start by taking its base address from the decoded metadata file, as per (Goryachy, Ermolov, & Sklyarov, 2017). That would look something like this:

To better understand this disassembly, we need to look at the memory map for a userland ME module. Figuring out the specifics took significant amounts of my time earlier this year, as the kernel is even tougher to understand than the modules themselves. It was already known that 0x1000-0x2000 contained the entry points for the ROM library and that 0x9000-0xa000 was the system library dispatch table (Goryachy, Ermolov, & Sklyarov, 2017).

The way these ranges work is simple: the ME uses shared libraries, but has no real dynamic linker. Instead, symbols are resolved at compile time by having a range of fixed-address vectors at the start of the libraries. This allows the size of library code to vary without having to recompile the modules using the library. The lack of import and export tables means that we have no symbol names to depend on for understanding the library calls.

In order to make sense of them we need to look at the library implementation, which for syslib is possible as it is contained in the flash ROM. However, the ROM library at 0x1000 is contained in on-die memory which we can only read once we already have code execution on the ME core. This was initially a significant problem for me; I had to resort to guessing functions by their usage, and the ROM library contains most of the OS-independent C library code so ignoring calls is not feasible. There is, however, a way around this dependency loop: for internal testing, Intel allows pre-production firmware to override the on-die ROM. Some of those firmware images have leaked, containing a partition called ROMB (ROM bypass) which provides a way to look at ROM code without having to first break ME security (Skochinsky, 2014)(Goryachy, Ermolov, & Sklyarov, 2017).

This leaves the read-write memory areas, which are located above the image text. We would expect a UNIX program to have .text, .data and .bss segments, but the image we have seems to be a flat binary, with metadata that does not directly reference the section addresses. We can guess there is some read-write memory by looking at the accesses, but to better understand what is going on we will need to pick apart the module entry point code.

Comparison of ME module with minix3 code at https://github.com/jncraton/minix3/blob/master/lib/i386/rts/crtso.s

We can see that the crtso (C RunTime StartOff) is mostly taken from minix, but that before any of the normal startup code runs, another function is called. This function is critical because it not just runs all the syslib initialization code (.init in linux) but also sets up the initialized data section.

This is the solution to using a mostly flat image format for code targetting a POSIX environment: have the kernel prepare a large .bss area and copy a small portion of the read-only data section to it. This data section is always at the start of the bss area and can thus be used to infer the .bss base address.

We now have a sufficiently good understanding of the memory map to start picking apart a module. When doing so we will run into the second surprise: one of the first steps we often take when analysing code is to look at the string literals present in it, and most of these modules contain only a handful of them.

A few do contain numerous strings, such as old revisions of evtdisp and busdrv, but most use Intel SVEN for debugging. SVEN is a technology that moves the format strings from the software under test to the debugger. It replaces printf() calls with sven_trace_<par_count>() calls that use a numeric ID to refer to a format string in a dictionary. Trimmed down versions of such dictionaries have leaked as part of some Intel System Studio releases, but no full versions are known to exist.

If we want to start attacking vulnerabilities in modules, we also need to know the location of the stack. This cannot trivially be deduced from the modules, so we either need to guess it from known memory sections, or painstakingly reverse engineer the microkernel.

The rough memory layout for a normal ME process - the orange areas are read only, the green area is read-write and no areas above .text are executable.

To better understand how this layout is derived from the metadata file, look at user/loader/map.c in the meloader source, which is part of an emulator I wrote to be able to do dynamic analysis of ME modules.

Due to the microkernel design used for the ME operating system, device drivers are also user-mode processes. This means there has to be a mechanism to provide access to memory-mapped IO (MMIO) address ranges from user space.

The ME mostly depends on x86 segmenting for its memory protection model, allowing very fine grained access control to memory ranges. I will cover the specifics of this memory protection model in a future article about the ME kernel, but will cover the relevant bits for understanding MMIO in driver modules here.

The module metadata contains a list of MMIO ranges, which are exposed as the first few segments in the LDT. To construct a segment selector (segment register value) we can refer to the Intel Architecture Manual, which yields sel = (mmio_index << 3) | 7.

The selectors are usually not directly accessed by user code, but only by the ROM library. In order to be able to emulate ME code, I had to hook these.

Their addresses for Sunrise Point-LP chipsets are listed in cfg/spt_lp.cfg in the meloader source.

Another important segment used by the ME is the thread-local segment (TLS), which is pointed to by [gs:0]. It is located at the top of the stack and contains pointers to the C library context as well as the errno value and thread ID (Goryachy & Ermolov, 2017).

The TLS structure.

The next article in this series will describe how to use a buffer overflow vulnerability in the bup module to gain almost unlimited access to the ME system.

References

Goryachy, M., Ermolov, M., & Sklyarov, D. (2017). Intel ME: The Way of the Static Analysis. Troopers. Retrieved from https://www.troopers.de/downloads/troopers17/TR17_ME11_Static.pdf
Skochinsky, I. (2014). Intel ME secrets. CODE BLUE. Retrieved from https://www.slideshare.net/codeblue_jp/igor-skochinsky-enpub
Goryachy, M., & Ermolov, M. (2017). How to Hack a Turned-Off Computer, or Running Unsigned Code inIntel Management Engine. Black Hat Europe. Retrieved from https://www.blackhat.com/docs/eu-17/materials/eu-17-Goryachy-How-To-Hack-A-Turned-Off-Computer-Or-Running-Unsigned-Code-In-Intel-Management-Engine.pdf

Introduction to the Intel Management Engine OS (Part 1)

Fri, 11 Oct 2019 00:00:00 +0200

The Intel Management Engine is a secondary processor present in all modern Intel systems. This series will cover version 11.0 of the Management Engine, which is present on 6th and 7th generation Intel systems. Later versions are derived from this version.

The ME is often portrayed as a backdoor that we users have no control or insight over, and there are a number of misunderstandings perpetuated by opponents of the technology.

It is widely believed that the ME serves no legitimate purpose; this is not true. The ME originated as a seperate component used for management of enterprise systems. As systems became more integrated, it was incorporated into the PCH (southbridge), rather than the CPU. Later, Intel must have decided that it needed more sophisticated logic for initializing the clock controllers, because they moved this functionality into the ME.

Since then, the ME can no longer be disabled by removing its firmware, as it is a critical part of system bringup and thus the system will not boot without it.

So what exactly is this ME? It consists of a CPU with associated ROM, RAM and peripherals located in the PCH. It runs its own operating system, with a number of applications and drivers providing services to the host firmware and user.

Image: Intel (c) (Hasarfaty & Moyal, 2019).

For a long time, the only real resource available for researching the ME was the firmware stored in the SPI flash on the motherboard. This was especially difficult before ME version 11.0, as the earlier implementations used variants of the ARC architecture, which were not well-supported in popular reverse engineering toolchains. Despite this, some research was done and yielded a basic understanding of those versions (Skochinsky, 2014).

Research really took off in 2015 when Intel released their Sunrise Point chipset, the first to use ME 11.0. ME 11 and onwards are based on a derivative of the Intel 486 processor, which makes analysis of the code much easier.

The ME 11.0 images consist of various layers stored in the ME region of the SPI ROM (Goryachy, Ermolov, & Sklyarov, 2017). I will describe these layers in more detail later on in this series. There are easy-to-use tools available for parsing and extracting these images, such as unME11 by Dmitry Sklyarov.

The files contained in the firmware images can be divided into rougly two categories: Metadata and module files. The metadata files are parsed by the tools into human readable text files, and the module files are flat binaries containing the module code.

The ME runs a microkernel operating system which (since (Goryachy, Ermolov, & Sklyarov, 2017)) is commonly believed to be Minix. In fact, Intel use a custom microkernel (replacing the ThreadX system used before ME 11.0), which takes the POSIX VFS, system call interface and parts of the userland from Minix 3.

The boot flow for the ME, Image: (Goryachy, Ermolov, & Sklyarov, 2017).

Microkernels have an inherent bootstrap problem, where functionality needed to load and start server processes is implemented by those same processes. The ME operation solves this by having a bringup (bup) server, which contains a barebones implementation of all services and drivers it would otherwise depend on. This module needs access to almost all core peripherals and system calls, and is thus a very interesting target when attacking the ME (Goryachy & Ermolov, 2017).

In the next article I will look into loading and analyzing a ME module

References

Hasarfaty, S., & Moyal, Y. (2019). Behind the Scenes of Intel Security and Manageability Engine. BlackHat. Retrieved from https://i.blackhat.com/USA-19/Wednesday/us-19-Hasarfaty-Behind-The-Scenes-Of-Intel-Security-And-Manageability-Engine.pdf
Skochinsky, I. (2014). Intel ME secrets. CODE BLUE. Retrieved from https://www.slideshare.net/codeblue_jp/igor-skochinsky-enpub
Goryachy, M., Ermolov, M., & Sklyarov, D. (2017). Intel ME: The Way of the Static Analysis. Troopers. Retrieved from https://www.troopers.de/downloads/troopers17/TR17_ME11_Static.pdf
Goryachy, M., Ermolov, M., & Sklyarov, D. (2017). Intel ME: Детали устройства файловой системы в Flash-памяти. RUCTF. Retrieved from https://live.ructf.org/intel_me.pdf
Goryachy, M., & Ermolov, M. (2017). How to Hack a Turned-Off Computer, or Running Unsigned Code inIntel Management Engine. Black Hat Europe. Retrieved from https://www.blackhat.com/docs/eu-17/materials/eu-17-Goryachy-How-To-Hack-A-Turned-Off-Computer-Or-Running-Unsigned-Code-In-Intel-Management-Engine.pdf

Boot Guard TOCTOU CVE-2019-11098

Tue, 17 Sep 2019 00:00:00 +0200

In order to trust a computer application it is necessary to be able to trust all the hidden infrastructure supporting it aswell. Some of this is obvious: If your operating system contains a root kit, application level security becomes almost useless. To fix this we try to ensure that all privileged code is under our control, which if we want to prevent offline attacks involves signing the binaries and verfiying those signatures at load time.

As with all security, the weakest link determines the strength of such a system: as soon as one stage of the boot process fails to verify the next, an attacker can install malicious code without the later stages knowing. This is why code signing applications while running an unsigned operating system offers very little protection against sophisticated attacks, and similarly why code signing the operating system and boot loader does not work without signed firmware.

Intel’s solution to this problem is Boot Guard, where the chain of trust starts in the CPU microcode and Management Engine firmware. There have been various (Ermolov, 2016) attacks (“Bypassing Intel Boot Guard,” 2015) on this technology, most of which targetting configuration mistakes by vendors or vendor code running after verification is done.

The first element of Boot Guard that is executed as normal code on the host CPU is the ACM, which is an Intel binary blob verifying the vendor firmware. It gets the security policy and keys from the Management Engine and verifies the vendor firmware, after which it sets up a safe environment for the vendor Initial Boot Block to execute in.

At this point in the boot process there is no DRAM available yet and as such the ACM will configure the CPU cache not to evict any lines, which allows the firmware to use the cache as RAM. This cache-as-RAM (CAR) is used to store state for the ACM and to provide a secure copy of the IBB to execute.

Usually any form of verification will look somewhat like this:

	memcpy( safe_buffer, data, data_size );
	if ( !verify_signature( safe_buffer, data_size ) )
		goto error;
	use_data( safe_buffer, data_size );

in order to prevent time of check/time of use attacks. The ACM code does no such thing: It simply verifies the data in place and proceeds to run it! However, things are not like they seem: The cache is in no evict mode and thus the verification itself implicitly copied the data. When the code is executed the fetches will hit the cache and the TOCTOU risk is averted.

This clever strategy does have some downsides: the implicit protection is easily overlooked and to make matters worse it is not failsafe. Any issue that forces the code out of the cache or makes fetches bypass it will silently open up the system to a TOCTOU attack on the data source.

In the case of Boot Guard that data source is usually a flash ROM on the motherboard attached via the relatively simple SPI bus. This bus is easy to monitor as it has a low clock speed (smaller than 100 MHz) and a low pin count. Capturing this bus allows watching ROM fetches (and thus, cache misses) from outside the system.

To find a vulnerable region of data it suffices to look for an address being read multiple times. One such address was 0xffcc40:

This address turns out to be part of the SecCore EFI module, which is responsible for early initialization and security during EFI bringup. The code at those addresses turned out to be:

# SecCore::PeiTemporaryRamDone
FFFFCC42    mov ecx, IA32_MTRR_DEF_TYPE
FFFFCC47    rdmsr
FFFFCC49    and eax, ~IA32_MTRR_ENABLE
FFFFCC5A    mov ecx, IA32_MTRR_DEF_TYPE
FFFFCC5F    wrmsr

This shows precisely the risk of having implicit protection: it is easy to forgot it is being used and to accidentally break it. In this case the system has initialized DRAM and starts to tear down CAR, which amongst other things consists of disabling the caches. Having disabled the caches, the code is now executing in place from memory-mapped ROM.

A simple attack against this flaw is to have a circuit monitor the SPI bus for the end of the verification and then switch the bus from the real ROM to a second one that contains malicious code. This is easy to do because SPI by design shares all but one of its signals between multiple targets on the bus, the CS signal is not shared and selects which device is being addressed. The proof of concept attack used an FPGA to intercept that signal and route it to a second chip when needed.

The trigger for the override was an address range chosen from the SPI capture, which was seen only to be accessed after verifying the code.

The proof of concept setup is shown below, with an insert showing the modifications to the Lenovo T460 motherboard.

This allowed code excecution, but not yet booting the system. In order to boot the system and not alert any security mechanism in the vendor firmware, the device would have to hide itself again. This was implemented by having a second range of addresses that would deactivate the device. As execution is under attacker control this address can be one that is otherwise never read, to prevent accidentally disabling the device.

By doing this, the system can be made to boot as normal and will not show any signs of tampering: The TPM measurement registers are not affected and the system is thus entirely unaware of the attack.

This specific vulnerability yields control before most of the sensitive MSRs are locked and thus allows installing a persistent backdoor before transfering control back to the vendor firmware.

The POC may seem awfully impractical, but a much simpler approach is possible: instead of intercepting the CS signal, one can simply override it with a stronger output and a series resistor on the motherboard will protect the chipset from being damaged by this. This allows simply clipping on the exploit device.

Another improvement is to not switch to a second ROM but instead serve the data directly from the FPGA.

These improvements were suggested and implemented by Trammell Hudson, who independently discovered the TOCTOU while working on his SPI flash emulator (missing reference). and helped me turn my POC into a much more realistic attack and report the issue to Intel Corp.

Intel Corporation recognized that even though the vulnerable code was in the UEFI reference implementation and not the ACM, it still impacted all Boot Guard implementations and thus considered the report to be in scope.

They not only fixed this specific issue, but also addressed the general class of ROM TOCTOU vulnerabilities by:

requiring EFI firmware to migrate all code and data to RAM
enabling paging after DRAM init and marking the IBB flash area not present

The mitigation code can be found at the EDK2 staging repository

Tramell Hudson and I presented this work at Hack in the Box Amsterdam, our slides are available at the conference site and the talk itself is on YouTube

I would like to thank

Trammell Hudson for guiding me through the vulnerability disclosure
RevSpace for providing me with a workshop and tools to develop the POC and test my findings
Intel Corp for recognizing the vulnerability, and going to great lengths to fix it while also allowing me and Trammell to present our work at HITB during the disclosure process.

References

Bypassing Intel Boot Guard. (2015). Embedi. Retrieved from https://embedi.org/blog/bypassing-intel-boot-guard/
Ermolov, A. (2016). Safeguarding rootkits: Intel BootGuard. ZeroNights. Retrieved from https://2016.zeronights.ru/wp-content/uploads/2017/03/Intel-BootGuard.pdf
Hudson, T. (2019). Spispy. Retrieved from https://trmm.net/Spispy

Toy operating system

Sun, 29 Apr 2018 00:00:00 +0200

Ever since I learned programming I had wanted to write my own operating system. At first, I was simply too young to understand the sheer difficulty of this task and did not have a clue as to what this would have taken. So back in 2014, after I had gained a few years of experience writing C I decided to finally take on this project.

As a disclaimer: the goal of this project was to understand operating systems on a technical level and to learn systems programming. This means that I have not chosen to spend much time on choosing the proper algorithms and that the quality of the code and design choices varies wildly.

I started out by writing some of the basic memory management code around March 2014 and by May I had implemented enough infrastructure to run a trimmed down version of busybox and GCC using newlib. I had stubbed out a lot of the less critical syscalls used by these programs and as such not all of the busybox applets worked, so I first wrote a bunch of utilities such as login and getty.

Once I was satisfied with this limited userland I decided to work on hardware support. Up to this point I had been using a tmpfs like filesystem to test my kernel and hadn’t yet gotten around to implementing a disk driver or even PCI support. The ATA driver I initially wrote did not use DMA and was quite slow. As a first real filesystem I chose to use ext2 as it was relatively simple, but this did pose some issues as my filesystem layer was not quite compatible with ext2 yet. Adding this second filesystem driver helped me spot quite a few bugs in the VFS, block device and ELF loader code.

By July 20th 2014, the ext2 driver was mostly working and I had also fixed a load of other bugs in the kernel. I still did not support booting from anything else than ramfs so the ext2 support was mostly used for large files such as the GCC port. All the work on the kernel was getting a bit annoying and I decided that adding a GUI would be fun, so I implemented a framebuffer driver, framebuffer console and added the mmap syscall to expose it to userland.

I had not implemented networking so in order to provide IPC for the display server I chose SystemV IPC, which had a pretty simple interface and implementation.

At first I wrote a very simple compositor from scratch, but it simply could not keep up with the volume of data it had to move around. I did not feel like writing a SIMD version of this by hand so I rewrote it to use libcairo and libpixman for the heavy lifting. This was reasonably simple and by September 1st 2014 I had a simple desktop environment with a working terminal emulator running.

At this point, I had to put in some more work for high school and development slowed down, I did some more work on the ext2 driver and added some error handling as well as refactoring the codebase.

By March 2015 I was still having a bunch of memory corruption issues and was hoping that a port to a different architecture might help track these down, so I started working on an ARMv7 port targetting my old Nokia N900 phone, this took me about a month: By March 29th I had it booting to /sbin/init on ARM:

Unfortunately, I never got around to run this build on a real device, but I do have a photo of an earlier build partially booting on real hardware:

After this I did some refactoring and implemented real POSIX signal support as well as allowing arbitrary root file systems instead of the ramfs based boot used earlier.

This and some other improvements allowed running real applications such as VIM and Python:

The corrupted directory listing seen in the Python screenshot is caused by running a busybox binary linked against a far too old libc as the ABI had some minor changes since I last built busybox.

The some more improvements to the ext2 driver meant that it was now stable enough to run in read/write mode and test things that required persistent files, such as rebuilding the kernel and booting it:

After this I wanted to add networking and as such added multithreading and replaced a large part of the scheduling code but other projects started taking up my time and development stopped there.

The toolchain is stuck at an old GCC version and does not build on recent distros.

The source code for the kernel is available at it’s GitHub repository.

The source code for the applications, libraries and build system is at the GitHub organization for this project.

DEC LK201 Emulator

Wed, 02 Aug 2017 00:00:00 +0200

I recently accquired a DEC VT220 video terminal which was fully functional asides from a little dirt on its case, however, it was not usable yet as I did have the proper keyboard for the terminal.

Because I did not have the money to buy the keyboard I started thinking about adapting a PC keyboard to work with it. Fortunately, the service manual for the terminal was available, which provided me with a description of both the hard- and software interface for the keyboard. The descriptions given in this manual were written almost like a specification and made implementing these features very easy.

The interface used by the LK201 (the type number of the keyboard the VT220 uses) is a binary protocol over 4800 8N1 RS232. This meant that a run of the mill microcontroller should easily be able to implement it. I chose an Arduino Nano because my hackerspace had these on hand.

At first I tried to translate the PS2 scancodes as they came in from the keyboard and forward them to the terminal. This approach made implementing the more advanced features such as auto repeat and make/break codes more difficult. In the end I decided on having the PS2 code update a virtual key array in the interrupt handler and having the keyboard emulator code treat that as a real key array which it periodically scans. This made it a lot easier to implement the behaviour specified by the manual.

This approach worked and I was now able to operate the set-up menu of the terminal, however, it was not stable and would often start spamming fictitious keystrokes or auto repeats of keys that had never been pressed. After a lot of debugging I discovered that the software serial port I was using for talking to the terminal was interfering with the timing of the PS2 interrupts and thereby corrupting data whenever a PS2 receive coincided with a serial receive. Switching to the hardware UART fixed this. The reason I did not immediately choose to use the hardware port was because I wanted to have a debugging port available during the firmware development. Not having this debug port would have made development a whole lot harder, but when it eventually turned out to be causing bugs of it’s own I had no other choice than to stop using it.

The LK201 is more than just a simple keyboard translating switch press or release information into serial data, it also contains the speaker for the terminal and four status LED’s which can be individually controlled by the terminal. In order to make the keyboard easier to use I decided not to put these LEDs and the buzzer into the converter box, but on the keyboard itself. I opened the PS2 keyboard and used an old soldering iron to melt holes for the LEDs. I also used a labelwriter to produce the correct keycap markings for the new keyboard.

The source code for this project is available at it’s github repository. The schematics were available on this site, but due to some unfortunate data loss I no longer have them.

The original version of this article was written using VIM on my VT220 using the keyboard emulator. It is thus not only “working” but usable aswell.

Motorola 68008 computer with Arduino as ROM

Thu, 14 Aug 2014 00:00:00 +0200

After a fellow TkkrLab member donated some old chips (including some 68008 and Z80 CPUs) to the hackerspace I wanted to try to build a computer with the 68000 processor.

The 68008 processor is the 8-bit bus version of the 68000, which is the first 68k architecture processor. It is a 16-bit processor that was designed with forward compatibility in mind and to achieve that goal all of its internal registers (except for the status register) are 32-bits. The 68008 is interesting because it has a relatively low pin count and is therefore quite easy to use on a breadboard.

I looked up the datasheet for the processor and noticed that it allowed a device on the bus to add wait states as long as needed to complete a transfer, this sparked the idea to use an Arduino as “boot ROM”. I came up with a system where the arduino would dynamically generate 32 byte chunks of 68k machine code to load data into the RAM, disable the arduino bus interface and reset the processor afterwards, booting the new code from RAM.

The idea was simple: Make the glue logic map the first 64K of address space to the Arduino until it signalled boot complete, When that happened it would map the RAM to the first 64K and reset, executing the code that was uploaded.

The implementation yielded some interesting problems: First, the Arduino turned out to be too slow at releasing ~DTACK which caused the processor to assume every read was immediately completed until the arduino released the line. I fixed this by adding a D flip-flop with the D input tied high, the reset input connected to NOT ~AS and the clock connected to the Arduno. The inverting output of the flip-flop is connected to ~DTACK so that the Arduino can assert ~DTACK by setting the clock input high, and once the cycle is complete the processor will release ~AS and the flipflop will clear, causing ~DTACK to be released until the Arduino pulses the clock input again.

When this was fixed the setup would still not work: Although setting the reset vectors appeared to work (I have the Arduino log the adresses requested from it) I could not get it to run a simple “loop: jmp loop”… After a lot of trial and error I discovered i had not connected D7 from the Arduino, as soon as I fixed this it worked: The logs indicated the processor repeatedly fetching address 8 through 11, indicating it was actually looping.

I could, however, not write data anywhere as the Arduino was still way too slow at releasing the bus so I had to add a bus transceiver between the Arduino and the bus to prevent the Arduino and 68008 driving it at the same time.

While doing this I realized that I could never write to the RAM while the bootstrap mode was enabled: Temporarily disabling that mode as I had intended to do would have caused the ~DTACK line to be asserted immediately by the RAM on any cycle after that so the Arduino would not have been able to (reliably) disable it again. To solve this I decided to map the third 64K block of address space to the RAM without routing it through the bootstrap logic so that I can write to the RAM while the Arduino is still mapped on the first 64K.

Now that I finally had the design flaws ironed out I decided to write a simple program for the LED attached to the digital output:

M68000
DL 0
DL 8
MOVE.B #$FF, D1
MOVE.B D1, $010000
LOOP:
JMP LOOP

When I tried to run this code the processor would lock up after fetching the MOVE.B D1, $01000 instruction. Turns out I did not make the digital output register assert ~DTACK, causing the processor to wait for it forever. This was a simple fix: I added an inverter between the write enable input of the register and the ~DTACK output, causing the digital output register accesses to always complete immediately. I had now arrived at the (hopefully) correct glue logic combination to make this thing work.

Having done so much simple testing I wanted to finally see something happen on the breadboard and wrote this simple blink program:

M68000
DL 0
DL 8
MOVE.B #$FF, D1

LOOP:
MOVE.L #$FF, D0

LOOPW1:
SUBQ #1, D0
BNE LOOPW1

MOVE.B D1, $010000
NOT D1
BRA LOOP

Against all expectations, this worked right away: The led started blinking nicely, albeit very slowly: The Arduino fetches take 1 millisecond per byte.

I have not yet tried writing to the RAM or disabling the Arduino but the most tricky part of the build works: The slow 8-bit micro acting as ROM.

I have uploaded the Arduino code here and the schematics are available online here.

Java OpenGL based graphics engine

Fri, 16 Aug 2013 00:00:00 +0200

Two years ago I wanted to replace the engine of an old game I had reverse engineered with a more modern one, but writing the whole new engine into the game seemed like a waste to me so I wrote this graphics engine. It has API’s for low level OpenGL features and is based on LWJGL, but also includes very high level interfaces which allow you to construct an application with very little OpenGL or computer graphics knowledge.

It consists of several layers of abstraction, the lowest layer wraps basic GL operations such as managing and encoding buffers of geometric data and integrating their deallocation into the Java GC to prevent OpenGL server memory leaks.

The layer above that very lowest layer manages objects in a scene graph, in which every object (which can be rendering objects, cameras, effects etc.) is based on a Node superclass which provides methods for managing their position in the scene graph, this layer also contains two special kinds of Node implementations: Camera and GeometryNode.

Camera provides a customizable camera which by default uses perspective projection and allows for extra rendering configuration to be done using RenderControl instances, changing properties of the camera not exposed through setter methods is possible by overriding methods like updateProjectionMatrix().

GeometryNode is a base class for objects that render geometry or do other operations affected by the modelview matrix, it sets up the modelview matrix in such a way that any operations done within renderGeometry(Camera) will have their results positioned at that Node‘s position in the world.

The next level above that is mainly for simple programs that do not really need to utilise the power of the lower level API’s and simply need a way to show some nice looking 3D objects. It provides a base class for the application, SimpleApplication, which takes all the complicated things like initializing the engine and implementing a main loop which does all necessary calls to get the image rendering. The CameraControl framework is also part of this layer, it provides a simple way of adding ,for example, a first person camera to your application.

The code is available on GitHub