Junpeng Wan /dʒuːn.pɛŋ wɑːn/ (joon-peng wahn) 万俊鹏

ARM BTB reverse engineering

Tue, 14 May 2024 00:00:00 +0000

Update

I reproduced this work on a Raspberry Pi 4B. The report can be found on arXiv, which is much clear than this blog. The corresponding code is available here.

Intro

One year and a half ago, I needed to figure out the BTB(branch target buffer) capacity of an ARM server. However, no public documents could be found then. The good news was that previous work reverse engineered BTB capacity on x86 architectures[1][2]. And I am luckily enough to reproduce it and according to my results on Kunpeng 920, the BTB capacity is 4K. My code can be found here.

Details

My method is to count an PMU event called ARM_PMU_BR_MIS_PRED when executing a bunch of branches. This event will tell us if there is a mispredicted or not predicted branch[3].

The following pseudocode is our test gadget, which consists of fall-through unconditional indirect branches. We control the branch number B and the align distance N, where N is the gap(bytes) between two labels(or basic blocks).

adr x0, next1
BR X0
nop
nop
.......(some other nops)
nop
 
next1:
adr x0, next2
BR X0
nop
nop
.......(some other nops)
nop
 
next2:
adr x0, next3
BR X0
nop
nop
.......(some other nops)
nop
 
next3:
ret

We count branch misprediction rates for different B and N. For each test, we first execute our test gadget 10 times to warm up the BTB; secondly, we run the test gadget once and log the difference of the PMU event counter(C). Then the branch misprediction rate is C/B. The result is in the following diagram.

From the diagram, we could know:

BTB set index starts from bit 5. Because if B goes from 3K to 7k, the misprediction rate is almost the same if N goes from 16 to 32(32=2^5); but there is a leap from 32 to 64.
The BTB entry size is 4096. Because when N is 32, the rate is 0.01 for 4K however 0.39 for 5K. However, there are still 40-50 branches not buffered by BTB. It may be introduced by timer interrupts or other noise.

A little bit more

I am not sure with the following contents and welcome to correct me.

Ways

We already knew that the set index starts from bit 5. Since the BTB capacity is 4K, the set index bits are at most 12 bits (2^12=4096). Therefore, if we let N >= 2^17, all branches will fit into the same set. By counting branch mispredictions, we could know the ways in each set. Here is our result. From the picture, we may conclude that each set has 8 ways. Because when B=8, there is always a low miss rate when log(N)>17. However, if there a eviction buffer for BTB, the ways maybe smaller(like 4 or 6). If way=8, then there are 512 sets(8512=4096). The set index should be 5-13 bit. However, we observe that when log(N) = 14 and B=16, the cache miss rate is 0, which implies that the 16 branches are held by different sets so bit 14 is also used for indexing sets. As the picture shows, each bit from 5 to 14 could influence the set distribution thus influencing results. There should be a hash function which maps 5-13 bit to 512 BTB sets. If way=4, and there exists a BTB eviction buffer, then there are 1024 sets(41024=4096). The set index should be 5-14 bit.

According to Ockham’s razor principle, I tend to believe the BTB is 4 way and some eviction buffer exists. But more explorations need to be done to confirm that.

Reference

[1] Experiment Flows and Microbenchmarks for Reverse Engineering of Branch Predictor Structures
[2] The BTB in contemporary Intel chips
[3] Armv8-M Architecture Reference Manual

Change SSD and Battery for my old MBP

Sat, 20 Nov 2021 00:00:00 +0000

Intro

I bought my MacBook Pro five years ago. Now in the winter of 2021, its battery life is significantly short, about 1 hour for typing words, and it contains a small-space SSD with 256GB space. I have to store my virtual machines and some data on an external SSD. Fortunately, I found that for my model, Retina, 15-inch, Mid 2015(A1398), both the battery and the SSD Card could be replaced.

My choices

The SSD strategy was learned from [1] and the Battery choice was learned from [2]. It costs me about 1000 RMB altogether. I think they are worth the price because in this way I do not need to buy a new computer.

SSD

The SSD interface in mac is mSATA. However, an SSD card with mSATA is very expensive, so I choose an M.2 interface SSD with a M.2-to-mSATA converter, which is much cheaper. Since my computer model supports PCIe 3.0 x4 whose theoretical maximal bandwidth is 4GB/s, it better to buy a high-speed SSD for me. My choice is Samsung 970 EVO plus, 512G, which could achieve a sequential read/write speed of 3500MB/s and 3300MB/s. (This is not an advertisement.) Note that Samsung 980 could not be used as the system disk in Mac OS.(I wasted an afternoon on that.)

Battery

I buy a new set of batteries from Jingdong which contains a full set of tools, like screwdrivers and anti-static gloves.

Procedures

The following are my procedures:

backup the system using TimeMachine
take out the screws on the behind of the computer, and take down the back lid(There is so much dirt!)
unplug the Battery interface to prevent accidents
change SSD
- unscrew the screws of the SSD, like this
- take down the above old SSD
- put our M.2-to-mSATA converter to the mSATA interface
- put the new M.2 SSD on the M.2-to-mSATA converter
- Screw the screws
change Battery (There is a detailed specification in the battery case)
- release the wires of the touchpad
- use wane and ethanol to pry open the old battery(it may take you a while)
- put the new battery on and plug the Battery interfaces
- put back the wires of the touchpad
Re-tighten the screws
Start the system by “cmd+option+r+power”, and it will recover from the Internet
Erase the Disk APFS mode with GUID. Then recover the system from TimeMachine.

Ends

My old battery looks like this: If I had known that previously, I would not take it out every day. It seems to explode at any time. Now, I have a good better life. Most importantly, I don’t need to care about the disk space too much.

Reference

[1]SSD
[2]Battery

Drivers

Sun, 14 Nov 2021 00:00:00 +0000

Drivers

Intro

I wrote some slides to share the basic knowldege of drivers. My main reference is Writing Network Device Drivers for Linux.

RSS

Tue, 02 Nov 2021 00:00:00 +0000

RSS: You decide what you read

Introduction

Sometimes I will feel bored and want to read and learn something new, but at this time, I don’t know what to read: bookmarks in my Chrome are so chaotic; news apps send me a lot of things I am not concerned about. As a result, I end up spending a lot of time on Moments and Weibo. Recently, I found that RSS could solve this problem, which makes accessing new updates on my interested websites impossible. In actuality, I have collected a lot of information sources from the Internet, like blogs, tutorials, some official websites, and stuff. But in general, they just lay in my collection directories in my Chrome. If I could use RSS to follow their newer contents, I could have something to read.

Solution

My solution is quite simple, divided into three steps:

register an account of Inoreader, which is an RSS reader.
add an extension in Chrome: RSS Reader Extension (by Inoreader) . If a website has a web feed(support RSS), you could subscript that by clicking this extension.
read my subscripts on Inoreader website or the Inoreader application on my phone.

In this way, I could subscript anything which attracts me, and then I could read their new contents when I am bored.

By the way, welcome to subscript my blog with RSS!

Mesh Side-Channel Attack

Fri, 08 Oct 2021 00:00:00 +0000

Mesh Side-Channel Attack

Introduction

In this blog, I will briefly introduce a research project done by our group which we have submitted to a top conference. You could find further details on our paper.

By accessing cachelines to make directed data flow, we could congest the router of Mesh Interconnect of server-grade CPU, where we could get a stable delay. If another program access the memory at the same time and the cachelines are transferred from the router we congested, we could observe a higher dealy. By recording all the delays, our attack is able to deduce the victim program’s secret information, for example, the private key of RSA. We attacked the Java program running on JVM, and captured the Square and multiply sequences.

Our attack contains the three parts of work.

The reverse engineering of Mesh NOC topology.
Implementing point-to-point mesh accessing and make the interconnect congested.
Recording the logs and deduce the secrets.

Reverse Engineering

Take an example of Xeon 8260(our experiment environment), there are 28 tiles on the CPU chip, and each tile has core and uncore components. The CHA of uncore component is responsible to reply the LLC access from cores and manage the LLC on this tile. To make point-to-point congestion, we should learn the mapping relationships of tiles, CHAs, and cores. Firstly, 4 tiles of 8260 were disabled after it produced. And we will confirm which 4 cores are disabled by reading msr register CAPID6. In our machine, bits 2, 3, 21, and 27 are 0, which means that these four tiles are disabled. As a result, the layout is as following:

By the way, the ID of tiles and CHAs are growing from top to down and left to right. As a result, CHA 2 is in tile 4. In this way, we could map the CHA ID with tile ID for all CHAs. We also need the relationships between core ID(core ID is the physical core ID in OS) and CHA ID. We found this information could be got by a PMU event - LCORE_PMA GV (Core Power Management Agent Global system state Value). Firstly, we bind a process to a core(ID=X), and do a lot of operations with that process(e.g. access a large volume of memory.) At the same time, we monitor the LCORE_PMA GV counter of every CHA. We could observe that the counter on one CHA(ID=Y) is higher than others. So we can confirm that core X and CHA Y lay on the same tile because the activities of core X could change the power management state of CHA Y. Repete the above procedure from core 0 to core 23, we could learn the mappings between core and CHA as the following picture shows.

Point to point mesh interconnect congestion

As the picture shows, LLC is non-inclusive by cores, while L1/L2 is inclusive by cores. So, a core on CPU could use access LLC slices from any tiles. LLC is managed by CHA on the tile, and a hash algorithm could determine the corresponding CHA ID which manages a specific Cacheline. By the way, the input of this hash algorithm is bit 6 to bit 64 of the physical memory.

We devise an evict-based method L2-evict to make the memory access flow, which is similar to the concurrent work Lord of Ring. Suppose we want to congest the interconnect between core R and CHA T. Firstly, we will find an EV(eviction set) set. Cachelines in one EV set will be in a set of core R’s L2 cache and will be managed by CHA T. To find the EV set of a specific LLC slice, we use check_conflict and find_EV functons in Attack Directory. Such information could also be obtained by PMU events. And to get cachelines in one specific L2 cache set, we use bit 6-bit 15 of physical memory.

If we access the cachelines in the above EV set, cachelines will be evicted from L2 to LLC and then reload from LLC to L2 (The L2’s evict policy is pseudo-LRU). In our machine, an L2 set has 16 ways, and an LLC set has 11 ways. To avoid the situation that cachelines in LLC are evicted to memory which will introduce a higher delay, we will make half of bit 16 of EV’s cachelines to 0 and the other to 1. In this way, EV will be evicted to 2 LLC sets. According to our test, setting the number of cachelines in EV to 24 could maximize the congestion of the mesh interconnect.

Recording and Analysis

We will access 20 EVs and record the rdtscp timestamp once a time (in actuality, we access 10 EVs and each of them twice). By the gaps between rdtscp timestamps, we could infer the program’s secret information.

For example, when a Java program running on JVM is decrypting an RSA-encryped message with a private key, it will call the modPow() method of BigInteger package of JDK, which adopts a slide-window algorithm. And our attack is able to capture the square and Multiply sequences in the process of the slide-window algorithm. As the following picture shows, we could capture 3 kinds of memory access patterns. For instance, in pattern B we could observe square operations directly and then we could deduce the multiply operations by the gaps between the captured square operations. By applying a SRID algorithm, we could recover about 30% bits of the private keys.

The Network-On-Chip Structure of Skylake and Congestion Monitoring

Wed, 03 Mar 2021 00:00:00 +0000

The Network-On-Chip Structure of Skylake and Congestion Monitoring

If we want to understand the functions and behaviors of the mesh network in Skylake, one way is via PMON. We can monitor the number of counters for several specific events through PMON to speculate the inner states of CPU. For example, if we monitor event HORZ_RING_BL_IN_USE and read the corresponding counters, it will tell us how many Uncore cycles the horizontal block ring is in use. One of our aims is to illustrate the congestion degree by counting PMON events among where are a lot of events related to congestion, but it’s a pity that Intel didn’t explain these events clearly. However, if we know some design ideas about the Skylake Network On Chip(NOC) structure, especially the design of the routers and the functions of flow control, we are able to learn more from these events.

The Router

From the macro-perspective, the NOC of Skylake is a mesh network. The routing algorithm is a Y-X route that could avoid deadlock and is easy to implement. The data goes along the vertical ring firstly and then the horizontal ring. The Common Ring Stop(CMS) is actually the router of the mesh and connects rings of four directions. The picture below is CMS on the PMON document(Ref. 1). It has two agents with partly different functions, which can transfer data from AD, AK, BL, and IV rings. AD, AK, and BL rings have two directions respectively, but the IV ring has only one direction. We can see that from the descriptions of Unit Masks for TxR_VERT_CYCLES_FULL. By the way, in the PMON events, Egress has vertical and horizontal descriptions while Ingress just has the horizontal description. And the EGR area is about twice of IGR. So we guess that the Egress buffers have a specific buffer for each direction and the Ingress buffer stores all the incoming packets.

To understand the microarchitecture, one of the possible sources is a paper written by Intel(Ref. 3), which is published about when the Skylake was designed. The router’s design is as follows.

Flow Control

According to the descriptions in PMON events, the flow control method of Skylake is Flowless routing, and specifically, Credit-based routing.

Congestion Moniter

We can infer the congestion state from many CMS events, and the following are some:

RING_IN_USE: the Uncore cycles during which the rings are in use
NACK: not get responded when CMS send messages
BYPASS: packets that bypass CMS Ingress or Egress buffer
ADS: Anti Deadlock Slot was used
SINK_STARVED: discarded packets due to starvation
STALL: stalled due to lacking credits

The experiment evaluation can be found on our paper.

Reference

Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring
Topology and Cache Coherence in Knights Landing and Skylake Xeon Processors
MoDe-X: Microarchitecture of a Layout-Aware Modular Decoupled Crossbar for On-Chip Interconnects, IEEE Transactions on Computers, Vol. 63, No. 3, March 2014 P622.
Intel’s patent about On-chip mesh interconnect
Intel’s patent about shared mesh
Cms answer on stackoverflow
mesh priciples

Invisible Probe

Tue, 02 Mar 2021 00:00:00 +0000

Invisible Probe: Timing Attacks with PCIe Congestion Side-channel

Introduction

In this blog, I will introduce a research project done by our group, Invisible Probe: Timing Attacks with PCIe Congestion Side-channel. My supervisor leaded this program and my contribution is mainly in the experiment part which contains a little exploration. The details are on the paper.

Attack Surface - PCIe (Peripheral Component Interconnect express) Link

This is the first work focus on a side-channel attack through the congestion in PCIe. If the attacker commits congestion upon PCIe link, she may percept what is transferring on this PCIe link. And we find 2 threat scenarios and test them by design 4 specific experiments.

Two Threat Scenarios

PCH: NVMe SSD & NIC

PCH (Platform Controller Hub) was designed to connect multiple relatively slow devices, like hard disks, sound cards, and NICs. We assume that an NVMe SSD card and a NIC are both connected by PCH, so the attacker can repeatedly access the SSD by SPDK to congest PCH and log the intervals between every 2 accession. If a victim is browsing a website at the same time, the traffic transferred back by NIC will increase the intervals logged by our attacker. If we train the different logs with deep learning, we can distinguish the different websites which the victims are browsing.

PCIe Switch: RDMA NIC & GPU

PCIe Switch allows several devices connected to one interface offered by CPU. We assume that an RDMA NIC and a GPU are connected by a PCIe switch. The attacker will repeatedly access memory through another machine’s RDMA NIC. Therefore, traffic transferred on PCIe switch will be discovered. Victim’s activities which related to GPU, like browsing website, training models and input password, will be perceived.

Four Specific Experiment

NO	Congested at	Attacker operates on	Victim operates on	Information stealed
A	PCH	NVMe SSD	NIC	Websites
B	PCIe switch	RDMA NIC	GPU	Websites
C	PCIe switch	RDMA NIC	GPU	Trained models
D	PCIe switch	RDMA NIC	GPU	Password keystrokes

In the 4 experiments above, we first use our attack scripts to congest with logging the access time intervals and then recover information in 2 ways. For A, B and C, we collect enough data and train deep learning models. For D, we just extract the keystrokes from logs. When the victim inputs the password in Chrome, the intervals are like Fig. 3, and red stars are the keystrokes related to the password.

LITE Kernel RDMA

Thu, 10 Dec 2020 00:00:00 +0000

Paper Read: LITE Kernel RDMA

Next week I’ll make a presentation in Advanced Network, a graduate course. Our teacher provided a paper list about Computer Network, from which we can choose a paper and make a presentation, and then introduce it in class. The paper I choose is a 2017 SOSP paper LITE Kernel RDMA Support for Datacenter Applications. Here are some important points of their work.

The Abstraction Mismatch

As in the picture, in Native RDMA, the programmer will write codes on the libraries provided by the RNIC hardware which totally bypassing the kernel. So what the native RDMA mechanism has is Low-level and Difficult-to-use APIs, whilst what the developer wants is high-level and easy-to-use APIs. Therefore there is an Abstraction Mismatch.

Things worked well in HPC(High-Performance Computing) because it has special hardware, few applications, and relatively cheaper developers. While in the datacenters, we just have commodity and cheaper hardware and what we handle are a lot of changing apps. The resource sharing and isolation will also be a problem in this scenario.

Hince, things will be very complicated when trying to use native RDMA in datacenters.

What This Paper Do In General

This paper adds an Indirection tier in the Linux kernel to support RDMA operations at the OS(operating system) level. The Name LITE comes from “Local Indirection TiEr”, as it refers, the designer just adds one local kernel layer in the local node. The remote side is the same as the native RDMA. With the support of OS, it can provide high-level APIS to userspace, so the applications will be simpler. LITE also on-load the Permission check and Address mapping operations into the kernel, so LITE needs simpler hardware.

Design and Abstraction Principles

If we want the features like High-level abstraction, Resource sharing, Performance isolation, and Protection, one easy way is to use kernel. And as Butler Lampson says, “All problems in computer science can be solved by another level of indirection”. So what the LITE do is to add an indirection layer in the kernel. LITE works on RDMA verbs so it’s easy to support different hardware. By the way, verbs are just low-level descriptions of RDMA, not APIs.

They list three design principles:

1. Indirection only at local for one-sided RDMA
1. Avoid hardware indirection
1. Hide kernel cost

To avoid the present hardware indirection, the author finds an API which can register physical address in the kernel. In this way, there is no need to cache the PTEs. LITE registers the whole memory at once and manages them in the kernel, so we just need to store one pair of global keys in RNIC SRAM.

Some Performance

LITE scales much better than native RDMA wrt MR size and numbers. LITE only adds a very slight overhead even when native RDMA doesn’t have scalability issues

In The End

In the code of LITE, we can see that it was written and compiled into several kernel modules, but only supports kernel version 3.11.1, 3.10.108 and 4.9. If possible, I will read the source code and write another blog(it’s a flag :)). But since the IO_uring have already appeared in the kernel, LITE will be harder to be put in use(as my advisor says).

My first blog

Wed, 02 Dec 2020 00:00:00 +0000

My First Blog

This is my blog to share what I learn and how I think in my study and work, maybe a project finished or a paper read recently. Also, the moments or thoughts which will be awkward to be shared in the Moments and Weibo. Besides, I believe that writing is a good way to test whether you really understand something. Hope it will be a long journey.

GDB Basic Commands

Thu, 31 Oct 2019 00:00:00 +0000

GDB–Basic Commands

Here is a simple tutorial of GDB I wrote before.

GDB

gdb level1: use gab to debug binary level1
run: execute this binary
disas f_A: disassembly function f_A
break *0xdeedbeef: set a new breakpoint at position 0xdeedbeef
info breakpoint: check all the breakpoints
info register: check the states of registers
x/wx address: check the contents at address
- w could be b/h/w/g as 1/2/4/8 bytes
- x/100wx: show 100 four-byte word once a time
- the second x could be u/d/s/x/w (determines the type of the memory address)
  - u: unsigned int
  - d: show as decimal number
  - x: show as hexadecimal number
  - s: show as strings
  - i: show as instructions
ni – finish next instruction(if it is a call, it will execute until it returns)
si - execute next instruction(if it is a call, it will execute the first instruction of the function next)
backtrace - show all the stack frames of the calling chain
continue - execute the process until it ends, crashes, or is at a breakpoint
set *address = value
- set 4 bytes at address
- change it to char, short, int, and long to represent 1,2, 4, and 8 bytes
- e.g. set [int]0x80408000 = 666
attach [pid]: attach a running process

programs compiled with debug symbols

list: list the source codes
b: add breakpoint at source code number
Info local: list local variables
print val: print the value of variables

GDB-peda

checksec – check protection mechanisms of the binary
elfsymbol - get every PLT addresses(useful at rop)
vmmap - check all the memory segments and permissions(read, write,execute)
readelf - check positions of important elf data structures(.plt, .plt.got, .bss)
find /bin/sh - find the address of string “/bin/sh”